Recently I’ve had the opportunity to dig deeply into the CMMI Data Management Maturity model. Since its release, the DMM model has emerged as the de facto standard data management maturity framework (I’ve listed other frameworks at the end of this post).
I’m deeply impressed by the completeness and polish of the DMM model as a comprehensive catalog of processes required for effective data management. Even after decades in the business the broad scope and business focus of the model changed the way I think about data management.
Over the past year I’ve reviewed what seem like countless plans for enterprise data warehouses. The plans address real problems in the organizations involved: the organization needs better data to recognize trends and react faster to opportunities and challenges; business measures and analyses are unavailable because data in source systems is inconsistent, incomplete, erroneous, or contains current values but no history; and so on.
The plans detail source system data and its integration into a central data hub. But the ones I’m referring to don’t tell how the data will be delivered, or portray a specific vision of how the data is to drive business value. Instead, their business case rests on what I’ll call the “railroad hypothesis”. No one could have predicted how the railroads enabled development of the West, so the improved data infrastructure will create order of magnitude improvements in ability to access, share, and utilize data, from which order of magnitude business benefits will follow.* All too often these plans just build bridges to nowhere. Continue reading →
There’s an unfortunate and rather rude saying about assumptions that I’ve found popular among IT folks I’ve worked with. I say unfortunate because, to me, assumptions that are recognized early and handled the right way are a key to successful projects. Technical players who use assumptions well can help set projects on the right path long before they go astray.
Sometimes on waterfall and hybrid projects technical players are asked to estimate work early, before requirements are complete. My instinctive reaction is not to provide an ungrounded estimate, but that’s not helpful. The way to handle this uncomfortable uncertainty is to fill out the unknowns with assumptions: detailed, realistic statements that provide grounding for your estimate. Continue reading →
A quick Google search seems to reveal if you manage People, Process, and Technology you’ve got everything covered. That’s simply not the case. Data is separate and distinct from the things it describes — namely people, processes, and technologies — and organizations must separately and intentionally manage it.
The data management message seems a tough one to deliver effectively. Data management interest groups have hammered at it for years, but a sometimes preachy and jargon laden approach relying on data quality train wreck stories hasn’t generally loosened corporate purse strings. Yes, financial companies’ data-first successes in the 1990s paved the way for the ’00s dot com juggernauts, whose market capitalization stems largely from innovative data management. Yet, we still have huge personal data breaches at some of our most trusted companies, and data scientists spend the bulk of their valuable time acquiring, cleaning, and integrating poorly organized data.
The first steps are often the hardest, so here’s a short, no jargon, big picture guide to getting started with effective data management in three steps:
There’s consensus among data quality experts that, generally speaking data quality is pretty much bad (here, here, and here). Data quality approaches generally focus on profiling, managing, and correcting data after it is already in the system. This makes sense in a data science or warehousing context, which is often where quality problems surface. To quote William McKnight at the first of those sources:
“Data quality is no longer the domain of just the data warehouse. It is accepted as an enterprise responsibility. If we have the tools, experiences, and best practices, why, then, do we continue to struggle with the problem of data quality?”
So if the data quality problem is Garbage In Garbage Out (GIGO), then I would think that it would be easy to find data quality guidelines for app dev, and that those guidelines would be lightweight and helpful to those projects. Based on my research there are few to none such sources (please add them to the comments if you find otherwise).
So, all that said here’s my cut at app dev data quality guidelines by project activity: Continue reading →
As you’ve read on this site and many others, the database world is well into a transition from a relational focus to a focus on non-relational tools. While the relational approach underpins most organizations’ data management cycles, I’d venture to say that all have a big chunk of big data, NoSQL, unstructured data, and more in their five-year plans, and that chunk is what’s getting most of the executive “mind share”, to use the vernacular.
Some are well along the way in their big data learning adventure, but others haven’t started yet. One thing about this IT revolution is that there’s no shortage of highly accessible training options. But several people have complained to me about the sheer quantity of options, not to mention the sheer number of new words the novice needs to learn in order to figure out what the heck big data is.
So here’s a very short list of training options accessible to the IT professional who is a rank big data beginner, starting with a very brief classification of the tools that I hope provides a some context. Continue reading →
His point is, since big data applications are often off the beaten IT path, big data professionals must solve “problems that companies don’t even know they have – as their insights highlight bottlenecks or inefficiencies in the production, marketing or delivery processes,” often with “data which does not fit comfortably into tables and charts, such as human speech and writing.” Continue reading →
Logical data modeling is one of my tools of choice in business analysis and requirements definition. That’s not particularly unusual – the BABOK (Business Analysis Body of Knowledge) recognizes the Entity-Relationship Diagram (ERD) as a business analysis tool, and for many organizations it’s a non-optional part of requirements document templates.
In practice, however, data models in requirements packages often include many-to-many relationships. I’ve heard experienced data modelers advocate this practice, and it unfortunately seems consistent with the “just enough, just in time” approach associated with agile culture.
In my experience unresolved M:M relationships indicate equally unresolved business questions. The result: schedule delays and budget overruns as missed requirements are built back in to the design, or the familiar “that’s not what we wanted” reaction during User Acceptance Testing (UAT). Continue reading →
At the very first TDWI Conference, Duane Hufford described a phenomenon he called “embedded data”, now more commonly called “overloaded data”, where two or more concepts are stuffed into a single data field (“Metadata Repositories,” TDWI Conference 1995). He described and portrayed in graphics three types of overloaded data. Almost 20 years later, overloaded data remains rampant but Mr Hufford’s ideas, presented below with updated examples, are unfortunately not widely discussed.
Overloaded data breeds in areas not exposed to sound data management techniques for one reason or the other. Big data acquisition typically loads data uncleansed, shifting the burden of unpacking overloaded fields to the receiver (pity the poor data scientist spending 70% of her time acquiring and cleaning data!)
One might refer to non-overloaded data as “atomic”. Beyond making data harder to use, overloaded data requires more code to manage than atomic data (see why in the sections below) so by extension it increases IT costs.
Here’s a field guide to three different types of overloaded data, associated risks, and how to avoid them: Continue reading →