There’s consensus among data quality experts that, generally speaking data quality is pretty much bad (here, here, and here). Data quality approaches generally focus on profiling, managing, and correcting data after it is already in the system. This makes sense in a data science or warehousing context, which is often where quality problems surface. To quote William McKnight at the first of those sources:
“Data quality is no longer the domain of just the data warehouse. It is accepted as an enterprise responsibility. If we have the tools, experiences, and best practices, why, then, do we continue to struggle with the problem of data quality?”
So if the data quality problem is Garbage In Garbage Out (GIGO), then I would think that it would be easy to find data quality guidelines for app dev, and that those guidelines would be lightweight and helpful to those projects. Based on my research there are few to none such sources (please add them to the comments if you find otherwise).
So, all that said here’s my cut at app dev data quality guidelines by project activity: Continue reading →
As you’ve read on this site and many others, the database world is well into a transition from a relational focus to a focus on non-relational tools. While the relational approach underpins most organizations’ data management cycles, I’d venture to say that all have a big chunk of big data, NoSQL, unstructured data, and more in their five-year plans, and that chunk is what’s getting most of the executive “mind share”, to use the vernacular.
Some are well along the way in their big data learning adventure, but others haven’t started yet. One thing about this IT revolution is that there’s no shortage of highly accessible training options. But several people have complained to me about the sheer quantity of options, not to mention the sheer number of new words the novice needs to learn in order to figure out what the heck big data is.
So here’s a very short list of training options accessible to the IT professional who is a rank big data beginner, starting with a very brief classification of the tools that I hope provides a some context. Continue reading →
At the very first TDWI Conference, Duane Hufford described a phenomenon he called “embedded data”, now more commonly called “overloaded data”, where two or more concepts are stuffed into a single data field (“Metadata Repositories,” TDWI Conference 1995). He described and portrayed in graphics three types of overloaded data. Almost 20 years later, overloaded data remains rampant but Mr Hufford’s ideas, presented below with updated examples, are unfortunately not widely discussed.
Overloaded data breeds in areas not exposed to sound data management techniques for one reason or the other. Big data acquisition typically loads data uncleansed, shifting the burden of unpacking overloaded fields to the receiver (pity the poor data scientist spending 70% of her time acquiring and cleaning data!)
One might refer to non-overloaded data as “atomic”. Beyond making data harder to use, overloaded data requires more code to manage than atomic data (see why in the sections below) so by extension it increases IT costs.
Here’s a field guide to three different types of overloaded data, associated risks, and how to avoid them: Continue reading →
Recently there was a great post at Dzone recounting how one “tech savvy startup” moved away from its NoSQL database management system to a relational one. The writer, Matt Butcher, plays out the reasons under these main points:
The well-publicized problems with healthcare.gov are disturbing, especially when we remember they might result in many continuing without health insurance. But it seemed a step in the right direction when recent a news report differentiated between “front end” and “back end” problems. The back end problems were data issues, like a married applicant with two kids being sent to an insurer’s systems as a man with three wives.
Coincidently, I recently responded to a questionnaire about health care data. I’ve paraphrased the questions and my responses below. Perhaps the views of someone who’s spent a lot of time in the health care engine room might provide some useful perspective. Continue reading →
Recently I read a thoughtful post
at the PASS Business Analytics Conference site discussing how different the world is now for database professionals. Author Chris Webb focuses on the data science side in this post. His analysis made me think of the challenges and opportunities “big data” serves up to relational database designers.
To me these challenges are fundamental. Big Data and NoSQL bring lots of what we know about data elements, inherent data design, and data management into question. I think considering these elements closely leads to a sensible to-do list for relational database professionals. Continue reading →
As important as it is, data modeling has always had a geeky, faintly impractical tinge to some. I’ve seen application development projects proceed with a suboptimal, “good enough”, model. The resulting systems might otherwise be well-architected, but sometimes strange vulnerabilities emerge that track directly to data design flaws.
Recently I saw an example where a “good enough” data design, similar to the one pictured, enabled a significant application bug.
One common theme in recent tectonic shifts in information technology is data management. Analyzing customer responses may require combing through unstructured emails and tweets. Timely analysis of web interactions may demand a big data solution. Deployment of data visualization tools to users may dictate redesign of warehouses and marts. The data architect is a key player in harnessing and capitalizing on new data technologies. Continue reading →
In some presentations, I assert that top-down data modeling should result in not only a business-consistent model but also a pretty well normalized model.
One of the basic concepts behind normalization is functional dependency. In layperson’s terms, functional dependency means separating entities from each other and putting attributes into the obviously correct entity. For example, a business person knows that item color doesn’t belong in the order table because it describes the item, not the order. Everyone knows that the order isn’t green! Continue reading →