His point is, since big data applications are often off the beaten IT path, big data professionals must solve “problems that companies don’t even know they have – as their insights highlight bottlenecks or inefficiencies in the production, marketing or delivery processes,” often with “data which does not fit comfortably into tables and charts, such as human speech and writing.” Continue reading →
At the very first TDWI Conference, Duane Hufford described a phenomenon he called “embedded data”, now more commonly called “overloaded data”, where two or more concepts are stuffed into a single data field (“Metadata Repositories,” TDWI Conference 1995). He described and portrayed in graphics three types of overloaded data. Almost 20 years later, overloaded data remains rampant but Mr Hufford’s ideas, presented below with updated examples, are unfortunately not widely discussed.
Overloaded data breeds in areas not exposed to sound data management techniques for one reason or the other. Big data acquisition typically loads data uncleansed, shifting the burden of unpacking overloaded fields to the receiver (pity the poor data scientist spending 70% of her time acquiring and cleaning data!)
One might refer to non-overloaded data as “atomic”. Beyond making data harder to use, overloaded data requires more code to manage than atomic data (see why in the sections below) so by extension it increases IT costs.
Here’s a field guide to three different types of overloaded data, associated risks, and how to avoid them: Continue reading →
Recently there was a great post at Dzone recounting how one “tech savvy startup” moved away from its NoSQL database management system to a relational one. The writer, Matt Butcher, plays out the reasons under these main points:
Application developers and business people accessing relational databases need data dictionaries in order to properly load or query a database. The data dictionary provides a source of information about the model for those without model access, including entity/table and attribute/column definitions, datatypes, primary keys, relationships among tables, and so on. The data dictionary also provides data modelers with a useful cross reference that improves modeling productivity.
It is particularly useful for the dictionary to be a filterable/sortable Excel document, but out of the box ERwin, one of the leading data modeling tools, includes a notably inflexible reporting capability. Luckily, it is possible to directly query the ERwin “metamodel”. However, I found the ERwin documentation a bit hard to decipher and not quite accurate. Hopefully this post will save modelers some steps in figuring out how to query the metamodel.
The data integration process is traditionally thought of in three steps: extract, transform, and load (ETL). Putting aside the often-discussed order of their execution, “extract” is pulling data out of a source system, “transform” means validating the source data and converting it to the desired standard (e.g. yards to meters), and load means storing the data at the destination.
An additional step, data “enrichment”, has recently emerged, offering significant improvement in business value of integrated data. Applying it effectively requires a foundation of sound data management practices. Continue reading →
The well-publicized problems with healthcare.gov are disturbing, especially when we remember they might result in many continuing without health insurance. But it seemed a step in the right direction when recent a news report differentiated between “front end” and “back end” problems. The back end problems were data issues, like a married applicant with two kids being sent to an insurer’s systems as a man with three wives.
Coincidently, I recently responded to a questionnaire about health care data. I’ve paraphrased the questions and my responses below. Perhaps the views of someone who’s spent a lot of time in the health care engine room might provide some useful perspective. Continue reading →
A technique for reporting requirements has emerged as the de facto standard in the business intelligence community. The technique, which emerged in the mid-2000s, is new enough to be as yet unacknowledged by the requirements analysis powers that be. David Loshin describes how it works in this 2007 post:
Start with a business question about how to monitor a business process using a metric, like “How many widgets have been shipped by size each week by warehouse?” Continue reading →
I recently stumbled upon one of The Martin Agency’s hilarious Geico caveman ads and wondered, rather geekily, why they didn’t do one about data analysis. I think if a caveman suddenly arrived in the 2010s he or she would see parallels between his life and the activities of today’s knowledge worker. When I thought it through, it seemed obvious that knowledge workers need to be more like farmers and less like hunter/gatherers if they want to achieve the full potential of business intelligence.
I hold a strong prejudice that IT paradigms are useful for about 30 years. The PC was dominant from 1980 to 2010, “online” mainframe systems from 1970 to 2000, and so on. If that’s the case then time’s up for Bill Inmon’s data warehousing framework. So far no widely held pattern has emerged to help us envision data management in today’s big data, mobile BI, end-user visualization, predictive analytics world, but at their recent Business Technology conference, Forrester Research took a swing at it by presenting their 2009 “hub and spoke” organizational strategy as a data management vision. Continue reading →