The term “trust” implies absolutes, and that’s a good thing for relationships and art. However, in the business of data management, framing trust in data in true or false terms puts data governance at odds with good practice. A more nuanced view that recognizes the usefulness of not-fully-trusted data can bring vitality and relevance to data governance, and help it drive rather than restrict business results.
The Wikipedia entry — for many a first introduction to data governance — cites Bob Seiner’s definition: “Data governance is the formal execution and enforcement of authority over the management of data and data related assets.” The entry is accurate and useful, but words like “trust”, “financial misstatement”, and “adverse event” lead the reader to focus on the risk management role of governance.
However, the other role of data governance is to help make data available, useful, and understood. That means sometimes making data that’s not fully trusted available and easy to use.
That’s not nearly as strange as it may sound. For example, every month the Bureau of Labor Statistics releases admittedly incomplete statistics.* The agency releases monthly employment figures to great fanfare, and then few notice as the stats are revised across the two following months. Employment numbers are based on surveys of hiring organizations, some of whom are late turning in their forms. Even though the initial number isn’t right, and is often revised significantly, it is apparently “good enough”, being prominent on news outlets as a key economic indicator.
Similarly, there’s a conflict going on in the data management world between “fast data” and well-governed data. Recently, many advanced solutions provide analyses in real time based on platforms like Kafka and Storm that are capable of managing very high volume data streams. This also applies to real time operations using NoSQL DBMS platforms like Cassandra. Certainly its true that
“Regardless of data volume, you still need to do data management. In big data discussions, there’s often Hadoop data and social media data, along with whatever data you have internally. Regardless of all that, you still need to integrate the data. You still need to cleanse and rationalize it somehow.”
However, if you always emphasize integration and cleanliness then the data may no longer be fast enough to provide business value. Here are some examples:
- Eventual Consistency: This post explains how designing for complete and immediate consistency of data in different databases is misguided. The presenter, from Netflix, talks through examples from Netflix and Amazon, then describes how much the banking system relies on “eventual consistency”. When you cash a check, you get your money, even though there’s still a chance that the check will bounce. The bank accepts the transaction in spite of the risk that it is invalid, and applies after the fact corrections if needed (like returned check fees).
- Solicitation: It’s a good bet that there’s mail waiting in your mailbox today that’s either addressed to the previous occupant, misspells your name, or that’s not interesting to you at all. As much as we the solicitees would appreciate it if they would clean up their prospect databases, its just not worth it to marketers who make a great return with a very small response rate.
- Data Science: Data analysts apply the scientific method, which means using their imaginations to generate and test hypotheses. In some industries they sometimes are under pressure to work quickly in order to keep up with increasingly fast paced competition. Their sources of data include scrubbed, warehoused data, but they more than likely are pulling in data that hasn’t yet been cleansed and verified. Maybe they’re correlating automated feeds from the plant floor with customer quality complaints, or political survey data against sales figures, and so on. Idea generation is always faster than incorporation of new data into a governance framework, so ungoverned data must be available to data scientists or their research will be hamstrung.
Clearly there are many instances where data must be of the highest quality and reliability, like security systems and regulatory reporting. On the other hand, trying to make sure all data is clean and verified can grind some business processes to a halt. Data governance teams that take that kind of principled approach quickly become irrelevant as business teams find other ways to solve their problems.
On the other hand, data governance teams that adopt a nuanced approach by understanding data use by different business processes, and customizing quality standards based on business needs, help drive business results by making the right data available.
3. Why does the establishment survey have revisions? The establishment survey revises published estimates to improve its data series by incorporating additional information that was not available at the time of the initial publication of the estimates. The establishment survey revises its initial monthly estimates twice, in the immediately succeeding 2 months, to incorporate additional sample receipts from respondents in the survey and recalculated seasonal adjustment factors. For more information on the monthly revisions, please visit www.bls.gov/ces/cesrevinfo.htm. On an annual basis, the establishment survey incorporates a benchmark revision that re-anchors estimates to nearly complete employment counts available from unemployment insurance tax records. The benchmark helps to control for sampling and modeling errors in the estimates. For more information on the annual benchmark revision, please visit www.bls.gov/web/empsit/cesbmart.htm.