Data Quality, Evolved

Data quality doesn’t have to be a train wreck. Increased regulatory scrutiny, NoSQL performance gains, and the needs of data scientists are quietly changing views and approaches toward data quality. The result: a pathway to optimism and data quality improvement.

Here’s how you can get on the new and improved data quality train in each of those three areas:

Regulatory Scrutiny: Assess and publicize data quality

At the recent EDW17 conference Bob Schmidt presented an innovative approach to measuring and publicizing quality of financial data. Mr Schmidt is a data steward at Wells Fargo and author of the excellent Data Modeling for Information Professionals. (Insightful, funny, and even after 19 years and the big data revolution a useful introduction to object and data principles).

His two-step process first recommends that organizations focus responsibility for data quality: identify who is accountable for the quality of, for example, mortgage data. There might be different accountabilities across the data life cycle (that is for defining, collecting, presenting, etc.). What’s important is that those accountabilities are known and unambiguous.

The next step is to measure data quality and publish the results. The key here is to designate a data assessment team expert in “assays” of highly utilized, business critical information. Assessment efforts would be rightsized by focusing on most-used or highest-value data, and accuracy would be measured with selective audits. For example, stored mortgage data might be compared with randomly selected images of original mortgage documents.

As to feasibility, Mr Schmidt presented an example showing that two mid-level FTE analysts, supported by impact analysis tools like Teleran iSight, could conduct a reasonable data quality program across 300 or so data element assessments per year. Is reasonable understanding of data quality risk worth two FTEs for organizations in finance, healthcare, or manufacturing, for example? I would think so.

Apply NoSQL Data Quality Solutions in the Cloud

Sure, sounds like a bunch of buzzwords, but in a big data world data quality is a big data problem. A company called Reltio offers a Cassandra-based polyglot solution that enables upload of reference and transactional data from all sources to a secure cloud data store, enriching proprietary data with third party reference data, and making that data available to operational and analytical applications.

Much of Reltio’s publicity, like this article from Ajay Khanna, emphasizes analytic breakthroughs enabled by well-organized reference data on a modern database platform. However, standard operational processes also benefit from the data quality gains that this approach enables.

Many organizations suffer real customer and supplier friction due to differences in master data among different applications. Reltio and products like it offer the ability to interface those applications to a service that provides a single location for managing reference and master data. So rather than many focal points for maintaining, say, information about your largest customer, there’s a single place for making sure it is correct.

Implementing an MDM platform does come with risk. For example, any significant architecture change is also a cultural change, and any organization managing sensitive data must carefully evaluate cloud options. However, on the upside there are significant improvements due to improved data quality, including breakthroughs in service consistency and analytics potential.

Serve Data Scientists’ Unique Data Quality Needs

To Harald Smith of Trillium, data quality is different in the big data world. Operational systems, and the kind of analytics that Mr. Khanna describes, imply a restrictive approach to data quality. Not that that’s a bad thing: in those types of applications the customer’s address is either right or wrong, and there’s a penalty to pay for sending his or her package to the wrong one.

However, data scientists work in a different world, in which an attempt to correct data as it flows in to a data lake might reduce its analytical utility. For example, consider a data store containing reactions to a product which includes tweets, blog posts, product reviews, etc. An operational perspective might lead us to restrict incoming data to only include verified purchasers, but a data scientist tracking product buzz might be as interested in non-purchasers as they consider reactions from the entire internet echo chamber.

In the case of data science, Mr. Smith says, the trick is not to deliver data quality by excluding data but rather to provide metadata that informs data scientists by providing context. In his words, “We can feed our profiling or rule-based information in so that if you look across the metadata landscape or the business semantic layer, you can get some insight into the quality measure as well.” In effect, data quality is customized for each analysis, with different outliers included depending on the needs of each hypothesis.

Although much of our language around data quality is the same as it ever was, regulatory pressures, NoSQL opportunities, and the needs of data scientists are generating new approaches that are truly evolving the discipline.

Bob Lambert