What is Data Quality anyway? If you are a data professional, I’m sure someone from outside our field has asked you that question, and if you’re like me you’ve fallen into the trap of answering in data-speak.
To my listener, I’d guess that the experience was similar to having a customer service rep who has just turned down his simple request justify it by describing byzantine company policies.
There’s a ton of great writing available on data quality, and I in no way mean to disparage it or its value in the field. But in that writing I’ve yet to find a concise and compelling definition that’s useful to non-data professionals. I’ll review one or two prevailing definitions and then offer one that could help us unlock real data quality improvements.
The first non-ad result of my google search, an excellent data quality overview, provides this:
“Data quality is a perception or an assessment of data’s fitness to serve its purpose in a given context. The quality of data is determined by factors such as accuracy, completeness, reliability, relevance and how up to date it is. As data has become more intricately linked with the operations of organizations, the emphasis on data quality has gained greater attention.”
That’s a solid and useful definition for us insiders, but hardly an effective opening to a data quality elevator speech. By covering all the bases and emphasizing that data quality is a “perception or assessment” it unintentionally sounds evasive.
Malcolm Chisholm offers an excellent, tightly reasoned blog post that starts with Joseph M. Juran’s take: “… data to be of high quality if they are fit for their intended uses in operations, decision-making and planning.”
Mr. Chisolm follows with a definition from Larry English that focuses on meeting user expectations but omits accuracy, and then the blog post proceeds to an enlightening exploration of the nature of data quality, concluding that “what we’re seeing with these definitions is that they are somewhat difficult to understand and to put to use.” Agreed.
In a world where we’re crusading for more commitment to data quality improvement, and building business cases for real dollars to support increases in data maturity, we need to be direct, concise, and unequivocal. Here’s my suggestion for a data quality definition for non-data people:
Quality data is complete, correct, and timely for its intended purpose.
Using this direct statement, the data professional takes responsibility for quality, and asserts that a partnership with the stakeholder is needed to define “intended purpose”. Moreover, it uses the rule of three to capture the stakeholders’ interest. Now that you’ve captured their attention, you can drill down to the other dimensions of data quality that they need to know about, in groups of three of course.
If you are like most of us, you are in no position to deliver perfect data quality. Most of us rely on source data that’s incomplete, incorrect, and late, and we’re often held responsible for poor data quality that we believe we have no power to correct. Perhaps that’s the reason for so many conditional definitions. It’s time to flip that script.
If you as the data professional acknowledge that data quality as we’ve defined it is concrete and achievable, then you can map out steps toward defining your situation and how to improve it. If prospect emails in your marketing system often have the value firstname.lastname@example.org, then can you work with the web team to add a validation? If your feed from the purchasing system always arrives late, then can your team build its own extract?
Agile methods lend themselves to this kind of steady evolution. On one team, our data on patient to physician relationships was notoriously unreliable. Over 18 months of one month iterations, we made small improvements like working with data source teams to make them aware of our challenges, improving our data extraction processes, blending in data from alternative sources, and training our users on how and to whom to report issues. While none of the improvements was particularly significant in itself, over time quality of our patient to physician data substantially improved.
In such cases, partnership with stakeholders is key. Defining quality “for its intended purpose” implies that stakeholders communicate that “purpose”, or goal, in detail, and provide priorities for quality improvements. By owning the goal of data quality improvements, stakeholders thereby become advocates in funding them. Suddenly, instead of the data team taking the blame for poor quality, the data team shares responsibility with stakeholders for improving it.
Data quality has long been an elusive target that’s difficult to define. Maybe progress will require cutting through that complexity and, paradoxically, drawing stakeholders in by unequivocally taking responsibility for poor data quality.