Data quality improvements follow specific, clear leadership from the top. Project leaders count data quality among project goals when senior management encourages them to do so with unequivocal incentives, a common business vocabulary, shared understanding of data quality principles, and general agreement on the objects of interest to the business and their key characteristics.
Poor data quality costs businesses about “$15 million per year in losses, according to Gartner.” As Tendü Yoğurtçu puts it, “artificial intelligence (AI) and machine learning algorithms are only as effective as the data they use.” Data scientists understand the difficulties well, as they spend over 70% of their time in data prep.
Recent studies report that data entry typos are the largest source of poor data quality (here and here). My experience says otherwise. From what I’ve seen, operational data is generally good, and data errors only appear when data changes context. In this post I’ll detail why data quality is management’s responsibility, and why data quality will remain poor until leadership makes it a priority.
Operational data is generally good
Like most people, I purchase things online, use online banking, and do a lot of other business over the internet. Speaking of the sites I use on a regular basis, the data about me at those sites is correct. I can make purchases online and the site charges the right credit card, the item arrives at my doorstep, and notifications arrive in my email inbox. Online accounts at banks, credit cards, insurance companies, the host of this website, and so on are likewise correct.
Similarly, the individual records I’ve reviewed over the years in my work as a consulting data analyst across various industries have been generally accurate, including tobacco shipments, financial and insurance transactions, railcar maintenance records, and many others.
Of course I have seen exceptions. In one case a system supporting real estate transactions collected but didn’t validate transaction codes useful to the back office but not to those working directly with customers. As expected, almost all resulting records had default values. Another common example: how many websites require entry of an email address for access to some feature or information, but then allow entry of abc@def.com?
Still, operational data is always good enough for the transaction it supports. The real estate transaction completed every time with the default values. Millions of people access web resources every day with abc@def.com.
Data errors appear when data changes context
For the real estate system example, the default transaction codes were irrelevant to operational participants, but to those reporting revenue to state regulators, accurate values would have saved many work hours. The email address abc@def.com enables the website user to proceed to the next step, but is useless to the marketing team assembling a prospect list.
But context challenges seem to result even more often when data is correct at operational sources. I once worked with a column in a database that could either hold the name of the employee or age of an animal. This happened as a result of a merger, where the integration team merged data from the acquired outfit into the acquiring company’s database. Rather than create new data structures for the new concepts, acquired agricultural data was squeezed into existing tables.
While the values in the column were operationally accurate, overloading the two concepts into the same column meant added IT cost by requiring additional code to parse, and business cost with the possibility of misinterpretation.
In both of these examples, data that works perfectly in the narrower operational context is unsuitable or costly in the larger context, and the quality deficiencies appear only when data changes context.
Data quality is management’s responsibility
In one of the studies referenced above, 89% of survey participants cited “human error” or “too many data sources” as “biggest contributors to lack of data quality in 2018.” To me, instances like the two above contradict that survey.
For the code value in the real estate system, why did the custom-built system default codes at all? And if the system’s outputs were so important to the finance team, why weren’t appropriate cross validations implemented? Furthermore, bonuses for the agents operating the system were based on volume processed without regard to data quality.
In the case of the overloaded column in the merged database, why did the integration team not design the database in a way that maintained a one-to-one relationship between data field and business concept? I understand that integration of acquired entities is often far too much work in too little time, but stuffing data into a table where it doesn’t fit isn’t always the fastest option. At a minimum, it might have been just as fast to copy in the acquired company’s tables wholesale, temporarily integrate the data in views, and after the merger iteratively migrate to a more permanent solution for this core business application: not optimal, but quick with at least a path to a better solution.
At first glance these two cases seem to result from “human error” and “too many data sources”, but not if you seek root causes. The real estate system suffered not from bad data entry, but rather from inadequate data entry validation. Rather than “too many data sources” causing overloaded columns in the insurance database, the root cause was poor integration of the two sources involved.
But let’s pause before we blame the requirements analysts and project managers. The projects were likely balancing scope, schedule, and quality to the best of their abilities, and most likely data quality simply wasn’t a priority. Data quality is management’s responsibility, but the managers who hold that responsibility must be able to drive priorities.
Data quality remains poor until leadership makes it a priority
Reviewing the corporate website of one Fortune 500 company, a site search for “data+quality” returned three hits, while “compliance” returned 1241 and “information+security” returned 52 (results included job postings). Visiting another company’s site, there was exactly one hit for data quality, 158 for compliance, and 184 for information security. Data quality simply isn’t on the map in the executive suites of many organizations, and as a result folks in those organizations aren’t incented to value data quality.
Effective data quality leadership is specific, clear, and unequivocal, in the same vein as are policies for security, regulatory compliance, inclusion, and so on. A data quality standard must include these key elements:
- Standard business vocabulary that defines commonly used terms for the organization, as described here.
- A statement of key data quality principles. In other words, what is data quality, why do we care about it as an organization, and how does each associate help ensure it.
- Identification of the primary “things”, or business objects, the organization manages, and definition of characteristics of those things that the organization must know. For example, Amazon manages products for sale, sellers, customers, employees, warehouses, and so on. It keeps track of product prices, seller products, customer addresses, etc.
- A mechanism for ensuring that applications and databases are consistent with the business vocabulary and organization-level objects and characteristics.
One might object that executives shouldn’t be mired in the detail of business glossaries, conceptual data models, and governance structures. But what could be more appropriate than an executive defining exactly what an organization manages? Moreover, as a matter or setting data quality as an organizational priority, less is more. A relatively concise vocabulary of 50 key terms, 10 key business objects with 10 attributes each, and inclusion of data quality in annual performance goals would would go far to orient an organization toward improving data quality.
With specific data quality guidance and prioritization, teams know that the revenue code in the real estate system must be valid, and overloading nurse data with horse data violates data quality principles. Until then, data quality remains poor until leadership makes it a priority.