Thoughts on Healthcare Data Quality

The well-publicized problems with are disturbing, especially when we remember they might result in many continuing without health insurance. Healthcare.govBut it seemed a step in the right direction when recent a news report differentiated between “front end” and “back end” problems. The back end problems were data issues, like a married applicant with two kids being sent to an insurer’s systems as a man with three wives.

Coincidently, I recently responded to a questionnaire about health care data. I’ve paraphrased the questions and my responses below. Perhaps the views of someone who’s spent a lot of time in the health care engine room might provide some useful perspective.

In reading through the questions and my responses, picture a care management system that pulls patient, provider, and visit data from several different insurance companies and sends letters with helpful advice to patients with certain conditions.

1) What kinds of problems exist in collecting patient-centric data? What data quality issues are present?

In context, this question asked how well data from the patient-provider encounter (health care jargon for doctor visit) was recorded, including patient enrollment, data from the encounter, the diagnosis, and even matching patient encounter records to the right providers.

Being a “back end” developer, most often in an insurance company, my job is to integrate data from several different sources into a single database. Based on my experience, I’d characterize 5 to 10% of the records as incomplete or inconsistent in some way.  A smaller percentage of the records feature errors significant to the mission of whatever system I’m involved in building.

Quality problems might include:

  • Missing data for a given patient
  • Data entered incorrectly
  • Duplicate patient data due to key data entered differently at separate care locations.  Redundant data could also be entered multiple times at the same location, if data validation controls are not in place

The result of the latter might be multiple records for the same person or data for different people being “matched” into the same record incorrectly.

These data quality issues cause errors when systems use patient data to drive transaction processing. Using letters related to a specific condition – say, diabetes – as an example:

  • Missing data might cause a patient with diabetes to be omitted.
  • Invalid duplication of a patient record might result in duplicate letters for the same person.
  • Merging patient records incorrectly might cause someone without the given condition to receive a letter.

2) What steps are taken to improve upon the data or the collection methods?

From the perspective of a data integration professional without any influence on the point of data entry, the options are these:

  1. Detect and exclude invalid data,
  2. Include all data regardless of validity, and
  3. Interpolate or estimate correct values.

The first two options are the most common. In most business applications the third option is frowned upon because it changes source data without any way of knowing the actual value.

3) What are the negative impacts of dealing with dirty-data?

A business process cannot be 100% accurate if its source data isn’t 100% accurate. Each data quality measure introduced to the data integration process increases data integration cost and increases data transmission lag to the target system.

4) What efforts are in place to reduce the negative impacts? What kinds of processing or architecture helped make the difference?

Business process design and system improvements that introduce measures to correct data at the point of entry are the best way to insure quality data.

I believe the health care industry has a lot of potential improvement here. Why do I have to fill out a 10-page form for error-prone manual entry at the doctor’s office when the insurance company they share data with has all of my detailed information available for download? Of course those kinds of solutions quickly raise Big Brother type questions, but for better or worse any insured person’s health care details are replicated across many provider and insurer systems. Our records are already at risk without addition of this one very useful interface.

After entry, the data passes through the labyrinth of back end systems of large providers and insurance companies. Different systems, or even different locations, over time evolve different business rules and data definitions. It is critical that those developing integration processes understand the business aspects of all different sources and conform them to a common standard for proper interpretation for the particular target business process. Projects that skimp on business analysis and source data research pay the price as many of their implicit assumptions turn out to be wrong. Maybe that’s what’s happened at some of the back-end systems served by

It’s possible to correct data after the fact. For example, data analysts on one project for a large health insurer tracked data errors on reports back to the incorrect data in source systems, and recommended the corrections needed. The 12-month effort resulted in significant data quality improvement after diligent work by the 10 or so skilled data analysts on the team.

6 thoughts on “Thoughts on Healthcare Data Quality

  1. Bob

    I received this offline comment to this post: “My bet is that the project was recorded as “green” at every stage in the development cycle. In line with your position, if some of the tasks had been appropriately “yellow-lighted”, the people in charge (whoever they were) could have made better decisions.” Thanks for your message – as a frequent yellow-lighter I couldn’t agree more!

  2. Bob

    Responding to this post, a graduate student asked me questions that I thought might be of interest to others involved with healthcare data. Here are the questions and my responses:

    1. What factors in your opinion define quality in healthcare clinical data?
    Assumptions: Clinical data is biometric data gathered by medical professionals or by machines (e.g. thermometer, X-ray, heart monitor).
    Response: Criteria for measuring data quality are well documented (one graphical example here: To me the primary factor that defines data quality is the business process used to collect it – data quality is directly proportional to the quality of the collecting business process. Data quality must be an identified goal of the business process, and process design must feature methods of collecting and evaluating data quality.

    2. If there were a model to define quality in healthcare big data, what aspects should the model address?
    Assumptions: Big data is defined as representing a sufficient quantity of instances of a given business event – say, billions – such that it is impractical or even impossible to meaningfully process data about the events one-by-one.
    Response: Due to the scale of big data, its users lose ability to evaluate it by some of the traditional quality criteria. For example, it is impossible at big data scale to evaluate accuracy of records with valid values that are independent of each other. Therefore, a quality model for big data must focus on validity rather than accuracy, and should exclude instances with out-of-bounds or inconsistent values.
    While traditional systems prevent entry of invalid data into the database, the typical model for big data is to load all data, valid or invalid, and to apply quality processes upon extract to applications that use the data.
    Finally, big data imposes serious security challenges, especially in the health care arena. Depending on the application, one should anonymize or eliminate personally identifying data. If that can’t be done, then security and access controls should be carefully designed, applied, and monitored.

    3. How in your opinion could one enhance a quality model designed for traditional database (small data) to encompass big data?
    Assumptions: See big data assumption for 2, above.
    Response: From the perspective of data quality, their very scale hamstrings big data systems. One does not enhance a traditional quality model to work with big data, but rather pares down quality measures because they are impractical at big data scale. In addition, the two approaches are incompatible because traditional database processing involves keeping bad data out, while big data systems bring all data in and require a caveat emptor approach on the part of applications accessing the data.

    4. What did big data introduce to healthcare organizations?
    I have not personally seen a big data application in a health care organization. While I have worked with databases with billions of claims, they were operational systems operating on a traditional DBMS.

    5. What tools could be used to analyze, study, or observe big data related to healthcare?
    Big data analysis tools are emerging (eg. Datameer) to provide tools perhaps a bit more accessible than the analysis languages most often used (R, SAS, etc.). However, the key to being able to use big data in analysis applications is design of the big data store, as described in this recent post about Cassandra:

    6. What organizations could offer useful resources regarding healthcare and big data?
    The best sources of information on big data utilization are big data pioneers, companies who have made big data core to their business processes. A second valuable source, albeit not totally objective, is vendors of big data tools and consultants with experience.
    Big data is still new, and working with big data often means stringing together many immature products at version 0.x or 1.x. It also imposes a risk in procuring services due to the many firms looking for their first foothold in the field.

    7. In your country, or to your knowledge, what organization is considered to be a good example of handling big data in general, that we could study their approach to adopt into healthcare big data.
    Netflix and Google stand out as examples.

    8. If a quality model is to be certified, what organization or individual could review it?
    I am not aware of an organization that certifies healthcare data quality processes. In researching this you might start with HIMSS:

    9. To study big data and how to manage its quality, what is the advised approach in your opinion?
    I recommend learning from those who are successfully applying big data techniques as a core of their business, and research organizations who are pioneering big data techniques. Many provide detailed public resources, for example:

  3. Afnan

    Thank you Mr. Lambert .
    I was a bit surprised that big data are being processed on traditional DBMS.

    What about quality performance indicators (QPIs) are they being used in hospitals to measure the hospital’s data quality? Have they improved the data quality, like to what percent? Should each hospital have their own indicators or should they be standardized (unified) as one?


  4. Bob

    Thanks for your reply. Yes, it is interesting that in this case SQL Server was used as a source for an analytics process against a very large membership/claims database. The capability was developed about 10 years ago before big data techniques emerged and is now finely tuned to handle tens of millions of members and about 100 times the number of claims.
    I would like to be able to answer your question about QPIs but I don’t have experience in the clinical area, my experience in healthcare has focused on membership and claims at insurance companies, and, many years ago, in aggregating clinical data in a data warehouse. That said, I would think emergence of increased state management of national healthcare systems would promote emergence of standard QPIs.

Leave a Reply

Your email address will not be published. Required fields are marked *