Beware the devils in the details of data integration

Excerpt from Illusions, Allusions – Let’s Get Real about Database Design, InfoManagement Direct, October 4, 2002

Excerpt from "Illusions, Allusions – Let’s Get Real about Database Design", October 4, 2002

Much of today’s IT application development – custom or off-the-shelf – involves integrating data from legacy systems, third- party software products and external data sources such as demographics or mail lists.  More often than not, data integration is unexpectedly complex, either due to data quality issues or the nature of the data integration itself.

Here are some typical examples:

  • One ERP package uses the same table for both Sales Quotes and Sales Orders. Columns that mean one thing for Quotes mean quite something else Orders. One team extracting data from this ERP package continually mixed up, for example, Date Received on the Quote with Date Prepared for the Order. The designer who blindly copies data from input systems can propagate these issues. In this case, the correct solution is to extract the two documents into separate tables in the destination system, making each column describe either a quote or an order, not both.
  • Marketing databases often store data purchased from several third parties on the same set of customers. These sources usually include overlapping columns with different values. For the same customer, different sources might store different values for the person’s address, credit scores or even name. It is sometimes important to preserve all of the columns from all of the sources and to maintain the information on where the data came from as well as what its value was. This can result in a messy database design, where columns again carry dual meaning: their value and their source.
  • Codes from legacy databases tend to evolve into complex forms, embedding more and more information into a single field. This is perhaps a natural reaction to the slow evolution of the system relative to changes in business, as users shoehorn information into the system that it was not designed to store. For instance, in a legacy system a one- character code might classify customers by “customer category,” with values 1 for small business, 2 for mid-size, and 3 for Fortune 5000. Users might add codes 4, 5 and 6 for corresponding values for aerospace customers, then 7 for federal government, and so on. The database designer must know the data well to extract each embedded concept into a different destination column.

When data integration is part of a project, expect complexity and leave room in interface development estimates for devils in the details of source system analysis and integration design.

Leave a Reply

Your email address will not be published. Required fields are marked *