One More Species of Overloaded Data

A while back I wrote the post A Field Guide to Overloaded Data, which publicized the work of Duane Hufford, who examined different types of overloaded data during the 1990s. Over the years his classifications of overloaded data effectively categorized data anomalies I encountered in the wild.

That is until recently, when a colleague encountered an array of file names in a single SQL Server column. This instance didn’t fit into the three categories detailed in the earlier post, so I’m presenting it here. I’ve also added it to the the original post.

Bundling

Definition: Bundled data is a situation where a single column in a table contains an array of values.

Example from the wild: We found the list of file names in a single varchar column in a system that tracks file transfers. A quick web search for “sql column contains an array” returns many examples and how-to instructions. Arrays are a valid type in Google Big Query and Spark SQL. This construction will likely become more common as more companies adopt modern database technology and associated methods.

Impacts:

  • Bundled data requires programming to unpack. For example, in our case we wanted to report on individual file status in Tableau. Tableau didn’t have processing tools to easily separate the bundled file names, so we had to request an extract table that divided the values into separate rows.
  • If bundling becomes widespread in operational systems, reporting from these databases will require extract operations to divide bundled columns into separate rows, increasing development costs and maintenance overhead unless those unbundling operations are integrated into database management systems that support array datatypes.

Eradication and prevention: Organizations can reduce the impact of bundled columns by including stakeholders who require relational data in system design. Rather than leaving interfacing and reporting requirements to the end, and tacking on reporting functionality after completion of core components, thinking early about outbound data flows and data presentation can ensure techniques like bundling are limited to areas where they will have less impact.

Leave a Reply

Your email address will not be published. Required fields are marked *