Data scientists spend most of their time doing data integration rather than gathering insights. In my interview with data scientist Yan Li, she said that data collection and prep takes at least 70% of her time. Obviously, there’s a lot of integration work to do on data that’s new to analytics efforts, but not every analysis uses brand new data. Organizations can improve analytics efficiency by staging commonly used data pre-integrated for data science.
For years, large organizations have supported data warehouses, but prevailing data warehousing practices often fail when faced with “big data” volume and velocity. Still, warehousing teams in large organizations can pre-prep frequently-used internal data. Examples include reference and master data, production and sales records, and so on.
However, traditional Business Intelligence/Data Warehousing (BI/DW) techniques are often ill suited for data science. Established data layer teams can serve data scientists well by adhering to two key principles:
Business Intelligence and Data Science are different. “The purpose of Business Intelligence is to support better business decision making.” On the other hand, “data science is the extraction of knowledge from data.”* BI works in the realm of the known unknown: values of understood key performance measures and business metrics. Data science, on the other hand, works with the unknown unknown, using the scientific method to first intuit testable hypotheses, then apply data to gather new insights by testing the hypotheses.
Warehoused data tends to be structured and sometimes modified to support BI. Forrester describes a three tier “layer cake” consisting of data load, data warehouse, and business intelligence reporting, but the data load complex (also called ETL) can consist of may layers itself. ETL often omits fields deemed irrelevant to measures of interest. Values might be corrected or even altered for consistency with master data, reference data, or data from other sources. Financial records might be consolidated, summarized, or “restated” to reflect current conditions.
Such processing is reasonable in BI because the target is known: data in the warehouse must serve needed business measures. However, the nature of data science means that its needs are unpredictable. In general, those integrating data for data scientists should expose as much information as possible by:
- Translating to standard values when possible, while also exposing original values even if inconsistent with reference and master data. For example, if a customer survey allows entry of invalid product codes that could reasonably be mapped to valid ones, integrated data should expose both to data scientists.
- Providing full detail: If the ETL stream summarizes sales to purchases by store by day, then find a way to store each individual transaction. This type of detail storage is a great first business case for organizations wanting to get on board with Big Data technologies.
- Retaining original semantics: In cases where financial transactions are reprocessed to reflect new legal, regulatory, or other conditions, retain the originals as well as the reprocessed versions.
Show Source Data
A standard technique in ETL is to land files directly from data sources to “staging areas”. This has a number of advantages, including rapid load of incoming data and ability to rerun failed load processes from original source data.
Staging areas are rarely designed for user access, but they can be a gold mine for data scientists. Providing access to staging is an easy first step for warehousing teams seeking to serve data science, one that can buy time for the integration changes noted above.
It’s true that exposing raw source data runs counter to the objective of helping data scientists capitalize on pre-integrated information, but providing staging files makes it clear that the warehousing team is being transparent and helpful to data science by opening up every possible opportunity for improving the organization’s speed to insight.