Information Management recently sent around their pick of best IM blog articles of 2009. Among them was Forrester’s James Kobelius’s reaction to Bill Inmon’s “incineration of a straw man concept that he refers to as ‘virtual data warehousing (DW).’”
According to Mr. Inmon, virtual data warehousing reminds him of the carnival game called whac-a-mole. He says “just when you think this incredibly inane idea has died and just when someone has delivered what should have been a deathly blow, out it pops again from another hole.” There’s just a very informal definition of virtual DW in Mr. Inmon’s post (remember, he says he’s whacked this mole before), but, as I interpret, he’s talking about a system built after a decision to avoid all the expense of building a data warehouse by just having a query engine that pulls the data from wherever it lives. Mr. Inmon argues that a query accessing diverse databases would leave data integration to the user, and there’s no guarantee that two users would integrate data the same way. He cites virtual database query inefficiency risks and, on the assumption that the query is trolling operations focused databases, says that source data would be “tuned” to operational rather than informational specifications for history retention and completeness.
Mr. Inmon’s ideas drew quick reaction from Mr. Kobelius and Neil Raden. Each in his own measured way stresses that integration can be compatible with distributed architectures, and that there is a DW solution architected for efficiency that includes effective data integration from diverse sources: the Federated Data Warehouse.
Experience and emerging tools reinforce their point. According to a colleague at CapTech, for smaller organizations “you can deal with this issue using a BI tool with a metadata layer that has joins predefined: the data integration is done by the BI metadata modeler.” Another CapTech’er cites mashup as a potential quick and dirty approach. Check out “7 Mashups Every Company Needs” here.
A well-architected federated warehouse certainly can integrate and deliver data, maintain history, and enable a “single version of the truth”, perhaps in a more timely manner than a “traditional” DW architecture. On this question the devil is in the specifics of the situation. It is difficult to argue one way or another out of the context of a real project in a real organization.
However, even though it certainly has a technical side, data integration is first a business activity. Sometimes when we apply terms like “semantic rationalization” to software components, we in IT start believing you can actually build a machine that does the things you need to do to rationalize data semantics, like figure out the corporate definition of a customer. Of course all we can do in IT is to build the empty shell. The real work happens when business people from departments whose data is being integrated sit down and decide how they are going to define “staff member”, “customer”, and so on. Only business professionals can say, for example, whether they want to include contractors in staffing reports or whether the term “customer” includes homebuyers under contract but not yet closed.
Integration tools that support data warehouses, whether centralized or federated, are only as good as the business consensus behind them. The consensus behind integrated data is arguably more rewarding to the business that the tools because with consensus on critical objects and events come non-IT-specific improvements like reduction of repetitive and conflicting business processes, reduced communication breakdown due to terminology disconnects, and more.
To me the beauty of the Inmon DW model is that it provides a mechanism that can assist an organization in evolving toward improved information maturity. Organizations achieve some benefit by simply integrating data into a single data warehouse. However, the data warehouse also makes source data quality problems obvious and blatantly reveals differences in data meaning from one operational source to another. So the warehouse delivers some benefit early and also shows how much better it would be if data were integrated. It therefore becomes a tool for identifying, assessing, prioritizing, and motivating correction of data deficiencies.
For organizations not so far along on the maturity curve, the additional complexity of the federated warehouse tends to obscure this data quality feedback loop. Federation based on drawing from operational sources integrates data from a set of different databases built toward different architectural goals. On the other hand, the logical data model for the enterprise warehouse is the enterprise data model, and its architectural objective is to integrate enterprise data to provide a single source of truth. Therefore, the enterprise data warehouse provides an architectural focal point for integration. It isolates responsibility for improving data integration crisply at either the source or the warehouse, and — within the framework of solid information management strategy, management, and facilitation — motivates diverse business players to work toward consensus definition of enterprise data.
Federation, or virtual data warehousing if you will, can be the best strategy for the mature organization that has already integrated business data to a consistent enterprise view. For the rest of us, the single centralized warehouse with its unambiguous architectural goals and borders seems the shortest distance to achieving the business benefits of data integration.