Recently there has been a long, and very interesting, discussion of do-it-yourself versus third-party metadata tools on LinkedIn’s TDWI BI and DW discussion forum (membership required to follow the link). I have followed but haven’t commented, but I suppose I contributed when Information Management kindly published my article on DIY metadata.
The discussion is extremely informative, presenting the views of a variety of knowledgeable professionals in different situations, and describing successful and sometimes not-so-successful efforts to solve the essential metadata challenge: how to document what information is locked up in databases.
The conversation has mostly been cast in terms of build versus buy. That said, I haven’t seen a post citing a successful enterprise effort, and no tool emerged as a candidate comprehensive solution (please correct me if I’ve missed something). As one commenter put it:
- “An ETL tool will track the movement of the data.
- “A data quality tool would tell you the types of problems.
- “A reporting tool might tell you who uses them.
- “A modeling tool might tell you business definitions.
- “The DBMS will give you sizing data.”
He added that these diverse tools still lack essential capabilities, notably metadata sharing, problem reporting, and integration with planning software.
A few writers moved on to the obvious question: what are you doing this for and is there a payoff? Justin Hay points to a nice blog article expressing the goals of metadata management in the bluntest of practical assertions, including:
- “Where does the data on that report come from?
- “How much of the requirement has been completed? At what stage is it now?
- “If a data element is removed from scope, what reports are affected?
- “If a report is removed from scope what data elements aren’t needed any longer?
- “How much overlap in data is happening between projects?
- “Where does the integrated data appear in the data marts?
- “Will the data model accomodate the source data?
- “Of the required data elements, how many show integrity issues through profiling? (Do we have to manually cross-reference them?)”
Not knowing the answers to these questions can cost money, most often in the form of additional analysis effort. That was why I built the prototype described in the InformationManagement article. I was on a team integrating data from diverse systems with overlapping data content, the project was subject to frequent requirements change, and quality of data from our sources varied widely. We were challenged to keep our Excel mapping spreadsheets consistent and up to date, and believed a metadata solution would save time and effort. The project was cancelled for unrelated reasons shortly after we completed the prototype, so we didn’t have the opportunity to put it to the test.
On the other hand, metadata management can be a hard sell when it seems unnecessary at the project level. The next project I worked on built a datamart loaded from a single data warehouse. Warehouse to mart ETL restructured the normalized source data to dimensional mart structures, but data sourcing was pretty obvious: after a short learning curve even source to target mapping documents were superfluous to a reasonably tech-savvy analyst. It would be difficult to gain approval for the cycles required for a metadata solution on that project, however valuable it would be from an enterprise perspective.
Even if enterprise metadata management has yet to take hold, point solutions can still be valuable. Unless a metadata management tool comes as an add-on to a package already in-house, like SQL Server or Informatica, I favor home-grown solutions over purchased point solutions. Building your own allows for a minimal point solution that precisely addresses the need and develops deep understanding of metadata among team members.