Data and Wine?
Saturday, November 14th, 2009Great together, check this out:

Great together, check this out:

In a recent very thoughtful post on data quality, Paul Erb plays out an analogy comparing data users with Don Quixote and data quality professionals with Sancho Panza, then reverses the analogy to cleverly coin the “Sancho Panza” test of data quality professionals. He encourages data quality professionals promoting the critical role of data quality to apply a what would Sancho say test to ensure that they are aligned with the needs and interests of data consumers.
Here’s Paul’s description of the Sancho Panza test:
Think of Don Quixote [DQ] as the data-quality specialist or even the data management specialist or software vendor, bringing to the world his specialist’s perspective and vocabulary and enthusiasm, influenced by the books he’s read, visioning everyday business practices, with his value added, as goldmines for the organization. Meanwhile Sancho Panza represents the person who does a practical job every day, who knows what works around here and what doesn’t.
I advocate to Data Quality (let’s call it DQ) consultants that they listen to this Sancho Panza, and consider themselves as Don Quixote. Sancho doesn’t know much about data, but he knows what he likes… He’s open to listening, but slow to change, and he’ll tell you what he thinks.
Paul’s article reminded me that as a child I thought the problem with Don Quixote was that he tilted at windmills and attempted to ambush acting troupes because of his bad eyesight. Of course this is not the case, but to me it provides a relevant perspective on data quality in many organizations.
Here’s the problem I’ve seen play out on a number of IT application projects:
In these situations the business case was incorrect because it did not account for all of the costs of data integration. I’ve seen projects weather steps four and five well, but often discovery of previously unseen data complexity starts a disruptive chain of events. (Sadly for the project manager, such situations are often seen as a failure of project management and corrected accordingly, but that’s a topic for another post.)
In my view the root cause of unforeseen data complexity on projects is the lack of a data constituency in current IT. It is only recently that success of companies like Google and Amazon have motivated emergence of data as a key business resource in the collective consciousness. Famous success stories notwithstanding (see this link), there are relatively few senior IT managers with data quality backgrounds. Conversely, many rose through the ranks of the infrastructure, application development, or business (process) analysis groups.
It will be a while before, for example, a Mobil CIO’s predecessor jobs include definition of a metadata repository or elimination of multipurpose data, but in the meantime here’s what we can do: add a business case to the application lifecycle as the last step in requirements. Stop the project when the real costs are known, recalculate the cost/benefit, and ask the sponsors if the project should continue. Give Sancho (in this case the project team) a chance to speak to the reality of the situation, and hand to Don Quixote (project sponsors) the eyeglasses of in-depth visibility into real costs. If the decision is to move ahead with the project, then all share the same vision and the sponsors have endorsed the actual project, not the fuzzy image from earlier on that might have been a windmill.
It seems I’m frequently in conversations about using null to represent a business value. To paraphrase, say there are credit and cash customers, and there’s a suggestion to set “Customer_Type” to “C” for credit and null for cash. To data and database professionals this is obviously a bad idea, but it’s not obvious from a business point of view.
In a database null means that there is literally no value, or the value is indeterminate. Null is not the same as zero or blank. When a database operation involves nulls the result can be difficult to predict for someone not practiced in SQL. In many cases the answer is null. For example, 1+0=0 but 1+null=null. In plain English, what you’re asking the DBMS to do in the latter case is to add 1 to [I don't know what], and of course 1+[I don't know what] equals [I don't know what].
So, if you use null to represent a business value then you might not get the results you’re looking for when you try to get business answers out of your database. For example, say “C” represents credit customers and null represents cash customers, and you have 2 cash and 1 credit customers. In SQL Server if you use a Count function to tally all of your cash customers the answer isn’t 2, it is null.
That’s one example of why it’s not a good idea to try to represent a business fact with a null value. It doesn’t make business sense and in this case the DBMS, correctly, won’t make sense of it for you.
To be clear, whether or not a given database column permits null values is an entirely different question, best left to database designers. For example, a database table might record which patient occupies which hospital bed. It may be reasonable and correct to assign a null patient ID if the bed is currently available. However, there are alternative methods of representing this situation, and the database designer should be free to choose the right alternative taking into account the specifics of the application under development.
While many organizations understand the value of managing the information resource, for many others information management remains abstract and difficult to define. In an effort to make it concrete here’s a hypothetical proposal to provide an Enterprise Information Architect for a hypothetical organization that really needs one.
Today: inconsistent data of uncertain quality blurs enterprise view and restricts planning
Today managers, planners, and analysts lack the information required to run the organization as a single enterprise rather than a collection of diverse units.
Solution: enable sound planning and management by identifying data assets and setting processes to manage them
Empower an Enterprise Information Architect to lead an effort that (1) identifies data that describes the organization, (2) defines how to integrate and improve quality of that data, and (3) improves the ability of information technology to maintain data quality.
(1) Lead definition of an Enterprise Information Architecture identifying information required to manage the organization as a single integrated enterprise, and data quality standards that ensure that data supports enterprise goals.
(2) Working with senior business managers, put in place a program of data quality improvement that plans and executes specific measures and sustained commitment to improving data quality in business processes and IT applications
- System of origination is the application that provides the entry point of a given data object to the organization.
- System of record is the application that is the source of record for the data object.
(3) Working with senior IT managers define and put in place standards for application requirements definition, data management, and metadata management to
Much of today’s IT application development – custom or off-the-shelf – involves integrating data from legacy systems, third- party software products and external data sources such as demographics or mail lists. More often than not, data integration is unexpectedly complex, either due to data quality issues or the nature of the data integration itself.
Here are some typical examples:
When data integration is part of a project, expect complexity and leave room in interface development estimates for devils in the details of source system analysis and integration design.
Everyone involved in managing or improving a business process should understand data modeling. For real. And almost everyone is in a position to improve a business process by understanding the current one and making suggestions to improve it. Understanding a business process means understanding business objects, events, the relations among them, and the business rules that govern the process, and that’s what data modeling is all about.
Sure, data modeling is the first step to designing a database, but that’s just a coincidence. A well designed database is well designed both because it’s efficient and because it matches business needs.
The first step in data modeling is understanding entities. An entity is like a business object: examples may include customer, order, product, patient, blogger, post, or whatever. The next step is to understand relationships among the entities, like a customer may place many orders or a post is written by a blogger. Then the data modeler thinks about the attributes of the entities, like the name of the blogger, the price of the product, and so on. The attributes and relationships are where the business rules come from. Examples of rules may be “a post is written by exactly one blogger” or “every order must have a shipping address.” It’s not really a sequential step by step thing, more like a series of really interesting brainstorm sessions, but you get the idea.
Data modeling can get really complex, especially when it includes enough detail to generate an actual database, but that’s beside the point. We’re talking about clear thinking about business things, events, relationships, and rules. The point is that this kind of thinking can enable a business person to understand better the things, events, and rules of a business, and then to define more rational processes based on that understanding.
Today a lot of the problems that data modeling helps reveal relate to overcomplicated org charts and artificial complexities of legacy information systems. Often the business evolves around the system, and it takes clear thinking to untangle the process spaghetti that results.
I worked with one company that organized itself by four different product line channels, served by matrixed support functions like accounts receivable and claims. Worked pretty well, except when you wanted to know all of your contacts with a given customer, for example. The same customer could have had a different record for each channel, then more information was socked away in the matrixed support functions. Furthermore, the customer records were all laid out differently, and had different addresses, contact information, billing instructions, and so on.
Maybe I’m oversimplifying, but shouldn’t one customer have one file with the same information that everyone uses? It seems to me that if applied in the real world this data-oriented perspective could really make things simpler and more cost effective.