Posts Tagged ‘Database Design’

Use conceptual data modeling in requirements definition

Friday, July 16th, 2010

I’ve often thought that conceptual data modeling was an underused tool in the arsenal available to requirements analysts, and in a recent conversation I found that many were surprised that it would be used in the requirements phase at all.  Checking the Business Analysis Body of Knowledge (BABOK) I found data modeling listed among the tools available to requirements analysts to “to describe the concepts relevant to a domain, the relationships between those concepts, and information associated with them.”  There’s also Steve Hoberman’s excellent book on the topic, Data Modeling for the Business, an introduction to data modeling aimed at a business audience.

Data modeling has long been one of my requirements analysis tools of choice for custom operational applications.  To me, using data modeling techniques in requirements analysis reduces errors by improving requirements completeness, consistency, and communication, and provides unique continuity between analysis and design.   As David Elliott told me in conversation, “development of a data model uncovers many opportunities for clarification of existing requirements, or uncovering of additional detail.  At the very least, it confirms to one’s business customer that the BSA understands and can graphically demonstrate many business rules and relationships.”

I’ll hasten to add a these caveats.  (1) Perhaps strangely, conceptual data modeling is not useful in the same way in requirements for informational systems like data warehouses and marts (I’ll save that discussion for another post).  (2) Requirements definition for commercial off-the-shelf (COTS) applications follows a different methodology in which data modeling might be less applicable.  (3) This post is not about database design, but rather about use of conceptual data modeling as a tool for organizing and validating requirements.

Conceptual vs. logical data modeling

It is easy to see why in practice there are varying definitions of the different types of data models.  The Wikipedia entry on data modeling reflects the standard terminology based on the ANSI four-level database architecture, but features a confusing diagram that to me blurs the distinction between conceptual and logical models. The entries on Logical Data Model and Conceptual Data Model make them sound like the same thing: implementation-independent representations of business data.  Then, the entry on Database Design contradicts them by stating that the logical model “contains all the needed logical and physical design choices and physical storage parameters needed to … create a database.”

For this post I’ll follow the definitions offered in Simison and Witt’s Data Modeling Essentials:

  • Conceptual data modeling identifies a set of data structures that will meet requirements, focusing on business and not on technical or DBMS-specific concerns
  • Subsequent logical data modeling maps the conceptual model to structures supported by the particular DBMS, finalizing the design in DBMS-appropriate constructs but not yet optimizing for performance, which comes next in physical modeling

Use of data modeling in requirements definition

Conceptual data modeling is hardly an outlier technique in requirements definition:

  • Perhaps in reaction to problems experienced by adopters of Structured techniques in the 80s, data modeling was the cornerstone analytical technique in Clive Finkelstein’s and James Martin’s widely-adopted Information Engineering methodology.
  • The BABOK includes class modeling, data modeling’s object-oriented cousin, in its chapter on data modeling.  Class modeling is a core technique of object-oriented analysis.
  • Scott Ambler’s Agile Modeling site offers conceptual (or “slim”) data modeling as an option in the initial envisioning stage.
  • Informally searching requirements definition templates available on the web, I found that about a third recommend including conceptual data models.

Benefits of data modeling in requirements analysis

The BABOK separates requirements gathering from requirements analysis, defining requirements analysis as an essential step to organize, prioritize, and validate elicited requirements. Elicited requirements are the business objectives of the system. The analysis step organizes those objectives in a way that both makes sense to the business and guides subsequent application design.  Conceptual data modeling in this stage helps ensure requirements completeness, consistency, and communications:

Completeness: In my experience most requirements analysis is process-based, and the most common tool the “swim lane” activity diagram.  While such techniques are essential for understanding complex processes, they can miss requirements that aren’t directly involved in the process itself.  For example, a complex process might reference federal, state, and local tax rates by zip code.  Analysts who are heads-down in defining the process might neglect the need for at least annual refresh of the tax rate tables.  Data modelers thinking in terms of business objects and events and their life cycles would be less likely to miss that one.  This kind of review is formalized in the “CRUD Matrix” a table identifying which business activities create, read, update, or delete which business entities.

Consistency: Another challenge with process-oriented techniques is, for large systems, the risk of inconsistency in definition of business objects and events.  For example, I worked on requirements definition of a specialized order processing system.  Separate sub-teams defined field and headquarters processes, and as a result there were incompatible definitions of critical concepts like “customer” and “order”.  Time pressures made it difficult for the two sub-teams to work together to make their work consistent.  A separate data modeling sub-team can provide a reference point for object and event definitions and promote consistency between separate process analysis teams (on COTS installations the product database itself serves as the reference data model for separate process definition teams).

Communications: Data management professional Peter Carr recounted to me his experience as a consultant on a large project: “the Conceptual Diagram helped us think about all the current state situations, and broadly about the relationships between entities in the organization.  It helped us to ask questions of the business when they were looking to enhance or build new systems to solve business requirements.  Paraphrasing executive colleague Rich Hartt, “the enterprise data model is like a piece of art, it provides a picture into the business that offers new insight through its drawing and interpretation’.  He went on to tell the key business leaders that the enterprise data model will continually be changing, but that it helped them gain understanding of their business in a different way than written business rules”.

For relational database applications, data modeling applies the same conceptual tool throughout the development cycle.  A conceptual data model used to define the problem domain uses the same structure and symbols as a physical database design, although of course it uses fewer.  On the other hand, activity diagrams and data flow diagrams are fundamentally different in nature from the software that they describe.  In effect, process designers need to translate analysis artifacts into a different language.  Logical and physical data modelers use exactly the same language as the business analysts who complete the conceptual data model.

My colleague Grayson Gorman cites “Poorly Defined / Missed Requirements” as a key contributor to IT project failure.  In my view making data modeling a more prevalent part of requirements definition could help by improving requirements completeness, consistency, and communication with business participants, and promote a seamless transition from requirements to design.

SQL Saturday #30, Richmond Virginia, April 10, 2010

Friday, April 9th, 2010

Thanks to all who attended my presentations at SQL Saturday on April 10.  Here are the materials from my two presentations:

- The Business End of Data Modeling (2.5m powerpoint presentation)

- Normalize Metadata For Data Integration Analysis (5.5m full version, zip including presentation and code samples)

- Normalize Metadata For Data Integration Analysis (small) (2m reduced size version, graphics removed from ppt file)

Here are some quick notes for those looking to run the Metadata prototype:

The prototype metadata database includes SQL Server 2008 data definition language and data manipulation language (DDL and DML) needed to create the database and populate it with tables and columns from Microsoft’s AdventureWorksDW sample database. It also includes a sample requirements spreadsheet and source-to-target map, and SSIS jobs to load the spreadsheets to corresponding metadata tables. These define fictional requirements and mappings to populate the AdventureWorksDW FACTInternetSales table from tables in the AdventureWorks sample database.

AdventureWorks and AdventureWorksDW are available here: http://msftdbprodsamples.codeplex.com/Wikipage (accessed 4/14/2010)

Cloud databases and business/IT alignment

Sunday, August 23rd, 2009

Today, the foundation of most of our custom-built systems is a relational dbms.  While development frameworks vary, they overwhelmingly access and maintain data in relational tables and columns.  As I write I routinely save this post in a MySQL database, and at work I tend SQL Server applications.  Millions of others develop, use, and extract analytical data from thousands of SQL Server, DB2, and Oracle applications, on servers and networks maintained in-house by in-house administrators.

Some claim that the relational dbms may be out of style very soon.  Cool new “cloud computing” and “SaaS” apps and services  delivered over the internet seem to be popping up everywhere – just look at Salesforce.com, the well-established Customer Relations Manager vendor, and the many cloud-based PC backup sites.  As part of that trend, Amazon, Google, Microsoft and others offer database services over the internet that don’t look much like relational dbms’s.  Some supporters of the cloud-db options seek alternatives to the standard relational DBMS (note this widely read article).  Of these, many are OO developers.  There’s a fundamental dissonance between OO and relational approaches, requiring an intermediate object/relational mapping (ORM) layer for OO systems to operate effectively with relational DBMSs.  Many of the new cloud-db options are open source, lightly structured data services provided via the internet, capable of storing and delivering large data stores for high availability, fast response applications.

The convenient thing about relational databases is that they pretty much match the business view of data, and therefore give business people and developers common ground.  A well thought out relational data model is one way to express the inherent structure of business rules (see this previous post).  A relational model at the back end of a custom-built system means that both developers and business people can talk about the real guts of a system in ways that make sense to both, like this:

  • Developer to business person: “Should we allow a part_order to include items from only one division”
  • Business person to developer: “After a call from our shipping department, I ran a query on the part_order table and found a that there is a part_order with null shipper_phone_number. I thought it was a required column, what’s up?

How’s it going to be when those comments don’t reflect the underlying structure of the database?  Today’s cloud db offerings vary in structure, but tend to favor highly efficient and flexible models like name-value pairs, and avoid the overhead required by semantic layers like the relational model.  According to the MongoDB site, “by reducing transactional semantics the db provides, one can solve an interesting set of problems where performance is very important.”

In such databases the structure of the data will be hidden from business people; there will be no shared business/IT view.  Rather than talking with business people about the actual database structure we’ll talk about its custom abstraction, and when things go poorly with performance and functionality the developer will in effect say “trust me on this one” to the business person rather than explaining what’s up.

For a long time cloud databases will be another option alongside the relational model, but the more prominent cloud databases become the more difficult it will be for developers and business people to communicate about business data in IT applications, and it could be a serious challenge for developers to learn to cross that communications gap without the bridge provided by the relational model.

SQL Server Row Level Security @ Richmond Code Camp 2009.1

Monday, April 27th, 2009

—–

Update 10 January 2010: Thanks to Gints Plivna for observing that we had not posted the slides to this presentation, here they are: Pretty Good Row Level Security Slides.  – Bob

—–

Thanks to those who attended Saturday’s Microsoft Code Camp (see http://richmondcodecamp.org/).  Here are materials for the presentation “Pretty Good Row Level Security” which I did with Nic Morel, my fellow CapTech Ventures Lead Consultant.

Summary:

The presentation reviewed a security solution that limits access at the data level, leverages simple database protocols, requires minimal administration resources, applies to both users and application developers, and overcomes some of the vulnerabilities of default approaches like having the application use a “system” user id.

The solution presents a three layer security architecture:

1. Base tables are accessible only to DBAs.

2. A security layer including cross-reference tables relating user ids to key data like department or sales territory, and table-valued functions that accept user id as a parameter and deliver data from base tables with a join to the security cross reference tables.

3. A data access layer exposed to users consisting of views that access the table-valued functions, providing the current user id as a parameter.  End users and developers only have access to this data access layer.

Email me if you’d like the powerpoint (a little big to attach here, even without the accent graphics), and I’ve pasted in the SQL scripts as a comment on this post.  The second two examples run against the SQL Server AdventureWorks database that ships with 2005 Developer Edition.

No business value in nulls

Sunday, April 5th, 2009

It seems I’m frequently in conversations about using null to represent a business value.  To paraphrase, say there are credit and cash customers, and there’s a suggestion to set “Customer_Type” to “C” for credit and null for cash.  To data and database professionals this is obviously a bad idea, but it’s not obvious from a business point of view.

In a database null means that there is literally no value, or the value is indeterminate.  Null is not the same as zero or blank.  When a database operation involves nulls the result can be difficult to predict for someone not practiced in SQL.  In many cases the answer is null.  For example, 1+0=0 but 1+null=null.  In plain English, what you’re asking the DBMS to do in the latter case is to add 1 to [I don't know what], and of course 1+[I don't know what] equals [I don't know what].

So, if you use null to represent a business value then you might not get the results you’re looking for when you try to get business answers out of your database.  For example, say “C” represents credit customers and null represents cash customers, and you have 2 cash and 1 credit customers.   In SQL Server if you use a Count function to tally all of your cash customers the answer isn’t 2, it is null.

That’s one example of why it’s not a good idea to try to represent a business fact with a null value.  It doesn’t make business sense and in this case the DBMS, correctly, won’t make sense of it for you.

To be clear, whether or not a given database column permits null values is an entirely different question, best left to database designers.  For example, a database table might record which patient occupies which hospital bed.  It may be reasonable and correct to assign a null patient ID if the bed is currently available.  However, there are alternative methods of representing this situation, and the database designer should be free to choose the right alternative taking into account the specifics of the application under development.

Grow your own row-level security

Wednesday, March 18th, 2009
Dr. Dobbs Portal

Excerpt from "Protecting Your Data with Row Level Security for SQL Server Databases," March 17, 2009

Data security is not optional in today’s business environment. High-visibility hacking and fraud, Sarbanes-Oxley, HIPAA regulations, and the Patriot Act all reinforce the need to present the right data to the right users and prevent the wrong ones from gaining access. Typically, “row level security” (RLS) is one requirement: to allow or permit access to particular users based on data in a particular database row. SQL Server does not provide built-in row level security.

In an article at Dr. Dobbs I present a way to build your own SQL Server row level security that restricts user access to data based on data in the row, without changing content of business tables, without affecting application or presentation developers, and regardless of how users access the data. Here’s my example application: How to can I add security to my existing orders database that will limit managers to departments they manage — and departments that report to those they manage — regardless of how users get to the tables and what queries and reports are developed against the database?

There are a number of ideas and code samples for hand-built RLS solutions ideas available on the web.  My favorite is A Fairly Capable Authorization Sub-System with Row-Level Security Capabilities (AFCAS) by Kemal Erdogan.  His solution is based on a lookup table, similar to the one I present. His solution is better than many others because it doesn’t involve business table changes, like adding a security code to each row, but still leaves tables unsecured if users access tables directly rather than through a given application.

The essence of the approach is to (1) use cross reference tables linking userid (or Active Directory group id) to business data values, and (2) protecting data tables behind table-valued functions that accept userid as a parameter and join it to the cross reference tables, returning only those rows that the user is permitted to see.

Using this solution can reduce security admin overhead and enable users to get to what they need without putting secure data into the wrong hands.

Someone’s integrating your data

Thursday, March 12th, 2009

Here’s a little-recognized fact about data integration: if you run a business or any sizable chunk of one, someone is integrating your data.

In my professional life I have on occasion suggested data integration efforts.  Sometimes my suggestions have been accepted and sometimes not.  As an IT professional I understand that different managers have different priorities, and in a given business situation sometimes other things are more important for example than having a single, consistent source for all customer records, or making sure production data matches financial data.

But as a customer?  That’s different.

A couple of years ago I bought a laptop from a company renowned for quality and customer service.  For the first weeks the computer was all it was cracked up to be, but then it cracked up.  The screen developed a mysterious flicker.  After a few diagnostic conversations they replaced the main logic board.  The problem recurred a few months later, and this time the company traded the lemon for a new computer.

All was well, but here we encounter the first data integration problem: they said I needed to call a different number to have my three-year service agreement transferred, which I did.

Months later I called service for a minor problem, and they had no record of the service agreement for the computer.  My warranty was still connected to the lemon.  After about an hour on the phone this company’s outstanding support staff came up with a more than satisfactory solution.

Even so, this company’s service records weren’t integrated with its warranty records.  In this case data integration happened because of my insistence and the service staff’s creativity.  The cost?  Only considering my last encounter, three service professionals were tied up for about an hour, and I’ll think twice before I buy again from this company.

It seems the choice is either pay now to integrate so all applications work from consistent data or pay later by having staff, customers, and suppliers do it on a case-by-case basis.

Beware the devils in the details of data integration

Sunday, March 1st, 2009
Excerpt from Illusions, Allusions – Let’s Get Real about Database Design, InfoManagement Direct, October 4, 2002

Excerpt from "Illusions, Allusions – Let’s Get Real about Database Design", October 4, 2002

Much of today’s IT application development – custom or off-the-shelf – involves integrating data from legacy systems, third- party software products and external data sources such as demographics or mail lists.  More often than not, data integration is unexpectedly complex, either due to data quality issues or the nature of the data integration itself.

Here are some typical examples:

  • One ERP package uses the same table for both Sales Quotes and Sales Orders. Columns that mean one thing for Quotes mean quite something else Orders. One team extracting data from this ERP package continually mixed up, for example, Date Received on the Quote with Date Prepared for the Order. The designer who blindly copies data from input systems can propagate these issues. In this case, the correct solution is to extract the two documents into separate tables in the destination system, making each column describe either a quote or an order, not both.
  • Marketing databases often store data purchased from several third parties on the same set of customers. These sources usually include overlapping columns with different values. For the same customer, different sources might store different values for the person’s address, credit scores or even name. It is sometimes important to preserve all of the columns from all of the sources and to maintain the information on where the data came from as well as what its value was. This can result in a messy database design, where columns again carry dual meaning: their value and their source.
  • Codes from legacy databases tend to evolve into complex forms, embedding more and more information into a single field. This is perhaps a natural reaction to the slow evolution of the system relative to changes in business, as users shoehorn information into the system that it was not designed to store. For instance, in a legacy system a one- character code might classify customers by “customer category,” with values 1 for small business, 2 for mid-size, and 3 for Fortune 5000. Users might add codes 4, 5 and 6 for corresponding values for aerospace customers, then 7 for federal government, and so on. The database designer must know the data well to extract each embedded concept into a different destination column.

When data integration is part of a project, expect complexity and leave room in interface development estimates for devils in the details of source system analysis and integration design.