To me, development projects fail or succeed in the first few weeks. Once a project starts off in the wrong direction, momentum and expectations tend to prevent a return to the proper path. With today’s wealth of database options each addressing exciting new possibilities, the right choice for the application’s data foundation plays a large part in steering a project to success.
At this year’s Enterprise Data World conference, William Brooks showed the relations among different data modeling approaches, in effect detailing how to derive nine different model types from a detailed conceptual entity relationship model. Mr Brooks’ presentation hinted at a way to correctly frame up your data direction early on in a project, setting the stage for success.
According to his presentation, called “Symmetry in Modeling Approaches“, the different model types — relational, graph, dimensional, JSON, XML, and so on — all represent different perspectives on the same data relationships. Each suits a different application, like dimensional for reporting applications, data vault for data warehouses, graph databases for multi-layered search, and so on. However, if properly constructed they all map back in predictable and specific ways to a normalized entity-relationship model.
I and others write that ER modeling should be integral to requirements definition, but Mr. Brooks’ presentation implies that ER modeling can also serve as the basis for application architecture as well.
Early requirements definition should follow parallel, but closely intertwined, data, functional, and non-functional paths. Taking a fraud detection appdev project as an example, one workstream defines what data is required to identify suspicious anomalies, another on how to gather, analyze, and report anomalies, and a third on required speed, scale, reliability, availability, security, and so on. Each track draws from and informs the others. For example, resolution of many-to-many ER relationships informs functional requirements by revealing additional business objects to manage. Discovery of previously unrecognized business processes reveals the need for new data. A requirement for protection of personal information might drive the need for integration of enhanced data about the usage patterns of the fraud detection system itself, and user management processes beyond the organization’s status quo.
With this three pronged requirements approach, recognizable shapes tend to emerge early, enabling teams to set up for success with grounded technical choices. For our fraud detection example, we might find customer, transaction, and payment method entities, with five million, three billion per year, and 10 million occurrences each, respectively. We might require notification of potential fraud before the loss event for 90% of threat occurrences. And finally, early detection might require analysis of unstructured data, like loss notices and adjuster reports for insurance fraud detection, or “email, desktop documents, internet logs, phone calls, text messages, and social media messages” in the case of financial fraud.
The early shapes that emerge drive sound architectural decisions, like choosing the right data framework for the application. In our fraud example, the big picture requirements that emerged were large data volume (around three billion transactions at about 1000 bytes each is roughly three terabytes, plus massive unstructured data), rapid response for early fraud detection and reporting, and high security for customer and employee personal data.
These requirements allow us to narrow data platform selection from the today’s enormous and confusing array of alternatives. Volume and unstructured data requirements eliminate ER or dimensional platforms from the picture. The system’s operational (as opposed to informational) nature and performance requirements tend away from data vault, analytical structures, raw data lake, anchor model, or key/value pair structures, all intended for either lake/warehouse storage or less time-sensitive analytics.
On the other hand, graph or document (xml, JSON) structures that support un- or semi-structured data, high volume, and high performance seem to fit within our example’s high-level requirements. Armed with that information, the team can limit the search to platforms like MongoDB, Cassandra, or one of their alternatives, or one of the many graph database offerings.
Many projects start with narrow data platform choices imposed by organization standards. For example, I once worked on a fraud detection app in which high business expectations were dampened by an early decision to implement in a clearly unsuitable RDBMS. Similarly, many organizations seek one-size-fits-all database solutions, applying a uniform 1990s approach to diverse current-century needs (here, here, and here). An organization large enough to develop rather than purchase applications needs a variety of data platforms to capitalize on today’s variety of processing options.
Successful modern database projects start with sound understanding of data, process, and non-functional requirements detailed enough to make broad brush architecture decisions, including the right data platform for the job. That initial step can set the project off to success.