Data Vault modeling concepts through real-world examples
The two predominant modeling techniques for applications are
(1) fully normalized models (based on normalization principles evolved in the 70’s) usually following 2NF for batch-based databases and 3NF for DML-intensive hot OLTP databases
(2) read-optimized data warehouse models known as Star Schema or Dimensional model (based on Kimball’s methodology)
All other techniques (for instance, top-down data warehouse design known as Inmon methodology, snowflake models etc.) are a combination or slight variation of the above two.
Data Vault modeling solves a problem that these two techniques doesn’t solve. And that problem is rigidity. It is better explained with an example.
Consider a subscription business (that sells whatever — any product). Imagine two entities (ie, tables) Customers and Subscription. Each customer signs up with an email-id and has exactly one subscription
Let’s imagine the company has 5 types of subscriptions varying in cost and levels of service offered. Each subscription has many customers in them. So:
Customer to Subscription is one-to-one relationship.
Subscription to Customer is one-to-many relationship.
After a few years business found that there is revenue opportunity in allowing a customer to purchase multiple subscriptions, for instance when a customer wants a second subscription for their family member, but under the same account (email-id). So now the data model has to change such that:
Customer to Subscription is one-to-many relationship.
Subscription to Customer continues to be one-to-many relationship.
Hence this has become a many-to-many relationship.
What used to look like this:
should now look like this:
The second data model introduces a bridge table that allows the new subscription model. But this design change is costly: when data model changes all applications that write to and read from the database has to be changed. ETLs that load the database need to be rewritten. This could easily become a multi-month project for complex data models.
Data Vault as a solution
The problem here is that the 3NF model (or 2NF based on the potential attributes on those tables) is rigidly expressed in database schema for specific business rules. When these rules change, model has to change. Essentially this means the relationship between data could change over time and the model should accommodate it without forcing the applications to re-write their code.
What if the bridge table (or associative table, or relationship table) that resolved the new many-to-many relationship was always present there? It adds an overhead for sure (additional joins for example), but it will work for one-to-one, one-to-many and many-to-many relationships. At its core this is the idea of data vault modeling.
It introduces links between business entities such that changes in rules doesn’t require changes in software.
The core entities of a business — Customer and Subscription for instance — are called Hubs. Hubs only hold business keys, no descriptive attributes (for example, no customer name or address). Descriptive attributes like name and address go to Satellite tables. Link tables hold the relationship between hubs. For instance the bridge table in many-to-many relationship is a Link. Descriptive attributes pertaining to a Link (eg: subscription start date by a customer) go to the Link’s Satellite tables.
Because of the presence of Link tables, changes in business rules not only is flexible (no design change, no application rewrite), the Link also tracks business changes (using, for example, subscription start date or other relevant attributes). This can be audited either for internal reporting or for legal purposes.
Other use cases for Data Vaults
The above example shows how data relationships are explicitly tracked through data records as opposed to foreign keys. This is one use case but DV models can be useful in other areas, just two of them given below.
(1) Data warehouses
In data warehouses, the Links could represent transactions that go into fact tables in an otherwise dimensionally modeled warehouse. Unlike fact tables, Links in DV only hold the keys from Hubs (which equate to dimension tables). All measures and descriptive attributes of the transaction record goes to Satellites of the Link. Similarly all descriptive attributes of Hubs go to their own Satellites.
Essentially the attributes that make up a data warehouse (measures and dimensions) sits on Satellite tables. The Type-2 or Type-4 changes are implemented on Satellites.
The challenge here is that this is not in a read-optimized format, requiring multiple joins. For analytics, data vault warehouses usually have a reporting layer or data mart on top of the Hub-Link-Satellite raw data. This means that if the sole purpose of a data warehouse is analysis of data and decision support, data vault is the wrong solution.
(2) Problem of multiple sources of truth
Imagine the above subscription business selling products that can have different definitions or descriptions between different vendors/resellers of the same product. You get data from these vendors, one of them has ProductA, a different vendor calls it ProdA, yet another calls it product-A.
Your application will have to interact with these vendors in their vernacular. But your reporting system should resolve this difference and show the sales of these products as the sales of a single item.
A data vault style model can accumulate each vendor’s data in its own Hub and use a Link to resolve the dependencies. In the above example, a Link table could be used to load one record each for ProductA, ProdA and product-A, and all of them set to product_id = 100 which would be an inorganic key that ETL generates. Obviously it’s upon us (not the vendors) to resolve the matching products to one item through matching techniques (similar to the record-linkage techniques used in master data systems).
Data Vaults are neither write-optimized (as in OLTP 3NF databases) nor read-optimized (as in OLAP or dimensional models), and this is its main drawback. Unless you have a specific need for flexibility and/or auditing, it isn’t worth implementing.
DV is usually implemented in data warehouses. This is because data warehouses are huge vertically scaled-up systems that can afford the lack of write-optimization — write-intensive applications usually have their own little database. To solve for read-optimization we are already used to building data marts in warehouses which can be built on top of DVs as well.