How to think like a Data Modeler

Raj Samuel
3 min readJan 29, 2021

I think of modeling as an art, something that requires intuition, albeit scientific methods behind it. There is more than one way to model a business. It’s easy to criticize a data model after the fact. Part of our technical maturity is in knowing that the specific nature of universe at the time of modeling— the budget, the pressure, the people you interact with, the timelines, the blindspots and so on — probably favored that particular design.

OLTP vs OLAP
If you have only modeled data for read intensive workloads, as in dimensionally modeled data warehouses, you haven’t had any fun at all. You’re focused on a read optimized model. On the other extreme designing a hot OLTP system, as in a global reservation system for example, isn’t so much fun either. Your focus is on a write optimized model.

There’s a third type of beast expected to have fast writes (including updates) and fast reads, usually found in Operational Data Stores. You can’t break data down to 3NF for faster writes and you can’t combine disparate entities for faster reads. Strike a balance (challenging) and better be an expert on the database product you’re using (stackoverflow expertise won’t help).

Core themes of modeling

Data Modeling at first seems to be a bit of black magic accomplished through certain rules — relationship types, hierarchies, normalization rules etc. The type of rules you’d expect to be asked in an interview. But over time these rules take a back seat and the modeling experience distills into certain core themes. To me, these 3 are the themes:

Understanding granularity

To consume data from an entity or to make design changes on it, one should be clear of the grain at which data is represented on the table.

On an Accounts table the grain is Account. The natural key would be something like Account ID. If you want to see data at customer’s granularity, you’re probably rolling up accounts to customer level through an SQL group by because a customer can have more than one account. So the question becomes, should you represent data at customer level or at account level?

The answer depends on the motive of designing that table. Who wants this table? What’s their goal? Most often we design entities at the most granular level (account in this example), and then let the users roll it up to higher levels (customer, region etc.) through SQLs or through pre-built aggregate tables.

Most importantly determining granularity drives other modeling decisions — should this be an identifying relationship or non-identifying? Is the cardinality one-to-many or one-to-one? and so on.

Appreciation for abstraction

This fortunately is something all engineers and technically-inclined people can appreciate. The same idea that inventing the wheel was great but what really made it worthy was making 4 of them and putting them under a box, so now you have a moving cart. The cart is abstract. You can reuse it for any purpose — carry people or things, anyone can use it, and when the wheels wear out just make another 4 and done.

Abstraction to me refers to covering more use cases with the same simple design without having to implement specific rules for each. Some examples of abstraction in modeling are the party model, self-referencing relationships (employee-manager hierarchy) , ragged hierarchies etc.

For the love of SQL.. think in sets

Accessibility of data model, aka ease of SQL for its consumers, is what makes a model successful in the long run when all other things (meeting read-write expectations, meeting business rules etc.) are equal. A data modeler has to be an SQL programmer first.

The quirky thing about SQL is that it requires us to think in sets. SQL manipulates data by following set theoretic patterns. Object oriented or procedural access of data in the general programming world is different from this. Understanding this is essential in the business of data.

Unlike specific modeling rules these 3 are very simple themes based on first principles. Great products come out of simplistic design — implementation could get hairy and tangled but — begin from simple themes.

With that in mind, these 3 questions are a good starting point for modeling exercise:

1. What’s the granularity of data in each business entity?

Determines the primary key.

2. How much do we solve for DML Anomalies?

Determines the normalization and no. of tables

3. Is the consumer read intensive or write intensive?

Influences #2.

--

--

Raj Samuel

I write because I forget. (PS: if you take what I wrote and post it as your own please try not to edit it and post rubbish. CTRL+C, CTRL+V is your friend.)