Design patterns every data engineer should know

Raj Samuel
13 min readJan 22, 2022

(empty introductory line to avoid a formatting issue with Medium editor)

Zero copy data transfer

This is an optimization technique to minimize context switch and data copy between kernel space and user space of the operating system, there by reducing memory usage as well. Although this is a hidden mechanism relevant only to APIs, a visible example is Snowflake Data Warehouse’s data cloning.

Data cloning and data sharing in Snowflake doesn’t involve any copy or cloning of data under the hood. It’s done by pointing the new object that was cloned (eg. a table) to the same storage area.

Snowflake stores data in a set of system-defined micro-partitions. When either of the original object or cloned object makes an update, Snowflake’s default behavior (cloned or not) is to copy the micro-partition containing the affected rows into a new micro-partition and version it. The changed object points to the latest micro-partition and the other object continues to point to the older version.

Message Queues for n-way data transfer

Message Queues are often seen as a solution to address asynchronous communication between systems, ie, A makes a request (passes data) to B and instead of waiting for B to respond (synchronously), A continues with other work. Here MQ acts as the request-response queue between A and B.

But regardless of synchrony, MQs can be a good solution when you have simultaneous multi-way data transfers. Consider four systems A, B, C, D passing data and making requests to each other. Each of these four entities can send/receive requests to/from the remaining three creating a convoluted web of dependencies. When A, B, C, D are disparate systems in an enterprise with separate ownership, a mediator like MQ will make the communication reliable and easy to design.

Messaging systems are indeed solving a data and application integration problem as is well described here.

REST vs GraphQL data transfer

REST is a web services pattern that apparently everyone does but allegedly not done right most of the time. What it means is still being actively debated, sort of like a supreme court justice “interpreting” the constitution in a court of law. The general agreement seems to be that RESTful web services have two properties:

  • they allow application to change state (ie, they modify data of app) without being state aware. For example a RESTful user-authentication service will always return the entire user record not just a few columns specific to the requested state change.
  • they have an endpoint to fetch specific data. For example the endpoint for customer data cannot be used to fetch account data.

These two properties result in a few well known problems mentioned below, that led to the development of GraphQL:

  • over-fetching due to lack of schema-awareness (the whole record or nothing)
  • under-fetching via each specific endpoint for different domains of data (requiring multiple round trips to DB)
  • data binding via endpoints leading to slower than expected UI iterations (UI can be built only as fast as services can build endpoints)

GraphQL introduces a schema definition layer to help web services make schema-aware calls to database. Schema definition is built using JSON and it is obviously state-aware at this point. You can chose specific data elements (instead of whole record) and you can have multiple domains of data with just one endpoint (made possible by GraphQL library internally invoking multiple workers for database calls). This also means that schema specifics are decoupled from UI at the GraphQL layer reducing data binding problems.

NoSQL vs SQL

NoSQL is no longer touted as the SQL replacement. Instead it has become the ideal solution for niche problem domains like search, caching, graph, streaming etc. Horizontal scaling or native support for JSON shouldn’t be the primary drivers for choosing NoSQL anymore, as these are solved problems in many modern relational databases (see Spanner, SingleStore/MemSQL, and the numerous DB products that use RocksDB engine).

The major caveat of using NoSQL is lack of an expressive data model to understand and analyze data (through well known interfaces like SQL). Hence the argument that it should only be used to solve a specific and narrow problem domain.

Command Query Responsibility Segregation (CQRS)

CQRS has often occurred to me as a rather strange pattern because not many business problems require this solution. Command (ie, write) and Query (ie, read) are segregated into different data models of the same business function, so that database reads and writes can be done independently. Martin Fowler has stated his concerns on the applicability of this pattern.

A practical use of this pattern can be observed in Notion’s backend migration. During the migration from a monolith Postgres DB to sharded Postgres DB, all read requests would continue to hit the monolith and the write requests were routed to both the monolith and the new sharded DB as double-writes. Essentially Q of CQRS (ie, read) was segregated from the sharded DB during the time of migration and data validation.

Data locality

If all the data that need to be processed is co-located, the need for reaching out to data goes away, thus speeding up data processing. However data locality is often contrived. Some examples:

  • SQL joins in databases require two tables to be matched by a common key column and brought together, a resource intensive process for large tables. Data warehouses solve this problem by storing data together in wider tables that would otherwise be normalized into many tables (at the cost of redundancy and potential DML anomalies).
  • NoSQL document databases lacking efficient join mechanisms, allow embedding of a document (ie, a record) inside another document to allow speed of locality when entities have to be joined.
  • Many databases allow user to co-locate frequently joined tables on the disk by simply specifying the option you want on the CREATE TABLE command.
  • Business Intelligence tools run faster when data is made available on the tool’s server in a native format (eg: twbx or hyper files for Tableau, pbix files for Power BI etc.) rather than made available on a database.

Database-as-API vs ORM

DB-as-API is a pattern in relational databases where data is exposed by the database through a well defined interface that sits on the database. Applications call out to this interface to fetch data. The interface is usually a stored procedure with parameters and return types agreed upon with the consumer applications, or it can be something as simple as a view or materialized view.

Using this pattern (as opposed to applications firing random SQLs) helps with separation of concerns:

  • Database team can isolate structural changes in DB from rippling to applications. Applications are only aware of the interface — the stored procedure definition or the view definition.
  • It prevents malformed SQLs from non-database developers and consequent performance issues
  • It allows independent testing of data without having to resort to API validation tools
  • It prevents SQL injection attacks from dynamic SQLs thrown by applications

When this is not possible applications should use an Object Relational Mapper (ORM) that usually comes with programming language frameworks (eg. Hibernate ORM for Java). ORMs provide an object oriented interface to the set theoretic relational data model. Instead of forming SQLs in string variables within application code, ORMs let you simply specify the database objects and filters, internally converting them to SQLs.

Automated data pipelines

Most enterprise data teams have a manual and DBA-controlled code migration method, which is perfectly fine. But it’s very much possible to fully automate how data is integrated between various applications and how database code is moved up to production. The poor man’s way (ie, using open source software, but still very efficient) is to use a combination of these tools:

  1. git as the code repository and as the collaborative development platform
  2. Liquibase or FlyWay for code deployment to and execution on database
  3. A scheduler like airflow for (a) scheduling database deployment mentioned above, (b) scheduling data integration jobs, (c) and moving code between dev git branch all the way up to prod branch; a, b, c together would act as a pipeline for an enterprise’s data integration
  4. Alternatively git services like GitLab, BitBucket etc. comes with a pipeline mechanism that uses a yml-based config to coordinate 1 & 2 but they lack scheduling features

Beyond automation, this setup gives the opportunity to deploy code to production continuously (ie, speed-to-market) in a structured (ie, process rigor) and coordinated way (ie, without conflicting with other development projects) — the paradigm known as CI/CD or Continuous Integration/Deployment.

CTEs to resolve granularity in SQL

What a large and complex SQL statement does can be generalized into three things:

  1. business logic (eg. if a customer’s enrollment_date is null then pick min(transaction_date) for that customer)
  2. bring different business logic calculations to the grain of the overall SQL (eg. if the overall SQL is supposed to deliver one record per customer, any calculation at customer’s account level need to be grouped-by so that account rows are rolled-up to customer level)
  3. combine such rolled-up pieces of data elements horizontally (ie, column by column) using SQL joins (to use the above example, this step would combine all customer-related columns; the step above already made sure that the final result set would only have one row per customer)

The inefficient and lazy way to write this would be to implement 1 & 2 in SELECT clause. This is the beginner’s way of writing SQL. But it won’t scale with large data. The FROM clause is where it should go to. Oracle introduced In-line Views in earlier versions for this purpose. You can create a subquery in a FROM clause (ie, in-line view) and join that subquery to the rest of the tables in FROM clause.

It is still possible to do that in Oracle and other database products, but the cleaner and readable way would be to write Common Table Expressions or CTEs. In Oracle, it’s done using WITH clause that’s written above the main SELECT clause. So steps 1 & 2 would be implemented inside CTE, and then step 3, by joining the CTE to tables in main FROM clause. Each piece of business logic having it’s own granularity calculation should be written in separate CTEs, and each of them joined to the main FROM clause.

An overly simplified example with two CTEs namely cust_enroll and cust_address are given below using Oracle SQL syntax. As the business logic inside CTEs get complex, the SQL will stay readable and performant.

WITH
cust_enroll as
(select cust_id, min(tx_date) as tx_date from account group by cust_id),

cust_address as
(select cust_id, max(cust_city) from address group by cust_id)

SELECT
c.cust_name, ce.tx_date, ca.cust_city
FROM
Customer c, cust_enroll ce, cust_address ca
WHERE
c.cust_id = ce.cust_id (+)
c.cust_id = ca.cust_id (+)

Fact table subsetting

Data warehouse literature often talks about pre-computed aggregate tables but this concept can be extended to subsetting transactional data. If a common access pattern for a large fact table is to fetch a subset of fact records based on a well defined business criteria (eg. fetch only transactions that are approved), you could create a second fact table that contains only that subset.

A problem then might be that the number of records in the subset varies from time to time based on the subsetting condition. For instance, a transaction from last week that satisfied the subsetting condition (eg. approved transactions) may not satisfy that condition as of today (eg. approvals are rescinded). This brings us back to applying an SQL filter on the larger fact table to fetch all appropriate records.

But this problem can be solved by recreating the subset into a separate table in every batch. In this case the SQL filter (for subsetting approved transactions) only needs to run once per batch for the purpose of recreating the subset. User queries can directly hit the subset without any filters. To make access easier, the same parent-child relationship between the larger fact table and dimensions should exist between the smaller fact and dimensions.

A note on updates. In the above example of transaction fact, if rescinding an approval is handled through an update of a flag column there are bigger problems lurking. Data Warehouse updates are a monster to deal with. See section below about Fast Database Writes.

Also note that data redundancy (eg. multiple fact tables) is an expected property of a data warehouse and not a bug.

Fast database writes

Inserting into a database table might involve seeking an empty space within the storage allocated for that table. Or it could be a fast append to the end of the table. But in the case of updates and deletes, it is always a seek plus write known as an in-place write. In-place write is a performance problem in VLDBs that the world’s best engineers haven’t solved yet.

The quick-and-dirty way to handle updates (or deletes) is the same way it’s done everywhere from Google to Salesforce: insert a new record with a pointer to the old and now-obsolete record. The pointer is usually a column in the new record that has the natural key of the old record, and is generally referred to as tombstone in database literature. Tombstone is how writes are implemented in LSM trees, a data structure used by the fastest database engines in the world from Spanner to Cassandra to MyRocks.

Secondly, in systems where deletes are rare (like warehouses), there is little chance of creating holes within a table’s storage. To take advantage of that and to make writes faster we should always blindly append at the end of that table’s high-water mark instead of doing a normal insert. In Oracle this would mean adding a hint on the INSERT command (among a few other things).

Lastly, in systems that doesn’t have hot transactions (again, like warehouses where loads are batch-based) enabling redo log is an unnecessary burden on writes. Database recovery should purely be based on backups and mirrors. Disabling log generation will make writes run faster.

Parallel loads in star schema models

Data Warehouse loads need to follow the pattern of loading dimensions (parent tables) first and then fact tables (child tables) to avoid foreign key violation. Besides fact doesn’t know what the new foreign keys are before dimensions create those new primary keys.

But dimensions and facts can be loaded in parallel if you use a hash of the natural key as the primary key of dimensions (eg. hash of customer_id). Obviously the foreign key constraints need to disabled before the load. The load process (ETL) calculates the hash of each record’s natural key as they are loaded into dimensions and facts. After finishing the load the constraints (which were created on the hash key column) can be enabled back.

If a rare constraint violation occurs when they are enabled, having a column indicating load_datetime would help clean-up.

A caveat is that Type 2 SCD can’t be implemented with this design because hash keys (primary key) will repeat for the same natural key as history of the record is added. This wouldn’t have been an issue if the primary key was instead an auto-incremented surrogate key. Not to worry, this can be resolved by implementing Type 4 SCD, which means history goes to a separate table.

In products where foreign key constraint doesn’t exist (eg. Redshift) these ideas are still valid except that there wouldn’t be a constraint violation to warn you of data integrity issues.

Data modeling for faster go-to-market

Given that data models are essentially rigid database schemas that incorporate business logic through relationships and cardinality of tables, how loose or tight the model is, can greatly impact how fast you can respond to business changes. A highly normalized database with too many tables is less flexible to change but has much higher data integrity. A loosely normalized database is flexible but based on how far to that extreme you go, it might diminish your ability to run analytics or to make sense of data.

Having seen both extremes, I think a middle ground is warranted but only if you can show a conscious bias towards data integrity (and hence normalization).

A pattern to accommodate schema-level changes without normalizing (for at least some of the data domains) is to create a classification value structure. Classification values are used in all scientific and engineering fields from anthropology to data security in order to organize nomenclature (taxonomy). It’s up to the data architect to decide what data classifications go into classification value table. Another name for this type of structure is Entity-Attribute-Value table or EAV table.

As an example, you could begin by modeling let’s say an insurance claim process by normalizing the transactional and master data entities of claims. To avoid a rigid write-optimized 3NF model, a few classifications (eg. claim_type) could be stored in a classification table. At the minimum this table will have a columns for:

  • classification type or attribute type (with values containing column names of data classifications — eg. “claim_type”)
  • classification value or attribute value (with values containing the data value of classifications — eg. for claim_type the values might be “automobile”, “home”, “life” stored in 3 rows)

And optionally columns for (but not limited to):

  • classification name or entity name
  • classification sub-type
  • classification language in multi-lingual countries
  • classification value order (with integer values 1, 2, 3 etc. to indicate some pre-determined order of values like “automobile”, “house”, “life”, either for display purpose in a UI or for internal processing order) etc.

This pattern alleviates the need for extensive normalization and also faster database changes for data classifications that need not be modeled in the normal way.

Slotted Counter pattern

Databases in operational systems are state machines as in they capture state of the operations (applications) at each point in time. Recall that part of RESTful pattern was about how services handle this statefulness. When an application tries to keep track of an event’s occurrence (state), it is common to increment an integer counter column in a database table. This could result in contention if multiple services are accessing the same record to increment the counter.

Slotted counter pattern splits that one record into multiple records to avoid contention. If the table’s primary key is event_id, the pattern is to alter the granularity to event_id + slot_id, so that for each unique event_id there is more than one slot_id, there by reducing the contention on each event_id. An illustration of this with SQL examples is provided in this great blog.

Data Lake vs Data Mesh

Data Lake made some sense as the internet became the predominant business medium. The amount of data businesses accumulate increased beyond the extend that all of it could be modeled into a data warehouse. Data Lakes accumulate enterprise-wide data that are not necessarily modeled or structured, and catalogue them in a light-weight metadata model using Hive or a Presto-type engine. Whether any enterprises get value out of this or not is an important question, and not one I can answer.

Data Mesh is the “new thing” that gets blustered around by writers like me — the same old wine that data federation and data virtualization folks have been trying to make money off of (and never really did), but in a whole new bottle called data mesh. The key idea is to leave data where it is (instead of integrating into a data warehouse or data lake) and run federated analytics.

I am sure both of these have their places where they are useful, but my cautious take is to not drink the kool-aid.

--

--

Raj Samuel

I write because I forget. (PS: if you take what I wrote and post it as your own please try not to edit it and post rubbish. CTRL+C, CTRL+V is your friend.)