NoSQL Modeling Considerations

Raj Samuel
5 min readFeb 15, 2018

--

The term “NoSQL”, originally “NOSQL”, originated over a meetup organized by Johan Oskarsson in Bay Area in 2009. Here are some arguments for preferring NoSQL over Relational databases.

  1. Volume, Velocity, Variety: NoSQL databases originated to handle huge data volume (think web-scale, google-scale, facebook-scale, amazon-scale), they can handle data in high velocity (think speed of processing, think data streams), and can handle a variety of data (think unstructured data or raw data like images). And it can handle all the three together.
  2. CAP theorem: NoSQL is one step closer to breaking CAP theorem, though it doesn’t quite break it. Consistency is pegged at eventually consistent models while providing Availability. Google Cloud Spanner is worth a look — it’s strongly consistent, globally available and natively supports SQLs and JOINs.
  3. Horizontal scaling: NoSQL lets you connect cheap commodity computers (and not necessarily powerful servers) to your existing network of data processing computers to add processing power — this is called horizontal scaling. These computers can be added, removed or let fail. The underlying programming model for distributed data processing — map-reduce — is built for this kind of scaling. This is traditionally achieved with vertical scaling, although in practice we resort to SMP and MPP servers.
  4. Schema-less: Enormous amount of data might benefit from a less rigid structure. A rigid schema forces you to normalize, resulting in JOINs to fetch data. If a few records can be allowed to have an extra column for instance, application development and database performance could benefit. This is called schema-less behavior.
  5. JOINs vs Locality: This is what much of this story is going to talk about. NoSQL works around having to join different datasets, by putting them together locally in a record. We will get to this shortly.
  6. Multi-model: More than inherent faculties of NoSQL, multi-model nature is the result of product maturity in this space. Today’s NoSQL products are multi-purpose. The same product can act as Key-Value stores, Document databases, Graph or RDF models, Messaging or Streaming systems etc.

A number of these advantages come at a cost: blowing some of the ACID properties. Let’s now see how locality plays a major role in modeling document databases. In the context of document databases, a document is a record or a row.

Locality is achieved through referencing or embedding. Referencing is what we do in relational models by enforcing referential integrity through a foreign key. Except, NoSQL products don’t enforce a constraint for referential integrity — it’s up to the developer to enforce it. Embedding just embeds the data that you would otherwise reference, essentially duplicating the data.

As a rule of thumb (and not as a hard and fast rule), follow these guidelines for modeling relationship cardinality.

1:1 → Embed, because data isn’t redundant. Just one document embedded in another. Here’s an example record in JSON format from a document database.

{id:123, class:”Husband”, name:”John Doe”, wife:{id:789, class:”Wife”, name:”Mary Poppins”}}

Husband:Wife is 1:1 so we chose to embed Wife inside Husband record.

1:few → Embed, if the embedded document doesn’t grow drastically or doesn’t mutate heavily. If it does, there is some tedious maintenance on the embedded document, so reference instead. Note I mentioned embedded document and not child document. This is because a child or a parent can be embedded — it’s flexible that way. More on that shortly.

1:Many → Reference, especially when the referenced data grows or mutates.

Many:Many → Partial embed, but this is a “it depends” kinda situation. Some prefer to implement this as a reference. What I prefer is a hybrid approach that I call partial embed. But let’s first clear a few things out of the way.

Since JOINs aren’t inherently supported in most NoSQL databases, do not implement M:M relationships using a third bridge table with references to the tables participating in M:M. Let’s see an example.

Cardinality of Author:Book is M:M. Here are some sample documents in JSON.

{id:123, author:”J.K. Rowling”, nationality:”British”}

{id:124, author:”Stephen King”, nationality:”American”}

{id:890, title:”Harry Potter”, edition: 1}

{id:891, title:”Book by JKR & King”, edition: 1}

If you were to design a bridge (junction) table, its documents would look like this.

{id:123, id:890}, {id:123, id:891}, {id: 124, id: 891}

Any fetch operation will need two queries two resolve this relationship. In relational parlance, joining tables will let you do this with one query. To solve this, embed the frequently accessed attributes of one document (book) in the other (author) as seen below.

{id:123, author:”J.K. Rowling”, nationality:”British”, book:

[{id:890, title:”Harry Potter”},

{id:891, title:”Book by JKR & King”}]}

Note that we have selectively embedded some attributes of books, ideally the most frequently used attribute, in this case title. We also have id’s 890 & 891, so we have a reference if we need to refer other attributes of book. Partially embed a frequently accessed attribute to get the benefit of locality. A suitable candidate for partial embedding is a frequently accessed natural key which cannot be used for data partitioning or for primary keys.

Locality considerations for Embedding and Referencing

Where do we implement locality? Is child embedded in parent the best way or parent in child? Thinking from an ER Modeling perspective, child embedded in parent seems logical. But let’s explore this.

Case 1. Referencing

Child in parent:

Add references to child inside parent document when child document doesn’t mutate. This is usually implemented as an array of references to the child record. If child records under a parent tend to grow in number or reduce in number, there is a maintenance hassle. As an example consider Publisher:Book. Publisher can publish many books so Publisher is the parent.

{id: 123, publisher: “O’Reilly”, book:[98,87,76]}

{id: 98, book: “Oracle Manual”}

{id:87, book: “AWS Reference”}

Parent in child:

Add references to parent inside child document when number of child documents (books in the above example) are expected to grow or be deleted often. Arrays aren’t needed here because a child has only one parent (except for M:M).

{id: 123, publisher: “O’Reilly”}

{id: 98, book: “Oracle Manual”, publisher_id: 123}

{id:87, book: “AWS Reference”, publisher_id: 123}

Case 2. Embedding

Child in parent:

Embed child inside parent document when number of child document won’t grow in an unbound fashion. For example, in a Blog:Comment data model there is no limit to comments (child) posted to a blog post (parent). It’s a good practice to not embed comments inside blog records in this case. Also if we expect a lot of updates on child records it might be better not to embed them in parent.

Parent in child:

Embed parent inside child document when the child has high cardinality data. With low cardinality child documents we will make parents redundant through embedding.

Consider all the above observations in the backdrop of data access patterns. For instance, if the application heavily access child documents and occasionally parents, we have a good reason (traversal cost for one) to keep child documents independently without embedding in parent. There are product specific features that should be considered as well. Some products allow real references, not just an id pointer. Some allow a variety of index types. Most of them support sharding for distributing data volume. It goes without saying that product features influence design decisions.

--

--

Raj Samuel
Raj Samuel

Written by Raj Samuel

I write because I forget. (PS: if you take what I wrote and post it as your own please try not to edit it and post rubbish. CTRL+C, CTRL+V is your friend.)

No responses yet