Database multi-tenancy continued..
I consider Salesforce the temple of multi-tenancy. The now famous discourse by Craig Weissman (former CTO of Salesforce) on Salesforce’s implementation of multi-tenancy is a must watch. Salesforce’s data layer is a frugal multi-tenant architecture built on plain old Oracle database.
No NoSQL. No map-reduce. No RDD.
An award winning multi-tenant PaaS built on Oracle database years before Oracle thought of getting into cloud business. That — is — AWESOME!
Although my personal belief is that they will eventually change this Oracle-centric setup (if they haven’t already), it’s worth learning how database multi-tenancy is achieved from bare bones. Here are the tenets of database multi-tenancy.
Horizontal scaling of database instance.
Horizontal scaling of database instance is not a new concept. Oracle RAC (Real Application Clusters) is the classic example. You spin up as many database instances as you want on an SMP or MPP server. All of these instances serve the same single database (we’ll get to the difference soon). The idea is high-availability (HA) more than multi-tenancy. But HA is critical for multi-tenancy. If an instance fails, none of your tenants, even the ones being served by the failed instance wouldn’t know an instance failed. It fails over to other running instances.
To clear things up, in Oracle world an instance is simply the name for in-memory structures (SGA, PGA, library cache, query plans etc. that sits in RAM) and processes that form the database engine. A database is the physical files (data files, redo files, temp files) that consists of user tables, catalog tables, disk-based sort files etc.
Data partitioning
Partitioning of tables is often misinterpreted as a database performance tuning technique. Just partition a very large table and you have the gift of performance unicorn! Nope. Partition is a way to optimize data access patterns if designed wisely.
Consider you have a very large table with hundreds of millions, or billions of records. If what you have is a high cardinality dataset and the need to partition it, you’re left with a rather handicapped solutioning. And that is, hashing the column (or columns) with highest cardinality (so that you get a uniform skew across partitions) into a few partitions. The trouble is, your access patterns may not guarantee that they will always find a record within one and only one partition. Accessing multiple partitions for a single fetch operation results in what we call broadcasting, and is not good news.
However, if your query access patterns are :
- limited within specific ranges of data (queries based on date ranges or queries for a specific customer out of a few customers, for example), and
- if that range is not disproportionately skewed (that is one range has 1 billion records, the other range has 100K, there’s a bad skew),
then you have a better shot at partitioning based on that pattern (a date range partition, or partition per customer). Now your access patterns are optimized. There is no or minimal broadcasting across partitions.
If your users’ data access patterns enable you to design table partitions such that a group of users access only the partition relevant to them (and another group of users access only their partition, and so on), you have the proverbial “happy path” to multi-tenancy through partitioning.
The DML nightmare
DML is dirty. UPDATES have to keep the old image of the value. DELETES have to keep the entire image of deleted record. Salesforce’s custom internal implementation of Oracle do not immediately DELETE records, in order to make this a little less unwieldy. They just flag them for deletion later. A house-keeping script deletes the flagged records periodically as a batch program.
Redshift, a cloud-based Data Warehouse comes with this feature built-in. The idea is to relieve strain on multi-tenancy by cheating the otherwise stringent database isolation controls. Let’s do dirty DML in the background, don’t throw it in the face of the user.
Index — the boon and the bane.
Indexing is a predominant tuning technique, once again, to be based on query access patterns. There are 3 challenges with indexing on multi-tenant systems (and elsewhere):
- When you have a fat table (hundreds of attributes) with large indexing requirements, indexing is not a scalable option.
- When you have a long table (millions or billions of records): B-tree indexes only perform well when requested data volume is less than 10% of total volume and when the index is built on a high cardinality column.
- Indexes themselves are not flexible creatures built for multi-tenancy. They are rigid schema bound structures. They can lock up data and freeze the user.
Some of the new entrants in database space is worth taking a look at. HarperDB a relatively new entrant, inherently indexes every attribute in a table. In fact, the database do not have the notion of a heap data for tables and additional b-tree data structures for indexes. Every attribute lives on an index, no matter how many attributes you have. The database engine takes care of fetching it, assembling it record by record and delivering to you.
AWS Redshift being a relational database is a notable exception to indexes. Instead of indexes, Redshift uses something called zone maps. It works like this. You designate an attribute in your table for a sort key. Then Redshift sorts the data based on sort key before saving it in data blocks on disk (a block is 1MB, considerably higher than most relational databases in the market) and internally creates a zone map to keep track of first and last sort key in each block. The query processor just looks at zone map to see if a 1MB block can be entirely skipped (because zone map has the highest and lowest value for that block).
In other words, a b-tree index specializes in seeking; a zone map specializes in skipping.
Read Replica and Multi-master topology
We looked at scaling database instances (RAC). We also looked scaling a single table (partition). What if we could scale out an entire database — the physical files encompassing the entire set of tables in it? Read replicas allow us to do that. Each replica is a silo copy of entire database and may have separate URL end points. Depending on what databases platform you use, you may have to write your own code to propagate changes across replicas. Managed database services on cloud (example DynamoDB) does this funky work for you.
Take this one step ahead, and you get read-write replicas aka multi-master databases. It just means that the user can not only read, but also write to any of the replicas. They are no longer just “read” replicas. The challenge is again propagating changes, except this time its more complex. Any replica can change. Any change must immediately go to all replicas. Chaos!
AWS DynamoDB has a mechanism to tackle this chaos. DynamoDB Streams is a continuously running event log that captures every transaction on database and keeps the log for 24 hours. Streams is a free service (as of now). Combine this with AWS Lambda (a serverless script writing tool), or another database scripting tool, for every change in database (in your replica) you can trigger an equivalent change to another replica.
In general database multi-tenancy is not a packaged COTS solution. There is no “standard industry architecture” to follow. It’s a collection of techniques that needs to be applied wisely.