Data Warehouse performance lessons

Specifications and environment for the warehouse mentioned here:

  1. The DW runs on Oracle hosted on AWS RDS with 8 CPUs, 32GB RAM, 3000 GB SSD with 4000 burstable IOPS on a single Availability Zone.
  2. ETL is written in PL/SQL stored procedures.
  3. 80% of the data is read from staging tables replicated from transactional databases via AWS DMS. A few are files based loads read through Oracle external tables that point to AWS S3 objects.
  4. Daily batch loads only, no real-time loads or intermediate updates between batches. No holidays, weekends, runs 365 days.
  5. The biggest FACT table holds upwards of 200 million records with close to 0.1% daily growth.
  6. Consumers range from Tableau reporting to web services. The major consumer needs only 2 years worth of data. With other business requirements applied, this amounts to only less than a tenth of the total data volume of DW.

Onto implementation and performance

  1. DW is dimensionally modeled with no snowflakes. Data is thus de-normalized for faster reads but still modeled around the idea of having fact tables for each business processes and linking those processes through dimensions.
  2. Not having to wait for DBA’s kindness was an early wish given our organizational structure. We addressed this using a two-pronged approach: (a) CI/CD and (b) managed cloud infrastructure (AWS RDS), taking DBA out of both DB changes and infrastructure adminstration, but both required some strong-arming from us. CI/CD was an unfamiliar turf for our database developers, and moving data to public cloud was an unfamiliar turf for our legal team. Both ideas were met with some opposition but dealt with successfully.
  3. Being a B2B data warehouse gave us the opportunity to use database server’s capacity for ETL without having to use any ETL tools other than stored procedures. If it were a B2C database we might have had to run ETLs separately due to HA requirements.
  4. Most reads on DW are bulk reads. Little to no ad hoc reports or low volume querying meant that we didn’t gain much from database indexes. We tried them, both bitmap and btree indexes, and ended up removing them.
  5. The biggest architectural boon to DW was that Tableau, it’s main consumer, runs off of local Tableau extracts for majority of reports (twbx/tde files) that are extracted from DW once a day after DW gets loaded. Most reports we sell (we sell analytics) are syndicated reports that favor this type of architecture that doesn’t need direct database connection. Tableau performs better through extracts as well.
  6. DML updates are a nightmare on a VLDB. On the source system a subset of the transactional data goes through constant deletes and inserts and that subset is kept in a separate table. Updating the state of this table on the larger fact table on DW turned out to be unwieldy. We optimized the updates in the best possible ways but in the end it came down to having to recreate the fact table each time we update instead of running an update statement. To solve the performance of update, we created a snapshot fact table with the subset of data that gets changed at source. It would be a snapshot of the changed data at source, no updates or DML, and gets re-created everyday. The original fact remains intact with just new record inserts.
  7. Tableau requires only the last 2 years worth of data but the fact table has millions of records from the beginning of business about 15 years ago. We partitioned the fact table (as well as the snapshot table and another dimension having a 1:1 relationship with fact) by range partitioning by date to form monthly partitions.
  8. To further ease things for reporting, we created a 2 year snapshot table that gets recreated every day. This table is list partitioned by an identifier that Tableau uses as filter to create separate extracts for each customer (akin to customer_id).
  9. Daily loads to DW employs CDC in a staged fashion - pick the new records, stage it, process it on a temporary table, throw the records to the end of high water mark of target fact (using /* + insert-append */ hint on Oracle). There are no deletes on fact tables creating holes, so this approach doesn’t affect query performance.
  10. DW generates hash keys as DW keys. The natural key is passed to an Oracle hash function which outputs a unique hash value for the surrogate key that links dimensions and facts through foreign key relationships. This hashing approach helps with loading dimensions and facts in parallel. This is critical because we have a “junk” dimension that’s as big as the fact table with a 1:1 relationship. Forcing a sequential order for parent-child relationship would cost double as much time.
  11. A fallout of hashed keys is that these primary keys cannot track slowly changing dimensions (SCD) through the popular Type 3 updates. If you add a second record to indicate change of an attribute value, the hash key based on that natural key will be duplicated on the second record and hence violate primary key constraint on hash key. An auto incrementing surrogate key wouldn’t have caused this but doesn’t serve our intent to have parallel loads. So we implemented Type 4 updates where history is tracked in a separate table. The windfall is that queries are now easier to write because dimension only has the latest data and we don’t have a lot of demand for historical dimensions.
  12. The largest tables have parallel DML enabled. Oracle will open multiple sessions to load data in parallel.
  13. A fallback of hashing and non-sequential loading of parent-child tables is that foreign keys have to disabled so load doesn’t error out in the middle due to a foreign key violation. They get enabled at the end. If there’s a violation at the end, you have the full picture of that day’s load before taking action. There are date fields to indicate what records got loaded that day if a clean up ensues.
  14. All tables are created with no logging option and the database runs on archive log mode. Combined with /*append*/ hint, none of the DMLs produce any logging info. This reduces IO activity considerably and speeds up ETL. If hell breaks loose, database recovery is fortunately easier on RDS via snapshots. And our ETL is designed to pick up data starting from day 1 or from the day it was last run, without code changes. In any case, a data warehouse in general shouldn’t need point-in-time recovery — it’s a querying system, not a hot transactional system.

I write because I forget.