Lessons learnt building a CI/CD data warehouse
CI/CD aka Continuous Integration/Continuous Deployment in the context of databases is not a very popular theme. The reason could be that enterprise databases are still hovering around 40 year old relational database technology (for good reasons most of the time) and the people working there are still “schema-bound”. Old dogs and new tricks.
Data Warehouse architecture
The data warehouse runs on Oracle 18c and hosted on AWS RDS. It’s dimensionally modelled with no snowflakes. Some of the dimensions are built to serve as a reference data source for the enterprise. This write-up won’t get into the database design part of the DW.
The DW hosted on AWS RDS, and the SDLC servers hosted on docker containers, both sit inside an AWS VPC in a private subnet. Users connect to the SDLC tools (Airflow DAGs and DBFit test cases) via browser or Windows RDP, but only after establishing an SSH tunnel for securely reaching the private subnet. A tunnel is required for DW connection as well.
SDLC Design
Schema bound development
Each developer and tester does their work in their own database schema. So user John Doe works within his JDOE schema. There is NEVER any work outside of their own schema. So every user schema has a copy of the entire DW and hence the entire DW is their playground- but their own version of it.
Code integration
Each developer/tester has their own EC2 virtual machine. This machine has a docker container which hosts Airflow, Liquibase and DBFit. Liquibase automatically deploys code to database and DBFit runs automated tests for both ETL and Data Quality, Airflow schedules ETL pipelines as well as code deployment and automated tests.
Every user has the same container image on their EC2. When an ETL developer has built PL/SQL code in their schema on DW, they test in the same schema. They can test it directly in their schema or test it by triggering the Airflow DAG on their EC2 server. To trigger from Airflow the new code has to be copied from their schema to a new file in the local git repo on their EC2 server.
Once code has been tested, git is used to merge the changed files to the master branch. The master branch points to a schema dedicated for integration where everyone’s code gets integrated and run.
Code deployment
Code deployment is done through Liquibase which is triggered by one of the tasks on Airflow that runs before the ETL task. Liquibase deploys code sequentially, by default in alphabetical order of the file names. It keeps track of migrations to each schema through two changelog tables available in those schemas. Liquibase migrates only the changes since last migration.
The DW refreshes every day with new data. The Airflow on integration server (inside the third yellowish docker in the pic) runs every day. One of the tasks in Airflow DAG will pull latest PL/SQL ETL code from git master branch and save it on the local repository. The following task will trigger Liquibase to deploy code on DW’s integration schema. The next task will run ETL which also includes the latest code.
The Airflow on each user’s machine (inside the first two yellowish docker in the pic) simply deploys code using Liquibase and runs ETL. It doesn’t pull from git master. Users get the updated code from master branch only when they are ready to merge others’ changes with their own, usually after they have built and tested their own changes. This git process is explained later below in Step 4.
Automated tests
DBFit tests run right after ETL load on the DAG. The same steps for development applies for testing as well. You can create test cases on DBFit through the browser. A test case is a query or a few queries or a query and the expected output value. These get saved as a file on the EC2 instance. From this point the tester follows the same process as a developer to integrate test cases to master branch.
CI/CD enables us to productionize changes faster but with great power comes great responsibilities. The speed means that code churn is high and so is the chance to introduce bugs. We originally felt that the daily automated tests is an overkill that adds to the batch time and cuts into SLAs. But without it the CI/CD process and the speed of it would have introduced a lot of bugs that would only unravel when it’s too late.
Code integration through git
We used a git service called BitBucket. Our tasks originate in Jira. The Jira assigned task identifier is used to identify each feature branch on git by explicitly naming it so. For example a task on Jira to create ETL for FACT_CLAIM could be assigned an identifier CLAIM-DW-150 where CLAIM-DW is the name of the Jira project and 150 is the sequential number of this task.
Trigger Airflow DAG to test the local copy of your code on EC2 file system. This will run your ETL changes, test cases etc. After you’re satisfied with it start with git. When you join a project there is a one-time step to clone the git repository of your project to your local machine (in our case to the EC2 instance).
- Each Jira task should be a separate branch on git. Committing directly to master is doable but it removes one step along the integration path and leads to errors on master more frequently.
Create a branch with the Jira taskname:
git checkout -b CLAIM-DW-150 - Do changes on local git
To make sure we are sitting on the right branch on local git:
git status
To add a new file to local git from local filesystem:
git add fact_load.sql
or
git add . (to add all new files)
git status (validate)
git commit -m “add comments here” - Back up changes to remote branch:
git push -u origin CLAIM-DW-150
At this point you can keep adding more changes and more files to the branch if you wish. Then back up those changes to the remote branch again. - Integrate changes from others to yours:
git rebase origin
git pull
This is called Pull Request in general or simply Pull in BitBucket’s terminology. If there are conflicts between your code and others’ changes git will warn you of those. Merging to master in Step 6 cannot happen without resolving this conflict.
Most often conflict resolution requires manually editing the files to resolve conflicts and merging them using a separate new branch other then the current one. The current one is deemed useless at that point and can be ignored or deleted. - After changes from others have been integrated, back that up to remote branch one last time.
git push -u origin CLAIM-DW-150 - Merge the backed up changes on remote branch to master. This step is called Merge Request, or in BitBucket terminology Pull Request. It’s easier to do this on BitBucket UI on the browser:
a. Click Create Pull Request
b. Ensure the changes are going from your branch to target branch (CLAIM-DW-150 to master)
c. Assign reviewer(s)
d. Let the reviewers do their job, give any comments (which will be notified). If changes need to be done, go back to step 2 and follow along.
e. Reviewers Approve and Merge to master.
Bird’s eye view
- There’s no single data warehouse on the dev server, there are many, based on each feature development in different user schemas.
- The integration schema presents the most consistent release of development data warehouse.
- What’s on integration schema is propagated to each developer’s/QA’s schema but the process allows them to choose when to accept it.
- Over the course of development and QA sprints, the integration schema gets stabilized through above process.
- When the time is up, the master branch that points to integration can be moved up to higher environments each having its own git branches— finally ending on a production DB server with a production DW schema.
Closing thoughts
A larger task/feature can be split into multiple feature branches on git instead of just one. We use the format CLAIM-DW-150-a, CLAIM-DW-150-b and so on.
Peer review is built into the git process. Nobody throws code over the fence.
There are separate branches for prod-fix (pre-prod) and production environments, which essentially point to their own schemas on a production DB server. These git branches have gate keepers from DevOps who reviews and merges code migrations.
Liquibase’s default ordering of files while deployment could create some headaches. We prefixed the filenames with a sequential number to control this behavior. For example, 00021_disable_fact_fk.sql.
Automated tests might seem like an overkill but it’s not. Its a necessary overhead on the batch to reign in any bugs introduced by the high-speed of CI/CD SDLC process.
We spent a few weeks trying to set up SDLC containers on Kubernetes cluster. Despite of having an existing Rancher setup and help from our web services team, this proved to be an overkill. Big NO.
Getting database developers and testers to be familiarized with automated CI/CD process was a bit of a challenge, myself included. But avoiding manual changes on higher environments (integration, prod) was a big process improvement.