Database Durability and Recovery overview
Data durability and instance recovery is a hot issue in OLTP than it is in OLAP/Analytical/DWH systems where writes are more often batch based. However we solved this in OLTP decades ago using a technique known as Write Ahead Logging (WAL).
Much of the exciting events in database systems in the last 15 years has been on analytical systems through advances in two areas:
- distributed data processing — horizontal scaling solutions using map-reduce, RDD, and vertical scaling solutions that (for the last few decades) evolved into SMP, MPP, RAID, SAN.
- alternate data structures (graphs*, documents, key-value**, columnar storage).
*Graph databases may be used in OLTP as well but they are best used when the number of relationships are extremely high and queries have to traverse multiple relationship hierarchies to answer simple questions.
**Document/kv stores may be used in OLTP but ACID isn’t their priority (though it must be, for OLTP) for most commercial flavors. The legendary Michael Stonebraker in his Redbook proposes that these will be taken over by traditional RDBMSs with JSON support.
This post is a study of the less exciting and already solved techniques around WAL.
To think about it naively, durability is the ability to write committed transactions to disk without data loss. When a database has to flush data to disk (commit/make durable), before it writes to the actual data files on disk it writes this data plus some metadata to log files. This is called write-ahead logging or WAL.
In most databases these log files are called redo log files. The data files are written later when the OS is ready to flush it to disk or when the database decides to do so. It might seem counter-intuitive to go to the trouble of writing to redo log files first instead of writing directly to the data files if durability is our aim.
Why don’t we just write(-ahead) into data files instead of log files?
Turns out people certainly have asked this question already and put the idea into successful products, but there are other factors at play that makes WAL intuitive:
1. Writing to data files may not be a single-pass write, for example when indexes are present. Also databases offer many data structures for tables themselves. Oracle has heaps (default option), index-organized tables, clustered tables and many more. So writing to data files needs special considerations (and hence takes time).
2. “Data organization” matters in data files for the sake of read performance. For example, hard disk latency is minimal when reads are sequential as opposed to random due to the cost of disk seek, so data belonging to a table or partition should be co-located whenever possible.
3. Data files need to do “in-place" writes for updates and deletes, as well as for inserts unless user forces an append(through hints for example). Redo logs are a time ordered append-only file. Even deletes and updates are appended as new records with metadata indicating what operation it is, so that the older versions of that record can be ignored. Appending without any regard to what is already on the file makes redo write a low latency operation.
In addition, data may be organized in logical sections in the file for faster access. For example, Oracle knows a table is located in a specific segment (a logical section) in a specific data file on disk. The tablespace (another logical structure) maps between data files and segments. Segments are organized into extents which are further divided into database blocks.
Redo log files do not have to meet any of these conditions. The only additional info that redo has to track is metadata about the transaction being written itself — in the case of Oracle, a System Change Number (48-bit integer stored in SGA), what operation is being performed (update/delete), block location of this data in data file, and some flags.
Although WAL takes priority over data files themselves, data is flushed out to data files as often as possible. This makes sure there is only so much data to recover from redo log in case of an instance failure. The amount of data in redo log minus the amount of data in data files is what a recovery operation can recover.
Why do we need Undo data then?
The purpose of undo is to “quickly” rollback operations while transactions are in progress if the user requests to do so. To enable this operation from redo would make redo structures a multipurpose vehicle requiring a different and rather complex data organization (more in line with data file’s organization). This is counter-productive to the primary motive of redo — which is log data as quickly as possible in case there’s an instance crash as mentioned before.
In essence, undo is a storage structure in line with the organization of actual data in data files. It facilitates rollbacks.
Asynchronous and Synchronous I/O in WAL
How I/O happens in redo logs is a subject of debate and source of contradiction (for closed-source systems like Oracle).
Kernel operations in I/O
Applications have two ways to write data to a disk.
- Request kernel to do direct I/O to disk there by placing the data directly into the disk controller.
- Request for asynchronous I/O which stores data in the OS buffer known as page cache.
Direct I/O system calls are synchronous, that is they are blocking calls. They return control to application only after the data has been placed on disk. In asynchronous I/O system calls, the data is first placed in kernel’s page cache and the control is immediately relinquished to application.
Placing data in memory (as opposed to disk) is pretty fast so it’s non-blocking, hence asynchronous. The database has no control over exactly when data in page cache will be flushed to disk in the case of asynchronous I/O. It’s an operation in the kernel space, not in user space. To make it a user space operation and make sure the data is written to disk immediately, it has to be a synchronous I/O call.
Although it sounds like a faster way to flush data to disk, since it’s a blocking operation, the database will have to wait until the costly disk operation is completed. The database process can’t get back to serving foreground requests while the system call is performing direct I/O.
Asynchronous write is the recommended way of writing to disk in Linux unless there’s a good reason not to. A lot of documentation on this subject suggest that this is how most databases do it including Postgres and Oracle.
However kernel traces on Oracle sometimes show that the O_DIRECT flag is used on the system calls to write data. This means kernel’s page cache is being skipped for a direct I/O by DBWR, Oracle’s process that writes to data files.
WAL is a mix of synchronous and asynchronous I/O
A number of databases (Oracle, MySQL, Postgres) takes the middle ground. The database writes to page cache and immediately gets on with other tasks in the foreground. The kernel will asynchronously update database process about the completion of disk flushing through callbacks.
When the database reaches a point where it must flush data from page cache to disk, for example when a checkpoint occurs, it can issue an fsync() system call(in Linux kernel). fsync() is often misinterpreted as an asynchronous call because it deals with page cache, but it’s really a synchronous blocking call that returns control only after data is saved on disk.
Undocumented anatomy of Oracle’s redo suggests further more complexity. LGWR, the process that writes redo to disk, performs a direct synchronous write (using O_DIRECT flag on the Linux system call) skipping page cache when the amount of redo data at hand is less than the maximum bytes the platform’s I/O port can handle in one I/O request.
When redo data is more than the port defined maximum, it will resort to asynchronous I/O. When a checkpoint occurs LGWR or its thread will issue fsync() to flush kernel’s page cache to disk and poll the kernel for completion of I/O. LGWR doesn’t confirm a commit to user until the data is flushed to disk.
When disk is in SAN, Oracle doesn’t wait for the actual disk write. It assumes durability as soon as data is transferred to disk array cache.
Summary
Write ahead logging is the predominant way of achieving durability and post-crash data recovery. Both Linux and Windows support asynchronous writes and is the recommended way to go. But databases use alternative approaches to write redo, for example, direct I/O with the aim to flush as much data to disk as possible so that an instance crash (power failure for example) won’t wipe out committed transactions.
The difference between what’s in the redo log files and what’s in the data files is the data that needs to be recovered and written to data files. This is usually less than a second worth of data but it could mean many transactions in a hot OLTP system.
Note that databases still hold their own cache in user space in addition to kernel’s page cache, which is duplication of cache in many cases (Oracle’s buffer cache in SGA for example). However, without this memory overhead database has less control over serving queries from multiple users and issuing direct I/O calls.