The disaggregated write-ahead log

The traditional way replicated systems are architected is to physically co-locate the write-ahead log (WAL) on the nodes where the state is being maintained. Then a consensus protocol like Paxos or Raft is used to make sure the log on each replica agrees. With many classical databases that have been around a while, instead of consensus there may be a configuration-driven primary/backup scheme.

This converged approach is not that different logically even with the physical decoupling of compute and storage as pioneered by hyperscalers. At Google, locally attached disks are rarely relied upon – you must use Colossus, the current generation of the Google File System and "the secret scaling superpower behind Google's storage infrastructure". All the data systems used internally at Google and those made available as cloud services like Bigtable and Spanner store logs and state in Colossus.

Colossus is a zonal service, so these systems still need to implement replication protocols to make them resilient to a single zone or region's unavailability. This is apparent if you have dealt with Bigtable replication, which is async between zonal clusters. Other services like Spanner offer a higher-level perspective of a regional or multi-regional deployment.

AWS went in a rather different direction. This 2014 re:Invent talk describes their evolution with distributed systems that had led them to embrace a transactional journal primitive. It is a multi-zone replicated service, and now "powers some of the biggest AWS databases and retail services such as DynamoDB, Amazon S3, Amazon QLDB, Amazon SWS, Amazon SNS, and many others" (source).

As a building block, this represents incredible leverage for AWS. MemoryDB which debuted in 2021 stands out to me. You get a highly durable Redis usable as a primary data store, for a pretty small write latency tradeoff of single-digit milliseconds. Meanwhile, Redis Labs has a Raft module which is still experimental. How did AWS do it? See for yourself:

AWS MemoryDB WAL data flow

Outside of AWS, Meta engineers in their blog on using Raft for MySQL replication speculate a follow-up: "Another idea that we are considering is to disentangle the log from the state machine (the database) into a disaggregated log setup. This will allow the team to manage the concerns of the log and replication separately from the concerns of the database storage and SQL execution engine."

At sub-hyper-scale too, we can find examples of systems successfully leveraging disaggregated WALs:

  • Fauna, the serverless document-relational hybrid database, runs a Calvin-inspired transaction protocol on top of a partitioned replicated log, delivering strict serializability with multi-region replication.

  • Pulsar and Pravega are streaming data solutions relying on BookKeeper for record storage, together with ZooKeeper to coordinate since BK eschews ordering.

  • Neon, the serverless Postgres system inspired by AWS Aurora (of "the log is the database"), built a multi-tenant WAL service called Safekeeper and have shared insights on their choice of Paxos for it.

Neon's WAL data flow

Even though pretty much every distributed data system stands to benefit from replicated logs being a commodity, the only commodity in that diagram today is object storage. The scope of composable data systems needs to extend to replicated logs on this journey of the "Great Decoupling", lest it turn into the "Great Divide" between the building blocks available to hyperscalers vs outsiders.

S3 revolutionized how slow immutable data is stored and accessed, where is the counterpart for fast data streams? S3 has been called the gold standard for serverless, and has even been proposed as the universal infrastructure backend. I agree on the former, and on the latter I would argue that object stores are not comprehensive enough, as they do not solve for cheap small writes, low latency IO, or coordination requirements.

I believe that if distributed systems hackers had the equivalent of S3 for logs – truly serverless API, single-digit-milliseconds to tail and append, for a practically unlimited number of them with bottomless storage and elastic throughput – we would unlock a ton of design innovation.

Maybe it can even be called S2, the Stream Store. Hypothetical S2 can take care of replicating within and between regions and even across clouds, so long as you pay the speed of light tax. Hypothetical S2 does a bit more to simplify the layers above – it makes leadership above the log convenient with leases and fenced writes.

Building a durable execution engine? There's your fast side effect journal.

Distributed transactions? Use a coordinator log for 2PC.

Need to quickly acknowledge writes while preparing index chunks persisted in S3? S2 can be your trimmable, durable buffer.

Unifying streaming and queuing, disklessly? Pulsar charted the path, let's take it to the next level without the JVM clusters.

Ingestion pipeline for a multi-tenant data store? Granular streams seem perfect, model a log per tenant-use-case.

Event sourcing? S2 for the log and S3 for state snapshots.

What would S2 enable for you? It's okay to dream!


Update: I am working on making S2 real!

Proprietary systems are described based on public info, let me know if I got something wrong and I'd be happy to fix it!