unofficial blog

Designing serverless stream storage

Shikhar Bhushan — Sun, 18 Feb 2024 15:42:46 GMT

I have alluded to "hypothetical S2" (Stream Store), a true counterpart to S3 for data in motion. As I work on making S2 real, I wanted to share the design and how it shaped up.

Vision

Unlimited streams

A pain point with most streaming data solutions is how they limit the number of ordered streams ('partitions'), typically to a few thousand, and/or associate large fixed costs with each of them. This can lead to an impedance mismatch when it may be natural to model distinct streams for fine-grained entities, like users or listings.

Let's imagine building the ingest pipeline for a multi-tenant vector database. It would make sense to have a stream per index. You get some buzz and usage on your free tier blows up: suddenly you are scrambling to shard into multiple clusters. Maybe you start to question why you have to contend with clusters at all.

Low latency

IO to a local SSD is sub-millisecond, as is the typical latency in crossing availability zones. If a cloud database can offer single-digit millisecond response times, any slower for sequential data is a shame.

There are certainly use-cases where sub-second or even sub-10-second for "near-real-time" is good enough, but online systems should rightfully balk. There can be lots of services involved in a modern tech stack, and the foundational primitives need to minimize latency. Beyond obviously online usage where latency affects user experience, imagine emitting an audit event at the end of a Lambda function waiting on an acknowledgment to retry failures can be a time slice you are paying compute for.

Elastic to high throughputs

Cloud VMs can easily offer many GiBps in and out (networking is fast), and not too far behind are local SSDs wielded well (storage is fast). You may experience dissonance checking out cloud streaming solutions that throttle ordered streams to a measly 1-10 MiBps.

When a busy database node can push hundreds of MiBps in a converged setup, disaggregating its write-ahead log is unworkable with such low throughput limits. Partitioning can be a viable strategy in other cases, but inconvenient and inefficient if you end up having to impose order later with stream processing.

Truly serverless

Serverless represents an ideal developer experience for an infrastructure product, elevating the level of abstraction to the workload rather than the physical resources powering it. S2 aspires to be truly serverless, with usage-based transparent pricing, and no faux-serverless bandit tricks.

Some vendors seem to look at a serverless offering as a way to go "downmarket" or upsell dedicated resources beyond some scaling wall and yet we have S3 the OG serving everyone from Netflix to tiny startups. Maybe S2 will offer dedicated single-tenant cells, but it should not be for lack of scalability or resource isolation in its default environment.

Enable the next generation of data systems

When data is moving reliably, there is a durable stream involved. Streams deserve to be a cloud storage primitive. To me this calls for intuitive, easy to integrate APIs over HTTP, like JSON and gRPC. The specifics will evolve, but S2 should always strive for simplicity.

Beyond core data operations like append / pull / trim, leases with fencing are a powerful abstraction for disaggregation. I certainly want it, in plotting a Kafka-on-S2 layer! I believe the best way to bring S2 to Kafka clients would be with an open-source wire-compatible layer that can be self-hosted, and is also made available as a serverless endpoint. The code can serve as an example of how to power distributed data systems with S2, with some of the neat features from the Kafka world like keyed compaction being natively integrated.

Architecture

The S2 vision presents quite an engineering challenge, and the table stakes are high as well: highly available, durable, and secure.

I initially discounted object storage as a primary source of durability, because of the unacceptably high latency of hundreds of milliseconds for writes. The engineering investment and bandwidth costs involved in manually implementing replicated durability seemed necessary.

That changed with the announcement of S3 Express at the Nov 2023 AWS Re:invent, promising single-digit millisecond IOs albeit as a "One Zone" zonal service. I also learned about Azure Premium Blob Storage, which is able to offer similarly low latencies regionally. Of the 3 major clouds, this leaves Google Cloud who certainly have SSD-based Colossus clusters internally, so I see it is a matter of time before they catch up with a "GCS Express". Suddenly, the future for fast cloud-provider-native object storage was looking bright.

With some performance testing, I determined that an IO size sweet spot for S3 Standard is 4 MiB, and the cost model for S3 Express already nudges one towards 512 KiB operations. As I reflected on previous experiments with local NVMe SSDs using Direct IO, a storage hierarchy was firming up. Direct IO forced me to issue 4K-multiple writes with great attention to driving concurrency, and larger sizes like 64 KiB helped with throughput. S3 was more forgiving on precise sizing of writes, but the higher latencies demanded amping up concurrency for a longer pipeline you may recognize this as Little's law.

	Write size	Latency	Sticker price (1 GiB regional durability)
Local SSD	64 KiB	sub-millisecond	$0.04 on bandwidth (albeit large discounts at scale!)
S3 Express	512 KiB	20 milliseconds	$0.01536 all-inclusive
S3 Standard	4 MiB	400 milliseconds	$0.00128 all-inclusive

Operating across this full price-performance spectrum seemed within reach, with the right abstractions. Enabling disaggregated systems was one thing, S2 needed to internally disaggregate storage as well. By building a quorum protocol for a zonal service like S3 Express to achieve regional durability and availability, there was an exciting path to also leverage a similar mechanism for a purpose-built storage tier on local disks.

S2 calls this abstraction the chunk store, and it ended up with a surprisingly small interface. Chunks can be comprised of records from many streams multiplexing helps achieve the optimal write size. An easily verifiable protocol performs quorum writes for durability, and cheaply minimizes the "window of uncertainty" chunk range which will require quorum reads in case of recovery. It would be accurate to characterize the sequence of chunks as a replicated write-ahead log.

A channel is diskless, serving real-time reads (i.e. the last N seconds of records) from memory, and works against a specific chunk store type. Of course, memory is a limited resource that a high throughput channel would quickly churn through which is why channels are continuously, asynchronously offloading per-stream data segments to object storage, and garbage collecting chunks. Higher time-to-first-byte for a lagging reader is not a concern, and comes with a nice win on scalability of read throughput.

Elasticity arises from streams being represented as a chain of lightweight "streamlets", with stream metadata storage centralized. This aspect of the design is inspired by Delos, which calls log epochs "loglets" and innovated with the idea of virtualizing consensus. The control plane can use a simple protocol for live-migrating streams between channels in response to changing workloads and even reconfigure streams to different types of chunk storage!

To unlock an unlimited number of streams, metadata storage needs to be extremely scalable. It stands behind a minimal interface which can be implemented efficiently against fast cloud-provider-native databases like AWS DynamoDB, GCP Spanner, and Azure CosmosDB, which tend to have even higher availability SLAs than object storage. Metadata reads are largely cacheable, with care to ensure safety in case of stale reads.

This kind of decoupled, multi-tenant architecture is key to delivering on a serverless experience. The calculus on physically separating components is very different from a piece of software intended for users to run, where single-binary-minimal-dependencies tends to provide the best experience.

It was fun to hack on a prototype for S2 that can work against both S3 Standard and S3 Express "Three Zone" as chunk stores, with the performance and cost profile I was shooting for. The latency with S3X3Z is not quite single-digit milliseconds, but close enough that it makes sense to push back a native chunk store.

I glossed over many interesting details, and a lot is currently up in the air. I look forward to sharing more as the work continues!

Now begins the journey of building a great team and delivering on the vision for S2 with a preview launch targeted for later this year. If you are intrigued and want to be an early user or perhaps even collaborate as a design partner, please signup on the waitlist.

The disaggregated write-ahead log

Shikhar Bhushan — Wed, 08 Nov 2023 16:44:40 GMT

The traditional way replicated systems are architected is to physically co-locate the write-ahead log (WAL) on the nodes where the state is being maintained. Then a consensus protocol like Paxos or Raft is used to make sure the log on each replica agrees. With many classical databases that have been around a while, instead of consensus there may be a configuration-driven primary/backup scheme.

This converged approach is not that different logically even with the physical decoupling of compute and storage as pioneered by hyperscalers. At Google, locally attached disks are rarely relied upon you must use Colossus, the current generation of the Google File System and "the secret scaling superpower behind Google's storage infrastructure". All the data systems used internally at Google and those made available as cloud services like Bigtable and Spanner store logs and state in Colossus.

Colossus is a zonal service, so these systems still need to implement replication protocols to make them resilient to a single zone or region's unavailability. This is apparent if you have dealt with Bigtable replication, which is async between zonal clusters. Other services like Spanner offer a higher-level perspective of a regional or multi-regional deployment.

AWS went in a rather different direction. This 2014 re:Invent talk describes their evolution with distributed systems that had led them to embrace a transactional journal primitive. It is a multi-zone replicated service, and now "powers some of the biggest AWS databases and retail services such as DynamoDB, Amazon S3, Amazon QLDB, Amazon SWS, Amazon SNS, and many others" (source).

As a building block, this represents incredible leverage for AWS. MemoryDB which debuted in 2021 stands out to me. You get a highly durable Redis usable as a primary data store, for a pretty small write latency tradeoff of single-digit milliseconds. Meanwhile, Redis Labs has a Raft module which is still experimental. How did AWS do it? See for yourself:

Outside of AWS, Meta engineers in their blog on using Raft for MySQL replication speculate a follow-up: "Another idea that we are considering is to disentangle the log from the state machine (the database) into a disaggregated log setup. This will allow the team to manage the concerns of the log and replication separately from the concerns of the database storage and SQL execution engine."

At sub-hyper-scale too, we can find examples of systems successfully leveraging disaggregated WALs:

Fauna, the serverless document-relational hybrid database, runs a Calvin-inspired transaction protocol on top of a partitioned replicated log, delivering strict serializability with multi-region replication.
Pulsar and Pravega are streaming data solutions relying on BookKeeper for record storage, together with ZooKeeper to coordinate since BK eschews ordering.
Neon, the serverless Postgres system inspired by AWS Aurora (of "the log is the database"), built a multi-tenant WAL service called Safekeeper and have shared insights on their choice of Paxos for it.

Even though pretty much every distributed data system stands to benefit from replicated logs being a commodity, the only commodity in that diagram today is object storage. The scope of composable data systems needs to extend to replicated logs on this journey of the "Great Decoupling", lest it turn into the "Great Divide" between the building blocks available to hyperscalers vs outsiders.

S3 revolutionized how slow immutable data is stored and accessed, where is the counterpart for fast data streams? S3 has been called the gold standard for serverless, and has even been proposed as the universal infrastructure backend. I agree on the former, and on the latter I would argue that object stores are not comprehensive enough, as they do not solve for cheap small writes, low latency IO, or coordination requirements.

I believe that if distributed systems hackers had the equivalent of S3 for logs truly serverless API, single-digit-milliseconds to tail and append, for a practically unlimited number of them with bottomless storage and elastic throughput we would unlock a ton of design innovation.

Maybe it can even be called S2, the Stream Store. Hypothetical S2 can take care of replicating within and between regions and even across clouds, so long as you pay the speed of light tax. Hypothetical S2 does a bit more to simplify the layers above it makes leadership above the log convenient with leases and fenced writes.

Building a durable execution engine? There's your fast side effect journal.

Distributed transactions? Use a coordinator log for 2PC.

Need to quickly acknowledge writes while preparing index chunks persisted in S3? S2 can be your trimmable, durable buffer.

Unifying streaming and queuing, disklessly? Pulsar charted the path, let's take it to the next level without the JVM clusters.

Ingestion pipeline for a multi-tenant data store? Granular streams seem perfect, model a log per tenant-use-case.

Event sourcing? S2 for the log and S3 for state snapshots.

What would S2 enable for you? It's okay to dream!

Update: I am working on making S2 real!

Proprietary systems are described based on public info, let me know if I got something wrong and I'd be happy to fix it!