Designing serverless stream storage

I have alluded to "hypothetical S2" (Stream Store), a true counterpart to S3 for data in motion. As I work on making S2 real, I wanted to share the design and how it shaped up.

Vision

Unlimited streams

A pain point with most streaming data solutions is how they limit the number of ordered streams ('partitions'), typically to a few thousand, and/or associate large fixed costs with each of them. This can lead to an impedance mismatch when it may be natural to model distinct streams for fine-grained entities, like users or listings.

Let's imagine building the ingest pipeline for a multi-tenant vector database. It would make sense to have a stream per index. You get some buzz and usage on your free tier blows up: suddenly you are scrambling to shard into multiple clusters. Maybe you start to question why you have to contend with clusters at all.

Low latency

IO to a local SSD is sub-millisecond, as is the typical latency in crossing availability zones. If a cloud database can offer single-digit millisecond response times, any slower for sequential data is a shame.

There are certainly use-cases where sub-second or even sub-10-second for "near-real-time" is good enough, but online systems should rightfully balk. There can be lots of services involved in a modern tech stack, and the foundational primitives need to minimize latency. Beyond obviously online usage where latency affects user experience, imagine emitting an audit event at the end of a Lambda function – waiting on an acknowledgment to retry failures can be a time slice you are paying compute for.

Elastic to high throughputs

Cloud VMs can easily offer many GiBps in and out (networking is fast), and not too far behind are local SSDs wielded well (storage is fast). You may experience dissonance checking out cloud streaming solutions that throttle ordered streams to a measly 1-10 MiBps.

When a busy database node can push hundreds of MiBps in a converged setup, disaggregating its write-ahead log is unworkable with such low throughput limits. Partitioning can be a viable strategy in other cases, but inconvenient and inefficient if you end up having to impose order later with stream processing.

Truly serverless

Serverless represents an ideal developer experience for an infrastructure product, elevating the level of abstraction to the workload rather than the physical resources powering it. S2 aspires to be truly serverless, with usage-based transparent pricing, and no faux-serverless bandit tricks.

Some vendors seem to look at a serverless offering as a way to go "downmarket" or upsell dedicated resources beyond some scaling wall – and yet we have S3 the OG serving everyone from Netflix to tiny startups. Maybe S2 will offer dedicated single-tenant cells, but it should not be for lack of scalability or resource isolation in its default environment.

Enable the next generation of data systems

When data is moving reliably, there is a durable stream involved. Streams deserve to be a cloud storage primitive. To me this calls for intuitive, easy to integrate APIs over HTTP, like JSON and gRPC. The specifics will evolve, but S2 should always strive for simplicity.

Beyond core data operations like append / pull / trim, leases with fencing are a powerful abstraction for disaggregation. I certainly want it, in plotting a Kafka-on-S2 layer! I believe the best way to bring S2 to Kafka clients would be with an open-source wire-compatible layer that can be self-hosted, and is also made available as a serverless endpoint. The code can serve as an example of how to power distributed data systems with S2, with some of the neat features from the Kafka world like keyed compaction being natively integrated.

Architecture

The S2 vision presents quite an engineering challenge, and the table stakes are high as well: highly available, durable, and secure.

I initially discounted object storage as a primary source of durability, because of the unacceptably high latency of hundreds of milliseconds for writes. The engineering investment and bandwidth costs involved in manually implementing replicated durability seemed necessary.

That changed with the announcement of S3 Express at the Nov 2023 AWS Re:invent, promising single-digit millisecond IOs albeit as a "One Zone" zonal service. I also learned about Azure Premium Blob Storage, which is able to offer similarly low latencies regionally. Of the 3 major clouds, this leaves Google Cloud who certainly have SSD-based Colossus clusters internally, so I see it is a matter of time before they catch up with a "GCS Express". Suddenly, the future for fast cloud-provider-native object storage was looking bright.

With some performance testing, I determined that an IO size sweet spot for S3 Standard is 4 MiB, and the cost model for S3 Express already nudges one towards 512 KiB operations. As I reflected on previous experiments with local NVMe SSDs using Direct IO, a storage hierarchy was firming up. Direct IO forced me to issue 4K-multiple writes with great attention to driving concurrency, and larger sizes like 64 KiB helped with throughput. S3 was more forgiving on precise sizing of writes, but the higher latencies demanded amping up concurrency for a longer pipeline – you may recognize this as Little's law.

Write sizeLatencySticker price (1 GiB regional durability)
Local SSD64 KiBsub-millisecond$0.04 on bandwidth (albeit large discounts at scale!)
S3 Express512 KiB20 milliseconds$0.01536 all-inclusive
S3 Standard4 MiB400 milliseconds$0.00128 all-inclusive

Operating across this full price-performance spectrum seemed within reach, with the right abstractions. Enabling disaggregated systems was one thing, S2 needed to internally disaggregate storage as well. By building a quorum protocol for a zonal service like S3 Express to achieve regional durability and availability, there was an exciting path to also leverage a similar mechanism for a purpose-built storage tier on local disks.

S2 calls this abstraction the chunk store, and it ended up with a surprisingly small interface. Chunks can be comprised of records from many streams – multiplexing helps achieve the optimal write size. An easily verifiable protocol performs quorum writes for durability, and cheaply minimizes the "window of uncertainty" chunk range which will require quorum reads in case of recovery. It would be accurate to characterize the sequence of chunks as a replicated write-ahead log.

A channel is diskless, serving real-time reads (i.e. the last N seconds of records) from memory, and works against a specific chunk store type. Of course, memory is a limited resource that a high throughput channel would quickly churn through – which is why channels are continuously, asynchronously offloading per-stream data segments to object storage, and garbage collecting chunks. Higher time-to-first-byte for a lagging reader is not a concern, and comes with a nice win on scalability of read throughput.

Elasticity arises from streams being represented as a chain of lightweight "streamlets", with stream metadata storage centralized. This aspect of the design is inspired by Delos, which calls log epochs "loglets" and innovated with the idea of virtualizing consensus. The control plane can use a simple protocol for live-migrating streams between channels in response to changing workloads – and even reconfigure streams to different types of chunk storage!

To unlock an unlimited number of streams, metadata storage needs to be extremely scalable. It stands behind a minimal interface which can be implemented efficiently against fast cloud-provider-native databases like AWS DynamoDB, GCP Spanner, and Azure CosmosDB, which tend to have even higher availability SLAs than object storage. Metadata reads are largely cacheable, with care to ensure safety in case of stale reads.

This kind of decoupled, multi-tenant architecture is key to delivering on a serverless experience. The calculus on physically separating components is very different from a piece of software intended for users to run, where single-binary-minimal-dependencies tends to provide the best experience.

It was fun to hack on a prototype for S2 that can work against both S3 Standard and S3 Express "Three Zone" as chunk stores, with the performance and cost profile I was shooting for. The latency with S3X3Z is not quite single-digit milliseconds, but close enough that it makes sense to push back a native chunk store.

I glossed over many interesting details, and a lot is currently up in the air. I look forward to sharing more as the work continues!

Now begins the journey of building a great team and delivering on the vision for S2 with a preview launch targeted for later this year. If you are intrigued and want to be an early user or perhaps even collaborate as a design partner, please signup on the waitlist.