A whole new storage... 🎁 #525

mrbbot · 2023-03-01T12:11:28Z

mrbbot
Mar 1, 2023

The Cloudflare Developer Platform provides a serverless runtime and a bunch of data stores: KV, Durable Objects, Cache, R2, D1 and Analytics Engine. Each of these data stores provides some mechanism for writing data, then reading it back. Specifically, KV, Durable Objects, Cache and R2 are all key-value stores: you can put/get/delete and (ignoring Cache) list keys.

For implementing these, Miniflare has a common key-value storage interface that supports putting/getting/deleting keys with metadata and expiry, then listing them based on prefix, start, end filters, with cursor, limit, reverse pagination, and delimiter grouping. This storage interface worked well in the early days of Miniflare, but as more complex data storage solutions have been added to Workers (particularly R2 and D1 😅), it's started to show some rough edges. In particular:

Whilst it supports single ranged-reads, it doesn't provide streaming APIs for large objects (these must be buffered in-memory completely, even if the user-facing API exposes a stream)
It doesn't handle transactions natively (fairly important for conditional puts and multipart uploads, see Implement R2 multipart upload bindings #486)
It loads all keys into memory when listing
It has restrictive querying capabilities (e.g. startAfter filtering has to be implemented with start and an increased limit, which is difficult to mix with cursor-based pagination)
The file-system backend uses sanitised keys as file names, creating directories for characters like : and /. This means you can't store a/b and a at the same time (KV keys with / with kvPersist throws EISDIR #167), and keys are case-insensitive when the underlying file-system is ([BUG] DO storage keys are case-sensitive on the edge, but Miniflare uses case-insensitive key matching with persistence #247). Multiple keys can also map to the same file name on disk. 😬
Using the file-system backend may result in race conditions given multiple concurrent writes (KV: simultaneous puts to same key result in malformed data #530)

Given that we can make breaking changes to the persistence format with the major version bump to Miniflare 3, I think it's time to rethink storage.

Requirements

KV

get: read value as stream if not expired, including expiration and metadata
put: write value stream, expiration and metadata
delete: delete value and metadata
list: list non-expired keys, filtering by prefix, paginating by cursor and limit, returning key names, expiration and metadata

Our KV implementation also needs to support read-only Workers Sites namespaces, backed by an arbitrary directory, with glob-style include/exclude rules.

Cache

match: read body as stream if not expired, including status and headers, optionally read a single-range or multiple-ranges as multipart
put: write body stream, status and headers
delete: delete body, status and headers, returning whether we actually deleted anything

R2

head: read metadata
get: read value stream, including metadata, optionally read a single-range or only return value if conditional succeeds
put: write value stream and metadata, optionally if conditional succeeds and checksums match
delete: delete value and metadata
list: lists keys with metadata, filtering by prefix, startAfter, paginating by cursor and limit, grouping by delimiter
createMultipartUpload: write metadata, returning new upload ID
uploadPart: write part stream, returning etag
completeMultipartUpload: read parts metadata, write entry with pointers to parts, mark upload ID completed
abortMultipartUpload: mark upload ID aborted

D1

Requires exclusive access to an SQLite database

Durable Objects

Persistence implemented entirely in workerd

Proposal

Instead of having a single store for both metadata and large blobs, I propose we split these up.

Given the variety in queries required by each data store (especially in R2's list), and the hard requirement on SQLite by D1, using SQLite for the metadata store seems like a good idea. This also gives us the transactional updates for things like multipart upload we're looking for.

We could then implement our own simple blob store, supporting multi-ranged, streaming reads. Multiple ranges with multipart responses are required by Cache, and streaming reads seem like a good idea for the large objects R2 can support.

Alternative: storing blobs in SQLite

We could use a BLOB-typed column and store everything in SQLIte. For R2 and potentially Cache, blobs might be pretty-big. R2's maximum upload size (maximum size of individual blobs) is ~5GB. Storing this size of file in SQLite doesn't feel like a good idea, even if we could use something like substr() to do ranged-reads. There's lots of resources online to support this feeling:

https://www.sqlite.org/intern-v-extern-blob.html

https://arxiv.org/abs/cs/0701168

https://dba.stackexchange.com/questions/2445/should-binary-files-be-stored-in-the-database/226464#226464

In-memory and file-system backed implementations would be provided for both stores. For file-system backed stores, a root directory should be provided containing the store's data. We should validate that no other stores in the same Miniflare instance are rooted in a child directory of any other file-system stores' roots.

We may also want to provide a simple expiring key-value-metadata store abstraction on top of these, for use with KV and Cache.

In the future, we may implement Miniflare's simulators in workerd as opposed to Node.js. SQLite may be implemented on top of Durable Objects, and we could use workerd's DiskDirectory services for blob storage.

SQLite

We should create a new SQLite database for each KV namespace, R2 bucket, D1 database, etc.

Alternative: one database for all KV namespaces, one for all R2 buckets, ...or one for everything

We could include a namespace/bucket name/etc column in our tables and filter on this when querying. We could take this further and have a single database for all KV, R2 and D1 data, with each in different table-spaces. We're not expecting many namespaces in an application though, we don't need to perform transactions across multiple namespaces, and this would increase query complexity. D1 also expects exclusive access to a database. Notably, workerd also provides a unique SQLite database for each Durable Object instance when persisting to disk.

For the in-memory implementation, we should use SQLite's built-in :memory:-databases.

Blob Store

This should provide an interface like:

type BlobId = string;

interface Range {
  start: number; // inclusive
  end: number; // inclusive
}

interface MultipartOptions {
  contentLength: number;
  contentType?: string;
}

interface MultipartReadableStream {
  multipartContentType: string;
  body: ReadableStream<Uint8Array>;
}

interface BlobStore {
  get(id: BlobId, range?: Range): Promise<ReadableStream<Uint8Array> | null>;
  get(id: BlobId, ranges: Range[], opts: MultipartOptions): Promise<MultipartReadableStream | null>;

  put(userId: string, stream: ReadableStream<Uint8Array>): Promise<BlobId>;

  delete(id: BlobId): Promise<void>;
}

BlobIds are opaque, un-guessable identifiers. Blobs can be deleted, but are otherwise immutable. Using immutable blobs makes it possible to perform atomic updates with the SQLite metadata store. No other operations will be able to interact with the blob until it's committed to the metadata store, because they won't be able to guess the ID, and we don't allow listing blobs. For example, if we put a blob in the store, then fail to insert the blob ID into the SQLite database for some reason during a transaction (e.g. onlyIf condition failed), no other operations can read that blob because the ID is lost (we'll just background-delete the blob in this case).

Whilst entire BlobIds must be un-guessable, they may still contain identifiable information. One advantage of Miniflare's existing file-system storage is that keys show up as files, grouped by namespace, in your IDE. This makes it easy to inspect written data, and see when storage operations are succeeding. By including a userId (e.g. KV key, Cache URL, R2 object name) in BlobIds along with some randomness, blobs written to disk will be user-identifiable and inspectable. We'll want to encode userIds to be file-system safe and not contain directory separators to avoid #167. This encoding must be one-to-one to avoid issues like #247, and ideally would preserve file extensions so images open in an image viewer for example. We'll also want to mark blobs are read-only files to maintain immutability. This is especially important given we'll want to store object size in SQLite for R2 (we don't want to stat files when listing), and this could put the system in an inconsistent state.

Ranges are inclusive, so Range HTTP headers easily map to them.

get has two overloads: one accepts an optional single-range that returns a ReadableStream, whereas the other accepts multiple ranges that returns a MultipartReadableStream with a multipart/byteranges Content-Type header. This multiple range overload will always return a mulitpart/byteranges body even with zero or one ranges. The caller should decide when this overload makes sense, as some data stores like R2 only support single-ranged reads. get will return null when a blob with the specified ID can't be found.

This interface makes it possible to delete a blob whilst performing a streaming get. On the file-system, delete will be implemented using fs.unlink. This has the behaviour of unlink(2), specifically...

"If the name was the last link to a file but any processes still have the file open, the file will remain in existence until the last file descriptor referring to it is closed."

This means the file will only be deleted once the streaming get has finished, so we don't have a problem here. This is the behaviour on Windows too.

Alternative: using content hash as blob ID (i.e. content-addressable store)

In our case, users aren't required to provide hashes when writing values. This means when putting objects in the store, we'd have to read and hash the entire stream before we knew the final location of the blob, or whether we'd stored it before. We could write the blob to a temporary location then copy it over, but I don't think the additional complexity is worth it, especially since I expect most of our objects will be unique.

This also makes object deletion more difficult. We'd have to make sure no other records were using the blob before removing it, either with reference counting, or some sort of scheduled mark-and-sweep style clean-up.

Alternative: using keys as blob IDs directly

This is basically what Miniflare 2 did. If we kept doing this, we'd URL encode the entire key to form the blob ID. In theory, you could end up with two concurrent streaming writes to the same key here. We'd probably need some read-write lock to prevent this. If multiple processes could share the same persistence directory, we'd ideally persist the lock bookkeeping too.

We also have problems performing transactions that span SQLite and the blob store. We could implement our own Optimistic-Concurrency-Control like Miniflare 2 for these types of transactions, but this would require shadow copies for SQL updates and blobs. Implementing our own transaction system on top of SQLite's seems like a bad idea, and we get blob shadow copies for free with opaque blobs.

penalosa · 2023-03-01T13:14:12Z

penalosa
Mar 1, 2023
Collaborator

This looks great! Still forming thoughts, but some initial ones:

A mark-and-sweep style cleanup system is mentioned as a drawback of a content addressable store, but I think that'll probably be needed anyway? What if Miniflare crashes during a write? A 'zombie' file would be left on the filesystem (it'll be rare, and may not be a big deal, but is something to think about).
What about changing the key that points to a value? If the key is somehow included in the blob filename, then would this be a copy-and-delete style rename rather than a pure metadata operation? (I know the cache and KV can't do this, but r2 probably can, I'd think?)
If userId is included in the filename, how will this be encoded in a way that preserves 1-1 decoding across case-insensitive and case-sensitive file systems?

1 reply

mrbbot Mar 1, 2023
Author

Great questions! 😃

A mark-and-sweep style cleanup system is mentioned as a drawback of a content addressable store, but I think that'll probably be needed anyway? What if Miniflare crashes during a write? A 'zombie' file would be left on the filesystem (it'll be rare, and may not be a big deal, but is something to think about).

I don't think this is a big problem. Because we're using unguessable IDs, they'll be no way to reference the dangling blobs so it won't affect any other records. Because this would be rare, I don't see the wasted disk usage as a problem. The SQLite metadata store will always be the source of truth, and we'll always be sure to initiate blob deletions after transactions to delete metadata records have committed sucessfully.

What about changing the key that points to a value? If the key is somehow included in the blob filename, then would this be a copy-and-delete style rename rather than a pure metadata operation? (I know the cache and KV can't do this, but r2 probably can, I'd think?)

I don't think R2 provides a rename-like operation in Worker bindings. For multipart values, we don't copy the parts when assembling the final record, we just store pointers to the existing blobs. If R2 were to add something like this in the future, we could implement this by updating the key in the metadata store, and keeping the blob reference the same. This would leave the incorrect key on disk, but I think that's an acceptable tradeoff for the benefits keeping blob IDs/values immutable gives us.

If userId is included in the filename, how will this be encoded in a way that preserves 1-1 decoding across case-insensitive and case-sensitive file systems?

We'll never actually need to decode the encoded keys. The encoding doesn't really need to be 1-1, this is badly worded on my part. File paths will look something like `${urlEncode(userIdWithoutExtension)}.${hex(timestamp)}${hex(randomBytes())}.${urlEncode(userIdExtension)}`. Because we're inserting a timestamp and randomness as part of the path, the chance of collisions will be vanishingly small.

petebacondarwin · 2023-03-17T14:39:53Z

petebacondarwin
Mar 17, 2023
Maintainer

I feel like we need to define a few more things for this proposal:

What are the actual constraints for things like durability, scalability etc. It is tempting to design this as though it is a production system but in reality miniflare should only be used for development and testing, right? Or do we predict there may be more significant usage? Depending upon these constraints we could be more flexible in the system design.
What other kinds of storage do we envisage might be coming down the line from Cloudflare product teams? E.g. I think Queues will move off D.O. storage soon. Have we covered those? What would be the plan if a new requirement came along? How could we be as forward compatible as possible?
How can we design in easy migration in the future, if we want to make further breaking changes? Keeping the data store opaque gives us a lot more control over the implementation. But it comes with trade-offs.
We should share this plan more widely, both within Cloudflare and in the community. I don't know how many people have noticed this discussion post. Perhaps postings on internal chat, meetings with product teams, postings on Discord for the community, meetings with community champs?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A whole new storage... 🎁 #525

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

A whole new storage... 🎁 #525

mrbbot Mar 1, 2023

Requirements

KV

Cache

R2

D1

Durable Objects

Proposal

SQLite

Blob Store

Replies: 2 comments · 1 reply

penalosa Mar 1, 2023 Collaborator

mrbbot Mar 1, 2023 Author

petebacondarwin Mar 17, 2023 Maintainer

mrbbot
Mar 1, 2023

Replies: 2 comments 1 reply

penalosa
Mar 1, 2023
Collaborator

mrbbot Mar 1, 2023
Author

petebacondarwin
Mar 17, 2023
Maintainer