Skip to content

Concerns and Decisions

Alan Gutierrez edited this page Feb 27, 2012 · 4 revisions

Generally I design my software by listing my concerns and then making decisions. These usually form a journal, but lately, I've simply taken to creating a concerns and decisions document such as this one. Feel free email me at alan@prettyrobots.com to discuss your concerns and decisions regarding Strata.

Medic

To complete the project, I'll need a medic, some form of recovery.

  • Everything is an append.
  • Files are renamed in an order.
  • The medic needs to know the difference between a branch and a leaf.
  • The left most leaf is always 1. The root is always 0.

Medic recoveries are simple, really. We write to file. If we die during a split, then we can recover by playing things forward.

We write out files named .0, .1, .2. We then move them into place. We won't know that .0 is written. We really need to touch a file, call it relink.lock. Then we move the temporary files, remove relink.lock.

Recover is rough, though. We can read through the files. But, hey, monkey balls. I am fucking tired. Night-night.

After restarting my consideration of Medic, I've come to this.

Currently, Strata is syncing constantly, so the only place where strata could fail, would be in the middle of one or more appends to leaf pages. When rewriting pages, the writes are journaled. New pages are written, the old files unlinked, and the new files are renamed into place.

I'm tired... Rewrite please.

Unfortunately, there is no good way to determine which pages.

Healing file system corruption is out of scope for Strata. That is, if a chunk of a page goes missing because the hard drive is failing, Strata does not have a method to detect the failure, other than to eventually visit the page.

Chances are good that Strata will detect a corrupt page, since JSON is somewhat fragile, and a missing closing bracket will cause a parse error. Any read of the b-tree can potentially produce a corruption exception.

Thus, there is a severity when a corrupted page is detected. When it is the last record on a page, it is assumed that this is a failed write. It could be vicious truncation, however, and there would be no good way to detect otherwise. Again, this would indicate file system corruption. If a bad line is found before the last line, that is a sure sign of file system corruption. Thus, the error levels are bad and very bad, however if you're not aware of any file system corruption, than bad probably only means a single lost record.

Strata is a database primitive. Probably the best way to aliviate the problems with bad and very bad would be to keep a write ahead log, which could simply be another Strata b-tree, where you write operations you indend to perform, then you perform them. In time, you purge old operations. This would allow you to start your database by checking that the last operations went through. You can navigate to their pages and check for corruption, which may indicate that it is simply a bad append. Afterward you know that any error that is bad is actually very bad.

Finally, we can have a simple way to check the append. We can write the size of the page on disk to file, which would be a poor man's write ahead log. Or, better still, when we get our page to insert, we can write its size then, to our journal, before we append, but this requires fsyncing a lot. It gets confusing, the trade-offs.

The Medic will be able to rewrite a page, reading in the objects, inserting them back into the tree. The medic will roll forward any rewrites.

In any case, if we find a bad block, we can run through the leaves, re-reading our records, writing them out again.

Strata will benefit from software RAID, hardware RAID, and regular backups.

What would Strata do? Pause every once in a while and visit each page, seeing if it can load it? Reading every page into memory? B-trees aren't about reading everything into memory. When would it occour? At startup? Every once in a while? You can move some of this to a journal and then link to it.

Note that the line oriented nature makes medic effective.

The decision is that we detect corruption by looking for bad JSON. Errors within the scope of Strata include a failed append, or else failure during a balance operation. The Medic will be able to roll forward, or else roll back, a failed append.

We might have been in the middle of a balance that never occured. Ah!

We also use the meta data of the files. We have an open file that lets us know when. Then when we recover, if there is an open file, we look for files that are newer, and then check those files for balances and failed appends. That is the poor mans way. We can touch that file prior to a balance, maybe, or maybe not, because then you'd have to make sure that you had some sort of global lock, which actually, would not be that hard to engineer, get a lock on the root element, right? No. You'd have to descend. You'd have to have an exclusive lock count, easy, then lock the root. Not hard.

Duh! Okay, Medic remembered.

SHA1 Checksums

Considered adding SHA1 checkums on each line. Skipped it for now. Would like to find a checksum implementation that can loaded and resumed. If there is a hard shutdown, we can checksum a few lines to detect corruption.

However, JSON is a fagile format, so it is quite likely that a corrupt line of JSON will not parse correctly. The absence of of the closing ]\n can be detected prior to attempting to parse. If that is missing or the parse raises an exception, then we know that we have a corrupt line.

Caching

Reading about 1 GB limits in V8 where performance degrades greatly as the stop-the-world garbage collector runs constantly.

Clone this wiki locally