Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

internal/manifest: encode/decode range keys in manifest #1541

Merged
merged 1 commit into from
Mar 11, 2022

Conversation

nicktrav
Copy link
Contributor

@nicktrav nicktrav commented Feb 26, 2022

Currently, only point key bounds for a table are present in a manifest
entry. With the addition of range keys, the manifest needs to track the
point and range key bounds, in addition to providing a mechanism to
infer the overall bounds for the table.

Adopt the following encoding scheme to incorporate range keys into the
manifest:

  • if the table only contains point keys, the existing encoding applies -
    a new file entry is written with tagNewFile4 and the smallest and
    largest point keys are used as the smallest and largest bounds for the
    table.

  • else, the table contains range keys, and the new file is written with
    a new tag, tagNewFile4RangeKeys that extends tagNewFile4:

    • a "marker" byte is written that indicates if the table also has
      point keys, and how the bounds can be used to reconstruct the
      overall table bounds. In increasing order of bit significance:

      • if the table contains point keys, 1, else 0
      • if the smallest key is a point key, 1, else 0
      • if the largest key is a point key, 1, else 0
    • the smallest/largest pairs are written - first the point key bounds
      (if present), then the range key bounds.

  • any custom tags are written (e.g. table marked for compaction, etc.).

This new encoding scheme uses no additional space in the manifest if
there are only point keys, and remains backwards compatible with the
existing manifest format. If range keys are present, each new file
requires two additional bytes (a varint with value 1 or 2 for the
smallest/largest keypair count, plus a byte for the bounds marker) on
top of what it takes to encode the smallest/largest point key bounds.

This approach also has the advantage of encoding enough information to
be able to fully reconstruct the overall table bounds at decode time,
rather than waiting until a Compare function is available to perform
the comparisons.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@nicktrav nicktrav mentioned this pull request Feb 26, 2022
29 tasks
Copy link
Member

@itsbilal itsbilal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: great work! I like the fact that the format lets us instantiate a FileMetadata without a Compare.

Reviewed 5 of 5 files at r1, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @jbowens, @nicktrav, and @sumeerbhola)


internal/manifest/version_edit.go, line 58 at r1 (raw file):

	// Pebble tags.
	tagNewFileRangeKeys = 300

Nit: maybe we should call this tagNewFile4RangeKeys to be clearer that it extends tagNewFile4?

Copy link
Contributor Author

@nicktrav nicktrav left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @jbowens, @nicktrav, and @sumeerbhola)


internal/manifest/version.go, line 248 at r1 (raw file):

)

// boundsMarker returns a marker byte whose bits encode as the following:

Note to self: after talking with @sumeerbhola, using two bits seems like a better way to go than half a byte :) It gives us 6 bits that we could potentially use in the future, plus bitmasks are more standard than half byte masks :)

Copy link
Collaborator

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @nicktrav and @sumeerbhola)


internal/manifest/version.go, line 248 at r1 (raw file):

Previously, nicktrav (Nick Travers) wrote…

Note to self: after talking with @sumeerbhola, using two bits seems like a better way to go than half a byte :) It gives us 6 bits that we could potentially use in the future, plus bitmasks are more standard than half byte masks :)

Maybe it's over optimizing, but could we use one of those extra bits for the key-pairs count?

I'm thinking one byte with three used bits:

  • maskContainsPointKeys— set if the file contains range keys
  • maskSmallest1 if the smallest comes from the point bound, 0 if it comes from the range bound. Required to be 0 if b & maskContainsPointKeys == 0
  • maskLargest1 if the largest comes from the point bound, 0 if it comes from the range bound. Required to be 0 if b & maskContainsPointKeys == 0

Copy link
Contributor Author

@nicktrav nicktrav left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @nicktrav and @sumeerbhola)


internal/manifest/version.go, line 248 at r1 (raw file):

Previously, jbowens (Jackson Owens) wrote…

Maybe it's over optimizing, but could we use one of those extra bits for the key-pairs count?

I'm thinking one byte with three used bits:

  • maskContainsPointKeys— set if the file contains range keys
  • maskSmallest1 if the smallest comes from the point bound, 0 if it comes from the range bound. Required to be 0 if b & maskContainsPointKeys == 0
  • maskLargest1 if the largest comes from the point bound, 0 if it comes from the range bound. Required to be 0 if b & maskContainsPointKeys == 0

Love it.

@nicktrav nicktrav force-pushed the nickt.manifest-range-keys branch 2 times, most recently from 0808cb4 to 54f768c Compare March 1, 2022 04:10
Copy link
Contributor Author

@nicktrav nicktrav left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 2 of 5 files reviewed, all discussions resolved (waiting on @itsbilal and @sumeerbhola)


internal/manifest/version_edit.go, line 58 at r1 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

Nit: maybe we should call this tagNewFile4RangeKeys to be clearer that it extends tagNewFile4?

Done.

@nicktrav
Copy link
Contributor Author

nicktrav commented Mar 1, 2022

Going to keep open until we make the 22.1 branch cut.

@nicktrav nicktrav force-pushed the nickt.manifest-range-keys branch 3 times, most recently from 23a2bf3 to 47b0226 Compare March 10, 2022 16:29
Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 3 of 5 files at r3.
Reviewable status: 3 of 5 files reviewed, 2 unresolved discussions (waiting on @itsbilal and @nicktrav)


internal/manifest/version.go, line 267 at r3 (raw file):

// - if the table contains point keys
// - if the table's smallest key is a point key
// - if the table's largest key is a point key

The PR description needs to be updated for this revised scheme.


internal/manifest/version_edit.go, line 58 at r3 (raw file):

	// Pebble tags.
	tagNewFile4RangeKeys = 300

I can't remember which tags belong to the same "space", i.e., which ones can't collide. And are all these encoded as unsigned varints?
I am wondering since it isn't clear to me why the value 300 was chosen, instead of 104?

May be a good time to add code comments for these tags.

Copy link
Contributor Author

@nicktrav nicktrav left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 2 of 5 files reviewed, 2 unresolved discussions (waiting on @itsbilal, @jbowens, and @sumeerbhola)


internal/manifest/version.go, line 267 at r3 (raw file):

Previously, sumeerbhola wrote…

The PR description needs to be updated for this revised scheme.

Good catch. Update the PR description. The commit message was already updated.


internal/manifest/version_edit.go, line 58 at r3 (raw file):

And are all these encoded as unsigned varints?

I believe this to be true.

isn't clear to me why the value 300 was chosen, instead of 104?

No particular reason for 300 - we have the 1xx values for RocksDB tags that we inherited, and 2xx for column families. I was a little wary of using something in those ranges for fear of incompatibilities with stores being migrated from Cockroach to Pebble, but maybe that ship has sailed already. That said, I just took another look at the corresponding code in RocksDB, and their latest tag is still 103, so I guess we could use 104, and even revert the name of the tag identifier to tagNewFile5, as it's literally the "fifth new file format" (in the LevelDB -> RocksDB -> Pebble lineage).

I'll make the suggested changes, but happy to revert. @jbowens - any thoughts here?

May be a good time to add code comments for these tags.

Agree. It's also time we documented the encoding. I'll make sure this gets rolled in.

Copy link
Contributor Author

@nicktrav nicktrav left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 2 of 5 files reviewed, 2 unresolved discussions (waiting on @itsbilal, @jbowens, and @sumeerbhola)


internal/manifest/version_edit.go, line 58 at r3 (raw file):

Cockroach to Pebble

Woops. RocksDB to Pebble.

Copy link
Contributor Author

@nicktrav nicktrav left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that the 22.1 branch is cut, this can land.

Reviewable status: 2 of 5 files reviewed, 2 unresolved discussions (waiting on @itsbilal, @jbowens, and @sumeerbhola)

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 5 files at r3, 1 of 1 files at r4.
Reviewable status: 4 of 5 files reviewed, 2 unresolved discussions (waiting on @itsbilal, @jbowens, and @sumeerbhola)


internal/manifest/version_edit.go, line 58 at r3 (raw file):

Previously, nicktrav (Nick Travers) wrote…

Cockroach to Pebble

Woops. RocksDB to Pebble.

I am fine if we want to be conservative wrt the tag numbering, as long as we document our reasoning.

I think a single byte unsigned varint can encode [0, 128).

Copy link
Collaborator

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 2 of 5 files at r3, 1 of 1 files at r4, all commit messages.
Reviewable status: 4 of 5 files reviewed, 5 unresolved discussions (waiting on @itsbilal, @nicktrav, and @sumeerbhola)


-- commits, line 18 at r4:
nit: s/tagNewFile4RangeKeys/tagNewFile5/


-- commits, line 36 at r4:
nit: this sentence about 'two additional bytes' is out of date now, right?


internal/manifest/version_edit.go, line 58 at r3 (raw file):

Previously, sumeerbhola wrote…

I am fine if we want to be conservative wrt the tag numbering, as long as we document our reasoning.

I think a single byte unsigned varint can encode [0, 128).

104 sounds good to me


internal/manifest/version_test.go, line 463 at r4 (raw file):

		}
		fmt.Fprintf(&b, "%s\n", m.DebugString(base.DefaultFormatter, true))
		fmt.Fprintf(&b, "  bounds: (smallest=%s,largest=%s) (0x%02x)\n", smallest, largest, bounds)

nit: could use %08b to format the bounds marker so that it's easier to interpret as a bit field

Currently, only point key bounds for a table are present in a manifest
entry. With the addition of range keys, the manifest needs to track the
point and range key bounds, in addition to providing a mechanism to
infer the overall bounds for the table.

Adopt the following encoding scheme to incorporate range keys into the
manifest:

- if the table only contains point keys, the existing encoding applies -
  a new file entry is written with `tagNewFile4` and the smallest and
  largest point keys are used as the smallest and largest bounds for the
  table.

- else, the table contains range keys, and the new file is written with
  a new tag, `tagNewFile5` that extends `tagNewFile4`:

  - a "marker" byte is written that indicates if the table also has
    point keys, and how the bounds can be used to reconstruct the
    overall table bounds. In increasing order of bit significance:

    - if the table contains point keys, `1`, else `0`
    - if the smallest key is a point key, `1`, else `0`
    - if the largest key is a point key, `1`, else `0`

  - the smallest/largest pairs are written - first the point key bounds
    (if present), then the range key bounds.

- any custom tags are written (e.g. table marked for compaction, etc.).

This new encoding scheme uses no additional space in the manifest if
there are only point keys, and remains backwards compatible with the
existing manifest format. If range keys are present, each new file
requires a single additional byte that encodes whether the table
contains point keys in addition to how to reconstruct the overall bounds
from the point and/or range keys in the table.

This approach also has the advantage of encoding enough information to
be able to fully reconstruct the overall table bounds at decode time,
rather than waiting until a `Compare` function is available to perform
the comparisons.
Copy link
Contributor Author

@nicktrav nicktrav left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 3 of 5 files reviewed, 2 unresolved discussions (waiting on @itsbilal, @jbowens, and @sumeerbhola)


-- commits, line 18 at r4:

Previously, jbowens (Jackson Owens) wrote…

nit: s/tagNewFile4RangeKeys/tagNewFile5/

Done.


-- commits, line 36 at r4:

Previously, jbowens (Jackson Owens) wrote…

nit: this sentence about 'two additional bytes' is out of date now, right?

Done.


internal/manifest/version_edit.go, line 58 at r3 (raw file):

Previously, jbowens (Jackson Owens) wrote…

104 sounds good to me

👍


internal/manifest/version_test.go, line 463 at r4 (raw file):

Previously, jbowens (Jackson Owens) wrote…

nit: could use %08b to format the bounds marker so that it's easier to interpret as a bit field

Good call. Done.

@nicktrav
Copy link
Contributor Author

TFTRs!

@nicktrav nicktrav merged commit 56c5aeb into cockroachdb:master Mar 11, 2022
@nicktrav nicktrav deleted the nickt.manifest-range-keys branch March 11, 2022 17:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants