Support for numbered filesets #1720

justinjc · 2019-06-10T16:30:39Z

What this PR does / why we need it:

With cold flushes, we need to support having multiple filesets per block. This PR introduces volume numbers to filesets so that new filesets will be written with an index as part of the filename. In order to facilitate a smooth migration, there is additional logic for gracefully handling legacy (non-indexed) filesets.

Special notes for your reviewer:

Does this PR introduce a user-facing and/or backwards incompatible change?:

NONE

Does this PR require updating code package or user-facing documentation?:

NONE

src/dbnode/persist/fs/retriever.go

src/dbnode/persist/fs/seek_manager.go

src/dbnode/storage/shard.go

richardartoul

Could you add a sanity check to seek/reader when they read the info file to make sure that the volume number in the info file matches what we expect it to be? Would give us some extra assurances our logic is right

src/dbnode/persist/fs/files.go

richardartoul · 2019-06-13T14:18:40Z

src/dbnode/persist/fs/files.go

+// and error if neither exist.
+func isFirstVolumeLegacy(prefix string, t time.Time, suffix string) (bool, error) {
+	path := filesetPathFromTimeAndIndex(prefix, t, 0, suffix)
+	_, err := os.Stat(path)


You need to handle the os.IsNotExist error separately from other types of errors: https://stackoverflow.com/questions/12518876/how-to-check-if-a-file-exists-in-go

I.E

if os.IsNotExist(err) { return false, nil } if err != nil { return false, err } ...

Yeah I saw that, but I'd argue that we don't really care about the other err cases. Either one of the types of files exist or neither of them do, in which case that's the error we propagate down.

Yeah I guess its fine because of how you're using the function (where there is an implicit assumption that a file should exist, regardless of whether its legacy or not. Maybe just add that assumption as a comment above the function

Will do 👍

src/dbnode/persist/fs/files.go

src/dbnode/persist/fs/types.go

src/dbnode/storage/cleanup.go

src/dbnode/storage/shard.go

codecov · 2019-06-14T19:53:53Z

Codecov Report

Merging #1720 into master will increase coverage by <.1%.
The diff coverage is 88%.

@@           Coverage Diff            @@
##           master   #1720     +/-   ##
========================================
+ Coverage    71.9%   71.9%   +<.1%     
========================================
  Files         982     982             
  Lines       82097   82216    +119     
========================================
+ Hits        59093   59189     +96     
- Misses      19107   19124     +17     
- Partials     3897    3903      +6

Flag	Coverage Δ
#aggregator	`82.4% <ø> (ø)`	⬆️
#cluster	`85.7% <ø> (ø)`	⬆️
#collector	`63.9% <ø> (ø)`	⬆️
#dbnode	`80% <88%> (-0.1%)`	⬇️
#m3em	`73.2% <ø> (ø)`	⬆️
#m3ninx	`74.1% <ø> (ø)`	⬆️
#m3nsch	`51.1% <ø> (ø)`	⬆️
#metrics	`17.6% <ø> (ø)`	⬆️
#msg	`74.7% <ø> (ø)`	⬆️
#query	`66.3% <ø> (ø)`	⬆️
#x	`85.2% <ø> (ø)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c897aa1...0190701. Read the comment docs.

src/dbnode/persist/fs/files.go

robskillington · 2019-06-18T10:53:06Z

src/dbnode/persist/fs/files.go

+	legacyPath := filesetPathFromTimeLegacy(prefix, t, suffix)
+	_, err = os.Stat(legacyPath)
+	if err == nil {
+		return true, nil


Do you also need to check/verify for a completed checkpoint file? (not sure who relies on this path).

This is a good point. This function should be used solely to figure out whether the first volume is legacy or not before we actually use them, so checking for a complete checkpoint file should come after this function is called. I'll add a comment.

I assume its fine because this is only used by the reader to determine what code path to follow, and if the checkpoint file is not complete the reader will figure that out pretty quickly and error out?

Yep, correct.

src/dbnode/persist/fs/merger.go

robskillington · 2019-06-18T11:01:27Z

src/dbnode/persist/fs/persist_manager.go

+		// files does not use the volume index provided in the prepareOpts.
+		// Instead, the new volume index is determined by looking at what files
+		// exist on disk. This means that there can never be a conflict when
+		// trying to write new snapshot files.


Kind of strange, but why does snapshots use one mechanism (i.e. increment new volume index by looking at files on disk) vs fileset volumes using a different one? Richie alluded that it was "hard" but would be interesting to hear what was the difficulty with just making them use similar/same code paths?

Yeah I kind of did a poor job explaining this. There are a couple of subtle complexities that we encountered but basically we realized that we need to maintain some kind of relationship between the flush version and the volume number for a variety of reasons but the one that immediately comes to mind is imagine the SeekerManager does a lot of checks to see if specific fileset files exist. In order to do that efficiently it needs to be able to go from namespace/shard/blockStart/volume -> expected filename.

Once we decided that we wanted to maintain that relationship the "simplest" thing is to ensure that the flush version and the volume number are always the same. Technically we could do the +1 logic and still maintain this guarantee but that would make it implicit instead of explicit and a brittle assumption. Basically instead of having two sources of truth (flush version in memory and volume numbers on disk) we only have one and if they're out of sync for any reason we'll at least get errors as opposed to silent successes.

So we decided to just tell the PersistManager / writer which volume we intended to create and that has some nice properties like if our logic is wrong the PersistManager can error out and tell us that the volume already exists (as opposed to it just incrementing by 1 and subtly breaking our assumptions).

This is also nice because it forces users of the PersistManager to think about what volume they're writing and whether or not thats valid.

Sine we took this route we do need to bootstrap the flush version back into the shard, but thats a trivial change since the shard already calls ForEachInfoFile so all we need to do is pull the volume number out of the info metadata and into the shard memory state.

@justinjc Do you remember any other details? We spent a few hours working through all this and I feel like I may be missing some of the other reasons we couldn't do the incrementing.

@robskillington Thinking about this now though it seems like it might be good to add a check in the persist manager that the volume being writeen is at least as high as the highest existing complete fileset file. I.E if complete versions of 1, 2, 3 exist then writing 3 (if deleteIfExist=true) and 4 should be allowed, but writing 2 should not. What do you think @justinjc ? Could also add it to your list of follow ups

We could also support a flag in the PersistManager that basically tells it to figure out the volume number itself which we would probably need if we wanted to allow incremental peer bootstrapping and flushing to occur in parallel, but not sure we want to go with that approach long term (see my other comment below)

@justinjc just pinging your thoughts on that potential follow up

Thanks for the explanation here @richardartoul, you summed it up pretty well. In short, we wanted an efficient way to figure out what is the correct volume for a particular namespace/shard/block, and having to go to disk, parsing filenames and then ordering them each time seemed significantly slower.

I'm not a big fan of the persist manager having the volume number check. The check would similarly involve looking for files in the specified directory, doing file name parsing, and ordering them in order to figure that out, which amounts to additional logic that's essentially double checking that other logic isn't buggy. Also, from a separation of concerns point of view, I don't think the persist manager should need to know that fileset volumes are always increasing. Its job should just be to persist things you tell it to persist. It should be caller of the persist manager that ensures the volume number it passes is correct.

src/dbnode/persist/fs/seek_manager.go

robskillington · 2019-06-18T11:05:25Z

src/dbnode/persist/types.go

-	DeleteIfExists    bool
+	// This volume index is only used when preparing for a flush fileset type.
+	// When opening a snapshot, the new volume index is determined by looking
+	// at what files exist on disk.


Again, does seem strange they don't just both do the same strategy.

Yeah agreed its a little weird, and Justin's initial implementation did do the +1 strategy but we ran into some shard edges with that approach that pushed us towards the other direction

robskillington

Looking mostly good!

robskillington · 2019-06-18T11:07:50Z

src/dbnode/storage/bootstrap/bootstrapper/peers/source.go

+			// bootstrap, then the bootstrap volume index will be >0. In order
+			// to support this, bootstrapping code will need to incorporate
+			// merging logic and flush version/volume index will need to be
+			// synchronized between processes.


Just a thought: I assume if we ever want to do repairs we might also need to support volume index >0 for bootstrapping. (If an "online" peer bootstrapping is the process that repairs).

@robskillington The designed repair feature won't use the peers bootstrapper at all right? Re: incremental bootstrap, what I've been thinking about recently is to eventually move away from having incremental in the peers boostrapper and instead combine cold flushing + streaming bootstrap to get the same effect.

The alternative approach (where we support index > 0) and then basically do merging from here would work also, but I think it would make allowing M3DB to flush while its bootstrapping really difficult (which is a place I think we want to get to long term) because their would have to be coordination of the flush versions there.

src/dbnode/storage/namespace_readers.go

src/dbnode/persist/fs/persist_manager.go

richardartoul

LGTM aside from the nits.

Lets try and get a stamp from @robskillington too

src/dbnode/persist/fs/files.go

richardartoul · 2019-06-19T19:08:41Z

src/dbnode/persist/fs/files.go

+// delimeters found, to be used in conjunction with the intComponentFromFilename
+// function to extract filename components. This function is deliberately
+// optimized for speed and lack of allocations, since filename parsing is known
+// to account for a significant proportion of system resources.


"since allocation-heavy filename parsing can quickly become the large source of allocations in the entire system, especially when namespaces with long retentions are configured."

richardartoul · 2019-06-19T19:08:49Z

src/dbnode/persist/fs/files.go

+// function to extract filename components. This function is deliberately
+// optimized for speed and lack of allocations, since filename parsing is known
+// to account for a significant proportion of system resources.
+func delimiterPositions(baseFilename string) ([maxDelimNum]int, int) {


richardartoul · 2019-06-19T19:10:18Z

src/dbnode/persist/fs/files.go

+// delimeters. Our only use cases for this involve extracting numeric
+// components, so this function assumes this and returns the component as an
+// int64.
+func intComponentFromFilename(


How do you feel about "componentAtIndex" as the name? The current name gave me the wrong impression until I read the comment

Sure. I'll call it intComponentAtIndex because I do want to say that it deals with ints only.

richardartoul · 2019-06-19T19:13:09Z

src/dbnode/persist/fs/files.go

+	componentPos int,
+	delimPos [maxDelimNum]int,
+) (int64, error) {
+	var start int


can you do var start = 0, that would have given me a nudge in the right direction to understand how you're using the positions

Yeah makes sense.

richardartoul · 2019-06-19T19:29:46Z

src/dbnode/storage/shard.go

-		// when we attempt to flush and a fileset already exists unless there is
-		// racing competing processes.
+		// filesets exist at bootstrap time so we should never encounter a time
+		// when we attempt to flush and a fileset that already exists unless


this extra that seems wrong

Reworded this.

richardartoul · 2019-06-19T19:29:59Z

src/dbnode/storage/shard.go

-		// racing competing processes.
+		// filesets exist at bootstrap time so we should never encounter a time
+		// when we attempt to flush and a fileset that already exists unless
+		// there are racing competing processes.


"there is a bug in the code" or something

richardartoul · 2019-06-19T19:30:44Z

src/dbnode/storage/shard.go

@@ -2006,7 +2019,7 @@ func (s *dbShard) ColdFlush(

 		// After writing the full block successfully, update the cold version
 		// in the flush state.
-		nextVersion := s.RetrievableBlockColdVersion(startTime) + 1
+		nextVersion := coldVersion + 1


Yeah so there is +1 logic here and in the Merge() function. Would prefer we figure out the next version here and then tell the Merge() function to use that

richardartoul · 2019-06-19T19:32:19Z

src/dbnode/storage/shard.go

+	blockStates := s.BlockStatesSnapshot()
+	// Filter without allocating by using same backing array.
+	compacted := filesets[:0]
+	for _, datafile := range filesets {


maybe add a comment saying this is safe because cleanup and flushing never happen in parallel so the snapshot will never get stale.

Also do we have BlockStateSnapshot() and FlushState() do those two duplicate each other?

Sure I'll add a comment. Also, since versions are always increasing, even if we get stale snapshots, we just don't clean up some files that can be cleaned (but we'll clean them up eventually next time round).

BlockStatesSnapshot gets the flush states for all block starts, while FlushState is for a specific block start. Wouldn't want to loop through all blocks and keep acquiring and releasing locks (mentioned in a comment above).

richardartoul · 2019-06-19T19:32:42Z

src/dbnode/storage/shard.go

+	// locks in a tight loop below.
+	blockStates := s.BlockStatesSnapshot()
+	// Filter without allocating by using same backing array.
+	compacted := filesets[:0]


I'd probably call this toDelete or something

also this is sketching me out lol can we just allocate a new slice. This is only once per shard so its not so bad

Alllllright.

justinjc requested a review from richardartoul June 10, 2019 16:30

justinjc changed the title ~~Support for numbered filesets~~ [WIP] Support for numbered filesets Jun 10, 2019

richardartoul reviewed Jun 10, 2019

View reviewed changes

src/dbnode/persist/fs/retriever.go Outdated Show resolved Hide resolved

src/dbnode/persist/fs/retriever.go Outdated Show resolved Hide resolved

src/dbnode/persist/fs/seek_manager.go Outdated Show resolved Hide resolved

src/dbnode/storage/shard.go Outdated Show resolved Hide resolved

richardartoul reviewed Jun 13, 2019

View reviewed changes

justinjc changed the title ~~[WIP] Support for numbered filesets~~ Support for numbered filesets Jun 17, 2019

robskillington reviewed Jun 18, 2019

View reviewed changes

src/dbnode/persist/fs/files.go Show resolved Hide resolved

robskillington reviewed Jun 18, 2019

View reviewed changes

src/dbnode/persist/fs/files.go Show resolved Hide resolved

robskillington reviewed Jun 18, 2019

View reviewed changes

src/dbnode/persist/fs/files.go Outdated Show resolved Hide resolved

robskillington reviewed Jun 18, 2019

View reviewed changes

src/dbnode/persist/fs/merger.go Show resolved Hide resolved

robskillington reviewed Jun 18, 2019

View reviewed changes

src/dbnode/persist/fs/seek_manager.go Outdated Show resolved Hide resolved

robskillington reviewed Jun 18, 2019

View reviewed changes

src/dbnode/storage/namespace_readers.go Show resolved Hide resolved

robskillington reviewed Jun 18, 2019

View reviewed changes

src/dbnode/storage/namespace_readers.go Show resolved Hide resolved

richardartoul reviewed Jun 18, 2019

View reviewed changes

src/dbnode/persist/fs/persist_manager.go Show resolved Hide resolved

richardartoul mentioned this pull request Jun 18, 2019

Pass the volume number returned by OpenLatestLease() to seek.Open() in SeekerManager #1746

Closed

justinjc force-pushed the juchan/seek-mgr branch from 9567db8 to ec65cb2 Compare June 18, 2019 21:34

justinjc added 9 commits June 19, 2019 09:08

Support for numbered filesets [wip]

31db35e

Remove seek manager changes regarding volume; clean up logic

4c11e63

Propagate volume to seekers too

82e739d

Bug fixes

9f69ff4

Rename function

6a195f7

Fix more tests

1516d46

Redo filename parsing

83803bc

Regen mocks

ead0499

Use block lease manager to open correct volume in seeker

40f98f1

justinjc force-pushed the juchan/seek-mgr branch from ec65cb2 to 40f98f1 Compare June 19, 2019 14:17

Fix typos

00384ff

richardartoul approved these changes Jun 19, 2019

View reviewed changes

justinjc added 3 commits June 20, 2019 15:14

Pass in volume to persist to merger

1bb67df

Add changelog explaining backword-/forward-compatibility

0190701

Merge branch 'master' into juchan/seek-mgr

71f8c0b

justinjc merged commit 2eec9d4 into master Jun 21, 2019

justinjc deleted the juchan/seek-mgr branch June 21, 2019 17:23

Support for numbered filesets #1720

Support for numbered filesets #1720

Conversation

justinjc commented Jun 10, 2019 • edited Loading

richardartoul left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Jun 14, 2019 • edited Loading

Codecov Report

robskillington Jun 18, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

richardartoul Jun 18, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

robskillington left a comment

Choose a reason for hiding this comment

robskillington Jun 18, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

richardartoul left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

justinjc commented Jun 10, 2019 •

edited

Loading

codecov bot commented Jun 14, 2019 •

edited

Loading

robskillington Jun 18, 2019 •

edited

Loading

richardartoul Jun 18, 2019 •

edited

Loading

robskillington Jun 18, 2019 •

edited

Loading