Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for numbered filesets #1720

Merged
merged 13 commits into from
Jun 21, 2019
Merged

Support for numbered filesets #1720

merged 13 commits into from
Jun 21, 2019

Conversation

justinjc
Copy link
Collaborator

@justinjc justinjc commented Jun 10, 2019

What this PR does / why we need it:

With cold flushes, we need to support having multiple filesets per block. This PR introduces volume numbers to filesets so that new filesets will be written with an index as part of the filename. In order to facilitate a smooth migration, there is additional logic for gracefully handling legacy (non-indexed) filesets.

Special notes for your reviewer:

Does this PR introduce a user-facing and/or backwards incompatible change?:

NONE

Does this PR require updating code package or user-facing documentation?:

NONE

@justinjc justinjc changed the title Support for numbered filesets [WIP] Support for numbered filesets Jun 10, 2019
src/dbnode/persist/fs/retriever.go Outdated Show resolved Hide resolved
src/dbnode/persist/fs/retriever.go Outdated Show resolved Hide resolved
src/dbnode/persist/fs/seek_manager.go Outdated Show resolved Hide resolved
src/dbnode/storage/shard.go Outdated Show resolved Hide resolved
Copy link
Contributor

@richardartoul richardartoul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a sanity check to seek/reader when they read the info file to make sure that the volume number in the info file matches what we expect it to be? Would give us some extra assurances our logic is right

src/dbnode/persist/fs/files.go Outdated Show resolved Hide resolved
src/dbnode/persist/fs/files.go Outdated Show resolved Hide resolved
src/dbnode/persist/fs/files.go Outdated Show resolved Hide resolved
// and error if neither exist.
func isFirstVolumeLegacy(prefix string, t time.Time, suffix string) (bool, error) {
path := filesetPathFromTimeAndIndex(prefix, t, 0, suffix)
_, err := os.Stat(path)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to handle the os.IsNotExist error separately from other types of errors: https://stackoverflow.com/questions/12518876/how-to-check-if-a-file-exists-in-go

I.E

if os.IsNotExist(err) {
    return false, nil
}
if err != nil {
    return false, err
}
...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I saw that, but I'd argue that we don't really care about the other err cases. Either one of the types of files exist or neither of them do, in which case that's the error we propagate down.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I guess its fine because of how you're using the function (where there is an implicit assumption that a file should exist, regardless of whether its legacy or not. Maybe just add that assumption as a comment above the function

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do 👍

src/dbnode/persist/fs/files.go Show resolved Hide resolved
src/dbnode/persist/fs/types.go Show resolved Hide resolved
src/dbnode/storage/cleanup.go Show resolved Hide resolved
src/dbnode/storage/shard.go Show resolved Hide resolved
src/dbnode/storage/shard.go Show resolved Hide resolved
src/dbnode/storage/shard.go Show resolved Hide resolved
@codecov
Copy link

codecov bot commented Jun 14, 2019

Codecov Report

Merging #1720 into master will increase coverage by <.1%.
The diff coverage is 88%.

Impacted file tree graph

@@           Coverage Diff            @@
##           master   #1720     +/-   ##
========================================
+ Coverage    71.9%   71.9%   +<.1%     
========================================
  Files         982     982             
  Lines       82097   82216    +119     
========================================
+ Hits        59093   59189     +96     
- Misses      19107   19124     +17     
- Partials     3897    3903      +6
Flag Coverage Δ
#aggregator 82.4% <ø> (ø) ⬆️
#cluster 85.7% <ø> (ø) ⬆️
#collector 63.9% <ø> (ø) ⬆️
#dbnode 80% <88%> (-0.1%) ⬇️
#m3em 73.2% <ø> (ø) ⬆️
#m3ninx 74.1% <ø> (ø) ⬆️
#m3nsch 51.1% <ø> (ø) ⬆️
#metrics 17.6% <ø> (ø) ⬆️
#msg 74.7% <ø> (ø) ⬆️
#query 66.3% <ø> (ø) ⬆️
#x 85.2% <ø> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c897aa1...0190701. Read the comment docs.

@justinjc justinjc changed the title [WIP] Support for numbered filesets Support for numbered filesets Jun 17, 2019
legacyPath := filesetPathFromTimeLegacy(prefix, t, suffix)
_, err = os.Stat(legacyPath)
if err == nil {
return true, nil
Copy link
Collaborator

@robskillington robskillington Jun 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you also need to check/verify for a completed checkpoint file? (not sure who relies on this path).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point. This function should be used solely to figure out whether the first volume is legacy or not before we actually use them, so checking for a complete checkpoint file should come after this function is called. I'll add a comment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume its fine because this is only used by the reader to determine what code path to follow, and if the checkpoint file is not complete the reader will figure that out pretty quickly and error out?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, correct.

// files does not use the volume index provided in the prepareOpts.
// Instead, the new volume index is determined by looking at what files
// exist on disk. This means that there can never be a conflict when
// trying to write new snapshot files.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kind of strange, but why does snapshots use one mechanism (i.e. increment new volume index by looking at files on disk) vs fileset volumes using a different one? Richie alluded that it was "hard" but would be interesting to hear what was the difficulty with just making them use similar/same code paths?

Copy link
Contributor

@richardartoul richardartoul Jun 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I kind of did a poor job explaining this. There are a couple of subtle complexities that we encountered but basically we realized that we need to maintain some kind of relationship between the flush version and the volume number for a variety of reasons but the one that immediately comes to mind is imagine the SeekerManager does a lot of checks to see if specific fileset files exist. In order to do that efficiently it needs to be able to go from namespace/shard/blockStart/volume -> expected filename.

Once we decided that we wanted to maintain that relationship the "simplest" thing is to ensure that the flush version and the volume number are always the same. Technically we could do the +1 logic and still maintain this guarantee but that would make it implicit instead of explicit and a brittle assumption. Basically instead of having two sources of truth (flush version in memory and volume numbers on disk) we only have one and if they're out of sync for any reason we'll at least get errors as opposed to silent successes.

So we decided to just tell the PersistManager / writer which volume we intended to create and that has some nice properties like if our logic is wrong the PersistManager can error out and tell us that the volume already exists (as opposed to it just incrementing by 1 and subtly breaking our assumptions).

This is also nice because it forces users of the PersistManager to think about what volume they're writing and whether or not thats valid.

Sine we took this route we do need to bootstrap the flush version back into the shard, but thats a trivial change since the shard already calls ForEachInfoFile so all we need to do is pull the volume number out of the info metadata and into the shard memory state.

@justinjc Do you remember any other details? We spent a few hours working through all this and I feel like I may be missing some of the other reasons we couldn't do the incrementing.

@robskillington Thinking about this now though it seems like it might be good to add a check in the persist manager that the volume being writeen is at least as high as the highest existing complete fileset file. I.E if complete versions of 1, 2, 3 exist then writing 3 (if deleteIfExist=true) and 4 should be allowed, but writing 2 should not. What do you think @justinjc ? Could also add it to your list of follow ups

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also support a flag in the PersistManager that basically tells it to figure out the volume number itself which we would probably need if we wanted to allow incremental peer bootstrapping and flushing to occur in parallel, but not sure we want to go with that approach long term (see my other comment below)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@justinjc just pinging your thoughts on that potential follow up

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation here @richardartoul, you summed it up pretty well. In short, we wanted an efficient way to figure out what is the correct volume for a particular namespace/shard/block, and having to go to disk, parsing filenames and then ordering them each time seemed significantly slower.

I'm not a big fan of the persist manager having the volume number check. The check would similarly involve looking for files in the specified directory, doing file name parsing, and ordering them in order to figure that out, which amounts to additional logic that's essentially double checking that other logic isn't buggy. Also, from a separation of concerns point of view, I don't think the persist manager should need to know that fileset volumes are always increasing. Its job should just be to persist things you tell it to persist. It should be caller of the persist manager that ensures the volume number it passes is correct.

DeleteIfExists bool
// This volume index is only used when preparing for a flush fileset type.
// When opening a snapshot, the new volume index is determined by looking
// at what files exist on disk.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, does seem strange they don't just both do the same strategy.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah agreed its a little weird, and Justin's initial implementation did do the +1 strategy but we ran into some shard edges with that approach that pushed us towards the other direction

Copy link
Collaborator

@robskillington robskillington left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking mostly good!

// bootstrap, then the bootstrap volume index will be >0. In order
// to support this, bootstrapping code will need to incorporate
// merging logic and flush version/volume index will need to be
// synchronized between processes.
Copy link
Collaborator

@robskillington robskillington Jun 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a thought: I assume if we ever want to do repairs we might also need to support volume index >0 for bootstrapping. (If an "online" peer bootstrapping is the process that repairs).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@robskillington The designed repair feature won't use the peers bootstrapper at all right? Re: incremental bootstrap, what I've been thinking about recently is to eventually move away from having incremental in the peers boostrapper and instead combine cold flushing + streaming bootstrap to get the same effect.

The alternative approach (where we support index > 0) and then basically do merging from here would work also, but I think it would make allowing M3DB to flush while its bootstrapping really difficult (which is a place I think we want to get to long term) because their would have to be coordination of the flush versions there.

Copy link
Contributor

@richardartoul richardartoul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM aside from the nits.

Lets try and get a stamp from @robskillington too

src/dbnode/persist/fs/files.go Outdated Show resolved Hide resolved
// delimeters found, to be used in conjunction with the intComponentFromFilename
// function to extract filename components. This function is deliberately
// optimized for speed and lack of allocations, since filename parsing is known
// to account for a significant proportion of system resources.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"since allocation-heavy filename parsing can quickly become the large source of allocations in the entire system, especially when namespaces with long retentions are configured."

// function to extract filename components. This function is deliberately
// optimized for speed and lack of allocations, since filename parsing is known
// to account for a significant proportion of system resources.
func delimiterPositions(baseFilename string) ([maxDelimNum]int, int) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

niiiice

// delimeters. Our only use cases for this involve extracting numeric
// components, so this function assumes this and returns the component as an
// int64.
func intComponentFromFilename(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you feel about "componentAtIndex" as the name? The current name gave me the wrong impression until I read the comment

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. I'll call it intComponentAtIndex because I do want to say that it deals with ints only.

componentPos int,
delimPos [maxDelimNum]int,
) (int64, error) {
var start int
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you do var start = 0, that would have given me a nudge in the right direction to understand how you're using the positions

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah makes sense.

// when we attempt to flush and a fileset already exists unless there is
// racing competing processes.
// filesets exist at bootstrap time so we should never encounter a time
// when we attempt to flush and a fileset that already exists unless
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this extra that seems wrong

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reworded this.

// racing competing processes.
// filesets exist at bootstrap time so we should never encounter a time
// when we attempt to flush and a fileset that already exists unless
// there are racing competing processes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"there is a bug in the code" or something

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha sure.

@@ -2006,7 +2019,7 @@ func (s *dbShard) ColdFlush(

// After writing the full block successfully, update the cold version
// in the flush state.
nextVersion := s.RetrievableBlockColdVersion(startTime) + 1
nextVersion := coldVersion + 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah so there is +1 logic here and in the Merge() function. Would prefer we figure out the next version here and then tell the Merge() function to use that

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

blockStates := s.BlockStatesSnapshot()
// Filter without allocating by using same backing array.
compacted := filesets[:0]
for _, datafile := range filesets {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add a comment saying this is safe because cleanup and flushing never happen in parallel so the snapshot will never get stale.

Also do we have BlockStateSnapshot() and FlushState() do those two duplicate each other?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure I'll add a comment. Also, since versions are always increasing, even if we get stale snapshots, we just don't clean up some files that can be cleaned (but we'll clean them up eventually next time round).

BlockStatesSnapshot gets the flush states for all block starts, while FlushState is for a specific block start. Wouldn't want to loop through all blocks and keep acquiring and releasing locks (mentioned in a comment above).

// locks in a tight loop below.
blockStates := s.BlockStatesSnapshot()
// Filter without allocating by using same backing array.
compacted := filesets[:0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd probably call this toDelete or something

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also this is sketching me out lol can we just allocate a new slice. This is only once per shard so its not so bad

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alllllright.

@justinjc justinjc merged commit 2eec9d4 into master Jun 21, 2019
@justinjc justinjc deleted the juchan/seek-mgr branch June 21, 2019 17:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants