Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow to prewarm the cache for searchable snapshot shards #55322

Merged
merged 11 commits into from
Apr 24, 2020

Conversation

tlrx
Copy link
Member

@tlrx tlrx commented Apr 16, 2020

This pull requests adds a way to prewarm the cache for searchable snapshot shard files.

It relies on a new index setting named index.store.snapshot.cache.load.eagerly (defaults to false) that can be passed when mounting a snapshot as an index. This setting is detected during the pre-recovery step before the snapshot files are exposed to the other components of the system. The method prewarmCache() of the SearchableSnapshotDirectory instance is executed, which builds the list of all parts of snapshot files that needs to be prefetched in cache (excluding the files that are stored in metadata hash and the ones explicitly excluded by the excluded_file_types setting).

Then parts are prefetched in cache in parallel using the SNAPSHOT thread pool. If a snapshot file is composed of multiple parts (or chunks) then the parts can potentially be downloaded and written in cache concurrently. The implementation relies on a new prefetchPart() method added to the CachedBlobContainerIndexInput class. This method allows to fetch a complete part of a file (or the whole file if the snapshot file is composed of a single part) in order to write it in cache. This is possible because CacheFile has been modified to work with configurable cache range sizes depending on the IOContext the IndexInput has been opened with.

When the IndexInput is opened using the specific CACHE_WARMING_CONTEXT context then the file is cached on disk using large ranges of bytes aligned on the beginning and the end of each part (or chunk) of the file. When using a different context then the fill is cached on disk using the normal cache range size defined through the range_size setting. This implementation allows to reuse the existing cache eviction mechanism if something goes wrong when reading or writing the part. It also simplifies the logic if the recovering shard is closed while prewarming the cache.

@tlrx tlrx added >enhancement :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs v8.0.0 v7.8.0 labels Apr 16, 2020
@tlrx tlrx requested review from ywelsch and DaveCTurner April 16, 2020 15:57
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (:Distributed/Snapshot/Restore)

@@ -115,8 +135,10 @@ public SearchableSnapshotDirectory(
this.cacheDir = Objects.requireNonNull(cacheDir);
this.closed = new AtomicBoolean(false);
this.useCache = SNAPSHOT_CACHE_ENABLED_SETTING.get(indexSettings);
this.loadCacheEagerly = useCache ? SNAPSHOT_CACHE_LOAD_EAGERLY_SETTING.get(indexSettings) : false;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better name suggestions are welcome

boolean alreadyLoaded = this.loaded;
if (alreadyLoaded == false) {
synchronized (this) {
alreadyLoaded = this.loaded;
if (alreadyLoaded == false) {
this.blobContainer = blobContainerSupplier.get();
this.snapshot = snapshotSupplier.get();
if (loadCacheEagerly) {
prewarmCache();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This methods blocks until the cache is fully prewarmed. It must be done before loaded is set to true so that other components of the system are not likely to trigger some caching on this directory files.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that strictly necessary? I would prefer to initiate the prewarming here, but at the same time allow the shard routing to move to started state as quickly as possible.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not strictly necessary but a misunderstanding from my part. We discussed this and I updated the PR so that cache warming now runs concurrently with the recovery.


final BlobStoreIndexShardSnapshot.FileInfo fileInfo = fileInfo(name);
private IndexInput openInput(final BlobStoreIndexShardSnapshot.FileInfo fileInfo, final IOContext context) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Splitting this method into two allows to open an IndexInput even if the snapshot is not marked as loaded yet

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is no longer necessary? The private openInput method is only called in one place.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed - I pushed c8a1c6b to remove the method.

);
final long startTimeInNanos = statsCurrentTimeNanosSupplier.getAsLong();
try {
final IndexInput input = openInput(file, CachedBlobContainerIndexInput.CACHE_WARMING_CONTEXT);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method uses an IndexInput with a specific IOContext to prewarm the cache for the given Lucene file. The IndexInput will be cloned for each part to write in cache later and closed once all parts are processed.

@@ -61,12 +61,11 @@ protected void closeInternal() {
@Nullable // if evicted, or there are no listeners
private volatile FileChannel channel;

public CacheFile(String description, long length, Path file, int rangeSize) {
public CacheFile(String description, long length, Path file) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We included the rangeSize in the CacheFile to compute the range to fetch given a specific position, but we were never asserting that the fetched ranges really matched the size.

@@ -144,6 +208,7 @@ private void writeCacheFile(FileChannel fc, long start, long end) throws IOExcep
final long length = end - start;
final byte[] copyBuffer = new byte[Math.toIntExact(Math.min(COPY_BUFFER_SIZE, length))];
logger.trace(() -> new ParameterizedMessage("writing range [{}-{}] to cache file [{}]", start, end, cacheFileReference));
assert assertRangeOfBytesAlignment(start, end);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This asserts the size of the ranges written in cache, depending of the IOContext.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't really assert the size of ranges now warming runs concurrently. This has been removed.

@tlrx tlrx changed the title Allow to prewarm that cache for searchable snapshot shards Allow to prewarm the cache for searchable snapshot shards Apr 17, 2020
@Override
protected void doRun() throws Exception {
CheckedRunnable<Exception> loader;
while (isOpen && (loader = queue.poll(0L, TimeUnit.MILLISECONDS)) != null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that this is taking the same approach as we use for uploading snapshot files (BlobStoreRepository). I would prefer not to hold onto workers for such a long time, as it can block the snapshot thread pool for a long time (cc: @original-brownbear).
In both cases (also the one BlobStoreRepository), I would prefer for the worker to process one file, then enqueue another task to the thread pool to pick up the next piece of work. This allows other operations to make progress as well, instead of waiting for a long time in the snapshot queue.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, that makes sense Yannick.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for the uploading side the downside of

This allows other operations to make progress as well

is that it causes the index commits to be held on for a suboptimally long time. That's why the approach of fully monopolizing the pool was consciously chosen for uploads there.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's something to be better controlled at the SnapshotShardsService level then, though. It could limit the number of shards to be snapshotted concurrently by lazily enqueuing there.

@tlrx
Copy link
Member Author

tlrx commented Apr 21, 2020

Please hold on reviews - Yannick and I discussed several points on this PR yesterday and I'll address them.

@tlrx tlrx added the WIP label Apr 21, 2020
@tlrx
Copy link
Member Author

tlrx commented Apr 23, 2020

I've updated this PR so that the cache warming is not blocking anymore and now runs concurrently with the shard recovery. Allowing concurrent reads of different chunk sizes required to remove assertions on the number and the length of gaps to be written in cache (which I think is OK as David planned to improve this). I also introduced a dedicated thread pool for cache warming as suggested by Yannick. This thread pool is sized larger than the default snapshot thread pool.

I've quickly ran some benchmarks and compared the results to regular full restores. Depending of the snapshot to restore this change runs from 10% to 50% faster than a regular restore. This makes sense now contention have been reduced in #55662 and custom thread pool is used for warming.

@tlrx
Copy link
Member Author

tlrx commented Apr 24, 2020

ML related failure

@elasticmachine run elasticsearch-ci/2

@ywelsch ywelsch self-requested a review April 24, 2020 08:11
@tlrx tlrx removed the WIP label Apr 24, 2020
@tlrx
Copy link
Member Author

tlrx commented Apr 24, 2020

@DaveCTurner sorry for the delay. This is ready for review.

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cunning use of an IOContext, makes sense to me. I left some small comments but nothing major.

@@ -99,6 +103,11 @@
true,
Setting.Property.IndexScope
);
public static final Setting<Boolean> SNAPSHOT_CACHE_LOAD_EAGERLY_SETTING = Setting.boolSetting(
"index.store.snapshot.cache.load.eagerly",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest keeping the terminology consistent around "warming", how about index.store.snapshot.cache.prewarm.enabled?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much better name, thanks. I pushed 4ee96af.

@@ -129,7 +129,7 @@ protected InputStream openSlice(long slice) throws IOException {
}
}

protected final boolean assertCurrentThreadMayAccessBlobStore() {
protected boolean assertCurrentThreadMayAccessBlobStore() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can relax the assertion here to permit the searchable_snapshots threadpool to access the repo, rather than overloading it only in CachedBlobContainerIndexInput.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I pushed 0469c7b


final BlobStoreIndexShardSnapshot.FileInfo fileInfo = fileInfo(name);
private IndexInput openInput(final BlobStoreIndexShardSnapshot.FileInfo fileInfo, final IOContext context) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is no longer necessary? The private openInput method is only called in one place.

@tlrx tlrx requested a review from DaveCTurner April 24, 2020 14:11
@tlrx
Copy link
Member Author

tlrx commented Apr 24, 2020

Thanks @DaveCTurner, I've applied your feedback. This is ready for another round.

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two more tiny comments regarding the threadpool, but otherwise LGTM. Great work @tlrx.

@tlrx tlrx merged commit bd40d06 into elastic:master Apr 24, 2020
@tlrx tlrx deleted the load-cache-eagerly branch April 24, 2020 15:47
@tlrx
Copy link
Member Author

tlrx commented Apr 24, 2020

Thanks David

tlrx added a commit that referenced this pull request May 5, 2020
Today the cache prewarming introduced in #55322 works by 
enqueuing altogether the files parts to warm in the 
searchable_snapshots thread pool. In order to make this fairer
 among concurrent warmings, this commit starts workers that 
concurrently polls file parts to warm from a queue, warms the 
part and then immediately schedule another warming 
execution. This should leave more room for concurrent 
shard warming to sneak in and be executed.

Relates #55322
tlrx added a commit to tlrx/elasticsearch that referenced this pull request May 5, 2020
Today the cache prewarming introduced in elastic#55322 works by 
enqueuing altogether the files parts to warm in the 
searchable_snapshots thread pool. In order to make this fairer
 among concurrent warmings, this commit starts workers that 
concurrently polls file parts to warm from a queue, warms the 
part and then immediately schedule another warming 
execution. This should leave more room for concurrent 
shard warming to sneak in and be executed.

Relates elastic#55322
tlrx added a commit that referenced this pull request May 5, 2020
Today the cache prewarming introduced in #55322 works by 
enqueuing altogether the files parts to warm in the 
searchable_snapshots thread pool. In order to make this fairer
 among concurrent warmings, this commit starts workers that 
concurrently polls file parts to warm from a queue, warms the 
part and then immediately schedule another warming 
execution. This should leave more room for concurrent 
shard warming to sneak in and be executed.

Relates #55322
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants