-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Searchable Snapshot] Design file caching mechanism for block based files #4964
Comments
I'll be working on this |
Are we building this with the assumption that the current searcher node is going to only be used for remote index? if no, then we need to think of some complex use-cases where the same node has local index (Hot) and remote index with cache. Which mean we need to build some mechanisms to control the storage usage for hot vs cached data. |
here is my findings for the caching mechanism Cache Scopethe file system cache will be data node local cache. Single data node might have multiple data volumes, multiple indices and shards, and maybe multiple caches. We need to decide what Volume:Cache mapping would be (1:1, 1:M, ..) and Cache:Index mapping. I think that the most flexible ultimate and highly configurable interface for defining caches is to define a static (maybe later dynamic API exposed) named cache in the yml file like:
or can be flattened as (because map Setting is not supported)
and when we restore a remote stored index modify the api
This way users can decide how many caches needed, what volume is being used by each cache, whether to share cache size with local hot indices or not, which indices uses which cache and how fast and big each cache is. This interface basically will allow user to decide node role to be hot, cold, warm or mixed node. For now I don't see a need to support single cache backed by multiple volumes / data paths. Disadvantage to this approach, it's harder to define default behavior (when no cache is defined) especially with the size reservation logic. Other options
Cache Size and ReservationDepending on the cache scope decision, a cache might be sharing space with hot Lucene indices data. Hot shards might grow, get relocated , ... so either we need to make the cache size dynamic (best effort to use not-utilized space by cache and hot shard will more priority to use the disk and shard relocation logic need to have access to the cache to make it shrink itself and will make it more complex ) or the cache size should be fixed size and reserved which I think it is more simple to reserve size and have more decoupled components. For now, I think defining cache by a fixed percentage to use of the volume is easier to use and more dynamic for users than absolute bytes value. Cache Disk reservation logic will impact all of:
Reservation LogicWhen creating the cache, a validations are needed to make sure:
Default BehaviorGiven that it's easier to implement fixed sized cache with reservation logic, and, depending on the cache scope decision not all data nodes might have a remote stored index, so it does not make sense to include a reserved file system cache by default because that would be waste of resources. To actually have default caching enabled, we need to implement dynamic sized cache. So for default behavior these are options I think of:
Stats to publish and how ? scope of the stats ?since cache is scoped per node, we can publish stats on node level or named cache level. named cache level is more granular with the cost of more expensive stats to maintain and transmit and we do both. For it's easier to have stats per named cache. Stats to publish:
How to build LRU Cache ?We should not build a cache from scratch for this. Any standard on-heap LRU cache with listeners and weighter should work to implement the file system caching. Basically the on-heap cache will hold Reference to file on disk as below
Looking at current LRU cache in our code base, look like Minimum Viable ProductIf we define a dedicated role for remote store reader (change hot shard allocation decider to exclude these nodes) then no need for cache size reservation logic at all since all data volumes will be exclusively used by cache. But this does not solve the concern of default out of the box behavior since used need to explicitly set node role to only remote reader. So depending the effort, I think that implementing cache reservation logic is more flexible for users and better future investment |
Task Breakdown:
|
Open Questions
|
Thanks @aabukhalil ! I'm still digging into a lot of this, but just wanted to point out an existing issue that documents the poor performance of the existing cache in OpenSearch. |
Path forward for the open questions here. File Cache - Design Considerations1. OverviewSearchable snapshots introduced indices which can be queried without downloading all the segment files onto the node, and instead fetching parts of the segment files on-demand as the query dictates. These parts, also known as blocks, are currently re-downloaded on every call and instead can be cached onto the local node where the query is being served. The design for searchable snapshot file caching has been detailed in this issue. There are certain design considerations and decisions around cache scope, space reservation, cache recovery that are answered as a part of this document. 2. Problem StatementThe file cache design has design decisions around the cache sizing, cache scope, reservation logic and cache recovery that need to be finalized as per the constraints of the node, cluster and the searchable snapshot feature -
3. GoalsThe file caching mechanism for searchable snapshots should be able to satisfy the above constraints for the following use cases -
There are two major scenarios for which the design will provide answers for -
4. Glossary
5. Proposed solutionThe principle applied to the solutions is keeping it simple, yet it is open for future extensibility. The solution for various design decisions are described below - 5.1 Cache Scope
Pros:
Cons:
Code considerations:
5.2 Cache Size
5.2.1 Reservation logic
Code considerations:
5.3 Cache Path
Code considerations:
6. Other Approaches6.1 Named CachesThe named cache approach extends on top of the default cache approach described above. It gives users more flexibility in terms of describing multiple caches with different volumes and names.
6.1.2 Cons
6.2 Dynamic CacheDynamic cache will ensure that the cache can be resized to accommodate for the changing nature of hot shards in case of the hybrid nature of nodes where both remote and local shards are assigned to a node. This approach will require some real world data before the dynamic resizing approach can be implemented. The priority of hot shards, space availability within other nodes will need to be looked into similar to the allocator approach and cache resizing to take place if no other options are available, which will require priority based weights on the decision making. 7. Backwards compatibilityThe change is not breaking. but an enhancement and can be released in a minor version. |
Following up with the Cache Design based on research already done in #4964 (comment) and some suggestions from discussions on #5641 These are the specs needed by the LRUCache implementation
So far we have evaluated using Guava/Caffeine cache but they don't support custom eviction logic (prevent evicting entries when they have active usage) and they are not open for extension. That's why #5641 has modified version of Guava/Caffeine cache. Apache JCS was suggested in #5641 and it look promising because it is more open for extension rather than modification but still I'm not able to find a straight forward path/ perfectly fitting implementation for our cache usage. For now I think the path forward to unblock this is to proceed with current custom cache implementation then figure out if using other LRU Cache can be worked out here. This is tracked in #6225 |
This is partially supported by pinning the entry so that it is skipped for evaluation. That is done by using a The problem with customizing this to inspect the candidate via a callback is that it turns an O(1) evaluation to an O(n) scan. That can result in a memory leak if no victim is found and may waste cpu cycles as reads/writes trigger new maintenance cycles that tries to honor the threshold. An extension point could be error prone and while experts might be careful, most users would not be. The two popular caches that kind of have this feature are Ehcache's EvictionAdvisor (a soft no; will evict an entry regardless of the advice) and Coherence's EvictionApprover is a hard no (which could lead to heap exhaustion). Neither offer high performance, the latter of which has switched to Caffeine going forward. The evaluation approach has come up a few times, such as by Apache Solr, but usually in terms of old code that the authors do not want to change. I haven't found a satisfactory api that avoids footguns, which is okay for internal code where the team can be pragmatic but very problematic for external ones. If you can use pinning (explicit or implicitly) then the implementation logic will be obvious, it will be easier to debug, and there will be fewer surprises. Otherwise if you need the evaluation approach then I am happy to discuss it for Caffeine, but am doubtful that we will find a solution that we'll support. |
Thanks @ben-manes for the help!! We will use your advice in our next evaluation |
Closing as #5641 has been merged, though follow up tasks have been created. |
Currently searchable snapshots download Lucene files using a chunking approach to only download the data that is needed to service a query. It should use a node-level LRU cache that will use up to a configurable amount of local disk space to avoid re-downloading the same parts of frequently-accessed files. All shards on the node should share the same logical cache, meaning that if one shard is queried exclusively then it should use up to the entire cache space configured for the node.
Open questions:
The text was updated successfully, but these errors were encountered: