[Searchable Snapshot] Design file caching mechanism for block based files #4964

andrross · 2022-10-28T00:12:05Z

Currently searchable snapshots download Lucene files using a chunking approach to only download the data that is needed to service a query. It should use a node-level LRU cache that will use up to a configurable amount of local disk space to avoid re-downloading the same parts of frequently-accessed files. All shards on the node should share the same logical cache, meaning that if one shard is queried exclusively then it should use up to the entire cache space configured for the node.

Open questions:

How is the cache size configured for a node? Is there a reasonable default if no configuration is provided?
Where on disk should the data be cached? i.e. inside the same directory structure as the rest of the index data? Or are there use cases where the cache would want to be a dedicated disk or mount that would require a separate top-level directory?
How should a node report cache statistics and/or utilization?

aabukhalil · 2022-11-01T17:30:47Z

I'll be working on this

anasalkouz · 2022-11-02T18:39:41Z

Are we building this with the assumption that the current searcher node is going to only be used for remote index? if no, then we need to think of some complex use-cases where the same node has local index (Hot) and remote index with cache. Which mean we need to build some mechanisms to control the storage usage for hot vs cached data.

aabukhalil · 2022-11-11T19:26:58Z

here is my findings for the caching mechanism
path.data setting is accepted as a list so the data node can be configured statically to use multiple data paths and/or volumes.

Cache Scope

the file system cache will be data node local cache. Single data node might have multiple data volumes, multiple indices and shards, and maybe multiple caches. We need to decide what Volume:Cache mapping would be (1:1, 1:M, ..) and Cache:Index mapping.

I think that the most flexible ultimate and highly configurable interface for defining caches is to define a static (maybe later dynamic API exposed) named cache in the yml file like:

remote_store_cache:
    cache_1_name:
	  path: /path/where/to/put/cache1
	  reserved_percent: 30
    cache_2_name:
	  path: /path/where/to/put/cache2
	  reserved_percent: 30

or can be flattened as (because map Setting is not supported)

named_caches: [cache_1_name, cache_2_name]
named_cache.cache_1_name.path:
named_cache.cache_1_name.reserved_percent:
named_cache.cache_2_name.path:
named_cache.cache_2_name.reserved_percent:

and when we restore a remote stored index modify the api /_snapshot/test_repository/test_snapshot/_restore to include cache_name

{
    "indices": "nyc_taxis",
    "rename_pattern": "(.+)",
    "rename_replacement": "restored_$1",
    "storage_type": "remote_snapshot",
    "cache_name": "named_defined_cache" // if not provided, nothing will be cached ?
}

This way users can decide how many caches needed, what volume is being used by each cache, whether to share cache size with local hot indices or not, which indices uses which cache and how fast and big each cache is. This interface basically will allow user to decide node role to be hot, cold, warm or mixed node. For now I don't see a need to support single cache backed by multiple volumes / data paths. Disadvantage to this approach, it's harder to define default behavior (when no cache is defined) especially with the size reservation logic.

Other options

Cache per Index: will be stored at same data path of the index and more convenient to use and will have less configuration but it's hard to satisfy cases where a shared cache is needed or even when index name will not be known ahead of time or in future stages when index metadata will not exist on the node, it's better to have dedicated directory for the cache which will make data cleaning easier (more details below). Also this might have too overhead to maintain a LRU cache per index. Multiple caches might be able to share same quote from disk usage but it's complicated to implement with many open question.
Single Cache per Node: might need to implement a cache backed by multiple data paths and it's less configurable and now all indices will use same cache. Complexity to implement multiple named caches is slightly more than this single cache so I think it's okay to implement the named cache at first iteration.

Cache Size and Reservation

Depending on the cache scope decision, a cache might be sharing space with hot Lucene indices data. Hot shards might grow, get relocated , ... so either we need to make the cache size dynamic (best effort to use not-utilized space by cache and hot shard will more priority to use the disk and shard relocation logic need to have access to the cache to make it shrink itself and will make it more complex ) or the cache size should be fixed size and reserved which I think it is more simple to reserve size and have more decoupled components.

For now, I think defining cache by a fixed percentage to use of the volume is easier to use and more dynamic for users than absolute bytes value.
When a cache is defined to be e.g. 30% of a volume or X bytes sized meaning that the cache might have size of [0-X) on disk at any time before it start evicting files and replacing them, the cache initially will not have these bytes occupied on disk.

Cache Disk reservation logic will impact all of:

NodesStats as fs stats need to take into consideration the cache reserved size
DiskThresholdDecider once monitor service having the reservation logic injected it should work out of the box
MonitorService need to take into consideration the cache reserved size
org.opensearch.cluster.ClusterInfo::reservedSpace
org.opensearch.cluster.InternalClusterInfoService::buildShardLevelInfo
org.opensearch.index.store.StoreStats##reservedSize
so that
org.opensearch.cluster.routing.allocation.DiskThresholdMonitor
and
org.opensearch.cluster.routing.allocation.decider.DiskThresholdDecider
take into consideration the new disk reservation logic

Reservation Logic

When creating the cache, a validations are needed to make sure:

(cache defined reserved size - current cache phantom files + lucene indices file size) <= cluster.routing.allocation.disk.watermark.high and cluster.routing.allocation.disk.watermark.low and so on
When cache is initiated on node start procedure, it's needed to walk through cache dir files and count their sizes
For now we don't need to handle logic if cache data path has changed and old path has ghost files

Default Behavior

Given that it's easier to implement fixed sized cache with reservation logic, and, depending on the cache scope decision not all data nodes might have a remote stored index, so it does not make sense to include a reserved file system cache by default because that would be waste of resources. To actually have default caching enabled, we need to implement dynamic sized cache.

So for default behavior these are options I think of:

having a dynamic sized cache which might be complex ? if feasible then it can be enabled by default
having a reserved fixed size cache by default which will not be suitable when users does not have remote store indices or don't need caching at all
not having a cache by default, but users need to explicitly define caches. having a cache on node level will somehow introduce new node type which is cachable reader. it should be a reader node with an option to put a cache.

Stats to publish and how ? scope of the stats ?

since cache is scoped per node, we can publish stats on node level or named cache level. named cache level is more granular with the cost of more expensive stats to maintain and transmit and we do both. For it's easier to have stats per named cache.

Stats to publish:
org.opensearch.common.cache.Cache.CacheStats (hits, misses, evictions) + utilized bytes (cache all bytes content) + reserved size per named cache and publish it to nodes stats

org.opensearch.node.NodeService#stats -> should publish
org.opensearch.monitor.fs.FsService#stats() should contain the named caches stats or introduce new field in org.opensearch.action.admin.cluster.node.stats.NodeStats if does not make sense to include remote store cache in fs stats.

How to build LRU Cache ?

We should not build a cache from scratch for this. Any standard on-heap LRU cache with listeners and weighter should work to implement the file system caching. Basically the on-heap cache will hold Reference to file on disk as below

static class FileOnDisk {
        int sizeOnDisk;

        public FileOnDisk(int sizeOnDisk) {
            this.sizeOnDisk = sizeOnDisk;
        }
    }

  Cache<String, FileOnDisk> theCache =
            CacheBuilder.newBuilder()
                .maximumWeight(10) // max size in bytes
                .weigher((Weigher<String, FileOnDisk>) (key, value) -> value.sizeOnDisk)
                .build();

Looking at current LRU cache in our code base, look like org.opensearch.common.cache has the functionality we need and we can use it. But it looks like it's an out dated version of Guava Cache with a lot of locking to maintain the LRU. So as an improvement to this remote store feature and to any user of that cache maybe we need to upgrade to latest Guava Cache or it's successor Caffeine Cache which looks like being used by many data stores like Solr, Cassandra,Druid, ...

Minimum Viable Product

If we define a dedicated role for remote store reader (change hot shard allocation decider to exclude these nodes) then no need for cache size reservation logic at all since all data volumes will be exclusively used by cache. But this does not solve the concern of default out of the box behavior since used need to explicitly set node role to only remote reader.

So depending the effort, I think that implementing cache reservation logic is more flexible for users and better future investment

aabukhalil · 2022-11-11T19:50:28Z

Task Breakdown:

([Searchable Snapshot] Implement disk reservation logic to be used for local file system cache #5508) Given Cache max reserved size, current cache occupied size, Implement size reservation logic to be reflected on NodesStats::FileSystemStats, Deciders, Monitoring Service
([Searchable Snapshot] Implement FileSystem Tracking LRU Cache with RefCounting #4965) Implement FileSystem Tracking LRU Cache with RefCounting (Cache should not delete a file being used so when cloning IndexInput refCount needs ++ and when being closed -- the counter) on Top of existing Cache
([Searchable Snapshot] IndexInput clones close handling #5243) When Lucene clones IndexInput it will not close it, it will only close the original IndexInput. So we need to keep track of the cloned IndexInputs and close them when they are not used anymore. Scope here is find how to detect the unused IndexInput, is relying on GC and Cleaner is the only option here ? and then implement it
([Searchable Snapshot] Add Integ test to make there is no resource leak #5244) Searchable Snapshot and how IndexInput is being used, it has lot of clones, closings and many files handling logic. If we missing closing any file then we will cause resource leaks. We Integ test to make sure there is no resource leak. This is possible (theoretically) by creating remote backed index, do some searches, and then delete the index. And monitor the file descriptor held by process affecting the caching directory.
Implement the proposed interface to define the named cache specs, basically transform new yml config to List<CacheDefintion> POJOs.
([Searchable Snapshot] Implement cache restore mechanism and track phantom files #5980) Right now, FileCache capacity is computed arbitrary using the first data path x 50% (hard coded) of total size. Also, for now, Cached files are stored within the corresponding indices directories. Also, for now, when node restart it won't put the cached files into the cache but will ignore them (phantom files). Next step is to decide which cache scope we should use and where to store the cached data and construct the cache accordingly. Change Node bootstrap to construct the named cached based on the List, it should include the cached files walk through logic as well to restore them to cache or to delete them.
Change restore API to include and validate the cache name
([Searchable Snapshot] Add listener to Index events to update local file system cache entries #5981) Add listener to index events so that when remote index is deleted it's corresponding cached files are de-referrenced and deleted from cache
([Searchable Snapshot] Expose local file system cache stats into node stats #5982) Expose all file system cache stats (usage, activeUsage, hitCount, missCount, ...) into NodeStats API. Stats already exist as POJOs and computed correctly but we need to include them in node stats.
For now, OnDemandVirtualFileSnapshotIndexInput is never cached or stored into disk but it will always use ByteArrayIndexInput since whole IndexInput is included in the hash in memory. Make sure this does not have any performance penalty
([Searchable Remote Index] Add a REST API for clearing file cache #6030)
([Searchable Snapshot] Revist using other open source LRUCache implementation for File System caching #6225)
([Searchable Snapshot] Assess File System caching impact on heap memory #6227)

aabukhalil · 2022-11-11T20:02:30Z

Open Questions

Since the cache should not be able to de-reference and delete a file while being used by any IndexInput (since there might be many IndexInputs created and backed by same file). What will happen if the cache is having high throughput concurrent searches where current held files exceed the cache capacity and IndexInput is needed to serve a request ? what if reaching disk size limits ? should these requests be blocked till there is more room to get cached ? if yes, how to prevent dead locks where multiple search requests are holding partial IndexInputs and all wait to empty room. Should we keep current implementation of no caching and using the ByteBufferIndexInput and when cache is full just use it ? what Circuit Breaker mechanism can we use here ?
If we will be implementing the named cache mechanism, how to handle situation where multiple data nodes having inconsistent cache names ? should coordinator re-route queries based on cache name ? if yes then cache name would act as a role itself. or should the cluster fail if it detect inconsistent static settings between nodes ? or just leave it as user responsibility ?

andrross · 2022-11-11T20:34:34Z

Thanks @aabukhalil ! I'm still digging into a lot of this, but just wanted to point out an existing issue that documents the poor performance of the existing cache in OpenSearch.

kotwanikunal · 2023-02-02T00:04:11Z

Path forward for the open questions here.

File Cache - Design Considerations

1. Overview

Searchable snapshots introduced indices which can be queried without downloading all the segment files onto the node, and instead fetching parts of the segment files on-demand as the query dictates. These parts, also known as blocks, are currently re-downloaded on every call and instead can be cached onto the local node where the query is being served.

The design for searchable snapshot file caching has been detailed in this issue. There are certain design considerations and decisions around cache scope, space reservation, cache recovery that are answered as a part of this document.

2. Problem Statement

The file cache design has design decisions around the cache sizing, cache scope, reservation logic and cache recovery that need to be finalized as per the constraints of the node, cluster and the searchable snapshot feature -

The cache needs to use the storage made available by the paths within the node environment
The cache needs to be local to the node
The cache should be recoverable on node bootstrapping and should not have any file/resource leaks

3. Goals

The file caching mechanism for searchable snapshots should be able to satisfy the above constraints for the following use cases -

Perform cache operations (reservation, recovery etc.) on a remote-only shard node
Perform cache operations on a hybrid node ( node which supports both local and remote shards)

There are two major scenarios for which the design will provide answers for -

Customer adds a remote-only shards node to the cluster
Customer adds the search role to an existing node within the cluster

4. Glossary

Cache Scope: In the context of this document, cache scope refers to the relation between the defined cache and node, and cache and index relation.
Cache size: The fixed size of the cache in bytes which will be utilized on disk for multiple indices as a part of searchable snapshots.
Cache reservation logic: The logic that will be introduced to accommodate reservation of space on the disk for searcher cache.
Cache recovery: The process that will be initialized once the node is restarted/recovered/bootstrapped to ensure that the searcher cache goes back the tracking the files available locally on storage and adds them back to the cache.
Phantom Files: The files which are unused and are not tracked by the cache currently, but were created by the cache and are consuming disk storage.

5. Proposed solution

The principle applied to the solutions is keeping it simple, yet it is open for future extensibility. The solution for various design decisions are described below -

5.1 Cache Scope

Have a single cache per node with a hierarchical structure, which contains the cache files as per the indices.
There will be a single cache location per node and will be initialized when the search role is added to the node and covers scenario 1 and scenario 2 described above.
This can act as a default cache when and if we expand to named caches in the future.

Pros:

Simple, easy to maintain
User does not need to define a cache or a named cache, adding the searcher role to the node will initiate the cache generation process.
API changes can be avoided as in case of named cache (described below) as the user does not need to name the cache
Cache restore process is simpler, phantom file issues can be rectified with ease.

Cons:

Does not provide users the flexibility to use different volumes for cache.

Code considerations:

We need to reuse the existing ultrawarm mechanism or define a new mechanism for restoring the cache.
In case of shard movement, we need to clear out the data for the relocated shard.

5.2 Cache Size

Introduce a new setting (searcher.cache.sizeinbytes or searcher.cache.reserved_disk_percent) which can be initialized to a default value when the node is initialized with the search role and a default size if the config does not exist.
This behavior of having a cache by default for search nodes (maybe a few 100 MBs) needs to be in place since we need to use some storage at all times for remote searchable snapshots
The user can also define explicit cache size/disk percentage using the yaml config file.
The default cache storage will be restricted to only a single data path (the first path on the list) and utilize the reservation logic defined as follows.

5.2.1 Reservation logic

The reservation logic will simply work as follows -
- Case A: The user has not defined an explicit cache size but the node is a search capable node
  - Reserve a few 100 MBs for the cache (TBD)
- Case B: The user defines the cache size using the defined property and the node is search capable
  - Reserve the defined size for the cache ensuring the disk watermark thresholds are not triggered.
- Case C: The node is not search capable
  - No-Op in this scenario

Code considerations:

Reservation logic to kick in as soon as search role is discovered and cleaned up when the role no longer exists for the node.
Documentation needs to be updated for the reserved space for cache when the searcher role is added (maybe a few 100 MBs)
FsProbe/FsService/MonitorService need to be updated to keep a track of cache size metric
DiskThresholdDecider/Monitor need to account for the cache size for shard management in case of a hybrid node

5.3 Cache Path

The node will have a default location under the data path for caching with the structure laid out above which will be relative to the defined node path passed into the node environment
We will introduce a new property (searcher.cache.path) which can be used to setup the root cache folder with the hierarchy described above.

Code considerations:

The cache folder might need to move if the user defines the property and restarts the node/cluster (either move previous files to new path or delete the default location files)

6. Other Approaches

6.1 Named Caches

The named cache approach extends on top of the default cache approach described above. It gives users more flexibility in terms of describing multiple caches with different volumes and names.
Described in detail here.
6.1.1 Pros

Gives the users additional flexibility and benefits in terms of cache selection

6.1.2 Cons

We will need to update the snapshot restore API to add support for cache selection
Phantom file issue can be exacerbated
We will require reporting of multiple metrics for each defined cache which might not be needed at this point.

6.2 Dynamic Cache

Dynamic cache will ensure that the cache can be resized to accommodate for the changing nature of hot shards in case of the hybrid nature of nodes where both remote and local shards are assigned to a node.
In essence, the cache can grow upto a certain point where it does not affect the shards, and start shrinking once it reaches the watermark capacities and dynamically resize.

This approach will require some real world data before the dynamic resizing approach can be implemented. The priority of hot shards, space availability within other nodes will need to be looked into similar to the allocator approach and cache resizing to take place if no other options are available, which will require priority based weights on the decision making.
Described in detail here.

7. Backwards compatibility

The change is not breaking. but an enhancement and can be released in a minor version.

aabukhalil · 2023-02-08T00:20:37Z

Following up with the Cache Design based on research already done in #4964 (comment) and some suggestions from discussions on #5641

These are the specs needed by the LRUCache implementation

Cache must have defined fixed maximum capacity. Capacity in this cache context represent IndexInput size on disk and not the number of objects
Cache need to maintain it's capacity and when current size exceed capacity it should delete IndexInput both from memory and from local disk
Cache has to support custom object size/ custom weigher so that cache size is computed based on IndexInput size
Since a cache entry (IndexInput) might still be in use, cache eviction should ignore IndexInput from eviction when it is still in use. IndexInput usage is tracked by the ReferenceCount which the cache is aware of
RemovalListener is needed
Cache entries are tracked in heap memory and number of cache entries (IndexInput) is not bounded. We need a heap circuit breaker for the cache to not impact JVM health or we need to bound number of entries in the cache.

So far we have evaluated using Guava/Caffeine cache but they don't support custom eviction logic (prevent evicting entries when they have active usage) and they are not open for extension. That's why #5641 has modified version of Guava/Caffeine cache.

Apache JCS was suggested in #5641 and it look promising because it is more open for extension rather than modification but still I'm not able to find a straight forward path/ perfectly fitting implementation for our cache usage.

For now I think the path forward to unblock this is to proceed with current custom cache implementation then figure out if using other LRU Cache can be worked out here. This is tracked in #6225

ben-manes · 2023-02-08T00:57:12Z

Since a cache entry (IndexInput) might still be in use, cache eviction should ignore IndexInput from eviction when it is still in use. IndexInput usage is tracked by the ReferenceCount which the cache is aware of
...
So far we have evaluated using Guava/Caffeine cache but they don't support custom eviction logic (prevent evicting entries when they have active usage) and they are not open for extension.

This is partially supported by pinning the entry so that it is skipped for evaluation. That is done by using a Map.compute to mark the entry so that it is evaluated to a zero weight and an infinite expiration. This is equivalent to simply moving the entry into a secondary unbounded map so that it is no longer managed by the cache. The api limitation is that the cache has to be notified that the entry is pinned and won't evaluate a custom predicate during the selection process.

The problem with customizing this to inspect the candidate via a callback is that it turns an O(1) evaluation to an O(n) scan. That can result in a memory leak if no victim is found and may waste cpu cycles as reads/writes trigger new maintenance cycles that tries to honor the threshold. An extension point could be error prone and while experts might be careful, most users would not be. The two popular caches that kind of have this feature are Ehcache's EvictionAdvisor (a soft no; will evict an entry regardless of the advice) and Coherence's EvictionApprover is a hard no (which could lead to heap exhaustion). Neither offer high performance, the latter of which has switched to Caffeine going forward.

The evaluation approach has come up a few times, such as by Apache Solr, but usually in terms of old code that the authors do not want to change. I haven't found a satisfactory api that avoids footguns, which is okay for internal code where the team can be pragmatic but very problematic for external ones. If you can use pinning (explicit or implicitly) then the implementation logic will be obvious, it will be easier to debug, and there will be fewer surprises. Otherwise if you need the evaluation approach then I am happy to discuss it for Caffeine, but am doubtful that we will find a solution that we'll support.

aabukhalil · 2023-02-08T01:02:07Z

Thanks @ben-manes for the help!! We will use your advice in our next evaluation

andrross · 2023-02-10T23:04:54Z

Closing as #5641 has been merged, though follow up tasks have been created.

andrross added enhancement Enhancement or improvement to existing feature or request Indexing & Search labels Oct 28, 2022

tlfeng assigned aabukhalil Nov 1, 2022

andrross mentioned this issue Nov 4, 2022

[Meta] Promote searchable snapshot out of experimental #5087

Closed

12 tasks

andrross added the discuss Issues intended to help drive brainstorming and decision making label Nov 11, 2022

aabukhalil mentioned this issue Dec 9, 2022

[Searchable Snapshot] Implement disk reservation logic to be used for local file system cache #5508

Closed

kotwanikunal mentioned this issue Jan 23, 2023

[Searchable Snapshot] Define additional metrics needed for searchable snapshots #4968

Closed

adnapibar mentioned this issue Jan 26, 2023

[Searchable Remote Index] Add a REST API for clearing file cache #6030

Closed

kotwanikunal mentioned this issue Feb 6, 2023

[Searchable Snapshots] Implement mechanism to specify cache path (API/setting definition/moving cache) #6206

Open

aabukhalil mentioned this issue Feb 8, 2023

[Searchable Snapshot] Revist using other open source LRUCache implementation for File System caching #6225

Open

This was referenced Feb 8, 2023

[Searchable Snapshot] Assess File System caching impact on heap memory #6227

Closed

File System Caching for remote store #5641

Merged

kotwanikunal added this to Searchable Snapshots Feb 9, 2023

github-project-automation bot moved this to Todo in Searchable Snapshots Feb 9, 2023

anasalkouz moved this from Todo to In Progress in Searchable Snapshots Feb 9, 2023

andrross closed this as completed Feb 10, 2023

github-project-automation bot moved this from In Progress to Done in Searchable Snapshots Feb 10, 2023

kotwanikunal mentioned this issue Feb 17, 2023

Add cache reservation logic #6350

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Searchable Snapshot] Design file caching mechanism for block based files #4964

[Searchable Snapshot] Design file caching mechanism for block based files #4964

andrross commented Oct 28, 2022

aabukhalil commented Nov 1, 2022

anasalkouz commented Nov 2, 2022

aabukhalil commented Nov 11, 2022

aabukhalil commented Nov 11, 2022 •

edited

Loading

aabukhalil commented Nov 11, 2022 •

edited

Loading

andrross commented Nov 11, 2022

kotwanikunal commented Feb 2, 2023 •

edited

Loading

aabukhalil commented Feb 8, 2023 •

edited

Loading

ben-manes commented Feb 8, 2023

aabukhalil commented Feb 8, 2023

andrross commented Feb 10, 2023

[Searchable Snapshot] Design file caching mechanism for block based files #4964

[Searchable Snapshot] Design file caching mechanism for block based files #4964

Comments

andrross commented Oct 28, 2022

aabukhalil commented Nov 1, 2022

anasalkouz commented Nov 2, 2022

aabukhalil commented Nov 11, 2022

Cache Scope

Other options

Cache Size and Reservation

Reservation Logic

Default Behavior

Stats to publish and how ? scope of the stats ?

How to build LRU Cache ?

Minimum Viable Product

aabukhalil commented Nov 11, 2022 • edited Loading

Task Breakdown:

aabukhalil commented Nov 11, 2022 • edited Loading

Open Questions

andrross commented Nov 11, 2022

kotwanikunal commented Feb 2, 2023 • edited Loading

File Cache - Design Considerations

1. Overview

2. Problem Statement

3. Goals

4. Glossary

5. Proposed solution

5.1 Cache Scope

5.2 Cache Size

5.2.1 Reservation logic

Code considerations:

5.3 Cache Path

Code considerations:

6. Other Approaches

6.1 Named Caches

6.2 Dynamic Cache

7. Backwards compatibility

aabukhalil commented Feb 8, 2023 • edited Loading

ben-manes commented Feb 8, 2023

aabukhalil commented Feb 8, 2023

andrross commented Feb 10, 2023

aabukhalil commented Nov 11, 2022 •

edited

Loading

aabukhalil commented Nov 11, 2022 •

edited

Loading

kotwanikunal commented Feb 2, 2023 •

edited

Loading

aabukhalil commented Feb 8, 2023 •

edited

Loading