Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Searchable Snapshot] Design file caching mechanism for block based files #4964

Closed
Tracked by #5087
andrross opened this issue Oct 28, 2022 · 11 comments
Closed
Tracked by #5087
Assignees
Labels
discuss Issues intended to help drive brainstorming and decision making enhancement Enhancement or improvement to existing feature or request Indexing & Search

Comments

@andrross
Copy link
Member

Currently searchable snapshots download Lucene files using a chunking approach to only download the data that is needed to service a query. It should use a node-level LRU cache that will use up to a configurable amount of local disk space to avoid re-downloading the same parts of frequently-accessed files. All shards on the node should share the same logical cache, meaning that if one shard is queried exclusively then it should use up to the entire cache space configured for the node.

Open questions:

  • How is the cache size configured for a node? Is there a reasonable default if no configuration is provided?
  • Where on disk should the data be cached? i.e. inside the same directory structure as the rest of the index data? Or are there use cases where the cache would want to be a dedicated disk or mount that would require a separate top-level directory?
  • How should a node report cache statistics and/or utilization?
@andrross andrross added enhancement Enhancement or improvement to existing feature or request Indexing & Search labels Oct 28, 2022
@aabukhalil
Copy link
Contributor

I'll be working on this

@anasalkouz
Copy link
Member

Are we building this with the assumption that the current searcher node is going to only be used for remote index? if no, then we need to think of some complex use-cases where the same node has local index (Hot) and remote index with cache. Which mean we need to build some mechanisms to control the storage usage for hot vs cached data.

@aabukhalil
Copy link
Contributor

here is my findings for the caching mechanism
path.data setting is accepted as a list so the data node can be configured statically to use multiple data paths and/or volumes.

Cache Scope

the file system cache will be data node local cache. Single data node might have multiple data volumes, multiple indices and shards, and maybe multiple caches. We need to decide what Volume:Cache mapping would be (1:1, 1:M, ..) and Cache:Index mapping.

I think that the most flexible ultimate and highly configurable interface for defining caches is to define a static (maybe later dynamic API exposed) named cache in the yml file like:

remote_store_cache:
    cache_1_name:
	  path: /path/where/to/put/cache1
	  reserved_percent: 30
    cache_2_name:
	  path: /path/where/to/put/cache2
	  reserved_percent: 30

or can be flattened as (because map Setting is not supported)

named_caches: [cache_1_name, cache_2_name]
named_cache.cache_1_name.path:
named_cache.cache_1_name.reserved_percent:
named_cache.cache_2_name.path:
named_cache.cache_2_name.reserved_percent:

and when we restore a remote stored index modify the api /_snapshot/test_repository/test_snapshot/_restore to include cache_name

{
    "indices": "nyc_taxis",
    "rename_pattern": "(.+)",
    "rename_replacement": "restored_$1",
    "storage_type": "remote_snapshot",
    "cache_name": "named_defined_cache" // if not provided, nothing will be cached ?
}

This way users can decide how many caches needed, what volume is being used by each cache, whether to share cache size with local hot indices or not, which indices uses which cache and how fast and big each cache is. This interface basically will allow user to decide node role to be hot, cold, warm or mixed node. For now I don't see a need to support single cache backed by multiple volumes / data paths. Disadvantage to this approach, it's harder to define default behavior (when no cache is defined) especially with the size reservation logic.

Other options

  • Cache per Index: will be stored at same data path of the index and more convenient to use and will have less configuration but it's hard to satisfy cases where a shared cache is needed or even when index name will not be known ahead of time or in future stages when index metadata will not exist on the node, it's better to have dedicated directory for the cache which will make data cleaning easier (more details below). Also this might have too overhead to maintain a LRU cache per index. Multiple caches might be able to share same quote from disk usage but it's complicated to implement with many open question.
  • Single Cache per Node: might need to implement a cache backed by multiple data paths and it's less configurable and now all indices will use same cache. Complexity to implement multiple named caches is slightly more than this single cache so I think it's okay to implement the named cache at first iteration.

Cache Size and Reservation

Depending on the cache scope decision, a cache might be sharing space with hot Lucene indices data. Hot shards might grow, get relocated , ... so either we need to make the cache size dynamic (best effort to use not-utilized space by cache and hot shard will more priority to use the disk and shard relocation logic need to have access to the cache to make it shrink itself and will make it more complex ) or the cache size should be fixed size and reserved which I think it is more simple to reserve size and have more decoupled components.

For now, I think defining cache by a fixed percentage to use of the volume is easier to use and more dynamic for users than absolute bytes value.
When a cache is defined to be e.g. 30% of a volume or X bytes sized meaning that the cache might have size of [0-X) on disk at any time before it start evicting files and replacing them, the cache initially will not have these bytes occupied on disk.

Cache Disk reservation logic will impact all of:

  • NodesStats as fs stats need to take into consideration the cache reserved size
  • DiskThresholdDecider once monitor service having the reservation logic injected it should work out of the box
  • MonitorService need to take into consideration the cache reserved size
  • org.opensearch.cluster.ClusterInfo::reservedSpace
  • org.opensearch.cluster.InternalClusterInfoService::buildShardLevelInfo
  • org.opensearch.index.store.StoreStats##reservedSize
    so that
    org.opensearch.cluster.routing.allocation.DiskThresholdMonitor
    and
    org.opensearch.cluster.routing.allocation.decider.DiskThresholdDecider
    take into consideration the new disk reservation logic

Reservation Logic

When creating the cache, a validations are needed to make sure:

  • (cache defined reserved size - current cache phantom files + lucene indices file size) <= cluster.routing.allocation.disk.watermark.high and cluster.routing.allocation.disk.watermark.low and so on
  • When cache is initiated on node start procedure, it's needed to walk through cache dir files and count their sizes
  • For now we don't need to handle logic if cache data path has changed and old path has ghost files

Default Behavior

Given that it's easier to implement fixed sized cache with reservation logic, and, depending on the cache scope decision not all data nodes might have a remote stored index, so it does not make sense to include a reserved file system cache by default because that would be waste of resources. To actually have default caching enabled, we need to implement dynamic sized cache.

So for default behavior these are options I think of:

  • having a dynamic sized cache which might be complex ? if feasible then it can be enabled by default
  • having a reserved fixed size cache by default which will not be suitable when users does not have remote store indices or don't need caching at all
  • not having a cache by default, but users need to explicitly define caches. having a cache on node level will somehow introduce new node type which is cachable reader. it should be a reader node with an option to put a cache.

Stats to publish and how ? scope of the stats ?

since cache is scoped per node, we can publish stats on node level or named cache level. named cache level is more granular with the cost of more expensive stats to maintain and transmit and we do both. For it's easier to have stats per named cache.

Stats to publish:
org.opensearch.common.cache.Cache.CacheStats (hits, misses, evictions) + utilized bytes (cache all bytes content) + reserved size per named cache and publish it to nodes stats

org.opensearch.node.NodeService#stats -> should publish
org.opensearch.monitor.fs.FsService#stats() should contain the named caches stats or introduce new field in org.opensearch.action.admin.cluster.node.stats.NodeStats if does not make sense to include remote store cache in fs stats.

How to build LRU Cache ?

We should not build a cache from scratch for this. Any standard on-heap LRU cache with listeners and weighter should work to implement the file system caching. Basically the on-heap cache will hold Reference to file on disk as below

static class FileOnDisk {
        int sizeOnDisk;

        public FileOnDisk(int sizeOnDisk) {
            this.sizeOnDisk = sizeOnDisk;
        }
    }

  Cache<String, FileOnDisk> theCache =
            CacheBuilder.newBuilder()
                .maximumWeight(10) // max size in bytes
                .weigher((Weigher<String, FileOnDisk>) (key, value) -> value.sizeOnDisk)
                .build();

Looking at current LRU cache in our code base, look like org.opensearch.common.cache has the functionality we need and we can use it. But it looks like it's an out dated version of Guava Cache with a lot of locking to maintain the LRU. So as an improvement to this remote store feature and to any user of that cache maybe we need to upgrade to latest Guava Cache or it's successor Caffeine Cache which looks like being used by many data stores like Solr, Cassandra,Druid, ...

Minimum Viable Product

If we define a dedicated role for remote store reader (change hot shard allocation decider to exclude these nodes) then no need for cache size reservation logic at all since all data volumes will be exclusively used by cache. But this does not solve the concern of default out of the box behavior since used need to explicitly set node role to only remote reader.

So depending the effort, I think that implementing cache reservation logic is more flexible for users and better future investment

@aabukhalil
Copy link
Contributor

aabukhalil commented Nov 11, 2022

Task Breakdown:

@aabukhalil
Copy link
Contributor

aabukhalil commented Nov 11, 2022

Open Questions

  • Since the cache should not be able to de-reference and delete a file while being used by any IndexInput (since there might be many IndexInputs created and backed by same file). What will happen if the cache is having high throughput concurrent searches where current held files exceed the cache capacity and IndexInput is needed to serve a request ? what if reaching disk size limits ? should these requests be blocked till there is more room to get cached ? if yes, how to prevent dead locks where multiple search requests are holding partial IndexInputs and all wait to empty room. Should we keep current implementation of no caching and using the ByteBufferIndexInput and when cache is full just use it ? what Circuit Breaker mechanism can we use here ?
  • If we will be implementing the named cache mechanism, how to handle situation where multiple data nodes having inconsistent cache names ? should coordinator re-route queries based on cache name ? if yes then cache name would act as a role itself. or should the cluster fail if it detect inconsistent static settings between nodes ? or just leave it as user responsibility ?

@andrross
Copy link
Member Author

Thanks @aabukhalil ! I'm still digging into a lot of this, but just wanted to point out an existing issue that documents the poor performance of the existing cache in OpenSearch.

@kotwanikunal
Copy link
Member

kotwanikunal commented Feb 2, 2023

Path forward for the open questions here.

File Cache - Design Considerations

1. Overview

Searchable snapshots introduced indices which can be queried without downloading all the segment files onto the node, and instead fetching parts of the segment files on-demand as the query dictates. These parts, also known as blocks, are currently re-downloaded on every call and instead can be cached onto the local node where the query is being served.

The design for searchable snapshot file caching has been detailed in this issue. There are certain design considerations and decisions around cache scope, space reservation, cache recovery that are answered as a part of this document.

2. Problem Statement

The file cache design has design decisions around the cache sizing, cache scope, reservation logic and cache recovery that need to be finalized as per the constraints of the node, cluster and the searchable snapshot feature -

  • The cache needs to use the storage made available by the paths within the node environment
  • The cache needs to be local to the node
  • The cache should be recoverable on node bootstrapping and should not have any file/resource leaks

3. Goals

The file caching mechanism for searchable snapshots should be able to satisfy the above constraints for the following use cases -

  1. Perform cache operations (reservation, recovery etc.) on a remote-only shard node
  2. Perform cache operations on a hybrid node ( node which supports both local and remote shards)

There are two major scenarios for which the design will provide answers for -

  1. Customer adds a remote-only shards node to the cluster
  2. Customer adds the search role to an existing node within the cluster

4. Glossary

  • Cache Scope: In the context of this document, cache scope refers to the relation between the defined cache and node, and cache and index relation.
  • Cache size: The fixed size of the cache in bytes which will be utilized on disk for multiple indices as a part of searchable snapshots.
  • Cache reservation logic: The logic that will be introduced to accommodate reservation of space on the disk for searcher cache.
  • Cache recovery: The process that will be initialized once the node is restarted/recovered/bootstrapped to ensure that the searcher cache goes back the tracking the files available locally on storage and adds them back to the cache.
  • Phantom Files: The files which are unused and are not tracked by the cache currently, but were created by the cache and are consuming disk storage.

5. Proposed solution

The principle applied to the solutions is keeping it simple, yet it is open for future extensibility. The solution for various design decisions are described below -

5.1 Cache Scope

  • Have a single cache per node with a hierarchical structure, which contains the cache files as per the indices.
  • There will be a single cache location per node and will be initialized when the search role is added to the node and covers scenario 1 and scenario 2 described above.
  • This can act as a default cache when and if we expand to named caches in the future.

Image

Pros:

  • Simple, easy to maintain
  • User does not need to define a cache or a named cache, adding the searcher role to the node will initiate the cache generation process.
  • API changes can be avoided as in case of named cache (described below) as the user does not need to name the cache
  • Cache restore process is simpler, phantom file issues can be rectified with ease.

Cons:

  • Does not provide users the flexibility to use different volumes for cache.

Code considerations:

  • We need to reuse the existing ultrawarm mechanism or define a new mechanism for restoring the cache.
  • In case of shard movement, we need to clear out the data for the relocated shard.

5.2 Cache Size

  • Introduce a new setting (searcher.cache.sizeinbytes or searcher.cache.reserved_disk_percent) which can be initialized to a default value when the node is initialized with the search role and a default size if the config does not exist.
  • This behavior of having a cache by default for search nodes (maybe a few 100 MBs) needs to be in place since we need to use some storage at all times for remote searchable snapshots
  • The user can also define explicit cache size/disk percentage using the yaml config file.
  • The default cache storage will be restricted to only a single data path (the first path on the list) and utilize the reservation logic defined as follows.

5.2.1 Reservation logic

  • The reservation logic will simply work as follows -
    • Case A: The user has not defined an explicit cache size but the node is a search capable node
      • Reserve a few 100 MBs for the cache (TBD)
    • Case B: The user defines the cache size using the defined property and the node is search capable
      • Reserve the defined size for the cache ensuring the disk watermark thresholds are not triggered.
    • Case C: The node is not search capable
      • No-Op in this scenario

Code considerations:

  • Reservation logic to kick in as soon as search role is discovered and cleaned up when the role no longer exists for the node.
  • Documentation needs to be updated for the reserved space for cache when the searcher role is added (maybe a few 100 MBs)
  • FsProbe/FsService/MonitorService need to be updated to keep a track of cache size metric
  • DiskThresholdDecider/Monitor need to account for the cache size for shard management in case of a hybrid node

5.3 Cache Path

  • The node will have a default location under the data path for caching with the structure laid out above which will be relative to the defined node path passed into the node environment
  • We will introduce a new property (searcher.cache.path) which can be used to setup the root cache folder with the hierarchy described above.

Code considerations:

  • The cache folder might need to move if the user defines the property and restarts the node/cluster (either move previous files to new path or delete the default location files)

6. Other Approaches

6.1 Named Caches

The named cache approach extends on top of the default cache approach described above. It gives users more flexibility in terms of describing multiple caches with different volumes and names.
Described in detail here.
6.1.1 Pros

  • Gives the users additional flexibility and benefits in terms of cache selection

6.1.2 Cons

  • We will need to update the snapshot restore API to add support for cache selection
  • Phantom file issue can be exacerbated
  • We will require reporting of multiple metrics for each defined cache which might not be needed at this point.

6.2 Dynamic Cache

Dynamic cache will ensure that the cache can be resized to accommodate for the changing nature of hot shards in case of the hybrid nature of nodes where both remote and local shards are assigned to a node.
In essence, the cache can grow upto a certain point where it does not affect the shards, and start shrinking once it reaches the watermark capacities and dynamically resize.

This approach will require some real world data before the dynamic resizing approach can be implemented. The priority of hot shards, space availability within other nodes will need to be looked into similar to the allocator approach and cache resizing to take place if no other options are available, which will require priority based weights on the decision making.
Described in detail here.

7. Backwards compatibility

The change is not breaking. but an enhancement and can be released in a minor version.

@aabukhalil
Copy link
Contributor

aabukhalil commented Feb 8, 2023

Following up with the Cache Design based on research already done in #4964 (comment) and some suggestions from discussions on #5641

These are the specs needed by the LRUCache implementation

  • Cache must have defined fixed maximum capacity. Capacity in this cache context represent IndexInput size on disk and not the number of objects
  • Cache need to maintain it's capacity and when current size exceed capacity it should delete IndexInput both from memory and from local disk
  • Cache has to support custom object size/ custom weigher so that cache size is computed based on IndexInput size
  • Since a cache entry (IndexInput) might still be in use, cache eviction should ignore IndexInput from eviction when it is still in use. IndexInput usage is tracked by the ReferenceCount which the cache is aware of
  • RemovalListener is needed
  • Cache entries are tracked in heap memory and number of cache entries (IndexInput) is not bounded. We need a heap circuit breaker for the cache to not impact JVM health or we need to bound number of entries in the cache.

So far we have evaluated using Guava/Caffeine cache but they don't support custom eviction logic (prevent evicting entries when they have active usage) and they are not open for extension. That's why #5641 has modified version of Guava/Caffeine cache.

Apache JCS was suggested in #5641 and it look promising because it is more open for extension rather than modification but still I'm not able to find a straight forward path/ perfectly fitting implementation for our cache usage.

For now I think the path forward to unblock this is to proceed with current custom cache implementation then figure out if using other LRU Cache can be worked out here. This is tracked in #6225

@ben-manes
Copy link

Since a cache entry (IndexInput) might still be in use, cache eviction should ignore IndexInput from eviction when it is still in use. IndexInput usage is tracked by the ReferenceCount which the cache is aware of
...
So far we have evaluated using Guava/Caffeine cache but they don't support custom eviction logic (prevent evicting entries when they have active usage) and they are not open for extension.

This is partially supported by pinning the entry so that it is skipped for evaluation. That is done by using a Map.compute to mark the entry so that it is evaluated to a zero weight and an infinite expiration. This is equivalent to simply moving the entry into a secondary unbounded map so that it is no longer managed by the cache. The api limitation is that the cache has to be notified that the entry is pinned and won't evaluate a custom predicate during the selection process.

The problem with customizing this to inspect the candidate via a callback is that it turns an O(1) evaluation to an O(n) scan. That can result in a memory leak if no victim is found and may waste cpu cycles as reads/writes trigger new maintenance cycles that tries to honor the threshold. An extension point could be error prone and while experts might be careful, most users would not be. The two popular caches that kind of have this feature are Ehcache's EvictionAdvisor (a soft no; will evict an entry regardless of the advice) and Coherence's EvictionApprover is a hard no (which could lead to heap exhaustion). Neither offer high performance, the latter of which has switched to Caffeine going forward.

The evaluation approach has come up a few times, such as by Apache Solr, but usually in terms of old code that the authors do not want to change. I haven't found a satisfactory api that avoids footguns, which is okay for internal code where the team can be pragmatic but very problematic for external ones. If you can use pinning (explicit or implicitly) then the implementation logic will be obvious, it will be easier to debug, and there will be fewer surprises. Otherwise if you need the evaluation approach then I am happy to discuss it for Caffeine, but am doubtful that we will find a solution that we'll support.

@aabukhalil
Copy link
Contributor

Thanks @ben-manes for the help!! We will use your advice in our next evaluation

@andrross
Copy link
Member Author

Closing as #5641 has been merged, though follow up tasks have been created.

@github-project-automation github-project-automation bot moved this from In Progress to Done in Searchable Snapshots Feb 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss Issues intended to help drive brainstorming and decision making enhancement Enhancement or improvement to existing feature or request Indexing & Search
Projects
Status: Done
Development

No branches or pull requests

5 participants