Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Relax Lucene Index Upgrade Policy to Allow Safe Upgrades Across Multiple Major Versions #13797

Open
markrmiller opened this issue Sep 16, 2024 · 2 comments

Comments

@markrmiller
Copy link
Member

markrmiller commented Sep 16, 2024

Description

TLDR: Relax index upgrade policy across major versions to only be as strict as necessary.

Here is an attempted summary of a recent discussion about this.

Currently, Lucene's policy requires a full reindex when upgrading across more than one major version, which can create significant friction for users with large indexes. We propose relaxing this policy to allow upgrades across multiple major versions when it is safe to do so. The goal is to provide flexibility for users without compromising data integrity or flexibility.

Proposed Changes:

  • Modify Upgrade Policy: Allow upgrades across multiple major versions, replacing the existing restriction with a configurable MIN_SUPPORTED_MAJOR version in Version.java.

  • Controlled Version Bumping: Bump MIN_SUPPORTED_MAJOR only when necessary, due to index format changes that prevent safe upgrades (e.g., changes to norms encoding).

  • Improved Documentation: Clearly document which versions can be safely upgraded to the current version without reindexing.

  • Retain Reindexing When Necessary: Ensure that reindexing is still required when necessary to maintain correctness or prevent the propagation of corruption.

Benefits:

  • Reduces friction and operational overhead for users with large indexes.

  • Facilitates more frequent major releases by reducing mandatory reindexing.

  • Maintains safety and integrity by reindexing only when required.

Implementation Plan:

  • Modify Version.java to use a configurable MIN_SUPPORTED_MAJOR.

  • Update the index upgrade logic to check against MIN_SUPPORTED_MAJOR rather than just the previous major version.

  • Enhance documentation to provide clear guidelines on safe upgrade paths and scenarios requiring reindexing.

Request for Feedback: We welcome feedback from the community on this proposal, especially regarding its potential impact, implementation details, and any concerns about safety and backward compatibility.

  • Note: Upgrading from Lucene 20 to Lucene 23 would require first going from 20 to 21, from 21 to 22, and then 22 to 23. Unless of course a change occurred in one of those versions that would prevent you from doing so, in which case a reindex would be required.
@jpountz
Copy link
Contributor

jpountz commented Sep 18, 2024

I have had many discussions on this topic of file format bw compat over the years, because users would ideally like to think of their indexes as never expiring. If this is the problem that should be solved, then there are two main options that I can think of:

  • increasing backward compatibility of already written data,
  • performing a periodic transparent background reindexing.

I have developed a preference for the second option. It is cheap in hardware costs when you compare the storage cost of storing an index for ~3 years (which is about the duration of our backward compatibility window) with the cost of reindexing the same index. And it comes with the great benefit that it can also be taken as an opportunity to index data in a more modern way, (e.g. switching from trie fields to points in Lucene 6, switching scoring factors from doc values to FeatureField in Lucene 8, enabling vector search in addition to lexical search in Lucene 9, enabling sparse indexing in Lucene 10, etc.).

The way I'm thinking of it is that you would create a point-in-time view of your index, reindex it into a new index, stop the world while you're replaying operations since the point-in-time view was taken and points are swapped from the old index to the new index, and finally remove the old index. Given the required orchestration that is needed, it's probably best solved on top of Lucene (in Solr, Elasticsearch, or luceneserver), but we could look into adding tooling for this in Lucene?

That said, I think there's benefits to your suggestion of decoupling major versions from backward compatibility, I would just use it to make it easier for us to do more frequent major versions without shortening our backward compatibility window, rather than to increase our backward compatibility window?

@markrmiller
Copy link
Member Author

Thank you, Adrien, for your thoughtful response and for sharing your expertise on this topic. Your insights are valuable, and I'd like to address a few points and seek some clarification.

First, I want to emphasize that the two approaches we're discussing - relaxing the upgrade policy and implementing background reindexing - are not mutually exclusive. Both have merit and could potentially be implemented to serve different use cases and user needs.

Relaxed Upgrade Policy: This approach aims to reduce friction for upgrades by allowing them across multiple major versions when safe to do so.

Background Reindexing: This method, as you've outlined, provides a path for long-term index modernization and feature adoption.

I'd like to clarify that our original proposal isn't about extending the backward compatibility window. Rather, it's about allowing index upgrades as long as backward compatibility hasn't been broken - essentially making the upgrade check only as strict as necessary. This doesn't change any promises about the backward compatibility window itself.
Could you elaborate on your concerns about extending the backward compatibility window? While that's not our intention, understanding these concerns could be useful.

Given that these approaches serve different purposes and timeframes, I believe there's value in considering both:

The relaxed upgrade policy could provide immediate benefits with relatively low development and operational costs.
The background reindexing solution offers long-term benefits for feature adoption and index modernization, albeit with higher development and operational costs.

Implementing both could provide flexibility for users with different needs and resources. Users could benefit from easier upgrades in the short term while having a path to adopt new features when they're ready.

Questions

Could you share more about your concerns regarding the relaxed upgrade policy? Are there specific technical or operational issues you foresee?
Do you see any conflicts or problems with implementing both approaches?
Would you be open to a phased approach, where we implement the relaxed upgrade policy first and then work on tooling for background reindexing?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants