-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove patching for doc blocks. #12741
Conversation
We are still keeping PFOR for positions only. This is a partial revert of apache#69 which brings back ForDeltaUtil.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks good to me in general. Can you also split Lucene90PostingsFormat
into a Lucene90PostingsFormat
that is read-only and a Lucene90RWPostingsFormat
that is only available for testing? You can check out Lucene95RWHnswVectorsFormat
for a recent example of how file formats get split into a read-only implementation and a test-only read-write implementation.
} | ||
|
||
/** Skip a sequence of 128 longs. */ | ||
void skip(DataInput in) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like we don't need this method as it's only used for tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, removed in latest commit.
if (bitsPerValue == 0) { | ||
prefixSumOfOnes(longs, base); | ||
} else { | ||
forUtil.decodeAndPrefixSum(bitsPerValue, in, base, longs); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we inline this other method into this class? It's a bit awkward to have the prefix sum logic in ForUtil
rather than ForDeltaUtil
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh maybe it's for convenience because this other class is generated and not this one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think its the convenience + otherwise we would have to duplicate about 650 lines of code from ForUtil
. (all the decode1 -> decode24)
lucene/core/src/java/org/apache/lucene/codecs/lucene99/PForUtil.java
Outdated
Show resolved
Hide resolved
Also: * Change to Changes.txt * Removal of dead code which was only used in unit tests * Removal of test code from PForUtil
Thanks for the suggestion, I added |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @slow-J ! I left some minor comments about additional 90
-> 99
refactoring.
...kward-codecs/src/java/org/apache/lucene/backward_codecs/lucene90/Lucene90PostingsFormat.java
Outdated
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99PostingsFormat.java
Outdated
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99SkipReader.java
Outdated
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99SkipReader.java
Outdated
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99SkipReader.java
Outdated
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99SkipReader.java
Outdated
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99PostingsFormat.java
Outdated
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99PostingsFormat.java
Outdated
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99PostingsFormat.java
Outdated
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99PostingsFormat.java
Outdated
Show resolved
Hide resolved
Co-authored-by: gf2121 <52390227+gf2121@users.noreply.github.com>
Thanks for tackling this / persisting @slow-J, especially the glorious fun experience of having to "bump" the Codec version ;) A nice rite-of-passage in this Lucene world! |
Thanks @mikemccand and yes, the codec version bump is the majority of this change :D |
For reference, I'm interested in taking advantage of the fact we're changing the codec anyway to look into other smaller changes, like switching tail postings from vints to group-varint, or better alignign blocks and skip lists so that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @slow-J -- what a big change this turned out to be.
I left some minor comments that can be resolve later. I think given how many files this is touching we should merge it sooner rather than later... I'll merge later today if there are no concerns otherwise.
* <li>SkipDatum --> DocSkip, DocFPSkip, <PosFPSkip, PosBlockOffset, PayLength?, | ||
* PayFPSkip?>?, ImpactLength, <CompetitiveFreqDelta, CompetitiveNormDelta?> | ||
* <sup>ImpactCount</sup>, SkipChildLevelPointer? | ||
* <li>PackedDocDeltaBlock, PackedFreqBlock --> {@link PackedInts PackedInts} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm maybe separate out these two to clarify that the PackedDocDeltaBlock
does not using patching, but the PackedFreqBlock
does?
* <dd><b>Frequencies and Skip Data</b> | ||
* <p>The .doc file contains the lists of documents which contain each term, along with the | ||
* frequency of the term in that document (except when frequencies are omitted: {@link | ||
* IndexOptions#DOCS}). It also saves skip data to the beginning of each packed or VInt block, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huh, I had thought skip data was saved at the end of each term's postings? And, the skip data is not stored per block, but rather once for the entire postings list?
(This is a pre-existing issue -- we can fix it separately).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opened PR for the 2 javadoc comments: #12776
* Change Postings back to using FOR in Lucene99PostingsFormat We are still keeping PFOR for positions only. This is a partial revert of #69 which brings back ForDeltaUtil. * fix merge commit * Add forgotten forDeltaUtil calls to reader * Addressing comments: adding Lucene90RWPostingsFormat + more Also: * Change to Changes.txt * Removal of dead code which was only used in unit tests * Removal of test code from PForUtil * Changes.txt edit in right place now * Apply suggestions from code review: `90 -> 99 refactoring` Co-authored-by: gf2121 <52390227+gf2121@users.noreply.github.com> * Remove decodeTo32 from ForUtil and regenerate --------- Co-authored-by: gf2121 <52390227+gf2121@users.noreply.github.com>
Thanks Mike and all reviewers! |
Thank you @slow-J! |
Addressing the last comments from apache#12741
IndexDiskUsageAnalyzer needs adjusting after apache/lucene#12741
Clean-up from adding the Lucene99PostingsFormat in apache#12741 These test cases were moved to Lucene99 dir and I forgot to copy the unmodified versions for the backward_codecs.lucene90
IndexDiskUsageAnalyzer and IndexDiskUsageAnalyzerTests, as well as CompletionFieldMapper, CompletionFieldMapperTests and CompletionStatsCacheTests need adjusting after apache/lucene#12741 , to refer to the latest postings format. KuromojiTokenizerFactory needs adjusting after apache/lucene#12390
Nightly benchmarks just caught up this change, it's no obvious that there is a speedup. |
FYI this great view could be easier to see the impact of changes in single day for all tasks. It seems some count tasks get a bit happy with little p-value. |
I think that it's a little hard to tell with 1 datapoint due to noise, it seems to be trending upwards in the |
Thanks both, I pushed an annotation, it should show up tomorrow. I hah high expectations based on preliminary results from #12696 (comment) where |
I ran a new luceneutil benchmark on Saturday with my commit 8ae598b (using Lucene99PostingsFormat) as candidate and the commit's parent as baseline (using Lucene90PostingsFormat). Other benchmark variables for transparency:
and I am still seeing a speed, although the |
Is this |
Should have specified, it's |
Addressing the last comments from #12741
Addressing the last comments from #12741
I wanted to give my $0.02 on this. I am not convinced that a 2% change on a benchmark warrants a 6.2k SLoC addition to such an important codebase. I think the differences in terms of performance between FOR and PFOR vary a lot across benchmarks and are heavily dependent on what your index looks like, how big it is. I would even argue that the space savings PFOR was bringing in (about 5%) might make a bigger difference in terms of performance depending on the size of the index and your hardware. |
We are still keeping PFOR for positions only.
This is a partial revert of #69 which brings back ForDeltaUtil.
Closes #12696
Starting this as a draft PR since creating the
Lucene99PostingsFormat
brings a lot of change.Also pending some more benchmarking.