Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove patching for doc blocks. #12741

Merged
merged 8 commits into from
Nov 6, 2023
Merged

Conversation

slow-J
Copy link
Contributor

@slow-J slow-J commented Oct 31, 2023

We are still keeping PFOR for positions only.
This is a partial revert of #69 which brings back ForDeltaUtil.

Closes #12696

Starting this as a draft PR since creating the Lucene99PostingsFormat brings a lot of change.

Also pending some more benchmarking.

We are still keeping PFOR for positions only.
This is a partial revert of apache#69 which brings back ForDeltaUtil.
Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks good to me in general. Can you also split Lucene90PostingsFormat into a Lucene90PostingsFormat that is read-only and a Lucene90RWPostingsFormat that is only available for testing? You can check out Lucene95RWHnswVectorsFormat for a recent example of how file formats get split into a read-only implementation and a test-only read-write implementation.

lucene/CHANGES.txt Outdated Show resolved Hide resolved
}

/** Skip a sequence of 128 longs. */
void skip(DataInput in) throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like we don't need this method as it's only used for tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, removed in latest commit.

if (bitsPerValue == 0) {
prefixSumOfOnes(longs, base);
} else {
forUtil.decodeAndPrefixSum(bitsPerValue, in, base, longs);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we inline this other method into this class? It's a bit awkward to have the prefix sum logic in ForUtil rather than ForDeltaUtil?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh maybe it's for convenience because this other class is generated and not this one?

Copy link
Contributor Author

@slow-J slow-J Oct 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think its the convenience + otherwise we would have to duplicate about 650 lines of code from ForUtil. (all the decode1 -> decode24)

Also:
* Change to Changes.txt
* Removal of dead code which was only used in unit tests
* Removal of test code from PForUtil
@slow-J
Copy link
Contributor Author

slow-J commented Oct 31, 2023

Thanks for the suggestion, I added Lucene90RWPostingsFormat in latest commit and made Lucene90PostingsFormat read-only.

Copy link
Contributor

@gf2121 gf2121 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @slow-J ! I left some minor comments about additional 90 -> 99 refactoring.

Co-authored-by: gf2121 <52390227+gf2121@users.noreply.github.com>
@slow-J
Copy link
Contributor Author

slow-J commented Nov 1, 2023

Thanks @slow-J ! I left some minor comments about additional 90 -> 99 refactoring.

Thanks @gf2121 , committed all the suggestions.

@mikemccand
Copy link
Member

Thanks for tackling this / persisting @slow-J, especially the glorious fun experience of having to "bump" the Codec version ;) A nice rite-of-passage in this Lucene world!

@slow-J
Copy link
Contributor Author

slow-J commented Nov 2, 2023

Thanks @mikemccand and yes, the codec version bump is the majority of this change :D

@slow-J slow-J marked this pull request as ready for review November 2, 2023 12:05
@jpountz
Copy link
Contributor

jpountz commented Nov 3, 2023

For reference, I'm interested in taking advantage of the fact we're changing the codec anyway to look into other smaller changes, like switching tail postings from vints to group-varint, or better alignign blocks and skip lists so that BlockDocsEnum#advance doesn't need to check whether if docBufferUpto == BLOCK_SIZE to decode a new block and could do it directly under the target > nextSkipDoc check.

Copy link
Member

@mikemccand mikemccand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @slow-J -- what a big change this turned out to be.

I left some minor comments that can be resolve later. I think given how many files this is touching we should merge it sooner rather than later... I'll merge later today if there are no concerns otherwise.

* <li>SkipDatum --&gt; DocSkip, DocFPSkip, &lt;PosFPSkip, PosBlockOffset, PayLength?,
* PayFPSkip?&gt;?, ImpactLength, &lt;CompetitiveFreqDelta, CompetitiveNormDelta?&gt;
* <sup>ImpactCount</sup>, SkipChildLevelPointer?
* <li>PackedDocDeltaBlock, PackedFreqBlock --&gt; {@link PackedInts PackedInts}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm maybe separate out these two to clarify that the PackedDocDeltaBlock does not using patching, but the PackedFreqBlock does?

* <dd><b>Frequencies and Skip Data</b>
* <p>The .doc file contains the lists of documents which contain each term, along with the
* frequency of the term in that document (except when frequencies are omitted: {@link
* IndexOptions#DOCS}). It also saves skip data to the beginning of each packed or VInt block,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh, I had thought skip data was saved at the end of each term's postings? And, the skip data is not stored per block, but rather once for the entire postings list?

(This is a pre-existing issue -- we can fix it separately).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened PR for the 2 javadoc comments: #12776

@mikemccand mikemccand merged commit 8ae598b into apache:main Nov 6, 2023
4 checks passed
mikemccand pushed a commit that referenced this pull request Nov 6, 2023
* Change Postings back to using FOR in Lucene99PostingsFormat

We are still keeping PFOR for positions only.
This is a partial revert of #69 which brings back ForDeltaUtil.

* fix merge commit

* Add forgotten forDeltaUtil calls to reader

* Addressing comments: adding Lucene90RWPostingsFormat + more

Also:
* Change to Changes.txt
* Removal of dead code which was only used in unit tests
* Removal of test code from PForUtil

* Changes.txt edit in right place now

* Apply suggestions from code review: `90 -> 99 refactoring`

Co-authored-by: gf2121 <52390227+gf2121@users.noreply.github.com>

* Remove decodeTo32 from ForUtil and regenerate

---------

Co-authored-by: gf2121 <52390227+gf2121@users.noreply.github.com>
@slow-J slow-J deleted the remove_patch_postings branch November 6, 2023 16:58
@slow-J
Copy link
Contributor Author

slow-J commented Nov 6, 2023

Thanks Mike and all reviewers!

@mikemccand
Copy link
Member

Thank you @slow-J!

javanna added a commit to javanna/elasticsearch that referenced this pull request Nov 7, 2023
IndexDiskUsageAnalyzer needs adjusting after apache/lucene#12741
slow-J added a commit to slow-J/lucene that referenced this pull request Nov 8, 2023
Clean-up from adding the Lucene99PostingsFormat in apache#12741
These test cases were moved to Lucene99 dir and I forgot to copy the unmodified versions for the backward_codecs.lucene90
mikemccand pushed a commit that referenced this pull request Nov 8, 2023
…2781)

Clean-up from adding the Lucene99PostingsFormat in #12741
These test cases were moved to Lucene99 dir and I forgot to copy the unmodified versions for the backward_codecs.lucene90
mikemccand pushed a commit that referenced this pull request Nov 8, 2023
…2781)

Clean-up from adding the Lucene99PostingsFormat in #12741
These test cases were moved to Lucene99 dir and I forgot to copy the unmodified versions for the backward_codecs.lucene90
javanna added a commit to elastic/elasticsearch that referenced this pull request Nov 9, 2023
IndexDiskUsageAnalyzer and IndexDiskUsageAnalyzerTests, as well as CompletionFieldMapper, CompletionFieldMapperTests and CompletionStatsCacheTests need adjusting after apache/lucene#12741 , to refer to the latest postings format.
KuromojiTokenizerFactory needs adjusting after apache/lucene#12390
@jpountz
Copy link
Contributor

jpountz commented Nov 10, 2023

Nightly benchmarks just caught up this change, it's no obvious that there is a speedup.

@gf2121
Copy link
Contributor

gf2121 commented Nov 10, 2023

FYI this great view could be easier to see the impact of changes in single day for all tasks. It seems some count tasks get a bit happy with little p-value.

@slow-J
Copy link
Contributor Author

slow-J commented Nov 10, 2023

I think that it's a little hard to tell with 1 datapoint due to noise, it seems to be trending upwards in the BooleanQuery graphs, but I agree that it's not obvious that there is a noticeable speedup...

@jpountz
Copy link
Contributor

jpountz commented Nov 13, 2023

Thanks both, I pushed an annotation, it should show up tomorrow. I hah high expectations based on preliminary results from #12696 (comment) where AndHighMed had a reproducible 3-4% speedup, so I was expecting nightlies to show it too. @slow-J I'm curious if you had a chance to run benchmarks on this PR, did it also show a speedup?

@slow-J
Copy link
Contributor Author

slow-J commented Nov 13, 2023

I ran a new luceneutil benchmark on Saturday with my commit 8ae598b (using Lucene99PostingsFormat) as candidate and the commit's parent as baseline (using Lucene90PostingsFormat).

Other benchmark variables for transparency:

  • Java 19
  • Ec2 instance: m5.12xlarge.
  • Disabled JFR
                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
           BrowseMonthTaxoFacets        3.98      (6.7%)        3.92      (2.3%)   -1.7% (  -9% -    7%) 0.286
       BrowseDayOfYearSSDVFacets        3.29      (1.6%)        3.27      (1.1%)   -0.7% (  -3% -    2%) 0.126
            BrowseDateSSDVFacets        1.01      (4.8%)        1.00      (5.6%)   -0.6% ( -10% -   10%) 0.695
                        PKLookup      151.00      (2.5%)      150.26      (2.6%)   -0.5% (  -5% -    4%) 0.543
           BrowseMonthSSDVFacets        3.44      (1.6%)        3.43      (1.5%)   -0.5% (  -3% -    2%) 0.321
                         LowTerm      345.74      (3.0%)      344.39      (3.1%)   -0.4% (  -6% -    5%) 0.689
                        Wildcard      142.48      (1.9%)      142.00      (1.8%)   -0.3% (  -3% -    3%) 0.569
                         Prefix3     1056.96      (4.0%)     1054.07      (3.2%)   -0.3% (  -7% -    7%) 0.812
     BrowseRandomLabelSSDVFacets        2.55      (7.2%)        2.54      (7.2%)   -0.1% ( -13% -   15%) 0.975
       BrowseDayOfYearTaxoFacets        3.96      (0.7%)        3.96      (0.7%)   -0.0% (  -1% -    1%) 0.856
            BrowseDateTaxoFacets        3.94      (0.7%)        3.94      (0.7%)   -0.0% (  -1% -    1%) 0.912
     BrowseRandomLabelTaxoFacets        3.41      (0.8%)        3.41      (0.9%)   -0.0% (  -1% -    1%) 0.977
                          Fuzzy1       69.71      (0.9%)       69.74      (0.8%)    0.0% (  -1% -    1%) 0.902
                   OrHighNotHigh      205.50      (4.6%)      205.67      (5.3%)    0.1% (  -9% -   10%) 0.958
                          Fuzzy2       58.39      (0.8%)       58.44      (0.6%)    0.1% (  -1% -    1%) 0.688
                    OrHighNotLow      265.57      (5.4%)      266.05      (5.9%)    0.2% ( -10% -   12%) 0.921
            HighIntervalsOrdered        5.32      (4.5%)        5.33      (4.1%)    0.2% (  -8% -    9%) 0.879
            HighTermTitleBDVSort        7.81      (3.1%)        7.83      (2.7%)    0.3% (  -5% -    6%) 0.756
                         Respell       34.00      (1.3%)       34.10      (1.1%)    0.3% (  -2% -    2%) 0.451
            MedTermDayTaxoFacets       16.00      (3.8%)       16.04      (3.8%)    0.3% (  -7% -    8%) 0.801
                    OrHighNotMed      255.88      (5.0%)      257.07      (5.2%)    0.5% (  -9% -   11%) 0.774
        AndHighHighDayTaxoFacets        2.83      (3.5%)        2.84      (3.4%)    0.5% (  -6% -    7%) 0.658
                         MedTerm      374.43      (4.9%)      376.51      (5.6%)    0.6% (  -9% -   11%) 0.738
                        HighTerm      472.29      (4.8%)      474.92      (5.7%)    0.6% (  -9% -   11%) 0.738
                     MedSpanNear        5.01      (4.3%)        5.04      (4.6%)    0.6% (  -7% -    9%) 0.690
               HighTermMonthSort     2670.49      (4.1%)     2689.05      (3.1%)    0.7% (  -6% -    8%) 0.549
             LowIntervalsOrdered        6.65      (4.2%)        6.70      (4.3%)    0.7% (  -7% -    9%) 0.584
                      OrHighHigh       22.54      (1.2%)       22.71      (2.4%)    0.7% (  -2% -    4%) 0.204
          OrHighMedDayTaxoFacets        1.58      (4.6%)        1.59      (3.5%)    0.8% (  -6% -    9%) 0.523
                          IntNRQ       27.99      (2.4%)       28.25      (3.9%)    0.9% (  -5% -    7%) 0.371
                   OrNotHighHigh      244.47      (3.7%)      246.97      (4.5%)    1.0% (  -6% -    9%) 0.432
             MedIntervalsOrdered        3.14      (3.7%)        3.18      (3.9%)    1.0% (  -6% -    9%) 0.391
                       OrHighMed       43.19      (1.3%)       43.65      (1.7%)    1.1% (  -1% -    4%) 0.025
                    HighSpanNear        6.73      (2.7%)        6.80      (3.3%)    1.2% (  -4% -    7%) 0.223
                       LowPhrase      255.75      (2.1%)      259.11      (2.0%)    1.3% (  -2% -    5%) 0.043
           HighTermDayOfYearSort      253.38      (4.1%)      257.57      (2.2%)    1.7% (  -4% -    8%) 0.112
                     AndHighHigh       15.70      (1.1%)       15.98      (2.5%)    1.8% (  -1% -    5%) 0.004
                       MedPhrase       14.60      (2.4%)       14.89      (2.2%)    1.9% (  -2% -    6%) 0.009
                       OrHighLow      340.15      (2.3%)      346.84      (2.6%)    2.0% (  -2% -    6%) 0.010
                      HighPhrase      105.17      (2.3%)      107.28      (2.0%)    2.0% (  -2% -    6%) 0.004
                HighSloppyPhrase        1.97      (5.2%)        2.01      (3.6%)    2.0% (  -6% -   11%) 0.152
               HighTermTitleSort      118.65      (5.4%)      121.16      (2.4%)    2.1% (  -5% -   10%) 0.112
                 MedSloppyPhrase        2.73      (5.0%)        2.79      (3.1%)    2.2% (  -5% -   10%) 0.092
                 LowSloppyPhrase        9.48      (3.7%)        9.69      (1.7%)    2.3% (  -3% -    8%) 0.013
                      AndHighMed       47.32      (1.3%)       48.52      (2.8%)    2.5% (  -1% -    6%) 0.000
                     LowSpanNear       17.05      (2.0%)       17.49      (2.1%)    2.6% (  -1% -    6%) 0.000
                    OrNotHighMed      292.37      (3.0%)      300.28      (3.3%)    2.7% (  -3% -    9%) 0.007
         AndHighMedDayTaxoFacets       21.13      (1.8%)       21.74      (1.5%)    2.9% (   0% -    6%) 0.000
                      TermDTSort      124.22      (4.8%)      127.91      (1.3%)    3.0% (  -3% -    9%) 0.008
                      AndHighLow      477.93      (3.1%)      494.29      (2.5%)    3.4% (  -2% -    9%) 0.000
                    OrNotHighLow      424.91      (2.3%)      444.79      (2.8%)    4.7% (   0% -   10%) 0.000

and I am still seeing a speed, although the AndHighMed gain was 2.5%

@mikemccand
Copy link
Member

I ran a new luceneutil benchmark on Saturday with my commit 8ae598b (using Lucene99PostingsFormat) as candidate and the commit's parent as baseline (using Lucene90PostingsFormat).

Is this wikimediumall or wikibigall?

@slow-J
Copy link
Contributor Author

slow-J commented Nov 13, 2023

I ran a new luceneutil benchmark on Saturday with my commit 8ae598b (using Lucene99PostingsFormat) as candidate and the commit's parent as baseline (using Lucene90PostingsFormat).

Is this wikimediumall or wikibigall?

Should have specified, it's wikimediumall

@s1monw
Copy link
Member

s1monw commented Dec 19, 2023

I wanted to give my $0.02 on this. I am not convinced that a 2% change on a benchmark warrants a 6.2k SLoC addition to such an important codebase. I think the differences in terms of performance between FOR and PFOR vary a lot across benchmarks and are heavily dependent on what your index looks like, how big it is. I would even argue that the space savings PFOR was bringing in (about 5%) might make a bigger difference in terms of performance depending on the size of the index and your hardware.
I don't wanna go that far and ask for a revert of this change but I think we need to look closer in the future if the rather questionable improvements warrant a change like this or if such a change should rather be an optional postings format rather than the default.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Adding option to codec to disable patching in Lucene's PFOR encoding
5 participants