Reciprocal Rank Fusion (RRF) normalization technique in hybrid query #874

Johnsonisaacn · 2024-08-28T20:02:34Z

Description

Adding ability to process and combine scores from multiple subqueries in neural search using the reciprocal rank fusion (RRF) technique. Built with a new processor and processor factory class apart from NormalizationProcessor. Changes to API included in RFC. Does not currently support weights when combining processed subquery scores, based on lack of examples in existing literature.

Example of usage for RRF processor:

create index

PUT /index-test
{
  "settings": {
    "index": {
      "knn": true
    }
  },
  "mappings": {
    "properties": {
      "vector": {
        "type": "knn_vector",
        "dimension": 3,
        "method": {
          "name": "hnsw",
          "space_type": "l2",
          "engine": "lucene"
        }
      },
      "field1": {
        "type": "integer"
      }
    }
  }
}

create pipeline with rrf processor and all defaults

PUT /_search/pipeline/nlp-search-pipeline
{
    "description": "Post processor for hybrid search",
    "phase_results_processors": [
        {
            "score-ranker-processor": {
                "combination": {
                    "technique": "rrf",
                    "parameters": {
                    }
                }
            }
        }
    ]
}

ingest 4 documents

POST /index-test/_doc/?refresh=true
{
    "field1": 2,
    "vector": [0.4, 0.5, 0.2],
    "title": "basic"
}

{
    "field1": 10,
    "vector": [0.2, 0.2, 0.3],
    "title": "java"
}

{
    "field1": 50,
    "vector": [4.2, 5.5, 8.9]
}

{
    "vector": [0.3, 0.12, 3.3],
    "title": "python"
}

run search request

GET /index-test/_search?search_pipeline=nlp-search-pipeline
{
    "query": {
        "hybrid": {
            "queries": [
                {
                    "knn": {
                        "vector": {
                            "vector": [
                                4.2,
                                5.0,
                                8.5
                            ],
                            "k": 10
                        }
                    }
                },
                {
                    "range": {
                        "field1": {
                            "gte": 10,
                            "lte": 50
                        }
                    }
                }
            ]
        }
    }
}

you'll get following response

{
    "took": 6,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 4,
            "relation": "eq"
        },
        "max_score": 0.032522473,
        "hits": [
            {
                "_index": "index-test",
                "_id": "fSgBmJIB3rlMI6kPNIQL",
                "_score": 0.032522473,
                "_source": {
                    "field1": 50,
                    "vector": [
                        4.2,
                        5.5,
                        8.9
                    ]
                }
            },
            {
                "_index": "index-test",
                "_id": "fCgBmJIB3rlMI6kPK4QS",
                "_score": 0.03201844,
                "_source": {
                    "field1": 10,
                    "vector": [
                        0.2,
                        0.2,
                        0.3
                    ],
                    "title": "java"
                }
            },
            {
                "_index": "index-test",
                "_id": "figBmJIB3rlMI6kPPITH",
                "_score": 0.016129032,
                "_source": {
                    "vector": [
                        0.3,
                        0.12,
                        3.3
                    ],
                    "title": "python"
                }
            },
            {
                "_index": "index-test",
                "_id": "eygBmJIB3rlMI6kPIYQm",
                "_score": 0.015873017,
                "_source": {
                    "field1": 2,
                    "vector": [
                        0.4,
                        0.5,
                        0.2
                    ],
                    "title": "basic"
                }
            }
        ]
    }
}

if you change rank to something smaller, like '1' your scores all will be scalled up
update rank contant

{
    "description": "Post processor for hybrid search",
    "phase_results_processors": [
        {
            "score-ranker-processor": {
                "combination": {
                    "technique": "rrf",
                    "parameters": {
                        "rank_constant": 1
                    }
                }
            }
        }
    ]
}

and search response is

{
    "took": 10,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 4,
            "relation": "eq"
        },
        "max_score": 0.8333334,
        "hits": [
            {
                "_index": "index-test",
                "_id": "fSgBmJIB3rlMI6kPNIQL",
                "_score": 0.8333334,
                "_source": {
                    "field1": 50,
                    "vector": [
                        4.2,
                        5.5,
                        8.9
                    ]
                }
            },
            {
                "_index": "index-test",
                "_id": "fCgBmJIB3rlMI6kPK4QS",
                "_score": 0.7,
                "_source": {
                    "field1": 10,
                    "vector": [
                        0.2,
                        0.2,
                        0.3
                    ],
                    "title": "java"
                }
            },
            {
                "_index": "index-test",
                "_id": "figBmJIB3rlMI6kPPITH",
                "_score": 0.33333334,
                "_source": {
                    "vector": [
                        0.3,
                        0.12,
                        3.3
                    ],
                    "title": "python"
                }
            },
            {
                "_index": "index-test",
                "_id": "eygBmJIB3rlMI6kPIYQm",
                "_score": 0.25,
                "_source": {
                    "field1": 2,
                    "vector": [
                        0.4,
                        0.5,
                        0.2
                    ],
                    "title": "basic"
                }
            }
        ]
    }
}

for comparison this is the response for same query if we use normalization processor with default techniques.

Important difference is that delta between document scores with RRF is much smaller, this is because it's based on document rank that are typically close in value comparing to scores where delta can be huge.

{
    "took": 16,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 4,
            "relation": "eq"
        },
        "max_score": 1.0,
        "hits": [
            {
                "_index": "index-test",
                "_id": "fSgBmJIB3rlMI6kPNIQL",
                "_score": 1.0,
                "_source": {
                    "field1": 50,
                    "vector": [
                        4.2,
                        5.5,
                        8.9
                    ]
                }
            },
            {
                "_index": "index-test",
                "_id": "fCgBmJIB3rlMI6kPK4QS",
                "_score": 0.5005,
                "_source": {
                    "field1": 10,
                    "vector": [
                        0.2,
                        0.2,
                        0.3
                    ],
                    "title": "java"
                }
            },
            {
                "_index": "index-test",
                "_id": "figBmJIB3rlMI6kPPITH",
                "_score": 0.0039931787,
                "_source": {
                    "vector": [
                        0.3,
                        0.12,
                        3.3
                    ],
                    "title": "python"
                }
            },
            {
                "_index": "index-test",
                "_id": "eygBmJIB3rlMI6kPIYQm",
                "_score": 1.7192177E-4,
                "_source": {
                    "field1": 2,
                    "vector": [
                        0.4,
                        0.5,
                        0.2
                    ],
                    "title": "basic"
                }
            }
        ]
    }
}

Related Issues

Resolves #[Issue number to be closed when this PR is merged]
#865
#659

Check List

New functionality includes testing.
New functionality has been documented.
API changes companion pull request created.
Commits are signed per the DCO using --signoff.
Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Isaac Johnson <isaacnj@amazon.com>

Signed-off-by: Isaac Johnson <114550967+Johnsonisaacn@users.noreply.github.com>

martin-gaievski · 2024-09-04T18:13:56Z

we should be merging to feature branch https://github.com/opensearch-project/neural-search/tree/feature/rrf-score-normalization, not main.

...main/java/org/opensearch/neuralsearch/processor/normalization/RRFNormalizationTechnique.java

src/main/java/org/opensearch/neuralsearch/processor/NormalizationExecuteDTO.java

yuye-aws · 2024-10-09T07:39:15Z

...ain/java/org/opensearch/neuralsearch/processor/combination/RRFScoreCombinationTechnique.java

+    // Not currently using weights for RRF, no need to modify or verify these params
+    public RRFScoreCombinationTechnique(final Map<String, Object> params, final ScoreCombinationUtil combinationUtil) {
+        ;
+    }


This class not completed?

we planned to have very simple implementation for this one, I'll be finishing this PR and address all misses if

I finished the class, please take a look @yuye-aws

...java/org/opensearch/neuralsearch/processor/normalization/RRFNormalizationTechniqueTests.java

vibrantvarun · 2024-10-16T17:32:49Z

src/main/java/org/opensearch/neuralsearch/processor/normalization/ScoreNormalizationUtil.java

+ * Collection of utility methods for score combination technique classes
+ */
+@Log4j2
+class ScoreNormalizationUtil {


Can't we shift the code in this class to HybridQueryUtil?

I don't think it belongs there, my view is - anything related to the query itself should go to that class, like parsing score collection into multiple sub query results.

...main/java/org/opensearch/neuralsearch/processor/normalization/RRFNormalizationTechnique.java

Signed-off-by: Martin Gaievski <gaievski@amazon.com>

vibrantvarun · 2024-10-17T17:16:18Z

src/main/java/org/opensearch/neuralsearch/processor/NormalizationExecuteDTO.java

+
+/**
+ * DTO object to hold data required for score normalization passed to execute() function
+ * in NormalizationProcessorWorkflow. Field rankConstant


Suggested change

* in NormalizationProcessorWorkflow. Field rankConstant

* in NormalizationProcessorWorkflow.

src/main/java/org/opensearch/neuralsearch/processor/NormalizationExecuteDTO.java

src/main/java/org/opensearch/neuralsearch/processor/combination/ScoreCombinationUtil.java

src/main/java/org/opensearch/neuralsearch/processor/factory/RRFProcessorFactory.java

...main/java/org/opensearch/neuralsearch/processor/normalization/RRFNormalizationTechnique.java

Signed-off-by: Martin Gaievski <gaievski@amazon.com>

vibrantvarun

Just one minor comment.

LGTM.

src/main/java/org/opensearch/neuralsearch/processor/NormalizationExecuteDTO.java

Co-authored-by: Varun Jain <varunudr@amazon.com> Signed-off-by: Martin Gaievski <gaievski@amazon.com>

...main/java/org/opensearch/neuralsearch/processor/normalization/RRFNormalizationTechnique.java

yuye-aws · 2024-10-18T07:03:34Z

src/main/java/org/opensearch/neuralsearch/processor/RRFProcessor.java

+ */
+@Log4j2
+@AllArgsConstructor
+public class RRFProcessor implements SearchPhaseResultsProcessor {


Is there somewhere to validate that RRFNormalizationTechnique is used together with RRFScoreCombinationTechnique? The execute method in NormalizationProcessorWorkflow class doing normalization and them combination.

...ava/org/opensearch/neuralsearch/processor/combination/RRFScoreCombinationTechniqueTests.java

yuye-aws · 2024-10-18T07:12:23Z

...ava/org/opensearch/neuralsearch/processor/combination/RRFScoreCombinationTechniqueTests.java

+    private float RRF(List<Float> scores, List<Double> weights) {
+        float sumScores = 0.0f;
+        for (float score : scores) {
+            sumScores += score;
+        }
+        return sumScores;
+    }


Why are you adding these method in this testing? I think you can simply with a few examples like 1 plus 1 is 2.

it's added to be compatible with https://github.com/opensearch-project/neural-search/blob/main/src/test/java/org/opensearch/neuralsearch/processor/combination/BaseScoreCombinationTechniqueTests.java and be able to use all test cases it provides. We need to ensure in better possible test coverage if it's a low hanging fruit

Did not get your point. The private randomScore outputs non-deterministic results.

yuye-aws · 2024-10-18T07:16:33Z

src/test/java/org/opensearch/neuralsearch/query/NeuralQueryBuilderTests.java

+        assertEquals(
+            RescoreContext.getDefault().getOversampleFactor(),
+            neuralQueryBuilder.rescoreContext().getOversampleFactor(),
+            DELTA_FOR_FLOATS_ASSERTION
+        );


Since you are already using big decimal, please remove the delta here

not sure what do you mean, assert requires third parameter in case we're comparing floats, and both arguments are float

I mean you are using big decimal in the test rrfNorm method. You can be more strict, and the delta can be set to 0.

...java/org/opensearch/neuralsearch/processor/normalization/RRFNormalizationTechniqueTests.java

martin-gaievski · 2024-10-18T15:59:08Z

Is there somewhere to validate that RRFNormalizationTechnique is used together with RRFScoreCombinationTechnique? The execute method in NormalizationProcessorWorkflow class doing normalization and them combination.

We do not retrieve normalization technique from user input, it's hardcoded and passed to processor class by the factory, check out code snippet
https://github.com/Johnsonisaacn/neural-search/blob/RRF/src/main/java/org/opensearch/neuralsearch/processor/factory/RRFProcessorFactory.java#L51-L69

I want to keep NormalizationProcessorWorkflow generic, maybe later refactor it to more abstract class not specific to normalization.

Signed-off-by: Martin Gaievski <gaievski@amazon.com>

martin-gaievski · 2024-10-18T16:43:30Z

I've addressed all comments, and most of them were minor in recent reviews. I'll be merging this one to feature branch and we'll start one more related to RRF soon, with focus on testing

yuye-aws · 2024-10-19T03:13:14Z

Nice work @martin-gaievski

…pensearch-project#874) * initial commit of RRF Signed-off-by: Isaac Johnson <isaacnj@amazon.com> Co-authored-by: Varun Jain <varunudr@amazon.com> Signed-off-by: Martin Gaievski <gaievski@amazon.com> Signed-off-by: Ryan Bogan <rbogan@amazon.com>

…874) * initial commit of RRF Signed-off-by: Isaac Johnson <isaacnj@amazon.com> Co-authored-by: Varun Jain <varunudr@amazon.com> Signed-off-by: Martin Gaievski <gaievski@amazon.com>

…pensearch-project#874) * initial commit of RRF Signed-off-by: Isaac Johnson <isaacnj@amazon.com> Co-authored-by: Varun Jain <varunudr@amazon.com> Signed-off-by: Martin Gaievski <gaievski@amazon.com>

…874) * initial commit of RRF Signed-off-by: Isaac Johnson <isaacnj@amazon.com> Co-authored-by: Varun Jain <varunudr@amazon.com> Signed-off-by: Martin Gaievski <gaievski@amazon.com>

Isaac Johnson added 4 commits August 16, 2024 12:33

first commit to test

b3fe6c1

Signed-off-by: Isaac Johnson <isaacnj@amazon.com>

initial commit of RRF

55917e3

Signed-off-by: Isaac Johnson <isaacnj@amazon.com>

commit includes implementation and initial tests

7590532

Signed-off-by: Isaac Johnson <isaacnj@amazon.com>

Rebasing from main

93e4778

Signed-off-by: Isaac Johnson <isaacnj@amazon.com>

Johnsonisaacn force-pushed the RRF branch from 0f9170b to 93e4778 Compare August 28, 2024 20:15

Johnsonisaacn changed the title ~~Rrf~~ Implementing Reciprocal Rank Fusion (RRF) in Neural Search Aug 28, 2024

Johnsonisaacn marked this pull request as ready for review August 28, 2024 20:47

Johnsonisaacn requested review from heemin32, navneet1v, VijayanB, vamshin, jmazanec15, naveentatikonda, junqiu-lei, martin-gaievski, sean-zheng-amazon, model-collapse, zane-neo, ylwu-amzn, jngz-es, vibrantvarun and zhichao-aws as code owners August 28, 2024 20:47

vibrantvarun changed the title ~~Implementing Reciprocal Rank Fusion (RRF) in Neural Search~~ Implementing Reciprocal Rank Fusion (RRF) Aug 28, 2024

vibrantvarun changed the title ~~Implementing Reciprocal Rank Fusion (RRF)~~ Reciprocal Rank Fusion (RRF) normalization technique in hybrid query Aug 28, 2024

Update CHANGELOG.md

632f2e0

Signed-off-by: Isaac Johnson <114550967+Johnsonisaacn@users.noreply.github.com>

martin-gaievski reviewed Sep 4, 2024

View reviewed changes

...main/java/org/opensearch/neuralsearch/processor/normalization/RRFNormalizationTechnique.java Outdated Show resolved Hide resolved

src/main/java/org/opensearch/neuralsearch/processor/NormalizationExecuteDTO.java Outdated Show resolved Hide resolved

Johnsonisaacn changed the base branch from main to feature/rrf-score-normalization September 4, 2024 22:24

Johnsonisaacn changed the base branch from feature/rrf-score-normalization to feature/rrf-score-normalization-v2 September 4, 2024 23:40

yuye-aws reviewed Oct 9, 2024

View reviewed changes

vibrantvarun reviewed Oct 16, 2024

View reviewed changes

Addresed comments, fixed unit tests

92a10c7

Signed-off-by: Martin Gaievski <gaievski@amazon.com>

martin-gaievski requested a review from vibrantvarun October 16, 2024 18:21

martin-gaievski added 2 commits October 16, 2024 18:17

Fixed npe when reading params from factory

f07b713

Signed-off-by: Martin Gaievski <gaievski@amazon.com>

Added more unit tests for rrf factory

44cdc6f

Signed-off-by: Martin Gaievski <gaievski@amazon.com>

vibrantvarun reviewed Oct 17, 2024

View reviewed changes

Fixed code comments and toString lombok annotations

a6237aa

Signed-off-by: Martin Gaievski <gaievski@amazon.com>

martin-gaievski requested a review from vibrantvarun October 17, 2024 18:44

vibrantvarun approved these changes Oct 17, 2024

View reviewed changes

src/main/java/org/opensearch/neuralsearch/processor/NormalizationExecuteDTO.java Outdated Show resolved Hide resolved

Corrected class level comment

f6d5148

Co-authored-by: Varun Jain <varunudr@amazon.com> Signed-off-by: Martin Gaievski <gaievski@amazon.com>

yuye-aws reviewed Oct 18, 2024

View reviewed changes

Address more commnents - minor refactor in tests and classes

08c969a

Signed-off-by: Martin Gaievski <gaievski@amazon.com>

martin-gaievski merged commit 245cd14 into opensearch-project:feature/rrf-score-normalization-v2 Oct 18, 2024
35 of 36 checks passed

ryanbogan mentioned this pull request Dec 13, 2024

[DOC] Adding RRF to hybrid search documentation opensearch-project/documentation-website#8956

Open

4 tasks

martin-gaievski mentioned this pull request Dec 16, 2024

[Feature to main] Explainability in hybrid query #1014

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reciprocal Rank Fusion (RRF) normalization technique in hybrid query #874

Reciprocal Rank Fusion (RRF) normalization technique in hybrid query #874

Johnsonisaacn commented Aug 28, 2024 •

edited by martin-gaievski

Loading

martin-gaievski commented Sep 4, 2024

yuye-aws Oct 9, 2024

martin-gaievski Oct 11, 2024

martin-gaievski Oct 11, 2024

vibrantvarun Oct 16, 2024

martin-gaievski Oct 16, 2024

vibrantvarun Oct 17, 2024

vibrantvarun left a comment

yuye-aws Oct 18, 2024

yuye-aws Oct 18, 2024

martin-gaievski Oct 18, 2024

yuye-aws Oct 19, 2024

yuye-aws Oct 18, 2024

martin-gaievski Oct 18, 2024

yuye-aws Oct 19, 2024

martin-gaievski commented Oct 18, 2024

martin-gaievski commented Oct 18, 2024

yuye-aws commented Oct 19, 2024

	* in NormalizationProcessorWorkflow. Field rankConstant
	* in NormalizationProcessorWorkflow.

Reciprocal Rank Fusion (RRF) normalization technique in hybrid query #874

Reciprocal Rank Fusion (RRF) normalization technique in hybrid query #874

Conversation

Johnsonisaacn commented Aug 28, 2024 • edited by martin-gaievski Loading

Description

Related Issues

Check List

martin-gaievski commented Sep 4, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vibrantvarun left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

martin-gaievski commented Oct 18, 2024

martin-gaievski commented Oct 18, 2024

yuye-aws commented Oct 19, 2024

Johnsonisaacn commented Aug 28, 2024 •

edited by martin-gaievski

Loading