Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reciprocal Rank Fusion (RRF) normalization technique in hybrid query #874

Conversation

Johnsonisaacn
Copy link

@Johnsonisaacn Johnsonisaacn commented Aug 28, 2024

Description

Adding ability to process and combine scores from multiple subqueries in neural search using the reciprocal rank fusion (RRF) technique. Built with a new processor and processor factory class apart from NormalizationProcessor. Changes to API included in RFC. Does not currently support weights when combining processed subquery scores, based on lack of examples in existing literature.

Example of usage for RRF processor:

create index

PUT /index-test
{
  "settings": {
    "index": {
      "knn": true
    }
  },
  "mappings": {
    "properties": {
      "vector": {
        "type": "knn_vector",
        "dimension": 3,
        "method": {
          "name": "hnsw",
          "space_type": "l2",
          "engine": "lucene"
        }
      },
      "field1": {
        "type": "integer"
      }
    }
  }
}

create pipeline with rrf processor and all defaults

PUT /_search/pipeline/nlp-search-pipeline
{
    "description": "Post processor for hybrid search",
    "phase_results_processors": [
        {
            "score-ranker-processor": {
                "combination": {
                    "technique": "rrf",
                    "parameters": {
                    }
                }
            }
        }
    ]
}

ingest 4 documents

POST /index-test/_doc/?refresh=true
{
    "field1": 2,
    "vector": [0.4, 0.5, 0.2],
    "title": "basic"
}

{
    "field1": 10,
    "vector": [0.2, 0.2, 0.3],
    "title": "java"
}

{
    "field1": 50,
    "vector": [4.2, 5.5, 8.9]
}

{
    "vector": [0.3, 0.12, 3.3],
    "title": "python"
}

run search request

GET /index-test/_search?search_pipeline=nlp-search-pipeline
{
    "query": {
        "hybrid": {
            "queries": [
                {
                    "knn": {
                        "vector": {
                            "vector": [
                                4.2,
                                5.0,
                                8.5
                            ],
                            "k": 10
                        }
                    }
                },
                {
                    "range": {
                        "field1": {
                            "gte": 10,
                            "lte": 50
                        }
                    }
                }
            ]
        }
    }
}

you'll get following response

{
    "took": 6,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 4,
            "relation": "eq"
        },
        "max_score": 0.032522473,
        "hits": [
            {
                "_index": "index-test",
                "_id": "fSgBmJIB3rlMI6kPNIQL",
                "_score": 0.032522473,
                "_source": {
                    "field1": 50,
                    "vector": [
                        4.2,
                        5.5,
                        8.9
                    ]
                }
            },
            {
                "_index": "index-test",
                "_id": "fCgBmJIB3rlMI6kPK4QS",
                "_score": 0.03201844,
                "_source": {
                    "field1": 10,
                    "vector": [
                        0.2,
                        0.2,
                        0.3
                    ],
                    "title": "java"
                }
            },
            {
                "_index": "index-test",
                "_id": "figBmJIB3rlMI6kPPITH",
                "_score": 0.016129032,
                "_source": {
                    "vector": [
                        0.3,
                        0.12,
                        3.3
                    ],
                    "title": "python"
                }
            },
            {
                "_index": "index-test",
                "_id": "eygBmJIB3rlMI6kPIYQm",
                "_score": 0.015873017,
                "_source": {
                    "field1": 2,
                    "vector": [
                        0.4,
                        0.5,
                        0.2
                    ],
                    "title": "basic"
                }
            }
        ]
    }
}

if you change rank to something smaller, like '1' your scores all will be scalled up
update rank contant

{
    "description": "Post processor for hybrid search",
    "phase_results_processors": [
        {
            "score-ranker-processor": {
                "combination": {
                    "technique": "rrf",
                    "parameters": {
                        "rank_constant": 1
                    }
                }
            }
        }
    ]
}

and search response is

{
    "took": 10,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 4,
            "relation": "eq"
        },
        "max_score": 0.8333334,
        "hits": [
            {
                "_index": "index-test",
                "_id": "fSgBmJIB3rlMI6kPNIQL",
                "_score": 0.8333334,
                "_source": {
                    "field1": 50,
                    "vector": [
                        4.2,
                        5.5,
                        8.9
                    ]
                }
            },
            {
                "_index": "index-test",
                "_id": "fCgBmJIB3rlMI6kPK4QS",
                "_score": 0.7,
                "_source": {
                    "field1": 10,
                    "vector": [
                        0.2,
                        0.2,
                        0.3
                    ],
                    "title": "java"
                }
            },
            {
                "_index": "index-test",
                "_id": "figBmJIB3rlMI6kPPITH",
                "_score": 0.33333334,
                "_source": {
                    "vector": [
                        0.3,
                        0.12,
                        3.3
                    ],
                    "title": "python"
                }
            },
            {
                "_index": "index-test",
                "_id": "eygBmJIB3rlMI6kPIYQm",
                "_score": 0.25,
                "_source": {
                    "field1": 2,
                    "vector": [
                        0.4,
                        0.5,
                        0.2
                    ],
                    "title": "basic"
                }
            }
        ]
    }
}

for comparison this is the response for same query if we use normalization processor with default techniques.

Important difference is that delta between document scores with RRF is much smaller, this is because it's based on document rank that are typically close in value comparing to scores where delta can be huge.

{
    "took": 16,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 4,
            "relation": "eq"
        },
        "max_score": 1.0,
        "hits": [
            {
                "_index": "index-test",
                "_id": "fSgBmJIB3rlMI6kPNIQL",
                "_score": 1.0,
                "_source": {
                    "field1": 50,
                    "vector": [
                        4.2,
                        5.5,
                        8.9
                    ]
                }
            },
            {
                "_index": "index-test",
                "_id": "fCgBmJIB3rlMI6kPK4QS",
                "_score": 0.5005,
                "_source": {
                    "field1": 10,
                    "vector": [
                        0.2,
                        0.2,
                        0.3
                    ],
                    "title": "java"
                }
            },
            {
                "_index": "index-test",
                "_id": "figBmJIB3rlMI6kPPITH",
                "_score": 0.0039931787,
                "_source": {
                    "vector": [
                        0.3,
                        0.12,
                        3.3
                    ],
                    "title": "python"
                }
            },
            {
                "_index": "index-test",
                "_id": "eygBmJIB3rlMI6kPIYQm",
                "_score": 1.7192177E-4,
                "_source": {
                    "field1": 2,
                    "vector": [
                        0.4,
                        0.5,
                        0.2
                    ],
                    "title": "basic"
                }
            }
        ]
    }
}

Related Issues

Resolves #[Issue number to be closed when this PR is merged]
#865
#659

Check List

  • New functionality includes testing.
  • New functionality has been documented.
  • API changes companion pull request created.
  • Commits are signed per the DCO using --signoff.
  • Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Isaac Johnson added 4 commits August 16, 2024 12:33
Signed-off-by: Isaac Johnson <isaacnj@amazon.com>
Signed-off-by: Isaac Johnson <isaacnj@amazon.com>
Signed-off-by: Isaac Johnson <isaacnj@amazon.com>
Signed-off-by: Isaac Johnson <isaacnj@amazon.com>
@Johnsonisaacn Johnsonisaacn changed the title Rrf Implementing Reciprocal Rank Fusion (RRF) in Neural Search Aug 28, 2024
@Johnsonisaacn Johnsonisaacn marked this pull request as ready for review August 28, 2024 20:47
@vibrantvarun vibrantvarun changed the title Implementing Reciprocal Rank Fusion (RRF) in Neural Search Implementing Reciprocal Rank Fusion (RRF) Aug 28, 2024
@vibrantvarun vibrantvarun changed the title Implementing Reciprocal Rank Fusion (RRF) Reciprocal Rank Fusion (RRF) normalization technique in hybrid query Aug 28, 2024
Signed-off-by: Isaac Johnson <114550967+Johnsonisaacn@users.noreply.github.com>
@martin-gaievski
Copy link
Member

we should be merging to feature branch https://github.com/opensearch-project/neural-search/tree/feature/rrf-score-normalization, not main.

@Johnsonisaacn Johnsonisaacn changed the base branch from main to feature/rrf-score-normalization September 4, 2024 22:24
@Johnsonisaacn Johnsonisaacn changed the base branch from feature/rrf-score-normalization to feature/rrf-score-normalization-v2 September 4, 2024 23:40
Comment on lines 22 to 25
// Not currently using weights for RRF, no need to modify or verify these params
public RRFScoreCombinationTechnique(final Map<String, Object> params, final ScoreCombinationUtil combinationUtil) {
;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class not completed?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we planned to have very simple implementation for this one, I'll be finishing this PR and address all misses if

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I finished the class, please take a look @yuye-aws

* Collection of utility methods for score combination technique classes
*/
@Log4j2
class ScoreNormalizationUtil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we shift the code in this class to HybridQueryUtil?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it belongs there, my view is - anything related to the query itself should go to that class, like parsing score collection into multiple sub query results.

Signed-off-by: Martin Gaievski <gaievski@amazon.com>
Signed-off-by: Martin Gaievski <gaievski@amazon.com>
Signed-off-by: Martin Gaievski <gaievski@amazon.com>

/**
* DTO object to hold data required for score normalization passed to execute() function
* in NormalizationProcessorWorkflow. Field rankConstant
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* in NormalizationProcessorWorkflow. Field rankConstant
* in NormalizationProcessorWorkflow.

Signed-off-by: Martin Gaievski <gaievski@amazon.com>
Copy link
Member

@vibrantvarun vibrantvarun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one minor comment.

LGTM.

Co-authored-by: Varun Jain <varunudr@amazon.com>
Signed-off-by: Martin Gaievski <gaievski@amazon.com>
*/
@Log4j2
@AllArgsConstructor
public class RRFProcessor implements SearchPhaseResultsProcessor {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there somewhere to validate that RRFNormalizationTechnique is used together with RRFScoreCombinationTechnique? The execute method in NormalizationProcessorWorkflow class doing normalization and them combination.

Comment on lines +28 to +34
private float RRF(List<Float> scores, List<Double> weights) {
float sumScores = 0.0f;
for (float score : scores) {
sumScores += score;
}
return sumScores;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you adding these method in this testing? I think you can simply with a few examples like 1 plus 1 is 2.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's added to be compatible with https://github.com/opensearch-project/neural-search/blob/main/src/test/java/org/opensearch/neuralsearch/processor/combination/BaseScoreCombinationTechniqueTests.java and be able to use all test cases it provides. We need to ensure in better possible test coverage if it's a low hanging fruit

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did not get your point. The private randomScore outputs non-deterministic results.

Comment on lines +187 to +191
assertEquals(
RescoreContext.getDefault().getOversampleFactor(),
neuralQueryBuilder.rescoreContext().getOversampleFactor(),
DELTA_FOR_FLOATS_ASSERTION
);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you are already using big decimal, please remove the delta here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure what do you mean, assert requires third parameter in case we're comparing floats, and both arguments are float

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean you are using big decimal in the test rrfNorm method. You can be more strict, and the delta can be set to 0.

@martin-gaievski
Copy link
Member

Is there somewhere to validate that RRFNormalizationTechnique is used together with RRFScoreCombinationTechnique? The execute method in NormalizationProcessorWorkflow class doing normalization and them combination.

We do not retrieve normalization technique from user input, it's hardcoded and passed to processor class by the factory, check out code snippet
https://github.com/Johnsonisaacn/neural-search/blob/RRF/src/main/java/org/opensearch/neuralsearch/processor/factory/RRFProcessorFactory.java#L51-L69

I want to keep NormalizationProcessorWorkflow generic, maybe later refactor it to more abstract class not specific to normalization.

Signed-off-by: Martin Gaievski <gaievski@amazon.com>
@martin-gaievski
Copy link
Member

I've addressed all comments, and most of them were minor in recent reviews. I'll be merging this one to feature branch and we'll start one more related to RRF soon, with focus on testing

@martin-gaievski martin-gaievski merged commit 245cd14 into opensearch-project:feature/rrf-score-normalization-v2 Oct 18, 2024
35 of 36 checks passed
@yuye-aws
Copy link
Member

Nice work @martin-gaievski

ryanbogan pushed a commit to ryanbogan/neural-search that referenced this pull request Nov 14, 2024
…pensearch-project#874)

* initial commit of RRF

Signed-off-by: Isaac Johnson <isaacnj@amazon.com>

Co-authored-by: Varun Jain <varunudr@amazon.com>
Signed-off-by: Martin Gaievski <gaievski@amazon.com>
Signed-off-by: Ryan Bogan <rbogan@amazon.com>
martin-gaievski pushed a commit that referenced this pull request Nov 18, 2024
…874)

* initial commit of RRF

Signed-off-by: Isaac Johnson <isaacnj@amazon.com>

Co-authored-by: Varun Jain <varunudr@amazon.com>
Signed-off-by: Martin Gaievski <gaievski@amazon.com>
martin-gaievski pushed a commit that referenced this pull request Nov 19, 2024
…874)


* initial commit of RRF

Signed-off-by: Isaac Johnson <isaacnj@amazon.com>

Co-authored-by: Varun Jain <varunudr@amazon.com>
Signed-off-by: Martin Gaievski <gaievski@amazon.com>
martin-gaievski pushed a commit that referenced this pull request Nov 19, 2024
…874)

* initial commit of RRF

Signed-off-by: Isaac Johnson <isaacnj@amazon.com>

Co-authored-by: Varun Jain <varunudr@amazon.com>
Signed-off-by: Martin Gaievski <gaievski@amazon.com>
ryanbogan pushed a commit to ryanbogan/neural-search that referenced this pull request Nov 20, 2024
…pensearch-project#874)

* initial commit of RRF

Signed-off-by: Isaac Johnson <isaacnj@amazon.com>

Co-authored-by: Varun Jain <varunudr@amazon.com>
Signed-off-by: Martin Gaievski <gaievski@amazon.com>
martin-gaievski pushed a commit to martin-gaievski/neural-search that referenced this pull request Nov 25, 2024
…pensearch-project#874)


* initial commit of RRF

Signed-off-by: Isaac Johnson <isaacnj@amazon.com>

Co-authored-by: Varun Jain <varunudr@amazon.com>
Signed-off-by: Martin Gaievski <gaievski@amazon.com>
martin-gaievski pushed a commit to martin-gaievski/neural-search that referenced this pull request Nov 25, 2024
…pensearch-project#874)

* initial commit of RRF

Signed-off-by: Isaac Johnson <isaacnj@amazon.com>

Co-authored-by: Varun Jain <varunudr@amazon.com>
Signed-off-by: Martin Gaievski <gaievski@amazon.com>
martin-gaievski pushed a commit that referenced this pull request Nov 26, 2024
…874)


* initial commit of RRF

Signed-off-by: Isaac Johnson <isaacnj@amazon.com>

Co-authored-by: Varun Jain <varunudr@amazon.com>
Signed-off-by: Martin Gaievski <gaievski@amazon.com>
martin-gaievski pushed a commit that referenced this pull request Nov 26, 2024
…874)

* initial commit of RRF

Signed-off-by: Isaac Johnson <isaacnj@amazon.com>

Co-authored-by: Varun Jain <varunudr@amazon.com>
Signed-off-by: Martin Gaievski <gaievski@amazon.com>
martin-gaievski pushed a commit that referenced this pull request Dec 17, 2024
…874)

* initial commit of RRF

Signed-off-by: Isaac Johnson <isaacnj@amazon.com>

Co-authored-by: Varun Jain <varunudr@amazon.com>
Signed-off-by: Martin Gaievski <gaievski@amazon.com>
martin-gaievski pushed a commit that referenced this pull request Dec 17, 2024
…874)

* initial commit of RRF

Signed-off-by: Isaac Johnson <isaacnj@amazon.com>

Co-authored-by: Varun Jain <varunudr@amazon.com>
Signed-off-by: Martin Gaievski <gaievski@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants