[FEATURE] Aggregations to be supported with Hybrid Search #509

ankitas3 · 2023-11-03T09:13:31Z

Is your feature request related to a problem?
Currently the aggregations returned from Hybrid Query corresponds to doc_count 0 for all the buckets.

{
  "Index": {
    "meta": {},
    "doc_count": 0,
    "model": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "index",
          "doc_count": 0
        }
      ]
    }
  }
}

What solution would you like?
Aggregation's bucket count should be evaluated and returned correctly on the basis of hybrid queries.

Similar Issue:

#422

The text was updated successfully, but these errors were encountered:

vamshin · 2023-11-21T00:19:03Z

Thanks @ankitas3. could you please explain the use case for this?

ankitas3 · 2023-11-24T05:22:15Z

@vamshin Use case here is to be able to filter data returned from hybrid query, filtering on any field available in docs. Currently the buckets returned all correspond to count 0.

navneet1v · 2023-12-05T16:51:46Z

Moving this issue to neural search as the Hybrid Query clause belongs to Neural Search

binarymax · 2023-12-14T20:08:02Z

Hi! I'm confused by this issue being open and #422 being closed "as completed". Do aggs work properly with Hybrid search now?

navneet1v · 2023-12-14T21:36:25Z

@binarymax I closed the #422 issue by commenting its a duplicate of this issue. I could have kept the older issue open, but seems like I forgot to check the dates and closed the older issue.

You can see my comment here: #422 (comment)

I updated the description of this issue to tag the older one.

Do aggs work properly with Hybrid search now?

The ans is no.

savanbthakkar · 2024-01-19T15:32:34Z

@vamshin @navneet1v
This seems to be a common use case. We index documents in Opensearch. Documents have title, description and Actual File Content, which is big in size. So AWS recommends splitting the content into chunks and save multiple records in OpenSearch and duplicate metadata for each chunk.
Now, if we have to search with a keyword, we have to search in Document's title, description and content. Content search is neural search and title/description is keyword search. So we have to have a Hybrid query with Aggregation.
Please suggest if there is an ETA on this issue.

hdhalter · 2024-03-06T00:19:39Z

Hi @vamshin , are there any documentation implications for this feature in 2.13? Thanks!

qmauret · 2024-03-29T13:44:48Z

Hello, can you confirm that it will be available in 2.13 ?

hdhalter · 2024-03-29T15:10:26Z

We have the documentation going out in 2.13: opensearch-project/documentation-website#6661.

vamshin · 2024-03-29T20:13:24Z

@qmauret Yes its going in 2.13

navneet1v · 2024-04-01T21:48:43Z

@martin-gaievski can we close this issue now. The feature will be released in 2.13

qmauret · 2024-05-29T13:20:29Z

Hi, it seems that there is still some things to look at regarding this issue.
I tried aggregations with hybrid search with 2.13 and had results that i could not explain.

I have an index of products which contains 16K products.

Let's focus on a specific category called "Mug" (id: 5)

GET product_1_bedrock/_search
{
  "size": 0
  "query": {
    "match": {
      "category.id_category": 5
    }
  }
}

Result

{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 566,
      "relation": "eq"
    },
    "max_score": null,
    "hits": []
  }
}

1. Basic match query with aggregates

GET product_1_bedrock/_search
{
  "aggs": {
    "categories": {
      "terms": {
        "field": "category.name.keyword",
        "size": 100
      }
    }
  },
  "size": 0,
  "query": {
    "multi_match": {
      "query": "Mug",
      "type": "most_fields",
      "fields": [
        "category_name^2",
        "name^4"
      ]
    }
  }
}

Result

{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 591,
      "relation": "eq"
    },
    "max_score": null,
    "hits": []
  },
  "aggregations": {
    "categories": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "Mug",
          "doc_count": 566
        },
        {
          "key": "Mug take away",
          "doc_count": 18
        },
        {
          "key": "Coffret",
          "doc_count": 7
        }
      ]
    }
  }
}

2. Neural query with aggregates

GET product_1_bedrock/_search
{
  "aggs": {
    "categories": {
      "terms": {
        "field": "category.name.keyword",
        "size": 100
      }
    }
  },
  "size": 0,
  "query": {
    "neural": {
      "search_description_v": {
        "query_text": "Mug",
        "model_id": "xxxxxx",
        "k": 200
      }
    }
  }
}

Result

{
  "took": 616,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 2686,
      "relation": "eq"
    },
    "max_score": null,
    "hits": []
  },
  "aggregations": {
    "categories": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "Mug",
          "doc_count": 555
        },
        {
          "key": "Sweatshirt homme",
          "doc_count": 147
        },
        {
          "key": "Hoodie homme",
          "doc_count": 144
        },
        ... // + 89 other categories
      ]
    }
  }
}

3. Hybrid query with aggregates

GET product_1_bedrock/_search?search_pipeline=nlp-search-pipeline
{
  "aggs": {
    "categories": {
      "terms": {
        "field": "category.name.keyword",
        "size": 100
      }
    }
  },
  "size": 0,
  "query": {
    "hybrid": {
      "queries": [
        {
          "neural": {
            "search_description_v": {
              "query_text": "Mug",
              "model_id": "xxxx",
              "k": 200,
            }
          }
        },
        {
          "multi_match": {
            "query": "Mug",
            "type": "most_fields",
            "fields": [
              "category_name^2",
              "name^4"
            ]
          }
        }
      ]
    }
  }
}

Result

{
  "took": 616,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 2686,
      "relation": "eq"
    },
    "max_score": null,
    "hits": []
  },
  "aggregations": {
    "categories": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "Mug",
          "doc_count": 555
        },
        {
          "key": "Sweatshirt homme",
          "doc_count": 147
        },
        {
          "key": "Hoodie homme",
          "doc_count": 144
        },
        // ... + 89 other categories
      ]
    }
  }
}

4. Hybrid query without aggregates

GET product_1_bedrock/_search?search_pipeline=nlp-search-pipeline
{
  "query": {
    "hybrid": {
      "queries": [
        {
          "neural": {
            "search_description_v": {
              "query_text": "Mug",
              "model_id": "xxxx",
              "k": 200,
              "_name": "neural_search"
            }
          }
        },
        {
          "multi_match": {
            "query": "Mug",
            "type": "most_fields",
            "fields": [
              "category_name^2",
              "name^4"
            ]
          }
        }
      ]
    }
  }
}

Result

{
  "took": 577,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 98,
      "relation": "eq"
    },
    "max_score": 0.2407519,
    "hits": []
  }
}

Hybrid query without aggregates gives a total of 98 results while the same query with aggregates exceeds by far this value.
So my understanding is that neural subquery is returning to much results but why is it ? Any idea @navneet1v ?

martin-gaievski · 2024-05-29T21:22:24Z

@qmauret thank you for reporting the problem. I was trying to replicate your steps locally, but with small dataset of 10 docs everything works correctly for me.
I suspect it can be dataset dependent. Is there a chance you can try your scenario on a smaller subset of your 16K docs, so that subset can be shared via github?
Please also share your cluster and index configuration.

One more question about the hybrid query, why you use size == 0? Typically it should return empty list even in case there are hits.

ankitas3 added enhancement untriaged labels Nov 3, 2023

vamshin self-assigned this Nov 15, 2023

vamshin removed the untriaged label Nov 15, 2023

navneet1v transferred this issue from opensearch-project/k-NN Dec 5, 2023

github-actions bot added the untriaged label Dec 5, 2023

navneet1v added Features Introduces a new unit of functionality that satisfies a requirement and removed untriaged labels Dec 5, 2023

navneet1v mentioned this issue Dec 5, 2023

[Feature] Support Aggregations with Hybrid Query #422

Closed

navneet1v assigned martin-gaievski and unassigned vamshin Dec 12, 2023

martin-gaievski mentioned this issue Jan 5, 2024

Refactor QueryCollectorContext to improve extensibility opensearch-project/OpenSearch#11778

Closed

8 tasks

vamshin moved this from Backlog (Hot) to 2.13.0 in Vector Search RoadMap Feb 7, 2024

martin-gaievski mentioned this issue Feb 14, 2024

[RFC] Aggregations and Hybrid query #604

Closed

This was referenced Mar 12, 2024

Adding aggregations in hybrid query #630

Merged

Adding integ tests for scenario of hybrid query with aggregations #632

Merged

vamshin added the v2.13.0 label Mar 13, 2024

vibrantvarun added v2.14.0 v2.13.0 and removed v2.13.0 v2.14.0 labels Mar 19, 2024

martin-gaievski closed this as completed Apr 1, 2024

github-project-automation bot moved this from 2.13.0 to ✅ Done in Vector Search RoadMap Apr 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] Aggregations to be supported with Hybrid Search #509

[FEATURE] Aggregations to be supported with Hybrid Search #509

ankitas3 commented Nov 3, 2023 •

edited by navneet1v

Loading

vamshin commented Nov 21, 2023

ankitas3 commented Nov 24, 2023

navneet1v commented Dec 5, 2023

binarymax commented Dec 14, 2023 •

edited

Loading

navneet1v commented Dec 14, 2023

savanbthakkar commented Jan 19, 2024 •

edited

Loading

hdhalter commented Mar 6, 2024

qmauret commented Mar 29, 2024

hdhalter commented Mar 29, 2024

vamshin commented Mar 29, 2024

navneet1v commented Apr 1, 2024 •

edited

Loading

qmauret commented May 29, 2024 •

edited

Loading

martin-gaievski commented May 29, 2024

[FEATURE] Aggregations to be supported with Hybrid Search #509

[FEATURE] Aggregations to be supported with Hybrid Search #509

Comments

ankitas3 commented Nov 3, 2023 • edited by navneet1v Loading

Similar Issue:

vamshin commented Nov 21, 2023

ankitas3 commented Nov 24, 2023

navneet1v commented Dec 5, 2023

binarymax commented Dec 14, 2023 • edited Loading

navneet1v commented Dec 14, 2023

savanbthakkar commented Jan 19, 2024 • edited Loading

hdhalter commented Mar 6, 2024

qmauret commented Mar 29, 2024

hdhalter commented Mar 29, 2024

vamshin commented Mar 29, 2024

navneet1v commented Apr 1, 2024 • edited Loading

qmauret commented May 29, 2024 • edited Loading

martin-gaievski commented May 29, 2024

ankitas3 commented Nov 3, 2023 •

edited by navneet1v

Loading

binarymax commented Dec 14, 2023 •

edited

Loading

savanbthakkar commented Jan 19, 2024 •

edited

Loading

navneet1v commented Apr 1, 2024 •

edited

Loading

qmauret commented May 29, 2024 •

edited

Loading