Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Aggregations to be supported with Hybrid Search #509

Closed
ankitas3 opened this issue Nov 3, 2023 · 13 comments
Closed

[FEATURE] Aggregations to be supported with Hybrid Search #509

ankitas3 opened this issue Nov 3, 2023 · 13 comments
Assignees
Labels
enhancement Features Introduces a new unit of functionality that satisfies a requirement v2.13.0

Comments

@ankitas3
Copy link

ankitas3 commented Nov 3, 2023

Is your feature request related to a problem?
Currently the aggregations returned from Hybrid Query corresponds to doc_count 0 for all the buckets.

{
  "Index": {
    "meta": {},
    "doc_count": 0,
    "model": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "index",
          "doc_count": 0
        }
      ]
    }
  }
}

What solution would you like?
Aggregation's bucket count should be evaluated and returned correctly on the basis of hybrid queries.

Similar Issue:

#422

@vamshin
Copy link
Member

vamshin commented Nov 21, 2023

Thanks @ankitas3. could you please explain the use case for this?

@ankitas3
Copy link
Author

@vamshin Use case here is to be able to filter data returned from hybrid query, filtering on any field available in docs. Currently the buckets returned all correspond to count 0.

@navneet1v
Copy link
Collaborator

Moving this issue to neural search as the Hybrid Query clause belongs to Neural Search

@navneet1v navneet1v transferred this issue from opensearch-project/k-NN Dec 5, 2023
@navneet1v navneet1v added Features Introduces a new unit of functionality that satisfies a requirement and removed untriaged labels Dec 5, 2023
@navneet1v navneet1v assigned martin-gaievski and unassigned vamshin Dec 12, 2023
@binarymax
Copy link

binarymax commented Dec 14, 2023

Hi! I'm confused by this issue being open and #422 being closed "as completed". Do aggs work properly with Hybrid search now?

@navneet1v
Copy link
Collaborator

@binarymax I closed the #422 issue by commenting its a duplicate of this issue. I could have kept the older issue open, but seems like I forgot to check the dates and closed the older issue.

You can see my comment here: #422 (comment)

I updated the description of this issue to tag the older one.

Do aggs work properly with Hybrid search now?

The ans is no.

@savanbthakkar
Copy link

savanbthakkar commented Jan 19, 2024

@vamshin @navneet1v
This seems to be a common use case. We index documents in Opensearch. Documents have title, description and Actual File Content, which is big in size. So AWS recommends splitting the content into chunks and save multiple records in OpenSearch and duplicate metadata for each chunk.
Now, if we have to search with a keyword, we have to search in Document's title, description and content. Content search is neural search and title/description is keyword search. So we have to have a Hybrid query with Aggregation.
Please suggest if there is an ETA on this issue.

@vamshin vamshin moved this from Backlog (Hot) to 2.13.0 in Vector Search RoadMap Feb 7, 2024
@hdhalter
Copy link

hdhalter commented Mar 6, 2024

Hi @vamshin , are there any documentation implications for this feature in 2.13? Thanks!

@qmauret
Copy link

qmauret commented Mar 29, 2024

Hello, can you confirm that it will be available in 2.13 ?

@hdhalter
Copy link

We have the documentation going out in 2.13: opensearch-project/documentation-website#6661.

@vamshin
Copy link
Member

vamshin commented Mar 29, 2024

@qmauret Yes its going in 2.13

@navneet1v
Copy link
Collaborator

navneet1v commented Apr 1, 2024

@martin-gaievski can we close this issue now. The feature will be released in 2.13

@github-project-automation github-project-automation bot moved this from 2.13.0 to ✅ Done in Vector Search RoadMap Apr 1, 2024
@qmauret
Copy link

qmauret commented May 29, 2024

Hi, it seems that there is still some things to look at regarding this issue.
I tried aggregations with hybrid search with 2.13 and had results that i could not explain.

I have an index of products which contains 16K products.

Let's focus on a specific category called "Mug" (id: 5)

GET product_1_bedrock/_search
{
  "size": 0
  "query": {
    "match": {
      "category.id_category": 5
    }
  }
}

Result

{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 566,
      "relation": "eq"
    },
    "max_score": null,
    "hits": []
  }
}

1. Basic match query with aggregates

GET product_1_bedrock/_search
{
  "aggs": {
    "categories": {
      "terms": {
        "field": "category.name.keyword",
        "size": 100
      }
    }
  },
  "size": 0,
  "query": {
    "multi_match": {
      "query": "Mug",
      "type": "most_fields",
      "fields": [
        "category_name^2",
        "name^4"
      ]
    }
  }
}

Result

{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 591,
      "relation": "eq"
    },
    "max_score": null,
    "hits": []
  },
  "aggregations": {
    "categories": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "Mug",
          "doc_count": 566
        },
        {
          "key": "Mug take away",
          "doc_count": 18
        },
        {
          "key": "Coffret",
          "doc_count": 7
        }
      ]
    }
  }
}

2. Neural query with aggregates

GET product_1_bedrock/_search
{
  "aggs": {
    "categories": {
      "terms": {
        "field": "category.name.keyword",
        "size": 100
      }
    }
  },
  "size": 0,
  "query": {
    "neural": {
      "search_description_v": {
        "query_text": "Mug",
        "model_id": "xxxxxx",
        "k": 200
      }
    }
  }
}

Result

{
  "took": 616,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 2686,
      "relation": "eq"
    },
    "max_score": null,
    "hits": []
  },
  "aggregations": {
    "categories": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "Mug",
          "doc_count": 555
        },
        {
          "key": "Sweatshirt homme",
          "doc_count": 147
        },
        {
          "key": "Hoodie homme",
          "doc_count": 144
        },
        ... // + 89 other categories
      ]
    }
  }
}

3. Hybrid query with aggregates

GET product_1_bedrock/_search?search_pipeline=nlp-search-pipeline
{
  "aggs": {
    "categories": {
      "terms": {
        "field": "category.name.keyword",
        "size": 100
      }
    }
  },
  "size": 0,
  "query": {
    "hybrid": {
      "queries": [
        {
          "neural": {
            "search_description_v": {
              "query_text": "Mug",
              "model_id": "xxxx",
              "k": 200,
            }
          }
        },
        {
          "multi_match": {
            "query": "Mug",
            "type": "most_fields",
            "fields": [
              "category_name^2",
              "name^4"
            ]
          }
        }
      ]
    }
  }
}

Result

{
  "took": 616,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 2686,
      "relation": "eq"
    },
    "max_score": null,
    "hits": []
  },
  "aggregations": {
    "categories": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "Mug",
          "doc_count": 555
        },
        {
          "key": "Sweatshirt homme",
          "doc_count": 147
        },
        {
          "key": "Hoodie homme",
          "doc_count": 144
        },
        // ... + 89 other categories
      ]
    }
  }
}

4. Hybrid query without aggregates

GET product_1_bedrock/_search?search_pipeline=nlp-search-pipeline
{
  "query": {
    "hybrid": {
      "queries": [
        {
          "neural": {
            "search_description_v": {
              "query_text": "Mug",
              "model_id": "xxxx",
              "k": 200,
              "_name": "neural_search"
            }
          }
        },
        {
          "multi_match": {
            "query": "Mug",
            "type": "most_fields",
            "fields": [
              "category_name^2",
              "name^4"
            ]
          }
        }
      ]
    }
  }
}

Result

{
  "took": 577,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 98,
      "relation": "eq"
    },
    "max_score": 0.2407519,
    "hits": []
  }
}

Hybrid query without aggregates gives a total of 98 results while the same query with aggregates exceeds by far this value.
So my understanding is that neural subquery is returning to much results but why is it ? Any idea @navneet1v ?

@martin-gaievski
Copy link
Member

@qmauret thank you for reporting the problem. I was trying to replicate your steps locally, but with small dataset of 10 docs everything works correctly for me.
I suspect it can be dataset dependent. Is there a chance you can try your scenario on a smaller subset of your 16K docs, so that subset can be shared via github?
Please also share your cluster and index configuration.

One more question about the hybrid query, why you use size == 0? Typically it should return empty list even in case there are hits.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Features Introduces a new unit of functionality that satisfies a requirement v2.13.0
Projects
Status: Done
Development

No branches or pull requests

9 participants