[Bug]: [new_indexes] The searched id after group by is not equal with the top1 id searched using the expression of "group_by_field_name==each unique value" when creating HNSW_FLAT/SQ/PQ/PRQ on float16/ bfloat16 vector #37135

binbinlv · 2024-10-25T04:16:26Z

Is there an existing issue for this?

I have searched the existing issues

Environment

- Milvus version:master-20241024-0dbf9482-amd64
- Deployment mode(standalone or cluster):all
- MQ type(rocksmq, pulsar or kafka):    both
- SDK version(e.g. pymilvus v2.0.0rc2): 2.5.0rc99
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

The searched id after group by is not equal with the top1 id searched using the expression of "group_by_field_name==each unique value" when creating HNSW_FLAT/SQ/PQ/PRQ on float16/ bfloat16 vector

[2024-10-25 12:10:00 - INFO - ci_test]: INT64 on FLOAT16_VECTOR dismatch_item, top1_grpby_dis: 33.54237365722656, top1_expr_dis: 14.902473449707031 (test_mix_scenes.py:2746)
[2024-10-25 12:10:00 - INFO - ci_test]: INT64 on FLOAT16_VECTOR  top1_dismatch_num: 15, results_num: 15, dismatch_rate: 1.0 (test_mix_scenes.py:2747)

Expected Behavior

The searched id after group by is always equal with the top1 id searched using the expression of "group_by_field_name==each unique value" when creating HNSW_FLAT/SQ/PQ/PRQ on float16/ bfloat16 vector

Steps To Reproduce

    @pytest.mark.tags(CaseLabel.L2)
    # @pytest.mark.parametrize("support_field", [DataType.INT8.name,  DataType.INT64.name,
    #                                            DataType.BOOL.name, DataType.VARCHAR.name])
    @pytest.mark.parametrize("support_field", [DataType.INT64.name])
    def test_search_group_by_supported_scalars_new_hnsw_index(self, support_field):
        """
        verify search group by works with supported scalar fields
        """
        nq = 1
        limit = 15
        for j in range(len(self.vector_fields)):
            j = 0
            search_vectors = cf.gen_vectors(nq, dim=self.dims[j], vector_data_type=self.vector_fields[j])
            search_params = {"params": cf.get_search_params_params(self.index_types[j])}
            res1 = self.collection_wrap.search(data=search_vectors, anns_field=self.vector_fields[j],
                                               param=search_params, limit=limit,
                                               group_by_field=support_field,
                                               output_fields=[support_field])[0]
            for i in range(nq):
                grpby_values = []
                dismatch = 0
                results_num = 2 if support_field == DataType.BOOL.name else limit
                for l in range(results_num):
                    top1 = res1[i][l]
                    top1_grpby_pk = top1.id
                    top1_grpby_value = top1.fields.get(support_field)
                    expr = f"{support_field}=={top1_grpby_value}"
                    if support_field == DataType.VARCHAR.name:
                        expr = f"{support_field}=='{top1_grpby_value}'"
                    grpby_values.append(top1_grpby_value)
                    res_tmp = self.collection_wrap.search(data=[search_vectors[i]], anns_field=self.vector_fields[j],
                                                          param=search_params, limit=1, expr=expr,
                                                          output_fields=[support_field])[0]
                    top1_expr_pk = res_tmp[0][0].id
                    if top1_grpby_pk != top1_expr_pk:
                        dismatch += 1
                        log.info(f"{support_field} on {self.vector_fields[j]} dismatch_item, top1_grpby_dis: {top1.distance}, top1_expr_dis: {res_tmp[0][0].distance}")
                log.info(f"{support_field} on {self.vector_fields[j]}  top1_dismatch_num: {dismatch}, results_num: {results_num}, dismatch_rate: {dismatch / results_num}")
                baseline = 1 if support_field == DataType.BOOL.name else 0.2    # skip baseline check for boolean
                assert dismatch / results_num <= baseline
                # verify no dup values of the group_by_field in results
                assert len(grpby_values) == len(set(grpby_values))

Milvus Log

https://grafana-4am.zilliz.cc/explore?orgId=1&panes=%7B%226mJ%22:%7B%22datasource%22:%22vhI6Vw67k%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bcluster%3D%5C%22devops%5C%22,namespace%3D%5C%22chaos-testing%5C%22,pod%3D~%5C%22test-null-new-1-kwwuq.%2A%5C%22%7D%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22vhI6Vw67k%22%7D%7D%5D,%22range%22:%7B%22from%22:%22now-1h%22,%22to%22:%22now%22%7D%7D%7D&schemaVersion=1

Anything else?

collection_name : "TestGroupSearchNewHNSWIndex_badYC7EF"

The text was updated successfully, but these errors were encountered:

binbinlv · 2024-10-25T04:18:15Z

This issue not exists for FLOAT_VECTOR vector.

alexanderguzhva · 2024-10-28T23:08:06Z

@binbinlv zilliztech/knowhere#913 should fix the problem

xiaofan-luan · 2024-10-31T18:17:31Z

/assign @binbinlv

binbinlv · 2024-11-01T07:20:08Z

working on verification.

binbinlv · 2024-11-01T15:59:35Z

Verified and fixed in master branch.
pymilvus: 2.5.0rc106
milvus: master-20241101-0ac8b166

binbinlv · 2024-11-01T17:21:03Z

But this issue still exists in 2.5 branch:

milvus: 2.5-20241101-116bf501-amd64

results:

[2024-11-02 01:19:28 - INFO - ci_test]: VARCHAR on BFLOAT16_VECTOR dismatch_item, top1_grpby_dis: 0.6354575157165527, top1_expr_dis: 0.7928284406661987 (test_mix_scenes.py:2775)
[2024-11-02 01:19:28 - INFO - ci_test]: VARCHAR on BFLOAT16_VECTOR  top1_dismatch_num: 15, results_num: 15, dismatch_rate: 1.0 (test_mix_scenes.py:2776)

alexanderguzhva · 2024-11-01T17:26:41Z

@binbinlv different knowhere version, maybe?

binbinlv · 2024-11-04T08:10:47Z

The 2.5 branch is deleted and master branch has been verified, so close this issue.

binbinlv added kind/bug Issues or changes related a bug triage/accepted Indicates an issue or PR is ready to be actively worked on. labels Oct 25, 2024

binbinlv added this to the 2.5.0 milestone Oct 25, 2024

binbinlv assigned alexanderguzhva, yanliang567 and liliu-z Oct 25, 2024

yanliang567 removed their assignment Oct 25, 2024

yanliang567 added the 2.5-features label Oct 28, 2024

alexanderguzhva mentioned this issue Oct 28, 2024

Fix group search zilliztech/knowhere#913

Merged

sre-ci-robot assigned binbinlv Oct 31, 2024

binbinlv closed this as completed Nov 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: [new_indexes] The searched id after group by is not equal with the top1 id searched using the expression of "group_by_field_name==each unique value" when creating HNSW_FLAT/SQ/PQ/PRQ on float16/ bfloat16 vector #37135

[Bug]: [new_indexes] The searched id after group by is not equal with the top1 id searched using the expression of "group_by_field_name==each unique value" when creating HNSW_FLAT/SQ/PQ/PRQ on float16/ bfloat16 vector #37135

binbinlv commented Oct 25, 2024 •

edited

Loading

binbinlv commented Oct 25, 2024

alexanderguzhva commented Oct 28, 2024

xiaofan-luan commented Oct 31, 2024

binbinlv commented Nov 1, 2024

binbinlv commented Nov 1, 2024

binbinlv commented Nov 1, 2024

alexanderguzhva commented Nov 1, 2024

binbinlv commented Nov 4, 2024

[Bug]: [new_indexes] The searched id after group by is not equal with the top1 id searched using the expression of "group_by_field_name==each unique value" when creating HNSW_FLAT/SQ/PQ/PRQ on float16/ bfloat16 vector #37135

[Bug]: [new_indexes] The searched id after group by is not equal with the top1 id searched using the expression of "group_by_field_name==each unique value" when creating HNSW_FLAT/SQ/PQ/PRQ on float16/ bfloat16 vector #37135

Comments

binbinlv commented Oct 25, 2024 • edited Loading

Is there an existing issue for this?

Environment

Current Behavior

Expected Behavior

Steps To Reproduce

Milvus Log

Anything else?

binbinlv commented Oct 25, 2024

alexanderguzhva commented Oct 28, 2024

xiaofan-luan commented Oct 31, 2024

binbinlv commented Nov 1, 2024

binbinlv commented Nov 1, 2024

binbinlv commented Nov 1, 2024

alexanderguzhva commented Nov 1, 2024

binbinlv commented Nov 4, 2024

binbinlv commented Oct 25, 2024 •

edited

Loading