Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: [new_indexes] The searched id after group by is not equal with the top1 id searched using the expression of "group_by_field_name==each unique value" when creating HNSW_FLAT/SQ/PQ/PRQ on float16/ bfloat16 vector #37135

Closed
1 task done
binbinlv opened this issue Oct 25, 2024 · 8 comments · Fixed by zilliztech/knowhere#913
Assignees
Labels
2.5-features kind/bug Issues or changes related a bug triage/accepted Indicates an issue or PR is ready to be actively worked on.
Milestone

Comments

@binbinlv
Copy link
Contributor

binbinlv commented Oct 25, 2024

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version:master-20241024-0dbf9482-amd64
- Deployment mode(standalone or cluster):all
- MQ type(rocksmq, pulsar or kafka):    both
- SDK version(e.g. pymilvus v2.0.0rc2): 2.5.0rc99
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

The searched id after group by is not equal with the top1 id searched using the expression of "group_by_field_name==each unique value" when creating HNSW_FLAT/SQ/PQ/PRQ on float16/ bfloat16 vector

[2024-10-25 12:10:00 - INFO - ci_test]: INT64 on FLOAT16_VECTOR dismatch_item, top1_grpby_dis: 33.54237365722656, top1_expr_dis: 14.902473449707031 (test_mix_scenes.py:2746)
[2024-10-25 12:10:00 - INFO - ci_test]: INT64 on FLOAT16_VECTOR  top1_dismatch_num: 15, results_num: 15, dismatch_rate: 1.0 (test_mix_scenes.py:2747)

Expected Behavior

The searched id after group by is always equal with the top1 id searched using the expression of "group_by_field_name==each unique value" when creating HNSW_FLAT/SQ/PQ/PRQ on float16/ bfloat16 vector

Steps To Reproduce

    @pytest.mark.tags(CaseLabel.L2)
    # @pytest.mark.parametrize("support_field", [DataType.INT8.name,  DataType.INT64.name,
    #                                            DataType.BOOL.name, DataType.VARCHAR.name])
    @pytest.mark.parametrize("support_field", [DataType.INT64.name])
    def test_search_group_by_supported_scalars_new_hnsw_index(self, support_field):
        """
        verify search group by works with supported scalar fields
        """
        nq = 1
        limit = 15
        for j in range(len(self.vector_fields)):
            j = 0
            search_vectors = cf.gen_vectors(nq, dim=self.dims[j], vector_data_type=self.vector_fields[j])
            search_params = {"params": cf.get_search_params_params(self.index_types[j])}
            res1 = self.collection_wrap.search(data=search_vectors, anns_field=self.vector_fields[j],
                                               param=search_params, limit=limit,
                                               group_by_field=support_field,
                                               output_fields=[support_field])[0]
            for i in range(nq):
                grpby_values = []
                dismatch = 0
                results_num = 2 if support_field == DataType.BOOL.name else limit
                for l in range(results_num):
                    top1 = res1[i][l]
                    top1_grpby_pk = top1.id
                    top1_grpby_value = top1.fields.get(support_field)
                    expr = f"{support_field}=={top1_grpby_value}"
                    if support_field == DataType.VARCHAR.name:
                        expr = f"{support_field}=='{top1_grpby_value}'"
                    grpby_values.append(top1_grpby_value)
                    res_tmp = self.collection_wrap.search(data=[search_vectors[i]], anns_field=self.vector_fields[j],
                                                          param=search_params, limit=1, expr=expr,
                                                          output_fields=[support_field])[0]
                    top1_expr_pk = res_tmp[0][0].id
                    if top1_grpby_pk != top1_expr_pk:
                        dismatch += 1
                        log.info(f"{support_field} on {self.vector_fields[j]} dismatch_item, top1_grpby_dis: {top1.distance}, top1_expr_dis: {res_tmp[0][0].distance}")
                log.info(f"{support_field} on {self.vector_fields[j]}  top1_dismatch_num: {dismatch}, results_num: {results_num}, dismatch_rate: {dismatch / results_num}")
                baseline = 1 if support_field == DataType.BOOL.name else 0.2    # skip baseline check for boolean
                assert dismatch / results_num <= baseline
                # verify no dup values of the group_by_field in results
                assert len(grpby_values) == len(set(grpby_values))

Milvus Log

https://grafana-4am.zilliz.cc/explore?orgId=1&panes=%7B%226mJ%22:%7B%22datasource%22:%22vhI6Vw67k%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bcluster%3D%5C%22devops%5C%22,namespace%3D%5C%22chaos-testing%5C%22,pod%3D~%5C%22test-null-new-1-kwwuq.%2A%5C%22%7D%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22vhI6Vw67k%22%7D%7D%5D,%22range%22:%7B%22from%22:%22now-1h%22,%22to%22:%22now%22%7D%7D%7D&schemaVersion=1

Anything else?

collection_name : "TestGroupSearchNewHNSWIndex_badYC7EF"

@binbinlv binbinlv added kind/bug Issues or changes related a bug triage/accepted Indicates an issue or PR is ready to be actively worked on. labels Oct 25, 2024
@binbinlv binbinlv added this to the 2.5.0 milestone Oct 25, 2024
@binbinlv
Copy link
Contributor Author

This issue not exists for FLOAT_VECTOR vector.

@alexanderguzhva
Copy link
Contributor

@binbinlv zilliztech/knowhere#913 should fix the problem

@xiaofan-luan
Copy link
Collaborator

/assign @binbinlv

@binbinlv
Copy link
Contributor Author

binbinlv commented Nov 1, 2024

working on verification.

@binbinlv
Copy link
Contributor Author

binbinlv commented Nov 1, 2024

Verified and fixed in master branch.
pymilvus: 2.5.0rc106
milvus: master-20241101-0ac8b166

@binbinlv
Copy link
Contributor Author

binbinlv commented Nov 1, 2024

But this issue still exists in 2.5 branch:

milvus: 2.5-20241101-116bf501-amd64

results:

[2024-11-02 01:19:28 - INFO - ci_test]: VARCHAR on BFLOAT16_VECTOR dismatch_item, top1_grpby_dis: 0.6354575157165527, top1_expr_dis: 0.7928284406661987 (test_mix_scenes.py:2775)
[2024-11-02 01:19:28 - INFO - ci_test]: VARCHAR on BFLOAT16_VECTOR  top1_dismatch_num: 15, results_num: 15, dismatch_rate: 1.0 (test_mix_scenes.py:2776)

@alexanderguzhva
Copy link
Contributor

@binbinlv different knowhere version, maybe?

@binbinlv
Copy link
Contributor Author

binbinlv commented Nov 4, 2024

The 2.5 branch is deleted and master branch has been verified, so close this issue.

@binbinlv binbinlv closed this as completed Nov 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2.5-features kind/bug Issues or changes related a bug triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants