Question about quantizing term weights #982

kwang2049 · 2022-02-02T23:43:08Z

kwang2049
Feb 2, 2022

Dear Pyserini/Anserini contributors,

I have some questions about how Py/Anserini deals with term weights, especially considering quantization.

I found without quantizing the term weights of the collection (which is the input for indexing), performance will drop significantly (I tried uniCOIL and SPLADEv2). So:

Why does this happen (performance drops if one does not do quantization)?
Does Py/Anserini only accept integer weights and would quantize the weights anyhow if it was not done by the users?

Thanks in advance for your reply:)

Answered by MXueguang

Feb 3, 2022

Hi @kwang2049
take uniCOIL as example,
the term weights generated by the model are usually in range 0-5 as float.
Py/Anserini only accepts integer weights for corpus, thus we quantitize the float in range 0-5 to integers in range 0-255.
If indexing the term weights without quantization, the floating points will be rounded into integers directly.
Integer 0-5 loss many infomation while describing a document v.s. integer 0-255

View full answer

MXueguang · 2022-02-03T00:00:19Z

MXueguang
Feb 3, 2022
Collaborator

Hi @kwang2049
take uniCOIL as example,
the term weights generated by the model are usually in range 0-5 as float.
Py/Anserini only accepts integer weights for corpus, thus we quantitize the float in range 0-5 to integers in range 0-255.
If indexing the term weights without quantization, the floating points will be rounded into integers directly.
Integer 0-5 loss many infomation while describing a document v.s. integer 0-255

2 replies

kwang2049 Feb 3, 2022
Author

Thanks so much for your quick reply!

Now I understand the doc term weights in Py/Anserini.

And then what about the query term weights? Does Py/Anserini use float query term weights and integer doc term weights and compute a float score in the end?

It seems that one does not need to do quantization over query terms, according to the search method in the ImpactSearcher class.

MXueguang Feb 3, 2022
Collaborator

yes, since Lucene/Anserini support float query weights, you can keep it as float while using ImpactSearcher.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about quantizing term weights #982

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Question about quantizing term weights #982

kwang2049 Feb 2, 2022

Replies: 1 comment · 2 replies

MXueguang Feb 3, 2022 Collaborator

kwang2049 Feb 3, 2022 Author

MXueguang Feb 3, 2022 Collaborator

kwang2049
Feb 2, 2022

Replies: 1 comment 2 replies

MXueguang
Feb 3, 2022
Collaborator

kwang2049 Feb 3, 2022
Author

MXueguang Feb 3, 2022
Collaborator