-
Dear Pyserini/Anserini contributors, I have some questions about how Py/Anserini deals with term weights, especially considering quantization. I found without quantizing the term weights of the collection (which is the input for indexing), performance will drop significantly (I tried uniCOIL and SPLADEv2). So:
Thanks in advance for your reply:) |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 2 replies
-
Hi @kwang2049 |
Beta Was this translation helpful? Give feedback.
Hi @kwang2049
take uniCOIL as example,
the term weights generated by the model are usually in range 0-5 as float.
Py/Anserini only accepts integer weights for corpus, thus we quantitize the float in range 0-5 to integers in range 0-255.
If indexing the term weights without quantization, the floating points will be rounded into integers directly.
Integer 0-5 loss many infomation while describing a document v.s. integer 0-255