Why and how the same model for doc_embeddings and word_embeddings? #219

Atharvalite · 2024-04-05T03:53:41Z

BERT-based, or any transformer-based models output contextualized embeddings, which is correctly used for document embeddings generation. But to get word_embeddings, the same model is used, moreover, the array passed is just a list of raw candidate words, with no context, how will the word_embeddings hold any semantic meaning in that case?

In the BaseEmbedder class functionality is given to add word_embedding model, however in the "embed" method there is no way to differentiate between a list of documents and a list of words.

MaartenGr · 2024-04-05T06:19:19Z

It depends on several things, including tokenization schemes but also training data, but in general, these models are also quite capable of creating word embeddings despite not having contextual information at the time of inference. As you might notice, and especially combined with MMR (which does take into account the relationship between words to a certain extent), this already produces quite good results.

The BaseEmbedder indeed started out with the additional option to pass a word embedding model but since both models needed to be in the same dimensional space to be comparable, this turned out to be something that could not easily be implemented. You can't really (or easily) compare the output embeddings of two different embeddings using distance functions. What has been on the list for a while is to extract the token embeddings before aggregation from sentence-transformers but that again depends on the underlying model.

Any suggestions for implementations are appreciated!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why and how the same model for doc_embeddings and word_embeddings? #219

Why and how the same model for doc_embeddings and word_embeddings? #219

Atharvalite commented Apr 5, 2024

MaartenGr commented Apr 5, 2024 •

edited

Loading

Why and how the same model for doc_embeddings and word_embeddings? #219

Why and how the same model for doc_embeddings and word_embeddings? #219

Comments

Atharvalite commented Apr 5, 2024

MaartenGr commented Apr 5, 2024 • edited Loading

MaartenGr commented Apr 5, 2024 •

edited

Loading