Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why and how the same model for doc_embeddings and word_embeddings? #219

Open
Atharvalite opened this issue Apr 5, 2024 · 1 comment
Open

Comments

@Atharvalite
Copy link

BERT-based, or any transformer-based models output contextualized embeddings, which is correctly used for document embeddings generation. But to get word_embeddings, the same model is used, moreover, the array passed is just a list of raw candidate words, with no context, how will the word_embeddings hold any semantic meaning in that case?

In the BaseEmbedder class functionality is given to add word_embedding model, however in the "embed" method there is no way to differentiate between a list of documents and a list of words.

@MaartenGr
Copy link
Owner

MaartenGr commented Apr 5, 2024

It depends on several things, including tokenization schemes but also training data, but in general, these models are also quite capable of creating word embeddings despite not having contextual information at the time of inference. As you might notice, and especially combined with MMR (which does take into account the relationship between words to a certain extent), this already produces quite good results.

The BaseEmbedder indeed started out with the additional option to pass a word embedding model but since both models needed to be in the same dimensional space to be comparable, this turned out to be something that could not easily be implemented. You can't really (or easily) compare the output embeddings of two different embeddings using distance functions. What has been on the list for a while is to extract the token embeddings before aggregation from sentence-transformers but that again depends on the underlying model.

Any suggestions for implementations are appreciated!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants