-
Notifications
You must be signed in to change notification settings - Fork 290
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add model jxm/cde-small-v1 #1521
base: main
Are you sure you want to change the base?
Conversation
Once this is approved I will clone the results repo within MTEB and add the generated results folder for this model and submit a PR |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might be misunderstanding something, but it doesn't seem like you added a correct implementation or metadata on the model. These should be done before we merge the PR.
mteb/leaderboard/table.py
Outdated
@@ -101,7 +101,7 @@ def get_means_per_types(df: pd.DataFrame) -> pd.DataFrame: | |||
def failsafe_get_model_meta(model_name): | |||
try: | |||
return get_model_meta(model_name) | |||
except Exception as e: | |||
except Exception: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since your PR is not concerned with the leaderboard, you probably shouldn't put changes in it related to that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I believe that was a result of running make lint, however I can leave that out.
mteb/models/cde-small-v1_model.py
Outdated
|
||
import mteb | ||
|
||
model = mteb.get_model( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if I understand this correctly, but it seems like you did not add a model implementation or model metadata for CDEs. I'm also unsure whether this would work or not. I believe their official guide on how to use CDE is a bit more complicated than this, since they have a first and a sceond stage in all of their guides where they first produce a corpus embedding and then pass it along to the model when embedding new documents.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They have evaluation script, but it a bit complecated https://github.com/jxmorris12/cde/blob/0de4e6c116c8e8223075a2b56277d69e04a2ab7c/evaluate_mteb.py#L26
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, but I guess it's a better choice still not to implement the model incorrectly here, and maybe just add metadata on it, then ask the CDE team to upload their results to the results repository.
I don't see too much value in adding a script here, that does not use CDEs as they are supposed to be used
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with you. I added it evaluation script just for information and show author's implementation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't explicitly define the model metadata because when I ran the mteb.get_model_meta command, the output seemed correct. However, I may have misunderstood and overlooked the need to explicitly define the model metadata.
I also have the results repository from when I ran the script. Should I disregard that?
I'm a bit unsure about the next steps I should take. I would appreciate your guidance—thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your results don't seem to match the authors'. I think you should refer to the authors' implementation to see what's missing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your feedback! I’d like to clarify and address the observed discrepancies to ensure alignment with expectations.
1. Comparison to Authors' Results
To better understand the differences in results, could you provide more details on how the comparisons were made? Specifically:
- What metrics were compared?
Were these metrics task-specific scores or overall aggregate scores from the MTEB evaluation? - Where are the original authors' results documented?
Is there a specific publication, repository, or script output that contains these results for reference? I tried looking around but didn't come across anything valuable, but maybe im just not looking for the right thing.
2. Analysis of Discrepancies
Upon reviewing the author's implementation in the evaluation script, here are the key differences I identified that likely contribute to the observed discrepancies:
-
Model Loading:
- Original Implementation: Loads a fine-tuned model checkpoint using
analyze_utils
, which may include task-specific adaptations and configurations. - My Implementation: Directly loads the model (
jxm/cde-small-v1
) viaSentenceTransformer
, missing the fine-tuned settings.
- Original Implementation: Loads a fine-tuned model checkpoint using
-
Dataset and Preprocessing:
- Original Implementation: Dynamically constructs datasets with task-specific sampling, applies prefixes (e.g.,
query_prefix
anddocument_prefix
), and uses transductive embeddings. - My Implementation: Uses a simplified random corpus (
random_strings_cde.txt
), which may not be representative or directly comparable.
- Original Implementation: Dynamically constructs datasets with task-specific sampling, applies prefixes (e.g.,
-
Prefixes and Embedding Normalization:
- Original Implementation: Includes task-specific query/document prefixes and normalizes embeddings for clustering tasks.
- My Implementation: Does not fully replicate these configurations, which can influence results.
-
Batch Size and Device Configurations:
- Original Implementation: Uses a large batch size (512) and optimized GPU settings (e.g.,
torch.autocast
). - My Implementation: Uses smaller batch sizes (8), which may introduce noise in embeddings. My computer doesn't have a GPU, so I don't think using such a large batch size is smart, I can possibly update the batch to 32 and see if thats feasible.
- Original Implementation: Uses a large batch size (512) and optimized GPU settings (e.g.,
-
Evaluation Settings:
- Original Implementation: Customizes evaluation splits (
dev
ortest
), chunk sizes, and additional configurations for specific tasks. - My Implementation: Generalizes these settings, potentially contributing to metric differences.
- Original Implementation: Customizes evaluation splits (
I wanted to get your thoughts on which of these differences are worth addressing and feasible to resolve. One idea I had was to simply copy and paste the evaluation script the author used and modify it to work without a GPU. However, at that point, does it make sense for me to implement the script myself, or would it be more practical to ask the author to upload their results directly?
Would love your perspective on the best way to proceed.
Thanks for all of your help and guidance, I really appreciate it!
Yash
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- You can find the authors' results in their model card.
- Yes, there are some differences in implementation, but I believe the results can match.
@jxmorris12, could you help integrate your model into MTEB?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi! I can help integrate the model into MTEB in the new year. Your implementation doesn't use the task-specific embeddings, for one thing, which is known to decrease performance (see Table 2 of the paper). That said, your results look significantly less performant than even that number, so I suspect something else is wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jxmorris12 Awesome, I look forward to working with you in the new year!
updated model.prompts for consistency with mteb
has results of evaluating CDE on tasks
results of running mteb tasks on cde
Checklist
make test
.make lint
.Adding datasets checklist
Reason for dataset addition: ...
mteb -m {model_name} -t {task_name}
command.sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
intfloat/multilingual-e5-small
self.stratified_subsampling() under dataset_transform()
make test
.make lint
.Adding a model checklist
mteb.get_model(model_name, revision)
andmteb.get_model_meta(model_name, revision)