🚀 Get Started

Website | Dataset | Paper

CiteME is a benchmark designed to test the abilities of language models in finding papers that are cited in scientific texts.

🚀 Get Started

Dataset

The hand curated version of the dataset can be found on citeme.ai.
It contains following columns:

id: A unique id that is used in all our experiments to reference a specific paper.
excerpt: The text excerpt describing the target paper.
target_paper_title: The title of the paper described by the excerpt.
target_paper_url: The URL to the paper described by the excerpt.
source_paper_title: The title of the paper the excerpt was taken from.
source_paper_url: The URL to the paper the excerpt was taken from.
year: The year the source paper was published.
split: Indicates if the sample is from the train or test split.

CiteAgent

Environment variables

CiteAgent requires following environment variables to function properly:

S2_API_KEY: Your semantic scholar api key
OPENAI_API_KEY: Your openai api key (for gpt-4 models)
ANTHROPIC_API_KEY: Your anthropic api key (for claude models)
TOGETHER_API_KEY: Your together api key (for llama models)

Run

Install the required python packages listed in the requirements.txt.
```
pip install -r requirements.txt
```
Download the dataset from citeme.ai and place it in the project folder as DATASET.csv.
Run the main.py file.
```
python src/main.py
```

Configuration

To modify the run parameters open src/main.py and update the metadata dict.

To run different models adjust the model entry (e.g. gpt-4o, claude-3-opus-20240229 or meta-llama/Llama-3-70b-chat-hf).

To run the agent without actions change the executor from LLMSelfAskAgentPydantic to LLMNoSearch and adjust the prompt_name to a *_no_search prompt.

📚Citation

If you find our work helpful, please use the following citation:

@inproceedings{press2024citeme,
  title={Cite{ME}: Can Language Models Accurately Cite Scientific Claims?},
  author={Press, Ori and Hochlehnert, Andreas and Prabhu, Ameya and Udandarao, Vishaal and Press, Ofir and Bethge, Matthias},
  booktitle={The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
  year={2024}
}

🪪 License

Code: MIT. Check LICENSE. Dataset: CC-BY-4.0. Check LICENSE_DATASET.

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
assets		assets
src		src
.gitignore		.gitignore
LICENSE		LICENSE
LICENSE_DATASET		LICENSE_DATASET
README.md		README.md
croissant_metadata.json		croissant_metadata.json
requirements.txt		requirements.txt
schema_metadata.json		schema_metadata.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🚀 Get Started

Dataset

CiteAgent

Environment variables

Run

Configuration

📚Citation

If you find our work helpful, please use the following citation:

🪪 License

About

Releases

Packages

Contributors 2

Languages

License

bethgelab/CiteME

Folders and files

Latest commit

History

Repository files navigation

🚀 Get Started

Dataset

CiteAgent

Environment variables

Run

Configuration

📚Citation

If you find our work helpful, please use the following citation:

🪪 License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages