🪐MARS: Benchmarking the Metaphysical Reasoning Abilities of Language Models with a Multi-task Evaluation Dataset
This is the official code and data repository for the paper: 🪐MARS: Benchmarking the Metaphysical Reasoning Abilities of Language Models with a Multi-task Evaluation Dataset.
The 🪐MARS benchmark and our best model checkpoints on three tasks in 🪐MARS can be downloaded at this link.
Code for instructing ChatGPT to curate the 🪐MARS benchmark can be found in the benchmark_curation
folder.
Code for evaluating language models on the 🪐MARS benchmark can be found in the evaluation
folder.
Please use the bibtex below for citing our paper:
@inproceedings{Wang2024MARSBT,
title={MARS: Benchmarking the Metaphysical Reasoning Abilities of Language Models with a Multi-task Evaluation Dataset},
author={Weiqi Wang and Yangqiu Song},
year={2024},
url={https://doi.org/10.48550/arXiv.2406.02106},
doi={10.48550/arXiv.2406.02106}
}
The authors of this paper were supported by the NSFC Fund (U20B2053) from the NSFC of China, the RIF (R6020-19 and R6021-20), and the GRF (16211520 and 16205322) from RGC of Hong Kong. We also thank the support from the UGC Research Matching Grants (RMGS20EG01-D, RMGS20CR11, RMGS20CR12, RMGS20EG19, RMGS20EG21, RMGS23CR05, RMGS23EG08).