Non Verbis, Sed Rebus: Large Language Models are Weak Solvers of Italian Rebuses

Gabriele Sarti • Tommaso Caselli • Malvina Nissim • Arianna Bisazza

Abstract: Rebuses are puzzles requiring constrained multi-step reasoning to identify a hidden phrase from a set of images and letters. In this work, we introduce a large collection of verbalized rebuses for the Italian language and use it to assess the rebus-solving capabilities of state-of-the-art large language models. While general-purpose systems such as LLaMA-3 and GPT-4o perform poorly on this task, ad-hoc fine-tuning seems to improve models' performance. However, we find that performance gains from training are largely motivated by memorization. Our results suggest that rebus solving remains a challenging test bed to evaluate large language models' linguistic proficiency and sequential instruction-following skills.

This repository contains scripts and notebooks associated with the paper "Non Verbis, Sed Rebus: Large Language Models are Weak Solvers of Italian Rebuses". If you use any of the following contents for your work, we kindly ask you to cite our paper:

@article{sarti-etal-2024-rebus,
    title = "Non Verbis, Sed Rebus: Large Language Models are Weak Solvers of Italian Rebuses",
    author = "Sarti, Gabriele and Caselli, Tommaso and Nissim, Malvina and Bisazza, Arianna",
    journal = "ArXiv",
    month = jul,
    year = "2024",
    volume = {abs/2408.00584},
    url = {https://arxiv.org/abs/2408.00584},
}

All models and data used in this work are available in our 🤗 Hub Collection.

Try it yourself! 🧩

We provide a simple online demo to test the rebus-solving capabilities of our model. You can access it here.

Installation

To install the required dependencies, you can use the following command:

python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

To run training and inference notebooks you will need a machine with access to a GPU (required by Unsloth). The environment setup is performed in the first cell of the training notebooks.

Model performances

Model	Setup	Crossword Definitions	First Pass EM	Solution EM
LLaMA-3 70B	5-shot prompt	0.22	0.04	0.00
Qwen-2 72B	5-shot prompt	0.28	0.04	0.00
GPT-4o	5-shot prompt	0.55	0.15	0.11
Claude-3.5 Sonnet	5-shot prompt	0.66	0.28	0.24
Gemma-2 2B (ours)	fine-tuned	0.78	0.43	0.36
Phi-3 3.8B (ours)	fine-tuned	0.84**	0.56	0.51
LLaMA-3.1 8B (ours)	fine-tuned	0.85	0.59	0.56

Fine-grained verbalized rebus solving performances of various LLMs. Bold denotes best overall performances. See the paper for more details.

Reproducing the paper results

Data preprocessing

⚠️ Refer to the EurekaRebus dataset card for more information on the dataset and the data licensing.

Run the following command to produce all data contained in the eureka-rebus folder from the rebus.csv file:

python scripts/process_data.py \
  --print_stats \
  --infer_punctuation \
  --generate_filtered_rebuses \
  --create_train_test_sets \
  --save_word_frequencies_train \
  --save_sharegpt_files

Fine-tuning LLMs on EurekaRebus

Follow the notebooks train_phi3_mini and train_llama3.1_8b to fine-tune the models on the EurekaRebus dataset. The training of both models was conducted on a single RTX 3090 GPU with 24GB RAM.

Generating and Parsing Rebus Solutions with Fine-tuned Models

To generate solutions for the rebuses in the test set, follow the instructions in the inference notebook. The folder outputs already contains parsed gold solutions and model predictions across saved training checkpoints for Phi-3 Mini and LLaMA-3.1 8B.

Generating Rebus Solutions with Prompted LLMs

Use the prompt_models script to generate solutions for the rebuses in the test set using prompted LLMs. The script requires the guidance library to be installed. The following command generates solutions for the test set using the Claude 3.5 Sonnet model:

python scripts/prompt_models.py \
  --model claude \
  --api_key YOUR_ANTHROPIC_API_KEY

Resulting files have the same format as the ones produced by the inference notebook, and are saved in the outputs/prompted_models folder.

⚠️ Due to an incompatibility of the guidance library, we used guidance=0.1.14 for TogetherAI models (LLaMA, Qwen) and guidance=0.1.15 for proprietary models (Claude, GPT-4o). We don't guarantee that other versions will work as expected.

Evaluating Results

For a new file containing the parsed predictions for a model produced with the inference notebook, you can compute all metrics from the paper using the following command:

python scripts/evaluate.py \
  --predicted_outputs outputs/phi3_mini/phi3_mini_results_step_500.csv \
  --gold_outputs outputs/test_gold_id_ood.csv \
  --word_frequencies outputs/word_frequencies_paisa.json \
  --word_frequencies_fp_train eureka-rebus/word_frequencies_fp_train.json \
  --word_frequencies_solution_train eureka-rebus/word_frequencies_solution_train.json \
  --do_corrs

To use the command above, if you use the do_corrs option, you should unzip the word_frequencies_paisa.json.zip file in the outputs folder.

Acknowledgments

We would like to thank the Associazione Culturale "Biblioteca Enigmistica Italiana - G. Panini" for their valuable contribution in maintaining the Eureka5 collection up to date and openly accessible.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
eureka-rebus @ 42c0621		eureka-rebus @ 42c0621
img		img
notebooks		notebooks
outputs		outputs
scripts		scripts
verbalized-rebus-solver @ 9ee78af		verbalized-rebus-solver @ 9ee78af
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Non Verbis, Sed Rebus: Large Language Models are Weak Solvers of Italian Rebuses

Try it yourself! 🧩

Installation

Model performances

Reproducing the paper results

Data preprocessing

Fine-tuning LLMs on EurekaRebus

Generating and Parsing Rebus Solutions with Fine-tuned Models

Generating Rebus Solutions with Prompted LLMs

Evaluating Results

Acknowledgments

About

Languages

License

gsarti/verbalized-rebus

Folders and files

Latest commit

History

Repository files navigation

Non Verbis, Sed Rebus: Large Language Models are Weak Solvers of Italian Rebuses

Try it yourself! 🧩

Installation

Model performances

Reproducing the paper results

Data preprocessing

Fine-tuning LLMs on EurekaRebus

Generating and Parsing Rebus Solutions with Fine-tuned Models

Generating Rebus Solutions with Prompted LLMs

Evaluating Results

Acknowledgments

About

Topics

Resources

License

Stars

Watchers

Forks

Languages