LLM Benchmarks play a crucial role in assessing the performance of Language Model Models (LLMs). However, it is essential to recognize that these benchmarks have their own limitations. This interactive tool is designed to engage users in a quiz game based on popular LLM benchmarks, offering an insightful way to explore and understand the constraints of these benchmarks during gameplay.
The primary goal of this tool is to provide a hands-on experience that allows users to not only test their knowledge but also gain a deeper understanding of the challenges and limitations associated with LLM benchmarks. By participating in the quiz game, users can appreciate the nuances involved in evaluating LLMs and how well these models perform on diverse tasks.
The chosen benchmarks are the ones prominently used for evaluating LLMs on the Open LLM Leaderboard. Here's a brief overview of the benchmarks included:
- ARC: A set of grade-school science questions.
- HellaSwag: A test of commonsense inference, challenging for state-of-the-art models despite being easy for humans (~95% accuracy).
- MMLU: A multitask accuracy test covering 57 diverse tasks, including mathematics, US history, computer science, law, and more.
- TruthfulQA: A test to measure a model's tendency to reproduce falsehoods commonly found online.
- WinoGrande: An adversarial Winograd benchmark at scale, focusing on commonsense reasoning.
- GSM8k: Diverse grade school math word problems to assess a model's ability to solve multi-step mathematical reasoning problems.
Simply go to https://play-with-llm-benchmarks.streamlit.app/ and get the full experience
Simply clone the repo and run streamlit run Main.py
and enjoy the quiz game based on the selected benchmarks. Answer questions related from these benchmarks and measure your own performance.
Feel free to contribute, report issues, or suggest improvements to enhance the overall experience. Happy quizzing!