LLM Benchmark Quiz Game

LLM Benchmarks play a crucial role in assessing the performance of Language Model Models (LLMs). However, it is essential to recognize that these benchmarks have their own limitations. This interactive tool is designed to engage users in a quiz game based on popular LLM benchmarks, offering an insightful way to explore and understand the constraints of these benchmarks during gameplay.

Purpose

The primary goal of this tool is to provide a hands-on experience that allows users to not only test their knowledge but also gain a deeper understanding of the challenges and limitations associated with LLM benchmarks. By participating in the quiz game, users can appreciate the nuances involved in evaluating LLMs and how well these models perform on diverse tasks.

Featured Benchmarks

The chosen benchmarks are the ones prominently used for evaluating LLMs on the Open LLM Leaderboard. Here's a brief overview of the benchmarks included:

ARC: A set of grade-school science questions.
HellaSwag: A test of commonsense inference, challenging for state-of-the-art models despite being easy for humans (~95% accuracy).
MMLU: A multitask accuracy test covering 57 diverse tasks, including mathematics, US history, computer science, law, and more.
TruthfulQA: A test to measure a model's tendency to reproduce falsehoods commonly found online.
WinoGrande: An adversarial Winograd benchmark at scale, focusing on commonsense reasoning.
GSM8k: Diverse grade school math word problems to assess a model's ability to solve multi-step mathematical reasoning problems.

How to Use

Hosted Preview -

Simply go to https://play-with-llm-benchmarks.streamlit.app/ and get the full experience

Local Development -

Simply clone the repo and run streamlit run Main.py and enjoy the quiz game based on the selected benchmarks. Answer questions related from these benchmarks and measure your own performance.

Feel free to contribute, report issues, or suggest improvements to enhance the overall experience. Happy quizzing!

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
data		data
data_preprocessing		data_preprocessing
pages		pages
Main.py		Main.py
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM Benchmark Quiz Game

Purpose

Featured Benchmarks

How to Use

Hosted Preview -

Local Development -

About

Releases

Packages

Languages

aflah02/Humans-v-s-LLM-Benchmarks

Folders and files

Latest commit

History

Repository files navigation

LLM Benchmark Quiz Game

Purpose

Featured Benchmarks

How to Use

Hosted Preview -

Local Development -

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages