Vivaria is METR's tool for running evaluations and conducting agent elicitation research.
-
Updated
Dec 14, 2024 - TypeScript
Vivaria is METR's tool for running evaluations and conducting agent elicitation research.
Hallucinations (Confabulations) Document-Based Benchmark for RAG
Benchmark evaluating LLMs on their ability to create and resist disinformation. Includes comprehensive testing across major models (Claude, GPT-4, Gemini, Llama, etc.) with standardized evaluation metrics.
RJafroc quick start for those already familiar with windows jafroc
Adaptive Testing Framework for AI Models (Psychometrics in AI Evaluation)
Repository for the LWDA'24 presentation on 'Psychometric Profiling of GPT Models for Bias Exploration', featuring conference materials including the poster, paper, slides, and references.
Add a description, image, and links to the ai-evaluation topic page so that developers can more easily learn about it.
To associate your repository with the ai-evaluation topic, visit your repo's landing page and select "manage topics."