Model Written Evals for generating Inverse Scaling effect datasets

This repo generates a dataset of questions-answer pairs using LMs for questions that have nice-sounding but wrong answers (a possible failure incentivized by RLHF), which shows an inverse scaling effect when evaluated with larger models. This project is inspired by 'Discovering Language Model Behaviors with Model-Written Evaluations' by Perez et. al The model written evaluations set consists of input-output pairs { $(x_i, y_i)_{i=1...n} | x \in Questions, y \in ['Yes', 'No']$ }, where $y_1,...,y_n$ are drawn from the finite set of possible answer labels [‘Yes’, ‘No’].

This model written eval set generation involved the following steps:

Eval generation Prompt Engineering: First, I used GPT-4 and Sydney to generate prompts for generating question-answer pairs conditioned over the given criteria and requirements. Then, I merged those prompts, and loomed over them to curate the final prompt Prompts File
AI Steering to filter and select QA pairs: This eval generation involves prompt-tuning to loom with LMs for sampling multiple completions of the generational dynamics at each timestep, rank those, and select the high ranked completion to steer the model written eval dataset generation with our required criteria. $$\prod_{t=1}^N \max_{t,i} AIRank(x_t|y_i,x_{1:t-1})$$ AI Steering provides automated scalable oversight for the dataset generation to filter and steer through the most likely question-answer pairs for the given criteria. Dataset generation and steering has been performed using gpt-3.5-turbo with temperature 1.0 for QA pair generation and temperature 0.0 for AI steering.
Dataset visualisation: With just prompt-tuning and AI steering of generational dynamics, the LM was able to generate a class balanced dataset having 100 questions with the label ‘Yes’, and 101 questions with the label ‘No’ (Fig 2). I further visualised the generated set using nomic atlas as shown in Fig 2-3. (data visualisation can be accessed and explored here using nomic’s atlas) ![download (3)]){: width="250px" height="250px"}

Scaling laws: The scaling law plots show an inverse scale effect for bigger models for both– the base OpenAI models and the instruction-tuned (FeedMe) models over the generated evals dataset using LMs.

Code organisation–

/dataset
	/atlas-data-visualisation.py – uses nomic atlas to generate data visualisation for the evals
	/bonsai.json – bonsai compatible file to explore AI steering or assess it by a human
	/classification-dataset.csv – classification score eval dataset
	/clean.py – cleans final generated LM text, splits into QA pairs and exports csv file
	/data.txt – final generated LM text via steering
	/logodds-dataset.csv – logodds metric dataset
/results - evaluation results for base and feedme models
/plots
	/plot-base-feedme.py - script to generate a line chart of base vs. feedme models accuracy vs. model size
	/Inverse_Scaling_GPT_3_Colab.ipynb - colab notebook for generating classification loss and logodd charts (credit: [Inverse Scaling](https://github.com/inverse-scaling/) Prize Notebook)
/eval_generation.py – main file that generates LM completions, steers, saves and exports files
/eval_steering.py – prompt-tuned steering to selected one of the n samples
/model.py – defines models including base model (code-davinci-002) & chat (gpt-3.5-turbo/4)
/prompts.py – defines diff. prompts used: main_prompt, system_prompt, ai_steering_prompt
/test_config.py – save this as config.py and add your api_keys here
/utils.py – basic util operations
/bonsai_export.py – generates a bonsai (web version of LOOM) compatible json file

Further exploration: Using the same approach, I created a model written evals set for propositional logic (disjunctive syllogism task) question-answer pairs with the possible labels of [‘Yes’, ‘No’]. Dataset link. Scaling law plots:-

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Model Written Evals for generating Inverse Scaling effect datasets

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
dataset		dataset
plots		plots
results		results
.gitignore		.gitignore
README.md		README.md
bonsai_export.py		bonsai_export.py
eval_generation.py		eval_generation.py
eval_steering.py		eval_steering.py
model.py		model.py
prompts.py		prompts.py
test_config.py		test_config.py
utils.py		utils.py

hunarbatra/model-written-evals

Folders and files

Latest commit

History

Repository files navigation

Model Written Evals for generating Inverse Scaling effect datasets

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages