v0.1.5: Tabular Data Node, Evaluation Output
We've added Tabular Data to ChainForge, to help conduct ground truth evaluations. Full release notes below.
Tabular Data Nodes 🗂️
You can now input and import tabular data (spreadsheets) into ChainForge. Accepted formats are jsonl
, xlsx
, and csv
. Excel and CSV files must have a header row with column names.
Tabular data provides an easy way to enter associated prompt parameters or import existing datasets and benchmarks. A typical use case is ground truth evaluation, where we have some inputs to a prompt, and an "ideal" or expected answer:
Here, we see variables {first}
, {last}
, and {invention}
"carry together" when filling the prompt template: ChainForge knows they are all associated with one another, connected via the row. Thus, it constructs 4 prompts from the input parameters.
Accessing tabular data, even if it's not input into the prompt directly
Alongside tabular data is a new property of response
objects in Evaluation nodes: the meta
dict. This allows you to get access to column data that is associated with inputs to a prompt template, but was not itself directly input into the prompt template. For instance, in the new example flow for ground truth evaluation of math problems:
Notice the evaluator uses meta
to get "Expected", which is associated with the prompt input variable question
by virtue of it being on the same row of the table.
def evaluate(response):
return response.text[:4] == \
response.meta['Expected']
Example flows
Tabular data allows us to run many more types of LLM evaluations. For instance, here is the ground truth evaluation multistep-word-problems
from OpenAI evals, loaded into ChainForge:
We've added an Example Flow for ground truth evaluation that provides a good starting point.
Evaluation Node output 📟
Curious what the format of a response
object is like? You can now print
inside evaluate
functions to print output directly to the browser:
In addition, Exceptions raised inside your evaluation function will also print to the node out:
Slight styling improvements in Response Inspectors
We removed the use of blue Badges to display unselected prompt variable and replaced them with text that blends into the background:
The fullscreen inspector also displays slightly larger font size for readability:
Final thoughts / comments
- Tabular Data was a major feature, as it enables many types of LLM evaluation. Our goal now is to illustrate what people can currently do in ChainForge through better documentation and connecting to existing datasets (e.g. OpenAI evals). We also will focus on quality-of-life improvements to the UI and adding more models/extensibility.
- We know there is a minor layout issue with the table not autosizing to the best fit the width of cell content. This happens as some browsers do not appear to autofit column widths properly when
<textarea>
is an element of a table cell. We are working on a fix so columns are automatically sized based on their content.