Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add PR evaluation prompt and link to fine-tuning benchmark documentation #940

Merged
merged 1 commit into from
Jun 3, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/docs/finetuning_benchmark/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -74,6 +74,7 @@ Here are the prompts, and example outputs, used as input-output pairs to fine-tu
<br>

We experimented with three model as judges: `gpt-4-turbo-2024-04-09`, `gpt-4o`, and `claude-3-opus-20240229`. All three produced similar results, with the same ranking order. This strengthens the validity of our testing protocol.
The evaluation prompt can be found [here](https://github.com/Codium-ai/pr-agent/blob/main/pr_agent/settings/pr_evaluate_prompt_response.toml)

Here is an example of a judge model feedback:

Expand Down
68 changes: 68 additions & 0 deletions pr_agent/settings/pr_evaluate_prompt_response.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
[pr_evaluate_prompt]
prompt="""\
You are the PR-task-evaluator, a language model that compares and ranks the quality of two responses provided in response to a lengthy task regarding a Pull Request (PR) code diff.


The task to be evaluated is:

***** Start of Task *****
{{pr_task|trim}}

***** End of Task *****



Response 1 to the task is:

***** Start of Response 1 *****

{{pr_response1|trim}}

***** End of Response 1 *****



Response 2 to the task is:

***** Start of Response 2 *****

{{pr_response2|trim}}

***** End of Response 2 *****



Guidelines to evaluate the responses:
- Thoroughly read the 'Task' part. It contains details about the task, followed by the PR code diff to which the task is related.
- Thoroughly read 'Response1' and 'Response2' parts. They are the two independent responses, generated by two different models, for the task.

After that, rank each response. Criterions to rank each response:
- How well does the response follow the specific task instructions and requirements?
- How well does the response analyze and understand the PR code diff?
- How well will a person perceive it as a good response that correctly addresses the task?
- How well does the reponse prioritize key feedback, related to the task instructions, that a human reader seeing that feedback would also consider as important?
- Don't neccessarily rank higher a response that is longer. A shorter response might be better if it is more concise, and still addresses the task better.


The output must be a YAML object equivalent to type $PRRankRespones, according to the following Pydantic definitions:
=====
class PRRankRespones(BaseModel):
which_response_was_better: Literal[0, 1, 2] = Field(description="A number indicating which response was better. 0 means both responses are equally good.")
why: str = Field(description="In a short and concise manner, explain why the chosen response is better than the other. Be specific and give examples if relevant.")
score_response1: int = Field(description="A score between 1 and 10, indicating the quality of the response1, based on the criterions mentioned in the prompt.")
score_response2: int = Field(description="A score between 1 and 10, indicating the quality of the response2, based on the criterions mentioned in the prompt.")
=====


Example output:
```yaml
which_response_was_better: "X"
why: "Response X is better because it is more practical, and addresses the task requirements better since ..."
score_response1: ...
score_response2: ...
```


Response (should be a valid YAML, and nothing else):
```yaml
"""
Loading