This repository contains code for evaluating the reasoning capabilities of different large language models (LLMs) using various prompts.
configure the settings in the provided configuration file - dojo.yaml
- run dojo.py to get result fro each llms
moonshot:
apikey: --masked--
groq:
apikey: --masked--
datascope:
apikey: --masked--
deepseek:
apikey: --masked--
baidu:
pokkoa:
apikey: --masked--
secretkey: --masked--
all:
acesskey: --masked--
secretkey: --masked--
test prompts goes here
Write me a python script example. 给我写诗
output to ./out/{dojo name}/
dojo_setup_simple = {
"name": "simple", //dojo name
"prompt": get_prompt_by_name("simple"), //./prompts/using prompt.simple.txt
"temperature": [0.1, 0.3, 0.5, 1],
"top_p": [0.8],
"presence_penalty": [2.0, 0, -2.0],
"frequency_penalty": [2.0, 0, -2.0],
"models": [moonshot,pokkoa_qwen]
}
class GPTToneConfig:
def __init__(self, name, temperature, top_p, frequency_penalty=0, presence_penalty=0):
self.name = name
self.temperature = temperature
self.top_p = top_p
self.frequency_penalty = frequency_penalty
self.presence_penalty = presence_penalty
def __str__(self):
return f"{self.name}: Temperature: {self.temperature}, Top-p: {self.top_p}, Frequency Penalty: {self.frequency_penalty}, Presence Penalty: {self.presence_penalty}"
class Character:
def __init__(self, name, role, personality, tone_config, prompt):
self.name = name
self.role = role
self.personality = personality
self.tone_config = tone_config
self.prompt = prompt
def __str__(self):
return f"角色: {self.name} ({self.role})\n性格: {self.personality}\n语气设置: {self.tone_config}\n提示词: {self.prompt}"
# 定义语气设置
encouraging = GPTToneConfig("鼓励", temperature=0.8, top_p=0.9, frequency_penalty=0.4, presence_penalty=0.3)
default = Character(
"预设",
"你是一个只解释易经卦像的bot",
"你是一个只解释易经卦像的bot",
encouraging,
"你是一个只解释易经卦像的bot"
)
characters = [default]
- make sure setup the prompt_name in hasher.py and scorer.py
- run hasher.py to create csv files for scoring
- ./human_dojo_text contains scorable csv files
- put scored csv files in ./human_result
- run scorer.py to get complete scoring