在本项目中,通过文件mm-vet-v2_evaluator.py评估常见多模态模型在 MM-Vet-v2 数据集上的表现。通过预定义的提示和 GPT 模型,对比模型的预测结果与标准答案,生成准确性评分,并支持多次运行以统计稳定性。最终,将评分结果保存为 JSON 文件,并导出能力和能力整合评估的 CSV 报告,用于分析模型的性能和能力分布。
文件mm-vet-v2_evaluator.py的具体实现流程如下所示。
(1)下面代码的功能是实现对大模型预测结果与人工标注的正确答案之间的比较,并根据一定的规则为预测结果生成一个准确性得分(Correctness Score)。这段代码在构建自动化评估工具时具有广泛的适用性,特别是用于生成式 AI 模型的性能比较和优化。
prompt = """Compare the ground truth and prediction from AI models, to give a correctness score for the prediction. in the question indicates where an image is. in the ground truth means it is totally right only when all elements in the ground truth are present in the prediction, and means it is totally right when any one element in the ground truth is present in the prediction. The correctness score is 0.0 (totally wrong), 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, or 1.0 (totally right). Just complete the last space of the correctness score.
| Question | Ground truth | Prediction | Correctness |
| --- | --- | --- | --- |
| What is x in the equation? | -1 -5 | x = 3 | 0.0 |
| What is x in the equation? | -1 -5 | x = -1 | 0.5 |
| What is x in the equation? | -1 -5 | x = -5 | 0.5 |
| What is x in the equation? | -1 -5 | x = -5 or 5 | 0.5 |
| What is x in the equation? | -1 -5 | x = -1 or x = -5 | 1.0 |
| Can you explain this meme? | This meme is poking fun at the fact that the names of the countries Iceland and Greenland are misleading. Despite its name, Iceland is known for its beautiful green landscapes, while Greenland is mostly covered in ice and snow. The meme is saying that the person has trust issues because the names of these countries do not accurately represent their landscapes. | The meme talks about Iceland and Greenland. It's pointing out that despite their names, Iceland is not very icy and Greenland isn't very green. | 0.4 |
| Can you explain this meme? | This meme is poking fun at the fact that the names of the countries Iceland and Greenland are misleading. Despite its name, Iceland is known for its beautiful green landscapes, while Greenland is mostly covered in ice and snow. The meme is saying that the person has trust issues because the names of these countries do not accurately represent their landscapes. | The meme is using humor to point out the misleading nature of Iceland's and Greenland's names. Iceland, despite its name, has lush green landscapes while Greenland is mostly covered in ice and snow. The text 'This is why I have trust issues' is a playful way to suggest that these contradictions can lead to distrust or confusion. The humor in this meme is derived from the unexpected contrast between the names of the countries and their actual physical characteristics. | 1.0 |
"""
(2)下面代码定义了函数 arg_parser,其功能是通过 argparse 模块创建一个命令行参数解析器,用于解析和管理运行时参数。主要的用途是设置路径、模型、提示词(prompt)、API密钥、评估细节等相关配置,以便执行基于多模态或语言模型的任务。它允许用户自定义运行的参数,也提供了默认值,便于快速实验。
def arg_parser(prompt=prompt):
parser = argparse.ArgumentParser()
parser.add_argument(
"--mmvetv2_path",
type=str,
default="/path/to/mm-vet-v2",
help="下载 mm-vet.zip 并解压 `unzip mm-vet.zip`,然后在这里更改路径",
)
parser.add_argument(
"--result_file",
type=str,
default="results/llava_llama2_13b_chat.json",
help="模型结果文件路径,必须以 .json 结尾",
)
parser.add_argument(
"--result_path",
type=str,
default="results",
help="保存评分结果的路径",
)
parser.add_argument(
"--openai_api_key",
type=str,
default=None,
help="如果未指定,则使用环境变量 OPENAI_API_KEY。",
)
parser.add_argument(
"--gpt_model",
type=str,
default="gpt-4-0613",
help="GPT 模型名称",
)
parser.add_argument(
"--prompt",
type=str,
default=prompt,
help="模型使用的提示词",
)
parser.add_argument(
"--subset",
type=str,
default=None,
help="包含评估 ID 的 JSON 文件路径",
)
parser.add_argument(
"--decimal_places",
type=int,
default=1,
help="保留小数位数",
)
parser.add_argument(
"--num_run",
type=int,
default=1,
help="我们在论文中设置为 5",
)
args = parser.parse_args()
return args
(3)函数 get_file_names 的功能是根据输入参数动态生成和返回三个文件路径,分别用于存储评分结果、能力评分结果和能力整合评分结果。这些路径基于模型名称、子集名称、GPT 模型版本和运行次数等参数动态拼接生成,并与结果目录合并,以便后续读取和存储。
def get_file_names(args, model, subset_name):
# 保存每个样本的评分结果文件
grade_file = f"{model}_{args.gpt_model}-grade-{args.num_run}runs_dev8.json"
grade_file = os.path.join(args.result_path, grade_file)
# 保存与能力/能力整合相关的评分结果文件
cap_score_file = (
f"{model}_{subset_name}{args.gpt_model}-cap-score-{args.num_run}runs_dev8.csv"
)
cap_score_file = os.path.join(args.result_path, cap_score_file)
cap_int_score_file = f"{model}_{subset_name}{args.gpt_model}-cap-int-score-{args.num_run}runs_dev8.csv"
cap_int_score_file = os.path.join(args.result_path, cap_int_score_file)
# 返回生成的文件路径
return grade_file, cap_score_file, cap_int_score_file
(4)函数load_metadata的主要功能是加载元数据文件(mm-vet-v2.json)和一个可选的子集文件(args.subset),解析并统计与任务能力相关的信息,包括能力类别的计数、组合及其分布。函数返回多个关键数据结构(如计数器、数据框等)以便后续使用,特别是处理能力类别组合的排序、名称生成以及数据框的初始化。
def load_metadata(args):
# 加载用户指定的子集文件(如果提供)
if args.subset:
with open(args.subset, "r") as f:
subset = json.load(f)
# 提取子集名称(文件名去掉扩展名)
subset_name = pathlib.Path(args.subset).stem
subset_name = subset_name + "_"
else:
subset = None
subset_name = ""
# 加载主元数据文件 mm-vet-v2.json
mmvet_metadata = os.path.join(args.mmvetv2_path, "mm-vet-v2.json")
with open(mmvet_metadata, "r") as f:
data = json.load(f)
# 初始化统计相关的数据结构
counter = Counter() # 用于统计每种能力的出现次数
cap_set_list = [] # 存储能力组合的列表
cap_set_counter = [] # 存储每个能力组合的计数
len_data = 0 # 子集数据总量
# 遍历数据条目,统计能力和能力组合的分布
for id, value in data.items():
# 如果子集文件存在,跳过不在子集中的数据
if subset is not None and id not in subset:
continue
# 获取当前条目的能力集合
cap = value["capability"]
cap = set(cap)
# 更新能力计数器
counter.update(cap)
# 如果能力组合是新的,添加到列表并初始化计数
if cap not in cap_set_list:
cap_set_list.append(cap)
cap_set_counter.append(1)
else:
# 增加已有能力组合的计数
cap_set_counter[cap_set_list.index(cap)] += 1
# 增加子集数据计数
len_data += 1
# 对能力计数器按出现频率降序排序
sorted_list = counter.most_common()
columns = [k for k, v in sorted_list]
# 为数据框添加额外列
columns.append("total")
columns.append("std")
columns.append("runs")
df = pd.DataFrame(columns=columns)
# 对能力组合按频率降序排序
cap_set_sorted_indices = np.argsort(-np.array(cap_set_counter))
new_cap_set_list = []
new_cap_set_counter = []
for index in cap_set_sorted_indices:
new_cap_set_list.append(cap_set_list[index])
new_cap_set_counter.append(cap_set_counter[index])
cap_set_list = new_cap_set_list
cap_set_counter = new_cap_set_counter
# 将能力组合转换为字符串名称
cap_set_names = ["_".join(list(cap_set)) for cap_set in cap_set_list]
# 为能力组合初始化数据框
columns2 = cap_set_names
columns2.append("total")
columns2.append("std")
columns2.append("runs")
df2 = pd.DataFrame(columns=columns2)
# 返回解析和统计结果
return (
subset,
subset_name,
data,
counter,
cap_set_list,
cap_set_counter,
len_data,
df,
df2,
cap_set_names,
)
(5)函数runs()的主要功能是使用指定的 GPT 模型对 AI 模型的预测结果进行多次评估,输出评分结果并保存到文件中。函数runs()的主要实现步骤如下:
def runs(
args,
grade_file,
data,
len_data,
subset=None,
):
# 加载模型预测结果文件
with open(args.result_file) as f:
results = json.load(f)
# 如果评分结果文件存在,加载已保存的评分结果;否则初始化为空
if os.path.exists(grade_file):
with open(grade_file, "r") as f:
grade_results = json.load(f)
else:
grade_results = {}
# 检查是否需要更多运行以满足指定运行次数
def need_more_runs(args, grade_results, len_data):
need_more_runs = False
if len(grade_results) > 0:
for k, v in grade_results.items():
if len(v["score"]) < args.num_run:
need_more_runs = True
break
return need_more_runs or len(grade_results) < len_data
# 当需要更多运行时,继续评估
while need_more_runs(args, grade_results, len_data):
for j in range(args.num_run):
print(f"eval run {j}")
for id, line in tqdm(data.items()):
# 如果提供了子集,跳过子集之外的条目
if subset is not None and id not in subset:
continue
# 如果该条目的当前运行已经评分,则跳过
if id in grade_results and len(grade_results[id]["score"]) >= (j + 1):
continue
# 获取模型预测结果
model_pred = results[id]
# 构造问题文本,将
替换为
queries = line['question'].split('
')
query = ""
for q in queries:
if q.endswith((".jpg", "jpeg", ".png")):
query += ""
else:
query += q
question = prompt + '| ' + ' | '.join([
query.replace('\n', '
'),
line['answer'].replace("", " ").replace("", " ").replace('\n', '
'),
model_pred.replace('\n', '
'),
""
])
# 构建消息内容
messages = [
{"role": "user", "content": question},
]
# 初始化条目评分
if id not in grade_results:
sample_grade = {"model": [], "content": [], "score": []}
else:
sample_grade = grade_results[id]
# 开始尝试评分,设置初始温度
grade_sample_run_complete = False
temperature = 0.0
while not grade_sample_run_complete:
try:
# 调用 GPT 模型获取响应
response = client.chat.completions.create(
model=args.gpt_model,
max_tokens=3,
temperature=temperature,
messages=messages,
)
content = response.choices[0].message.content
# 提取评分值
flag = True
try_time = 1
while flag:
try:
content = content.split(" ")[0].strip()
score = float(content)
if score > 1.0 or score < 0.0:
assert False
flag = False
except:
# 如果解析失败,生成新的提示语重试
question_try = question + "\n\nPredict the correctness of the answer (digit): "
messages = [
{"role": "user", "content": question_try},
]
response = client.chat.completions.create(
model=args.gpt_model,
max_tokens=3,
temperature=temperature,
messages=messages,
)
content = response.choices[0].message.content
try_time += 1
temperature += 0.5
print(f"{id} try {try_time} times")
print(content)
if try_time > 5: # 超过最大重试次数,评分置为 0.0
score = 0.0
flag = False
grade_sample_run_complete = True
response_model = response.model
except RateLimitError as e:
# 如果触发速率限制,暂停一段时间
print("sleep 30s")
time.sleep(30)
except BadRequestError as e:
# 捕获无效请求错误,记录默认值并退出评分循环
content = "BadRequestError"
score = 0.0
flag = False
print(id, "BadRequestError")
response_model = args.gpt_model
break
# 更新评分结果
if len(sample_grade["model"]) >= j + 1:
sample_grade["model"][j] = response_model
sample_grade["content"][j] = content
sample_grade["score"][j] = score
else:
sample_grade["model"].append(response_model)
sample_grade["content"].append(content)
sample_grade["score"].append(score)
grade_results[id] = sample_grade
# 保存评分结果到文件
with open(grade_file, "w") as f:
json.dump(grade_results, f, indent=4)
return grade_results
(6)函数export_result的主要功能是基于评估结果生成两个数据表格(df 和 df2),统计不同能力及其组合的得分表现,并将结果导出为 CSV 文件保存。具体过程包括:
def export_result(args, model, df, df2, grade_results, data, cap_set_counter, cap_set_names):
# 初始化能力得分字典,记录每次运行的得分
columns = df.columns
columns2 = df2.columns
# 为每种能力初始化一个包含运行次数的得分列表
cap_scores = {k: [0.0] * args.num_run for k in columns[:-2]}
counter["total"] = len_data # 记录总样本数
# 初始化能力组合的得分字典
cap_scores2 = {k: [0.0] * args.num_run for k in columns2[:-2]}
counter2 = {columns2[i]: cap_set_counter[i] for i in range(len(cap_set_counter))}
counter2["total"] = len_data
# 遍历评估结果,累加得分
for k, v in grade_results.items():
# 如果提供了子集,仅处理子集内的数据
if subset is not None and k not in subset:
continue
# 遍历每次运行的得分
for i in range(args.num_run):
score = v["score"][i]
caps = set(data[k]["capability"]) # 获取当前样本的能力类别
# 累加能力类别的得分
for c in caps:
cap_scores[c][i] += score
cap_scores["total"][i] += score
# 找到当前能力组合的索引并累加得分
index = cap_set_list.index(caps)
cap_scores2[cap_set_names[index]][i] += score
cap_scores2["total"][i] += score
# 计算能力类别的平均得分
for k, v in cap_scores.items():
cap_scores[k] = np.array(v) / counter[k] * 100
# 计算总得分的标准差和每次运行的平均值
std = round(cap_scores["total"].std(), args.decimal_places)
total_copy = cap_scores["total"].copy()
runs = str(list(np.round(total_copy, args.decimal_places)))
# 更新每个能力类别的平均得分
for k, v in cap_scores.items():
cap_scores[k] = round(v.mean(), args.decimal_places)
# 将总得分的标准差和运行记录添加到结果
cap_scores["std"] = std
cap_scores["runs"] = runs
df.loc[model] = cap_scores
# 计算能力组合的平均得分
for k, v in cap_scores2.items():
cap_scores2[k] = round(
np.mean(np.array(v) / counter2[k] * 100), args.decimal_places
)
cap_scores2["std"] = std
cap_scores2["runs"] = runs
df2.loc[model] = cap_scores2
# 导出结果到 CSV 文件
df.to_csv(cap_score_file)
df2.to_csv(cap_int_score_file)
return df, df2
(7)下面这段代码是主程序入口,完成了以下功能:
if __name__ == "__main__":
# 解析命令行参数
args = arg_parser()
# 设置 OpenAI API 密钥
if args.openai_api_key:
OPENAI_API_KEY = args.openai_api_key
else:
OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')
# 初始化 OpenAI 客户端
client = OpenAI(
api_key=OPENAI_API_KEY
)
# 验证结果文件的存在性和格式
if os.path.exists(args.result_file) is False:
import pdb; pdb.set_trace()
raise ValueError("结果文件不存在")
if not args.result_file.endswith(('.json', '.JSON')):
raise ValueError("结果文件应为 JSON 格式")
# 提取模型名称
model = pathlib.Path(args.result_file).stem
# 加载元数据
metadata = load_metadata(args)
(
subset,
subset_name,
data,
counter,
cap_set_list,
cap_set_counter,
len_data,
df,
df2,
cap_set_names,
) = metadata
# 生成评分和能力得分文件的路径
file_names = get_file_names(args, model, subset_name)
(
grade_file,
cap_score_file,
cap_int_score_file,
) = file_names
# 对模型输出进行评估
grade_results = runs(
args,
grade_file,
data,
len_data,
subset,
)
# 导出能力得分结果
df, df2 = export_result(
args,
model,
df,
df2,
grade_results,
data,
cap_set_counter,
cap_set_names,
)
# 打印结果及保存路径
print(df)
print("\n")
print(df2)
print("\n")
print(f"评分结果已保存于:\n{grade_file}\n{cap_score_file}\n{cap_int_score_file}")