(9-4-01)MM-Vet多模态大模型评估系统:多模态大模型评估

9.4  多模态大模型评估

在本项目中,通过文件mm-vet-v2_evaluator.py评估常见多模态模型在 MM-Vet-v2 数据集上的表现。通过预定义的提示和 GPT 模型,对比模型的预测结果与标准答案,生成准确性评分,并支持多次运行以统计稳定性。最终,将评分结果保存为 JSON 文件,并导出能力和能力整合评估的 CSV 报告,用于分析模型的性能和能力分布。

文件mm-vet-v2_evaluator.py的具体实现流程如下所示。

(1)下面代码的功能是实现对大模型预测结果与人工标注的正确答案之间的比较,并根据一定的规则为预测结果生成一个准确性得分(Correctness Score)。这段代码在构建自动化评估工具时具有广泛的适用性,特别是用于生成式 AI 模型的性能比较和优化。

prompt = """Compare the ground truth and prediction from AI models, to give a correctness score for the prediction.  in the question indicates where an image is.  in the ground truth means it is totally right only when all elements in the ground truth are present in the prediction, and  means it is totally right when any one element in the ground truth is present in the prediction. The correctness score is 0.0 (totally wrong), 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, or 1.0 (totally right). Just complete the last space of the correctness score.

| Question | Ground truth | Prediction | Correctness |
| --- | --- | --- | --- |
| What is x in the equation? | -1  -5 | x = 3 | 0.0 |
| What is x in the equation? | -1  -5 | x = -1 | 0.5 |
| What is x in the equation? | -1  -5 | x = -5 | 0.5 |
| What is x in the equation? | -1  -5 | x = -5 or 5 | 0.5 |
| What is x in the equation? | -1  -5 | x = -1 or x = -5 | 1.0 |
| Can you explain this meme? | This meme is poking fun at the fact that the names of the countries Iceland and Greenland are misleading. Despite its name, Iceland is known for its beautiful green landscapes, while Greenland is mostly covered in ice and snow. The meme is saying that the person has trust issues because the names of these countries do not accurately represent their landscapes. | The meme talks about Iceland and Greenland. It's pointing out that despite their names, Iceland is not very icy and Greenland isn't very green. | 0.4 |
| Can you explain this meme? | This meme is poking fun at the fact that the names of the countries Iceland and Greenland are misleading. Despite its name, Iceland is known for its beautiful green landscapes, while Greenland is mostly covered in ice and snow. The meme is saying that the person has trust issues because the names of these countries do not accurately represent their landscapes. | The meme is using humor to point out the misleading nature of Iceland's and Greenland's names. Iceland, despite its name, has lush green landscapes while Greenland is mostly covered in ice and snow. The text 'This is why I have trust issues' is a playful way to suggest that these contradictions can lead to distrust or confusion. The humor in this meme is derived from the unexpected contrast between the names of the countries and their actual physical characteristics. | 1.0 |
"""

(2)下面代码定义了函数 arg_parser,其功能是通过 argparse 模块创建一个命令行参数解析器,用于解析和管理运行时参数。主要的用途是设置路径、模型、提示词(prompt)、API密钥、评估细节等相关配置,以便执行基于多模态或语言模型的任务。它允许用户自定义运行的参数,也提供了默认值,便于快速实验。

def arg_parser(prompt=prompt):
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--mmvetv2_path",
        type=str,
        default="/path/to/mm-vet-v2",
        help="下载 mm-vet.zip 并解压 `unzip mm-vet.zip`,然后在这里更改路径",
    )
    parser.add_argument(
        "--result_file",
        type=str,
        default="results/llava_llama2_13b_chat.json",
        help="模型结果文件路径,必须以 .json 结尾",
    )
    parser.add_argument(
        "--result_path",
        type=str,
        default="results",
        help="保存评分结果的路径",
    )
    parser.add_argument(
        "--openai_api_key", 
        type=str, 
        default=None,
        help="如果未指定,则使用环境变量 OPENAI_API_KEY。",
    )
    parser.add_argument(
        "--gpt_model", 
        type=str, 
        default="gpt-4-0613", 
        help="GPT 模型名称",
    )
    parser.add_argument(
        "--prompt", 
        type=str, 
        default=prompt, 
        help="模型使用的提示词",
    )
    parser.add_argument(
        "--subset",
        type=str,
        default=None,
        help="包含评估 ID 的 JSON 文件路径",
    )
    parser.add_argument(
        "--decimal_places",
        type=int,
        default=1,
        help="保留小数位数",
    )
    parser.add_argument(
        "--num_run",
        type=int,
        default=1,
        help="我们在论文中设置为 5",
    )
    args = parser.parse_args()
    return args

(3)函数 get_file_names 的功能是根据输入参数动态生成和返回三个文件路径,分别用于存储评分结果、能力评分结果和能力整合评分结果。这些路径基于模型名称、子集名称、GPT 模型版本和运行次数等参数动态拼接生成,并与结果目录合并,以便后续读取和存储。

def get_file_names(args, model, subset_name):
    # 保存每个样本的评分结果文件
    grade_file = f"{model}_{args.gpt_model}-grade-{args.num_run}runs_dev8.json"
    grade_file = os.path.join(args.result_path, grade_file)

    # 保存与能力/能力整合相关的评分结果文件
    cap_score_file = (
        f"{model}_{subset_name}{args.gpt_model}-cap-score-{args.num_run}runs_dev8.csv"
    )
    cap_score_file = os.path.join(args.result_path, cap_score_file)
    cap_int_score_file = f"{model}_{subset_name}{args.gpt_model}-cap-int-score-{args.num_run}runs_dev8.csv"
    cap_int_score_file = os.path.join(args.result_path, cap_int_score_file)

    # 返回生成的文件路径
    return grade_file, cap_score_file, cap_int_score_file

(4)函数load_metadata的主要功能是加载元数据文件(mm-vet-v2.json)和一个可选的子集文件(args.subset),解析并统计与任务能力相关的信息,包括能力类别的计数、组合及其分布。函数返回多个关键数据结构(如计数器、数据框等)以便后续使用,特别是处理能力类别组合的排序、名称生成以及数据框的初始化。

def load_metadata(args):
    # 加载用户指定的子集文件(如果提供)
    if args.subset:
        with open(args.subset, "r") as f:
            subset = json.load(f)
        
        # 提取子集名称(文件名去掉扩展名)
        subset_name = pathlib.Path(args.subset).stem
        subset_name = subset_name + "_"
    else:
        subset = None
        subset_name = ""

    # 加载主元数据文件 mm-vet-v2.json
    mmvet_metadata = os.path.join(args.mmvetv2_path, "mm-vet-v2.json")
    with open(mmvet_metadata, "r") as f:
        data = json.load(f)

    # 初始化统计相关的数据结构
    counter = Counter()  # 用于统计每种能力的出现次数
    cap_set_list = []  # 存储能力组合的列表
    cap_set_counter = []  # 存储每个能力组合的计数
    len_data = 0  # 子集数据总量

    # 遍历数据条目,统计能力和能力组合的分布
    for id, value in data.items():
        # 如果子集文件存在,跳过不在子集中的数据
        if subset is not None and id not in subset:
            continue

        # 获取当前条目的能力集合
        cap = value["capability"]
        cap = set(cap)

        # 更新能力计数器
        counter.update(cap)

        # 如果能力组合是新的,添加到列表并初始化计数
        if cap not in cap_set_list:
            cap_set_list.append(cap)
            cap_set_counter.append(1)
        else:
            # 增加已有能力组合的计数
            cap_set_counter[cap_set_list.index(cap)] += 1

        # 增加子集数据计数
        len_data += 1

    # 对能力计数器按出现频率降序排序
    sorted_list = counter.most_common()
    columns = [k for k, v in sorted_list]
    # 为数据框添加额外列
    columns.append("total")
    columns.append("std")
    columns.append("runs")
    df = pd.DataFrame(columns=columns)

    # 对能力组合按频率降序排序
    cap_set_sorted_indices = np.argsort(-np.array(cap_set_counter))
    new_cap_set_list = []
    new_cap_set_counter = []
    for index in cap_set_sorted_indices:
        new_cap_set_list.append(cap_set_list[index])
        new_cap_set_counter.append(cap_set_counter[index])

    cap_set_list = new_cap_set_list
    cap_set_counter = new_cap_set_counter
    # 将能力组合转换为字符串名称
    cap_set_names = ["_".join(list(cap_set)) for cap_set in cap_set_list]

    # 为能力组合初始化数据框
    columns2 = cap_set_names
    columns2.append("total")
    columns2.append("std")
    columns2.append("runs")
    df2 = pd.DataFrame(columns=columns2)

    # 返回解析和统计结果
    return (
        subset,
        subset_name,
        data,
        counter,
        cap_set_list,
        cap_set_counter,
        len_data,
        df,
        df2,
        cap_set_names,
    )

(5)函数runs()的主要功能是使用指定的 GPT 模型对 AI 模型的预测结果进行多次评估,输出评分结果并保存到文件中。函数runs()的主要实现步骤如下:

  1. 加载评估所需的预测结果文件和已有评分结果(如果存在)。
  2. 检查是否需要更多运行以满足配置的运行次数。
  3. 遍历每个数据条目,生成问题提示(prompt),并将问题与预测传递给 GPT 模型。
  4. 解析 GPT 的响应以提取评分,记录模型的输出和评分。
  5. 支持处理响应错误和重试逻辑,确保评分的准确性。
  6. 将评分结果实时保存到指定文件。
def runs(
    args,
    grade_file,
    data,
    len_data,
    subset=None,
):
    # 加载模型预测结果文件
    with open(args.result_file) as f:
        results = json.load(f)

    # 如果评分结果文件存在,加载已保存的评分结果;否则初始化为空
    if os.path.exists(grade_file):
        with open(grade_file, "r") as f:
            grade_results = json.load(f)
    else:
        grade_results = {}

    # 检查是否需要更多运行以满足指定运行次数
    def need_more_runs(args, grade_results, len_data):
        need_more_runs = False
        if len(grade_results) > 0:
            for k, v in grade_results.items():
                if len(v["score"]) < args.num_run:
                    need_more_runs = True
                    break
        return need_more_runs or len(grade_results) < len_data

    # 当需要更多运行时,继续评估
    while need_more_runs(args, grade_results, len_data):
        for j in range(args.num_run):
            print(f"eval run {j}")
            for id, line in tqdm(data.items()):
                # 如果提供了子集,跳过子集之外的条目
                if subset is not None and id not in subset:
                    continue
                # 如果该条目的当前运行已经评分,则跳过
                if id in grade_results and len(grade_results[id]["score"]) >= (j + 1):
                    continue

                # 获取模型预测结果
                model_pred = results[id]

                # 构造问题文本,将替换为
                queries = line['question'].split('')
                query = ""
                for q in queries:
                    if q.endswith((".jpg", "jpeg", ".png")):
                        query += ""
                    else:
                        query += q
                question = prompt + '| ' + ' | '.join([
                    query.replace('\n', '
'), line['answer'].replace("", " ").replace("", " ").replace('\n', '
'), model_pred.replace('\n', '
'), "" ]) # 构建消息内容 messages = [ {"role": "user", "content": question}, ] # 初始化条目评分 if id not in grade_results: sample_grade = {"model": [], "content": [], "score": []} else: sample_grade = grade_results[id] # 开始尝试评分,设置初始温度 grade_sample_run_complete = False temperature = 0.0 while not grade_sample_run_complete: try: # 调用 GPT 模型获取响应 response = client.chat.completions.create( model=args.gpt_model, max_tokens=3, temperature=temperature, messages=messages, ) content = response.choices[0].message.content # 提取评分值 flag = True try_time = 1 while flag: try: content = content.split(" ")[0].strip() score = float(content) if score > 1.0 or score < 0.0: assert False flag = False except: # 如果解析失败,生成新的提示语重试 question_try = question + "\n\nPredict the correctness of the answer (digit): " messages = [ {"role": "user", "content": question_try}, ] response = client.chat.completions.create( model=args.gpt_model, max_tokens=3, temperature=temperature, messages=messages, ) content = response.choices[0].message.content try_time += 1 temperature += 0.5 print(f"{id} try {try_time} times") print(content) if try_time > 5: # 超过最大重试次数,评分置为 0.0 score = 0.0 flag = False grade_sample_run_complete = True response_model = response.model except RateLimitError as e: # 如果触发速率限制,暂停一段时间 print("sleep 30s") time.sleep(30) except BadRequestError as e: # 捕获无效请求错误,记录默认值并退出评分循环 content = "BadRequestError" score = 0.0 flag = False print(id, "BadRequestError") response_model = args.gpt_model break # 更新评分结果 if len(sample_grade["model"]) >= j + 1: sample_grade["model"][j] = response_model sample_grade["content"][j] = content sample_grade["score"][j] = score else: sample_grade["model"].append(response_model) sample_grade["content"].append(content) sample_grade["score"].append(score) grade_results[id] = sample_grade # 保存评分结果到文件 with open(grade_file, "w") as f: json.dump(grade_results, f, indent=4) return grade_results

(6)函数export_result的主要功能是基于评估结果生成两个数据表格(df 和 df2),统计不同能力及其组合的得分表现,并将结果导出为 CSV 文件保存。具体过程包括:

  1. 初始化能力得分统计字典 cap_scores 和能力组合得分字典 cap_scores2。
  2. 遍历评估结果,根据能力类别和能力组合累加对应的得分。
  3. 计算各能力的平均得分及标准差,并更新数据框。
  4. 将数据表保存为 CSV 文件供后续分析使用。
def export_result(args, model, df, df2, grade_results, data, cap_set_counter, cap_set_names):
    # 初始化能力得分字典,记录每次运行的得分
    columns = df.columns
    columns2 = df2.columns

    # 为每种能力初始化一个包含运行次数的得分列表
    cap_scores = {k: [0.0] * args.num_run for k in columns[:-2]}
    counter["total"] = len_data  # 记录总样本数

    # 初始化能力组合的得分字典
    cap_scores2 = {k: [0.0] * args.num_run for k in columns2[:-2]}
    counter2 = {columns2[i]: cap_set_counter[i] for i in range(len(cap_set_counter))}
    counter2["total"] = len_data

    # 遍历评估结果,累加得分
    for k, v in grade_results.items():
        # 如果提供了子集,仅处理子集内的数据
        if subset is not None and k not in subset:
            continue

        # 遍历每次运行的得分
        for i in range(args.num_run):
            score = v["score"][i]
            caps = set(data[k]["capability"])  # 获取当前样本的能力类别

            # 累加能力类别的得分
            for c in caps:
                cap_scores[c][i] += score
            cap_scores["total"][i] += score

            # 找到当前能力组合的索引并累加得分
            index = cap_set_list.index(caps)
            cap_scores2[cap_set_names[index]][i] += score
            cap_scores2["total"][i] += score

    # 计算能力类别的平均得分
    for k, v in cap_scores.items():
        cap_scores[k] = np.array(v) / counter[k] * 100

    # 计算总得分的标准差和每次运行的平均值
    std = round(cap_scores["total"].std(), args.decimal_places)
    total_copy = cap_scores["total"].copy()
    runs = str(list(np.round(total_copy, args.decimal_places)))

    # 更新每个能力类别的平均得分
    for k, v in cap_scores.items():
        cap_scores[k] = round(v.mean(), args.decimal_places)

    # 将总得分的标准差和运行记录添加到结果
    cap_scores["std"] = std
    cap_scores["runs"] = runs
    df.loc[model] = cap_scores

    # 计算能力组合的平均得分
    for k, v in cap_scores2.items():
        cap_scores2[k] = round(
            np.mean(np.array(v) / counter2[k] * 100), args.decimal_places
        )
    cap_scores2["std"] = std
    cap_scores2["runs"] = runs
    df2.loc[model] = cap_scores2

    # 导出结果到 CSV 文件
    df.to_csv(cap_score_file)
    df2.to_csv(cap_int_score_file)

    return df, df2

(7)下面这段代码是主程序入口,完成了以下功能:

  1. 解析命令行参数:通过 arg_parser() 获取用户输入的参数。
  2. 验证输入参数的合法性:
  3. 确保提供的结果文件路径存在。
  4. 确保结果文件的格式为 JSON。
  5. 加载 OpenAI API 客户端:基于提供的 API 密钥或环境变量 OPENAI_API_KEY 初始化 OpenAI 客户端。
  6. 加载元数据:通过调用 load_metadata(args) 获取模型能力及评估数据相关的信息。
  7. 生成文件路径:通过 get_file_names 生成评分文件和能力得分保存文件的路径。
  8. 运行评估过程:调用 runs 函数,对模型的输出进行多次运行评估,并保存结果。
  9. 导出结果:通过 export_result 函数生成能力类别及组合得分的 CSV 文件。
  10. 打印结果:输出评分结果和保存路径。
if __name__ == "__main__":
    # 解析命令行参数
    args = arg_parser()

    # 设置 OpenAI API 密钥
    if args.openai_api_key:
        OPENAI_API_KEY = args.openai_api_key
    else:
        OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')

    # 初始化 OpenAI 客户端
    client = OpenAI(
        api_key=OPENAI_API_KEY
    )

    # 验证结果文件的存在性和格式
    if os.path.exists(args.result_file) is False:
        import pdb; pdb.set_trace()
        raise ValueError("结果文件不存在")
    if not args.result_file.endswith(('.json', '.JSON')):
        raise ValueError("结果文件应为 JSON 格式")

    # 提取模型名称
    model = pathlib.Path(args.result_file).stem

    # 加载元数据
    metadata = load_metadata(args)
    (
        subset,
        subset_name,
        data,
        counter,
        cap_set_list,
        cap_set_counter,
        len_data,
        df,
        df2,
        cap_set_names,
    ) = metadata

    # 生成评分和能力得分文件的路径
    file_names = get_file_names(args, model, subset_name)
    (
        grade_file,
        cap_score_file,
        cap_int_score_file,
    ) = file_names

    # 对模型输出进行评估
    grade_results = runs(
        args,
        grade_file,
        data,
        len_data,
        subset,
    )

    # 导出能力得分结果
    df, df2 = export_result(
        args,
        model,
        df,
        df2,
        grade_results,
        data,
        cap_set_counter,
        cap_set_names,
    )

    # 打印结果及保存路径
    print(df)
    print("\n")
    print(df2)
    print("\n")
    print(f"评分结果已保存于:\n{grade_file}\n{cap_score_file}\n{cap_int_score_file}")

你可能感兴趣的:(训练,RAG,多模态),人工智能,机器学习,计算机视觉,深度学习,多模态,大模型)