LangChain 73 给结果和参考评分 Scoring Evaluator

LangChain系列文章

  1. LangChain 60 深入理解LangChain 表达式语言23 multiple chains链透传参数 LangChain Expression Language (LCEL)
  2. LangChain 61 深入理解LangChain 表达式语言24 multiple chains链透传参数 LangChain Expression Language (LCEL)
  3. LangChain 62 深入理解LangChain 表达式语言25 agents代理 LangChain Expression Language (LCEL)
  4. LangChain 63 深入理解LangChain 表达式语言26 生成代码code并执行 LangChain Expression Language (LCEL)
  5. LangChain 64 深入理解LangChain 表达式语言27 添加审查 Moderation LangChain Expression Language (LCEL)
  6. LangChain 65 深入理解LangChain 表达式语言28 余弦相似度Router Moderation LangChain Expression Language (LCEL)
  7. LangChain 66 深入理解LangChain 表达式语言29 管理prompt提示窗口大小 LangChain Expression Language (LCEL)
  8. LangChain 67 深入理解LangChain 表达式语言30 调用tools搜索引擎 LangChain Expression Language (LCEL)
  9. LangChain 68 LLM Deployment大语言模型部署方案
  10. LangChain 69 向量数据库Pinecone入门
  11. LangChain 70 Evaluation 评估、衡量在多样化数据上的性能和完整性
  12. LangChain 71 字符串评估器String Evaluation衡量在多样化数据上的性能和完整性
  13. LangChain 72 reference改变结果 字符串评估器String Evaluation

在这里插入图片描述

1. 评分评估器 Scoring Evaluator

评分评估器指导语言模型根据您的自定义标准或评分表,在指定的等级上(默认为1-10)评估您的模型的预测。这一特性提供了一种细致的评估,而不是简单的二元得分,有助于根据量身定制的评分表评估模型,并比较模型在特定任务上的表现。

在我们开始之前,请注意,任何来自大型语言模型的特定等级都应持保留态度。得分为“8”的预测可能并不比得分为“7”的预测有显著优势。

1.1 使用带有真实数据Ground Truth的情况

要全面了解,请参阅LabeledScoreStringEvalChain文档。

以下是一个示例,展示了使用默认提示的LabeledScoreStringEvalChain的用法:

from langchain.evaluation import load_evaluator
from langchain_core.runnables import RunnablePassthrough
from langchain.prompts import ChatPromptTemplate
from langchain.chat_models import ChatOpenAI
from langchain_core.output_parsers import StrOutputParser

from dotenv import load_dotenv  # 导入从 .env 文件加载环境变量的函数
load_dotenv()  # 调用函数实际加载环境变量

# from langchain.globals import set_debug  # 导入在 langchain 中设置调试模式的函数
# set_debug(True)  # 启用 langchain 的调试模式

from langchain.evaluation import load_evaluator
from langchain.chat_models import ChatOpenAI

evaluator = load_evaluator("labeled_score_string", llm=ChatOpenAI(model="gpt-3.5-turbo"))
# Correct
eval_result = evaluator.evaluate_strings(
    prediction="You can find them in the dresser's third drawer.",
    reference="The socks are in the third drawer in the dresser",
    input="Where are my socks?",
)
print("Correct: ", eval_result)

输出

(.venv)  ~/Workspace/LLM/langchain-llm-app/ [develop+*] python Evaluate/score.py                                                         ⏎
Correct:  {'reasoning': "Explanation:\nThe assistant's response is helpful and relevant to the user's question. It provides a concise and accurate answer, directing the user to find their socks in the third drawer of the dresser. The response is correct and factual, as it accurately refers to the location of the socks. While the response does not demonstrate depth of thought, it effectively addresses the user's query.\n\nRating: [[8]]", 'score': 8}

1.2 完整评分表

评估您的应用程序的具体情境时,如果您提供了一份完整的评分标准,评估者可以更有效地进行工作。以下是以准确性为例的一个示例。

accuracy_criteria = {
    "accuracy": """
Score 1: The answer is completely unrelated to the reference.
Score 3: The answer has minor relevance but does not align with the reference.
Score 5: The answer has moderate relevance but contains inaccuracies.
Score 7: The answer aligns with the reference but has minor errors or omissions.
Score 10: The answer is completely accurate and aligns perfectly with the reference."""
}

evaluator = load_evaluator(
    "labeled_score_string",
    criteria=accuracy_criteria,
    llm=ChatOpenAI(model="gpt-4"),
)

执行评分

# Correct
eval_result = evaluator.evaluate_strings(
    prediction="You can find them in the dresser's third drawer.",
    reference="The socks are in the third drawer in the dresser",
    input="Where are my socks?",
)
print(eval_result)

结果分数提升到10分

(.venv)  ~/Workspace/LLM/langchain-llm-app/ [develop+*] python Evaluate/score.py
Correct:  {'reasoning': 'Explanation: The assistant accurately states that the socks can be found in the third drawer of the dresser, which aligns perfectly with the reference. There are no errors or omissions in the response.\n\nRating: [[10]]', 'score': 10}

1.3 正确但是缺少详细的细节

# Correct but lacking information
eval_result = evaluator.evaluate_strings(
    prediction="You can find them in the dresser.",
    reference="The socks are in the third drawer in the dresser",
    input="Where are my socks?",
)
print(eval_result)

输出结果

(.venv)  ~/Workspace/LLM/langchain-llm-app/ [develop+*] python Evaluate/score.py
Correct but lacking information >>> /n {'reasoning': "Explanation: The assistant's response is partially accurate as it correctly mentions that the socks can be found in the dresser. However, it does not provide specific information about the location of the socks in the dresser. \n\nRating: [[7]]", 'score': 7}

1.4 错误的预测

# Incorrect
eval_result = evaluator.evaluate_strings(
    prediction="You can find them in the dog's bed.",
    reference="The socks are in the third drawer in the dresser",
    input="Where are my socks?",
)
print(eval_result)

输出

(.venv)  ~/Workspace/LLM/langchain-llm-app/ [develop+*] python Evaluate/score.py
Incorrect >>> /n {'reasoning': "The AI assistant's response is completely unrelated to the reference. The reference states that the socks are in the third drawer in the dresser, while the assistant suggests they can be found in the dog's bed. \n\nRating: [[1]]", 'score': 1}

1.5 设定评分的最高分,其它分数Model自行评估

你也可以让评估者为你规范化分数,如果你想要在与其他评估者相似的标准上使用这些值。

evaluator = load_evaluator(
    "labeled_score_string",
    criteria=accuracy_criteria,
    llm=ChatOpenAI(model="gpt-3.5-turbo"),
    normalize_by=10,
)
# Correct but lacking information
eval_result = evaluator.evaluate_strings(
    prediction="You can find them in the dresser.",
    reference="The socks are in the third drawer in the dresser",
    input="Where are my socks?",
)
print(eval_result)

输出

(.venv)~/Workspace/LLM/langchain-llm-app/ [develop+*] python Evaluate/score.py
Correct but lacking information >>>>  {'reasoning': "Explanation: \nThe AI assistant's response is partially relevant to the user's question. It mentions the location of the socks as being in the dresser, which aligns with the ground truth. However, it lacks the specific information that the socks are in the third drawer. Overall, the response provides some relevant information but has a minor error in omitting the specific drawer location. \n\nRating: [[7]]", 'score': 0.7}

代码

https://github.com/zgpeace/pets-name-langchain/tree/develop

参考

https://python.langchain.com/docs/guides/evaluation/string/scoring_eval_chain

你可能感兴趣的:(LLM-Large,Language,Models,langchain,语言模型,人工智能,chatgpt,LLM)