LangChain 77 LangSmith 从入门到精通二

LangChain系列文章

  1. LangChain 60 深入理解LangChain 表达式语言23 multiple chains链透传参数 LangChain Expression Language (LCEL)
  2. LangChain 61 深入理解LangChain 表达式语言24 multiple chains链透传参数 LangChain Expression Language (LCEL)
  3. LangChain 62 深入理解LangChain 表达式语言25 agents代理 LangChain Expression Language (LCEL)
  4. LangChain 63 深入理解LangChain 表达式语言26 生成代码code并执行 LangChain Expression Language (LCEL)
  5. LangChain 64 深入理解LangChain 表达式语言27 添加审查 Moderation LangChain Expression Language (LCEL)
  6. LangChain 65 深入理解LangChain 表达式语言28 余弦相似度Router Moderation LangChain Expression Language (LCEL)
  7. LangChain 66 深入理解LangChain 表达式语言29 管理prompt提示窗口大小 LangChain Expression Language (LCEL)
  8. LangChain 67 深入理解LangChain 表达式语言30 调用tools搜索引擎 LangChain Expression Language (LCEL)
  9. LangChain 68 LLM Deployment大语言模型部署方案
  10. LangChain 69 向量数据库Pinecone入门
  11. LangChain 70 Evaluation 评估、衡量在多样化数据上的性能和完整性
  12. LangChain 71 字符串评估器String Evaluation衡量在多样化数据上的性能和完整性
  13. LangChain 72 reference改变结果 字符串评估器String Evaluation
  14. LangChain 73 给结果和参考评分 Scoring Evaluator
  15. LangChain 74 有用的或者有害的helpful or harmful Scoring Evaluator
  16. LangChain 75 打造你自己的OpenAI + LangChain网页应用
  17. LangChain 76 LangSmith 从入门到精通一

在这里插入图片描述

评估代理Agent

除了记录运行情况外,LangSmith还允许您测试和评估LLM应用程序。

在本节中,您将利用LangSmith创建一个基准数据集,并在代理上运行AI辅助评估器。您将通过以下几个步骤完成:

  1. 创建数据集
  2. 初始化一个新的agent代理来进行基准测试benchmark
  3. 配置评估器来评分代理的输出
  4. 在数据集上运行代理并评估结果

1. 创建一个LangSmith数据集

在下面,我们使用LangSmith客户端从上面的输入问题和标签列表创建一个数据集。您将在以后使用这些来衡量新代理的性能。数据集是一组例子,它们只是您可以用作应用程序测试用例的输入-输出对。

有关数据集的更多信息,包括如何从CSV或其他文件创建它们,或者如何在平台上创建它们,请参阅LangSmith文档。

outputs = [
    "LangChain is an open-source framework for building applications using large language models. It is also the name of the company building LangSmith.",
    "LangSmith is a unified platform for debugging, testing, and monitoring language model applications and agents powered by LangChain",
    "July 18, 2023",
    "The langsmith cookbook is a github repository containing detailed examples of how to use LangSmith to debug, evaluate, and monitor large language model-powered applications.",
    "September 5, 2023",
]
dataset_name = f"agent-qa-{unique_id}"

dataset = client.create_dataset(
    dataset_name,
    description="An example dataset of questions over the LangSmith documentation.",
)

client.create_examples(
    inputs=[{"input": query} for query in inputs],
    outputs=[{"output": answer} for answer in outputs],
    dataset_id=dataset.id,
)

2. 初始化一个新的代理来进行基准测试

LangSmith允许您评估任何LLM、chain、agent,甚至是自定义函数。会话代理是有状态的(它们有内存);为了确保这种状态不会在数据集运行之间共享,我们将传入一个chain_factory(也称为构造函数constructor)函数,以便为每次调用进行初始化。

在这种情况下,我们将测试一个使用OpenAI的函数调用端点的代理。

from langchain import hub
from langchain.agents import AgentExecutor, AgentType, initialize_agent, load_tools
from langchain.agents.format_scratchpad import format_to_openai_function_messages
from langchain.agents.output_parsers import OpenAIFunctionsAgentOutputParser
from langchain_openai import ChatOpenAI


# Since chains can be stateful (e.g. they can have memory), we provide
# a way to initialize a new chain for each row in the dataset. This is done
# by passing in a factory function that returns a new chain for each row.
def create_agent(prompt, llm_with_tools):
    runnable_agent = (
        {
            "input": lambda x: x["input"],
            "agent_scratchpad": lambda x: format_to_openai_function_messages(
                x["intermediate_steps"]
            ),
        }
        | prompt
        | llm_with_tools
        | OpenAIFunctionsAgentOutputParser()
    )
    return AgentExecutor(agent=runnable_agent, tools=tools, handle_parsing_errors=True)

3. 配置评估

在界面上手动比较链条的结果虽然有效,但可能会耗费时间。使用自动化指标和人工智能辅助反馈来评估您的组件性能会很有帮助。

接下来,我们将创建一个自定义运行评估器,记录启发式评估。

启发式评估器 Heuristic evaluators

from langsmith.evaluation import EvaluationResult, run_evaluator
from langsmith.schemas import Example, Run


@run_evaluator
def check_not_idk(run: Run, example: Example):
    """Illustration of a custom evaluator."""
    agent_response = run.outputs["output"]
    if "don't know" in agent_response or "not sure" in agent_response:
        score = 0
    else:
        score = 1
    # You can access the dataset labels in example.outputs[key]
    # You can also access the model inputs in run.inputs[key]
    return EvaluationResult(
        key="not_uncertain",
        score=score,
    )

以下,我们将使用上面的自定义评估器配置评估,以及一些预先实施的运行评估器,执行以下操作:- 将结果与地面真相标签进行比较。- 使用嵌入距离测量语义(不)相似性- 使用自定义标准以无参考方式评估代理响应的“方面”

有关如何为您的用例选择适当的评估器以及如何创建自己的自定义评估器的更长讨论,请参考LangSmith文档。

from langchain.evaluation import EvaluatorType
from langchain.smith import RunEvalConfig

evaluation_config = RunEvalConfig(
    # Evaluators can either be an evaluator type (e.g., "qa", "criteria", "embedding_distance", etc.) or a configuration for that evaluator
    evaluators=[
        # Measures whether a QA response is "Correct", based on a reference answer
        # You can also select via the raw string "qa"
        EvaluatorType.QA,
        # Measure the embedding distance between the output and the reference answer
        # Equivalent to: EvalConfig.EmbeddingDistance(embeddings=OpenAIEmbeddings())
        EvaluatorType.EMBEDDING_DISTANCE,
        # Grade whether the output satisfies the stated criteria.
        # You can select a default one such as "helpfulness" or provide your own.
        RunEvalConfig.LabeledCriteria("helpfulness"),
        # The LabeledScoreString evaluator outputs a score on a scale from 1-10.
        # You can use default criteria or write our own rubric
        RunEvalConfig.LabeledScoreString(
            {
                "accuracy": """
Score 1: The answer is completely unrelated to the reference.
Score 3: The answer has minor relevance but does not align with the reference.
Score 5: The answer has moderate relevance but contains inaccuracies.
Score 7: The answer aligns with the reference but has minor errors or omissions.
Score 10: The answer is completely accurate and aligns perfectly with the reference."""
            },
            normalize_by=10,
        ),
    ],
    # You can add custom StringEvaluator or RunEvaluator objects here as well, which will automatically be
    # applied to each prediction. Check out the docs for examples.
    custom_evaluators=[check_not_idk],
)

4. 运行agent和评估者evaluators

使用run_on_dataset(或异步arun_on_dataset)函数来评估您的模型。这将:

  1. 从指定的数据集中获取示例行。
  2. 对每个示例运行您的代理(或任何自定义函数)。
  3. 将评估器应用于生成的运行轨迹和相应的参考示例,以生成自动化反馈。

结果将在LangSmith应用程序中可见。

from langchain import hub

# We will test this version of the prompt
prompt = hub.pull("wfh/langsmith-agent-prompt:798e7324")
import functools

from langchain.smith import arun_on_dataset, run_on_dataset

chain_results = run_on_dataset(
    dataset_name=dataset_name,
    llm_or_chain_factory=functools.partial(
        create_agent, prompt=prompt, llm_with_tools=llm_with_tools
    ),
    evaluation=evaluation_config,
    verbose=True,
    client=client,
    project_name=f"runnable-agent-test-5d466cbc-{unique_id}",
    # Project metadata communicates the experiment parameters,
    # Useful for reviewing the test results
    project_metadata={
        "env": "testing-notebook",
        "model": "gpt-3.5-turbo",
        "prompt": "5d466cbc",
    },
)

# Sometimes, the agent will error due to parsing issues, incompatible tool inputs, etc.
# These are logged as warnings here and captured as errors in the tracing UI.
View the evaluation results for project 'runnable-agent-test-5d466cbc-97e1' at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/14d8a382-3c0f-48e7-b212-33489ee8a13e/compare?selectedSessions=62f0a0c0-73bf-420c-a907-2c6b2f4625c4

View all tests for Dataset agent-qa-e2d24144 at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/14d8a382-3c0f-48e7-b212-33489ee8a13e
[>                                                 ] 0/5[------------------------------------------------->] 5/5
Server error caused failure to patch https://api.smith.langchain.com/runs/e9d26fe6-bf4a-4f88-81c5-f5d0f70977f0 in LangSmith API. HTTPError('500 Server Error: Internal Server Error for url: https://api.smith.langchain.com/runs/e9d26fe6-bf4a-4f88-81c5-f5d0f70977f0', '{"detail":"Internal server error"}')

实验结果:
LangChain 77 LangSmith 从入门到精通二_第1张图片
请查看测试结果
您可以通过点击上面输出中的URL或导航到LangSmith的“agent-qa-{unique_id}” 数据集中的“测试和数据集”页面来查看下面的测试结果跟踪UI。

代码

https://github.com/zgpeace/pets-name-langchain/tree/develop

参考

https://python.langchain.com/docs/langsmith/walkthrough

你可能感兴趣的:(LLM-Large,Language,Models,langchain,python,conda,langsmith)