如何通过轨迹评估来评估你的智能体

许多智能体的行为只有在使用真正的 LLM 时才会显现，例如智能体决定调用哪个工具、如何格式化响应，或者提示修改是否影响整个执行轨迹。LangChain 的 agentevals 包提供了专门用于测试智能体轨迹和实时模型的评估器。

本指南涵盖了开源的 LangChain agentevals 包，该包与 LangSmith 集成以进行轨迹评估。

AgentEvals 允许您通过执行轨迹匹配或使用LLM 评判者来评估智能体的轨迹（确切的消息序列，包括工具调用）：

轨迹匹配

为给定输入硬编码参考轨迹，并通过逐步比较来验证运行。适用于测试明确定义的工作流，您可以知道预期的行为。当您对应该调用哪些工具以及调用顺序有具体期望时使用。此方法具有确定性、快速且成本效益高，因为它不需要额外的 LLM 调用。

LLM 作为评判者

使用 LLM 定性验证智能体的执行轨迹。“评判者”LLM 根据提示评分标准（可以包括参考轨迹）审查智能体的决策。更灵活，可以评估效率和适当性等细微方面，但需要 LLM 调用且确定性较低。当您希望评估智能体轨迹的整体质量和合理性，而没有严格的工具调用或排序要求时使用。

安装 AgentEvals

pip install agentevals

或者，直接克隆 AgentEvals 仓库。

轨迹匹配评估器

AgentEvals 在 Python 中提供 create_trajectory_match_evaluator 函数，在 TypeScript 中提供 createTrajectoryMatchEvaluator 函数，用于将智能体的轨迹与参考轨迹进行匹配。您可以使用以下模式：

模式	描述	用例
`严格`	消息和工具调用按相同顺序精确匹配	测试特定序列（例如，授权前的策略查找）
`无序`	允许相同工具调用以任意顺序出现	在顺序无关紧要时验证信息检索
`子集`	智能体仅调用参考中的工具（没有额外的）	确保智能体不超过预期范围
`超集`	智能体至少调用参考工具（允许额外的）	验证是否采取了最低要求的操作

严格匹配

strict 模式确保轨迹包含相同顺序的相同消息和相同的工具调用，尽管它允许消息内容存在差异。这在您需要强制执行特定操作序列时非常有用，例如在授权操作之前要求进行策略查找。

from langchain.agents import create_agent
from langchain.tools import tool
from langchain.messages import HumanMessage, AIMessage, ToolMessage
from agentevals.trajectory.match import create_trajectory_match_evaluator


@tool
def get_weather(city: str):
    """Get weather information for a city."""
    return f"It's 75 degrees and sunny in {city}."

agent = create_agent("gpt-4o", tools=[get_weather])

evaluator = create_trajectory_match_evaluator(  
    trajectory_match_mode="strict",  
)  

def test_weather_tool_called_strict():
    result = agent.invoke({
        "messages": [HumanMessage(content="What's the weather in San Francisco?")]
    })

    reference_trajectory = [
        HumanMessage(content="What's the weather in San Francisco?"),
        AIMessage(content="", tool_calls=[
            {"id": "call_1", "name": "get_weather", "args": {"city": "San Francisco"}}
        ]),
        ToolMessage(content="It's 75 degrees and sunny in San Francisco.", tool_call_id="call_1"),
        AIMessage(content="The weather in San Francisco is 75 degrees and sunny."),
    ]

    evaluation = evaluator(
        outputs=result["messages"],
        reference_outputs=reference_trajectory
    )
    # {
    #     'key': 'trajectory_strict_match',
    #     'score': True,
    #     'comment': None,
    # }
    assert evaluation["score"] is True

无序匹配

unordered 模式允许以任意顺序调用相同的工具，这在您希望验证是否调用了正确的工具集但不在乎顺序时很有用。例如，智能体可能需要检查城市的天气和事件，但顺序无关紧要。

from langchain.agents import create_agent
from langchain.tools import tool
from langchain.messages import HumanMessage, AIMessage, ToolMessage
from agentevals.trajectory.match import create_trajectory_match_evaluator


@tool
def get_weather(city: str):
    """Get weather information for a city."""
    return f"It's 75 degrees and sunny in {city}."

@tool
def get_events(city: str):
    """Get events happening in a city."""
    return f"Concert at the park in {city} tonight."

agent = create_agent("gpt-4o", tools=[get_weather, get_events])

evaluator = create_trajectory_match_evaluator(  
    trajectory_match_mode="unordered",  
)  

def test_multiple_tools_any_order():
    result = agent.invoke({
        "messages": [HumanMessage(content="What's happening in SF today?")]
    })

    # Reference shows tools called in different order than actual execution
    reference_trajectory = [
        HumanMessage(content="What's happening in SF today?"),
        AIMessage(content="", tool_calls=[
            {"id": "call_1", "name": "get_events", "args": {"city": "SF"}},
            {"id": "call_2", "name": "get_weather", "args": {"city": "SF"}},
        ]),
        ToolMessage(content="Concert at the park in SF tonight.", tool_call_id="call_1"),
        ToolMessage(content="It's 75 degrees and sunny in SF.", tool_call_id="call_2"),
        AIMessage(content="Today in SF: 75 degrees and sunny with a concert at the park tonight."),
    ]

    evaluation = evaluator(
        outputs=result["messages"],
        reference_outputs=reference_trajectory,
    )
    # {
    #     'key': 'trajectory_unordered_match',
    #     'score': True,
    # }
    assert evaluation["score"] is True

子集和超集匹配

superset 和 subset 模式侧重于调用哪些工具，而不是工具调用的顺序，允许您控制智能体的工具调用必须与参考对齐的严格程度。

当您希望验证执行中调用了一些关键工具，但允许智能体调用其他工具时，请使用 superset 模式。智能体的轨迹必须至少包含参考轨迹中的所有工具调用，并且可能包含超出参考的其他工具调用。
使用 subset 模式通过验证智能体没有调用参考之外的任何不相关或不必要的工具来确保智能体效率。智能体的轨迹必须只包含参考轨迹中出现的工具调用。

以下示例演示了 superset 模式，其中参考轨迹只要求 get_weather 工具，但智能体可以调用其他工具

from langchain.agents import create_agent
from langchain.tools import tool
from langchain.messages import HumanMessage, AIMessage, ToolMessage
from agentevals.trajectory.match import create_trajectory_match_evaluator


@tool
def get_weather(city: str):
    """Get weather information for a city."""
    return f"It's 75 degrees and sunny in {city}."

@tool
def get_detailed_forecast(city: str):
    """Get detailed weather forecast for a city."""
    return f"Detailed forecast for {city}: sunny all week."

agent = create_agent("gpt-4o", tools=[get_weather, get_detailed_forecast])

evaluator = create_trajectory_match_evaluator(  
    trajectory_match_mode="superset",  
)  

def test_agent_calls_required_tools_plus_extra():
    result = agent.invoke({
        "messages": [HumanMessage(content="What's the weather in Boston?")]
    })

    # Reference only requires get_weather, but agent may call additional tools
    reference_trajectory = [
        HumanMessage(content="What's the weather in Boston?"),
        AIMessage(content="", tool_calls=[
            {"id": "call_1", "name": "get_weather", "args": {"city": "Boston"}},
        ]),
        ToolMessage(content="It's 75 degrees and sunny in Boston.", tool_call_id="call_1"),
        AIMessage(content="The weather in Boston is 75 degrees and sunny."),
    ]

    evaluation = evaluator(
        outputs=result["messages"],
        reference_outputs=reference_trajectory,
    )
    # {
    #     'key': 'trajectory_superset_match',
    #     'score': True,
    #     'comment': None,
    # }
    assert evaluation["score"] is True

您还可以通过设置 tool_args_match_mode (Python) 或 toolArgsMatchMode (TypeScript) 属性，以及 tool_args_match_overrides (Python) 或 toolArgsMatchOverrides (TypeScript) 属性，来自定义评估器如何考虑实际轨迹与参考轨迹中工具调用之间的相等性。默认情况下，只有具有相同参数的相同工具调用才被视为相等。访问存储库获取更多详细信息。

LLM 作为评判者评估器

本节涵盖了 agentevals 包中特定于轨迹的 LLM 作为评判者评估器。有关 LangSmith 中通用 LLM 作为评判者评估器，请参阅 LLM 作为评判者评估器。

您还可以使用 LLM 来评估智能体的执行路径。与轨迹匹配评估器不同，它不需要参考轨迹，但如果可用可以提供。

无参考轨迹

from langchain.agents import create_agent
from langchain.tools import tool
from langchain.messages import HumanMessage, AIMessage, ToolMessage
from agentevals.trajectory.llm import create_trajectory_llm_as_judge, TRAJECTORY_ACCURACY_PROMPT


@tool
def get_weather(city: str):
    """Get weather information for a city."""
    return f"It's 75 degrees and sunny in {city}."

agent = create_agent("gpt-4o", tools=[get_weather])

evaluator = create_trajectory_llm_as_judge(  
    model="openai:o3-mini",  
    prompt=TRAJECTORY_ACCURACY_PROMPT,  
)  

def test_trajectory_quality():
    result = agent.invoke({
        "messages": [HumanMessage(content="What's the weather in Seattle?")]
    })

    evaluation = evaluator(
        outputs=result["messages"],
    )
    # {
    #     'key': 'trajectory_accuracy',
    #     'score': True,
    #     'comment': 'The provided agent trajectory is reasonable...'
    # }
    assert evaluation["score"] is True

有参考轨迹

如果您有参考轨迹，您可以向提示中添加一个额外的变量并传入参考轨迹。下面，我们使用预构建的 TRAJECTORY_ACCURACY_PROMPT_WITH_REFERENCE 提示并配置 reference_outputs 变量

evaluator = create_trajectory_llm_as_judge(
    model="openai:o3-mini",
    prompt=TRAJECTORY_ACCURACY_PROMPT_WITH_REFERENCE,
)
evaluation = judge_with_reference(
    outputs=result["messages"],
    reference_outputs=reference_trajectory,
)

有关 LLM 如何评估轨迹的更多可配置性，请访问存储库。

异步支持 (Python)

所有 agentevals 评估器都支持 Python asyncio。对于使用工厂函数的评估器，通过在函数名称中的 create_ 之后添加 async 即可获得异步版本。以下是使用异步评判者和评估器的示例：

from agentevals.trajectory.llm import create_async_trajectory_llm_as_judge, TRAJECTORY_ACCURACY_PROMPT
from agentevals.trajectory.match import create_async_trajectory_match_evaluator

async_judge = create_async_trajectory_llm_as_judge(
    model="openai:o3-mini",
    prompt=TRAJECTORY_ACCURACY_PROMPT,
)

async_evaluator = create_async_trajectory_match_evaluator(
    trajectory_match_mode="strict",
)

async def test_async_evaluation():
    result = await agent.ainvoke({
        "messages": [HumanMessage(content="What's the weather?")]
    })

    evaluation = await async_judge(outputs=result["messages"])
    assert evaluation["score"] is True

在 GitHub 上编辑此页面源文件。

以编程方式连接这些文档到 Claude、VSCode 等，通过 MCP 获取实时答案。

数据集

设置评估

分析实验结果

标注与人工反馈

常见数据类型

如何通过轨迹评估来评估你的智能体

轨迹匹配

LLM 作为评判者

安装 AgentEvals

轨迹匹配评估器

严格匹配

无序匹配

子集和超集匹配

LLM 作为评判者评估器

无参考轨迹

有参考轨迹

异步支持 (Python)

数据集

设置评估

分析实验结果

标注与人工反馈

常见数据类型

轨迹匹配

LLM 作为评判者

​安装 AgentEvals

​轨迹匹配评估器

​严格匹配

​无序匹配

​子集和超集匹配

​LLM 作为评判者评估器

​无参考轨迹

​有参考轨迹

​异步支持 (Python)

安装 AgentEvals

轨迹匹配评估器

严格匹配

无序匹配

子集和超集匹配

LLM 作为评判者评估器

无参考轨迹

有参考轨迹

异步支持 (Python)