跳到主要内容
自主应用让 LLM 决定其自身的下一步以解决问题。这种灵活性强大,但模型的黑盒性质使得很难预测代理的一部分调整将如何影响其余部分。为了构建生产就绪的代理,彻底的测试至关重要。 有几种方法可以测试您的代理:
  • 单元测试通过使用内存中的模拟来隔离地练习代理的小的、确定性的部分,因此您可以快速、确定性地断言精确行为。
  • 集成测试使用真实的网络调用测试代理,以确认组件协同工作、凭据和架构一致,并且延迟可接受。
代理应用更倾向于集成测试,因为它们将多个组件链接在一起,并且必须处理由于 LLM 的非确定性性质而导致的不稳定性。

集成测试

许多代理行为只有在使用真实的 LLM 时才会出现,例如代理决定调用哪个工具、如何格式化响应,或者提示修改是否影响整个执行轨迹。LangChain 的 agentevals 包提供了专门为使用实时模型测试代理轨迹而设计的评估器。 AgentEvals 让您可以轻松地通过执行轨迹匹配或使用 LLM 评判器来评估您的代理轨迹(包括工具调用的确切消息序列):

轨迹匹配

为给定输入硬编码一个参考轨迹,并通过逐步比较来验证运行。适用于测试具有明确定义的工作流,您知道预期行为的情况。当您对应该调用哪些工具以及以何种顺序调用有特定期望时使用。此方法是确定性的、快速且经济高效的,因为它不需要额外的 LLM 调用。

LLM 评判器

使用 LLM 定性验证代理的执行轨迹。“评判器”LLM 根据提示评分标准(可以包含参考轨迹)审查代理的决策。更灵活,可以评估效率和适当性等细微方面,但需要 LLM 调用且不确定性更高。当您希望评估代理轨迹的整体质量和合理性,而没有严格的工具调用或排序要求时使用。

安装 AgentEvals

npm install agentevals @langchain/core
或者,直接克隆 AgentEvals 仓库

轨迹匹配评估器

AgentEvals 提供了 createTrajectoryMatchEvaluator 函数,用于将您的代理轨迹与参考轨迹进行匹配。有四种模式可供选择
模式描述用例
严格消息和工具调用按相同顺序精确匹配测试特定序列(例如,授权前的策略查找)
无序允许相同工具调用以任意顺序出现在顺序无关紧要时验证信息检索
子集智能体仅调用参考中的工具(没有额外的)确保智能体不超过预期范围
超集智能体至少调用参考工具(允许额外的)验证是否采取了最低要求的操作
strict 模式确保轨迹包含相同顺序的相同消息和相同的工具调用,尽管它允许消息内容存在差异。这在您需要强制执行特定操作序列时非常有用,例如在授权操作之前要求进行策略查找。
import { createAgent, tool, HumanMessage, AIMessage, ToolMessage } from "langchain"
import { createTrajectoryMatchEvaluator } from "agentevals";
import * as z from "zod";

const getWeather = tool(
  async ({ city }: { city: string }) => {
    return `It's 75 degrees and sunny in ${city}.`;
  },
  {
    name: "get_weather",
    description: "Get weather information for a city.",
    schema: z.object({
      city: z.string(),
    }),
  }
);

const agent = createAgent({
  model: "gpt-4o",
  tools: [getWeather]
});

const evaluator = createTrajectoryMatchEvaluator({  
  trajectoryMatchMode: "strict",  
});  

async function testWeatherToolCalledStrict() {
  const result = await agent.invoke({
    messages: [new HumanMessage("What's the weather in San Francisco?")]
  });

  const referenceTrajectory = [
    new HumanMessage("What's the weather in San Francisco?"),
    new AIMessage({
      content: "",
      tool_calls: [
        { id: "call_1", name: "get_weather", args: { city: "San Francisco" } }
      ]
    }),
    new ToolMessage({
      content: "It's 75 degrees and sunny in San Francisco.",
      tool_call_id: "call_1"
    }),
    new AIMessage("The weather in San Francisco is 75 degrees and sunny."),
  ];

  const evaluation = await evaluator({
    outputs: result.messages,
    referenceOutputs: referenceTrajectory
  });
  // {
  //     'key': 'trajectory_strict_match',
  //     'score': true,
  //     'comment': null,
  // }
  expect(evaluation.score).toBe(true);
}
unordered 模式允许以任何顺序调用相同的工具,这在您希望验证是否检索到特定信息但不关心顺序时很有用。例如,代理可能需要检查一个城市的天气和事件,但顺序并不重要。
import { createAgent, tool, HumanMessage, AIMessage, ToolMessage } from "langchain"
import { createTrajectoryMatchEvaluator } from "agentevals";
import * as z from "zod";

const getWeather = tool(
  async ({ city }: { city: string }) => {
    return `It's 75 degrees and sunny in ${city}.`;
  },
  {
    name: "get_weather",
    description: "Get weather information for a city.",
    schema: z.object({ city: z.string() }),
  }
);

const getEvents = tool(
  async ({ city }: { city: string }) => {
    return `Concert at the park in ${city} tonight.`;
  },
  {
    name: "get_events",
    description: "Get events happening in a city.",
    schema: z.object({ city: z.string() }),
  }
);

const agent = createAgent({
  model: "gpt-4o",
  tools: [getWeather, getEvents]
});

const evaluator = createTrajectoryMatchEvaluator({  
  trajectoryMatchMode: "unordered",  
});  

async function testMultipleToolsAnyOrder() {
  const result = await agent.invoke({
    messages: [new HumanMessage("What's happening in SF today?")]
  });

  // Reference shows tools called in different order than actual execution
  const referenceTrajectory = [
    new HumanMessage("What's happening in SF today?"),
    new AIMessage({
      content: "",
      tool_calls: [
        { id: "call_1", name: "get_events", args: { city: "SF" } },
        { id: "call_2", name: "get_weather", args: { city: "SF" } },
      ]
    }),
    new ToolMessage({
      content: "Concert at the park in SF tonight.",
      tool_call_id: "call_1"
    }),
    new ToolMessage({
      content: "It's 75 degrees and sunny in SF.",
      tool_call_id: "call_2"
    }),
    new AIMessage("Today in SF: 75 degrees and sunny with a concert at the park tonight."),
  ];

  const evaluation = await evaluator({
    outputs: result.messages,
    referenceOutputs: referenceTrajectory,
  });
  // {
  //     'key': 'trajectory_unordered_match',
  //     'score': true,
  // }
  expect(evaluation.score).toBe(true);
}
supersetsubset 模式匹配部分轨迹。superset 模式验证代理至少调用了参考轨迹中的工具,允许额外的工具调用。subset 模式确保代理没有调用超出参考轨迹中的任何工具。
import { createAgent } from "langchain"
import { tool } from "@langchain/core/tools";
import { HumanMessage, AIMessage, ToolMessage } from "@langchain/core/messages";
import { createTrajectoryMatchEvaluator } from "agentevals";
import * as z from "zod";

const getWeather = tool(
  async ({ city }: { city: string }) => {
    return `It's 75 degrees and sunny in ${city}.`;
  },
  {
    name: "get_weather",
    description: "Get weather information for a city.",
    schema: z.object({ city: z.string() }),
  }
);

const getDetailedForecast = tool(
  async ({ city }: { city: string }) => {
    return `Detailed forecast for ${city}: sunny all week.`;
  },
  {
    name: "get_detailed_forecast",
    description: "Get detailed weather forecast for a city.",
    schema: z.object({ city: z.string() }),
  }
);

const agent = createAgent({
  model: "gpt-4o",
  tools: [getWeather, getDetailedForecast]
});

const evaluator = createTrajectoryMatchEvaluator({  
  trajectoryMatchMode: "superset",  
});  

async function testAgentCallsRequiredToolsPlusExtra() {
  const result = await agent.invoke({
    messages: [new HumanMessage("What's the weather in Boston?")]
  });

  // Reference only requires getWeather, but agent may call additional tools
  const referenceTrajectory = [
    new HumanMessage("What's the weather in Boston?"),
    new AIMessage({
      content: "",
      tool_calls: [
        { id: "call_1", name: "get_weather", args: { city: "Boston" } },
      ]
    }),
    new ToolMessage({
      content: "It's 75 degrees and sunny in Boston.",
      tool_call_id: "call_1"
    }),
    new AIMessage("The weather in Boston is 75 degrees and sunny."),
  ];

  const evaluation = await evaluator({
    outputs: result.messages,
    referenceOutputs: referenceTrajectory,
  });
  // {
  //     'key': 'trajectory_superset_match',
  //     'score': true,
  //     'comment': null,
  // }
  expect(evaluation.score).toBe(true);
}
您还可以设置 toolArgsMatchMode 属性和/或 toolArgsMatchOverrides 来定制评估器如何考虑实际轨迹与参考轨迹中工具调用之间的相等性。默认情况下,只有具有相同参数的相同工具调用才被视为相等。请访问仓库了解更多详情。

LLM-作为-评估器

您还可以使用 LLM 和 createTrajectoryLLMAsJudge 函数来评估代理的执行路径。与轨迹匹配评估器不同,它不需要参考轨迹,但如果可用,可以提供一个。
import { createAgent } from "langchain"
import { tool } from "@langchain/core/tools";
import { HumanMessage, AIMessage, ToolMessage } from "@langchain/core/messages";
import { createTrajectoryLLMAsJudge, TRAJECTORY_ACCURACY_PROMPT } from "agentevals";
import * as z from "zod";

const getWeather = tool(
  async ({ city }: { city: string }) => {
    return `It's 75 degrees and sunny in ${city}.`;
  },
  {
    name: "get_weather",
    description: "Get weather information for a city.",
    schema: z.object({ city: z.string() }),
  }
);

const agent = createAgent({
  model: "gpt-4o",
  tools: [getWeather]
});

const evaluator = createTrajectoryLLMAsJudge({  
  model: "openai:o3-mini",  
  prompt: TRAJECTORY_ACCURACY_PROMPT,  
});  

async function testTrajectoryQuality() {
  const result = await agent.invoke({
    messages: [new HumanMessage("What's the weather in Seattle?")]
  });

  const evaluation = await evaluator({
    outputs: result.messages,
  });
  // {
  //     'key': 'trajectory_accuracy',
  //     'score': true,
  //     'comment': 'The provided agent trajectory is reasonable...'
  // }
  expect(evaluation.score).toBe(true);
}
如果您有参考轨迹,您可以向提示中添加一个额外的变量并传入参考轨迹。下面,我们使用预构建的 TRAJECTORY_ACCURACY_PROMPT_WITH_REFERENCE 提示并配置 reference_outputs 变量
import { TRAJECTORY_ACCURACY_PROMPT_WITH_REFERENCE } from "agentevals";

const evaluator = createTrajectoryLLMAsJudge({
  model: "openai:o3-mini",
  prompt: TRAJECTORY_ACCURACY_PROMPT_WITH_REFERENCE,
});

const evaluation = await evaluator({
  outputs: result.messages,
  referenceOutputs: referenceTrajectory,
});
有关 LLM 如何评估轨迹的更多可配置性,请访问存储库

LangSmith 集成

为了跟踪随时间变化的实验,您可以将评估器结果记录到 LangSmith,这是一个用于构建生产级 LLM 应用程序的平台,它包括跟踪、评估和实验工具。 首先,通过设置所需的环境变量来设置 LangSmith:
export LANGSMITH_API_KEY="your_langsmith_api_key"
export LANGSMITH_TRACING="true"
LangSmith 提供了两种主要的评估方法:Vitest/Jest 集成和 evaluate 函数。
import * as ls from "langsmith/vitest";
// import * as ls from "langsmith/jest";

import { createTrajectoryLLMAsJudge, TRAJECTORY_ACCURACY_PROMPT } from "agentevals";

const trajectoryEvaluator = createTrajectoryLLMAsJudge({
  model: "openai:o3-mini",
  prompt: TRAJECTORY_ACCURACY_PROMPT,
});

ls.describe("trajectory accuracy", () => {
  ls.test("accurate trajectory", {
    inputs: {
      messages: [
        {
          role: "user",
          content: "What is the weather in SF?"
        }
      ]
    },
    referenceOutputs: {
      messages: [
        new HumanMessage("What is the weather in SF?"),
        new AIMessage({
          content: "",
          tool_calls: [
            { id: "call_1", name: "get_weather", args: { city: "SF" } }
          ]
        }),
        new ToolMessage({
          content: "It's 75 degrees and sunny in SF.",
          tool_call_id: "call_1"
        }),
        new AIMessage("The weather in SF is 75 degrees and sunny."),
      ],
    },
  }, async ({ inputs, referenceOutputs }) => {
    const result = await agent.invoke({
      messages: [new HumanMessage("What is the weather in SF?")]
    });

    ls.logOutputs({ messages: result.messages });

    await trajectoryEvaluator({
      inputs,
      outputs: result.messages,
      referenceOutputs,
    });
  });
});
使用您的测试运行器运行评估
vitest run test_trajectory.eval.ts
# or
jest test_trajectory.eval.ts
或者,您可以在 LangSmith 中创建一个数据集并使用 evaluate 函数
import { evaluate } from "langsmith/evaluation";
import { createTrajectoryLLMAsJudge, TRAJECTORY_ACCURACY_PROMPT } from "agentevals";

const trajectoryEvaluator = createTrajectoryLLMAsJudge({
  model: "openai:o3-mini",
  prompt: TRAJECTORY_ACCURACY_PROMPT,
});

async function runAgent(inputs: any) {
  const result = await agent.invoke(inputs);
  return result.messages;
}

await evaluate(
  runAgent,
  {
    data: "your_dataset_name",
    evaluators: [trajectoryEvaluator],
  }
);
结果将自动记录到 LangSmith。
要了解更多关于评估您的代理的信息,请参阅LangSmith 文档

以编程方式连接这些文档到 Claude、VSCode 等,通过 MCP 获取实时答案。
© . This site is unofficial and not affiliated with LangChain, Inc.