如何评估 LLM 应用程序

本指南向您展示如何使用 LangSmith SDK 对 LLM 应用程序进行评估。

在本指南中，我们将介绍如何使用 LangSmith SDK 中的 evaluate() 方法来评估应用程序。

对于 Python 中较大的评估作业，我们建议使用 aevaluate()，即 evaluate() 的异步版本。在阅读有关异步运行评估的指南之前，仍然值得先阅读本指南，因为两者具有相同的接口。在 JS/TS 中，evaluate() 已经是异步的，因此不需要单独的方法。在运行大型作业时，配置 max_concurrency/maxConcurrency 参数也很重要。这通过有效分割线程中的数据集来并行化评估。

定义应用程序

首先，我们需要一个要评估的应用程序。让我们为这个示例创建一个简单的毒性分类器。

from langsmith import traceable, wrappers
from openai import OpenAI

# Optionally wrap the OpenAI client to trace all model calls.
oai_client = wrappers.wrap_openai(OpenAI())

# Optionally add the 'traceable' decorator to trace the inputs/outputs of this function.
@traceable
def toxicity_classifier(inputs: dict) -> dict:
    instructions = (
      "Please review the user query below and determine if it contains any form of toxic behavior, "
      "such as insults, threats, or highly negative comments. Respond with 'Toxic' if it does "
      "and 'Not toxic' if it doesn't."
    )
    messages = [
        {"role": "system", "content": instructions},
        {"role": "user", "content": inputs["text"]},
    ]
    result = oai_client.chat.completions.create(
        messages=messages, model="gpt-4o-mini", temperature=0
    )
    return {"class": result.choices[0].message.content}

我们已选择启用跟踪功能，以捕获管道中每个步骤的输入和输出。要了解如何为跟踪注释代码，请参阅此指南。

创建或选择数据集

我们需要一个数据集来评估我们的应用程序。我们的数据集将包含标记的毒性文本和非毒性文本的示例。需要 langsmith>=0.3.13

from langsmith import Client
ls_client = Client()

examples = [
  {
    "inputs": {"text": "Shut up, idiot"},
    "outputs": {"label": "Toxic"},
  },
  {
    "inputs": {"text": "You're a wonderful person"},
    "outputs": {"label": "Not toxic"},
  },
  {
    "inputs": {"text": "This is the worst thing ever"},
    "outputs": {"label": "Toxic"},
  },
  {
    "inputs": {"text": "I had a great day today"},
    "outputs": {"label": "Not toxic"},
  },
  {
    "inputs": {"text": "Nobody likes you"},
    "outputs": {"label": "Toxic"},
  },
  {
    "inputs": {"text": "This is unacceptable. I want to speak to the manager."},
    "outputs": {"label": "Not toxic"},
  },
]

dataset = ls_client.create_dataset(dataset_name="Toxic Queries")
ls_client.create_examples(
  dataset_id=dataset.id,
  examples=examples,
)

有关数据集的更多详细信息，请参阅管理数据集页面。

定义评估器

您还可以查看 LangChain 的开源评估包 openevals，以获取常见的预构建评估器。

评估器是用于对应用程序输出进行评分的函数。它们接收示例输入、实际输出以及（如果存在）参考输出。由于我们有此任务的标签，我们的评估器可以直接检查实际输出是否与参考输出匹配。

Python：需要 langsmith>=0.3.13
TypeScript：需要 langsmith>=0.2.9

def correct(inputs: dict, outputs: dict, reference_outputs: dict) -> bool:
    return outputs["class"] == reference_outputs["label"]

运行评估

我们将使用 evaluate() / aevaluate() 方法来运行评估。关键参数是：

一个目标函数，它接受一个输入字典并返回一个输出字典。每个示例的 example.inputs 字段是传递给目标函数的内容。在这种情况下，我们的 toxicity_classifier 已经设置为接收示例输入，因此我们可以直接使用它。
data - 要评估的 LangSmith 数据集的名称或 UUID，或示例的迭代器
evaluators - 用于对函数输出进行评分的评估器列表

Python：需要 langsmith>=0.3.13

# Can equivalently use the 'evaluate' function directly:
# from langsmith import evaluate; evaluate(...)
results = ls_client.evaluate(
    toxicity_classifier,
    data=dataset.name,
    evaluators=[correct],
    experiment_prefix="gpt-4o-mini, baseline",  # optional, experiment name prefix
    description="Testing the baseline system.",  # optional, experiment description
    max_concurrency=4, # optional, add concurrency
)

探索结果

每次调用 evaluate() 都会创建一个实验，可以在 LangSmith UI 中查看或通过 SDK 进行查询。评估分数作为反馈存储在每个实际输出中。 如果您已为跟踪功能注释了代码，则可以在侧面板视图中打开每行的跟踪。

参考代码

点击查看整合代码片段

from langsmith import Client, traceable, wrappers
from openai import OpenAI

# Step 1. Define an application
oai_client = wrappers.wrap_openai(OpenAI())

@traceable
def toxicity_classifier(inputs: dict) -> str:
    system = (
      "Please review the user query below and determine if it contains any form of toxic behavior, "
      "such as insults, threats, or highly negative comments. Respond with 'Toxic' if it does "
      "and 'Not toxic' if it doesn't."
    )
    messages = [
        {"role": "system", "content": system},
        {"role": "user", "content": inputs["text"]},
    ]
    result = oai_client.chat.completions.create(
        messages=messages, model="gpt-4o-mini", temperature=0
    )
    return result.choices[0].message.content

# Step 2. Create a dataset
ls_client = Client()
dataset = ls_client.create_dataset(dataset_name="Toxic Queries")
examples = [
  {
    "inputs": {"text": "Shut up, idiot"},
    "outputs": {"label": "Toxic"},
  },
  {
    "inputs": {"text": "You're a wonderful person"},
    "outputs": {"label": "Not toxic"},
  },
  {
    "inputs": {"text": "This is the worst thing ever"},
    "outputs": {"label": "Toxic"},
  },
  {
    "inputs": {"text": "I had a great day today"},
    "outputs": {"label": "Not toxic"},
  },
  {
    "inputs": {"text": "Nobody likes you"},
    "outputs": {"label": "Toxic"},
  },
  {
    "inputs": {"text": "This is unacceptable. I want to speak to the manager."},
    "outputs": {"label": "Not toxic"},
  },
]
ls_client.create_examples(
  dataset_id=dataset.id,
  examples=examples,
)

# Step 3. Define an evaluator
def correct(inputs: dict, outputs: dict, reference_outputs: dict) -> bool:
    return outputs["output"] == reference_outputs["label"]

# Step 4. Run the evaluation
# Client.evaluate() and evaluate() behave the same.
results = ls_client.evaluate(
    toxicity_classifier,
    data=dataset.name,
    evaluators=[correct],
    experiment_prefix="gpt-4o-mini, simple",  # optional, experiment name prefix
    description="Testing the baseline system.",  # optional, experiment description
    max_concurrency=4,  # optional, add concurrency
)

在 GitHub 上编辑此页面源文件。

以编程方式连接这些文档到 Claude、VSCode 等，通过 MCP 获取实时答案。

数据集

设置评估

分析实验结果

标注与人工反馈

常见数据类型

定义应用程序

创建或选择数据集

定义评估器

运行评估

探索结果

参考代码

数据集

设置评估

分析实验结果

标注与人工反馈

常见数据类型

​定义应用程序

​创建或选择数据集

​定义评估器

​运行评估

​探索结果​

​参考代码​

​相关​

定义应用程序

创建或选择数据集

定义评估器

运行评估

探索结果

参考代码

相关