如何定义一个目标函数进行评估

运行评估需要三个主要部分

包含测试输入和预期输出的数据集。
您正在评估的目标函数。
对目标函数的输出进行评分的评估器。

本指南将向您展示如何根据您正在评估的应用程序部分来定义目标函数。有关如何创建数据集和如何定义评估器的信息，请参见此处；有关运行评估的端到端示例，请参见此处。

目标函数签名

为了在代码中评估应用程序，我们需要一种运行应用程序的方法。使用 evaluate() (Python/TypeScript) 时，我们将通过传入一个目标函数参数来完成此操作。这是一个函数，它接受数据集示例的输入并以字典形式返回应用程序输出。在此函数中，我们可以随意调用我们的应用程序。我们也可以随意格式化输出。关键是，我们定义的任何评估器函数都应该与我们在目标函数中返回的输出格式一起工作。

from langsmith import Client

# 'inputs' will come from your dataset.
def dummy_target(inputs: dict) -> dict:
    return {"foo": 1, "bar": "two"}

# 'inputs' will come from your dataset.
# 'outputs' will come from your target function.
def evaluator_one(inputs: dict, outputs: dict) -> bool:
    return outputs["foo"] == 2

def evaluator_two(inputs: dict, outputs: dict) -> bool:
    return len(outputs["bar"]) < 3

client = Client()
results = client.evaluate(
    dummy_target,  # <-- target function
    data="your-dataset-name",
    evaluators=[evaluator_one, evaluator_two],
    ...
)

evaluate() 将自动跟踪您的目标函数。这意味着如果您在目标函数中运行任何可跟踪的代码，这也将作为目标跟踪的子运行进行跟踪。

示例：单次 LLM 调用

from langsmith import wrappers
from openai import OpenAI

# Optionally wrap the OpenAI client to automatically
# trace all model calls.
oai_client = wrappers.wrap_openai(OpenAI())

def target(inputs: dict) -> dict:
  # This assumes your dataset has inputs with a 'messages' key.
  # You can update to match your dataset schema.
  messages = inputs["messages"]
  response = oai_client.chat.completions.create(
      messages=messages,
      model="gpt-4o-mini",
  )
  return {"answer": response.choices[0].message.content}

示例：非 LLM 组件

from langsmith import traceable

# Optionally decorate with '@traceable' to trace all invocations of this function.
@traceable
def calculator_tool(operation: str, number1: float, number2: float) -> str:
  if operation == "add":
      return str(number1 + number2)
  elif operation == "subtract":
      return str(number1 - number2)
  elif operation == "multiply":
      return str(number1 * number2)
  elif operation == "divide":
      return str(number1 / number2)
  else:
      raise ValueError(f"Unrecognized operation: {operation}.")

# This is the function you will evaluate.
def target(inputs: dict) -> dict:
  # This assumes your dataset has inputs with `operation`, `num1`, and `num2` keys.
  operation = inputs["operation"]
  number1 = inputs["num1"]
  number2 = inputs["num2"]
  result = calculator_tool(operation, number1, number2)
  return {"result": result}

示例：应用程序或代理

from my_agent import agent

      # This is the function you will evaluate.
def target(inputs: dict) -> dict:
  # This assumes your dataset has inputs with a `messages` key
  messages = inputs["messages"]
  # Replace `invoke` with whatever you use to call your agent
  response = agent.invoke({"messages": messages})
  # This assumes your agent output is in the right format
  return response

如果您有一个 LangGraph/LangChain 代理，它接受数据集中定义的输入并返回您希望在评估器中使用的输出格式，您可以直接将该对象作为目标传入

from my_agent import agent
from langsmith import Client
client = Client()
client.evaluate(agent, ...)

在 GitHub 上编辑此页面源文件。

以编程方式连接这些文档到 Claude、VSCode 等，通过 MCP 获取实时答案。

数据集

设置评估

分析实验结果

标注与人工反馈

常见数据类型

如何定义一个目标函数进行评估

目标函数签名

示例：单次 LLM 调用

示例：非 LLM 组件

示例：应用程序或代理

数据集

设置评估

分析实验结果

标注与人工反馈

常见数据类型

​目标函数签名

​示例：单次 LLM 调用

​示例：非 LLM 组件

​示例：应用程序或代理

目标函数签名

示例：单次 LLM 调用

示例：非 LLM 组件

示例：应用程序或代理