如何定义代码评估器

评估器

代码评估器只是一个函数，它接收一个数据集示例和应用程序的最终输出，并返回一个或多个指标。这些函数可以直接传递给 evaluate() / aevaluate()。

基本示例

from langsmith import evaluate

def correct(outputs: dict, reference_outputs: dict) -> bool:
    """Check if the answer exactly matches the expected answer."""
    return outputs["answer"] == reference_outputs["answer"]

def dummy_app(inputs: dict) -> dict:
    return {"answer": "hmm i'm not sure", "reasoning": "i didn't understand the question"}

results = evaluate(
    dummy_app,
    data="dataset_name",
    evaluators=[correct]
)

评估器参数

代码评估器函数必须具有特定的参数名称。它们可以接受以下参数的任意子集：

run: Run：应用程序在给定示例上生成的完整 Run 对象。
example: Example：完整的数据集 Example，包括示例输入、输出（如果可用）和元数据（如果可用）。
inputs: dict：对应于数据集中单个示例的输入字典。
outputs: dict：应用程序在给定 inputs 上生成的输出字典。
reference_outputs/referenceOutputs: dict：与示例关联的参考输出字典（如果可用）。

对于大多数用例，您只需要 inputs、outputs 和 reference_outputs。只有当您需要应用程序实际输入和输出之外的额外跟踪或示例元数据时，run 和 example 才有用。使用 JS/TS 时，这些都应作为单个对象参数的一部分传入。

评估器输出

代码评估器应返回以下类型之一： Python 和 JS/TS

dict：形式为 {"score" | "value": ..., "key": ...} 的字典允许您自定义指标类型（“score”表示数值，“value”表示分类）和指标名称。这很有用，例如，当您想将整数记录为分类指标时。

仅限 Python

int | float | bool：这被解释为可平均、排序等的连续指标。函数名称用作指标的名称。
str：这被解释为分类指标。函数名称用作指标的名称。
list[dict]：使用单个函数返回多个指标。

其他示例

需要 langsmith>=0.2.0

from langsmith import evaluate, wrappers
from langsmith.schemas import Run, Example
from openai import AsyncOpenAI
# Assumes you've installed pydantic.
from pydantic import BaseModel

# We can still pass in Run and Example objects if we'd like
def correct_old_signature(run: Run, example: Example) -> dict:
    """Check if the answer exactly matches the expected answer."""
    return {"key": "correct", "score": run.outputs["answer"] == example.outputs["answer"]}

# Just evaluate actual outputs
def concision(outputs: dict) -> int:
    """Score how concise the answer is. 1 is the most concise, 5 is the least concise."""
    return min(len(outputs["answer"]) // 1000, 4) + 1

# Use an LLM-as-a-judge
oai_client = wrappers.wrap_openai(AsyncOpenAI())

async def valid_reasoning(inputs: dict, outputs: dict) -> bool:
    """Use an LLM to judge if the reasoning and the answer are consistent."""
    instructions = """
Given the following question, answer, and reasoning, determine if the reasoning for the
answer is logically valid and consistent with question and the answer."""

    class Response(BaseModel):
        reasoning_is_valid: bool

    msg = f"Question: {inputs['question']}\nAnswer: {outputs['answer']}\nReasoning: {outputs['reasoning']}"
    response = await oai_client.beta.chat.completions.parse(
        model="gpt-4o-mini",
        messages=[{"role": "system", "content": instructions,}, {"role": "user", "content": msg}],
        response_format=Response
    )
    return response.choices[0].message.parsed.reasoning_is_valid

def dummy_app(inputs: dict) -> dict:
    return {"answer": "hmm i'm not sure", "reasoning": "i didn't understand the question"}

results = evaluate(
    dummy_app,
    data="dataset_name",
    evaluators=[correct_old_signature, concision, valid_reasoning]
)

评估聚合实验结果：定义摘要评估器，它计算整个实验的指标。
运行比较两个实验的评估：定义成对评估器，它通过比较两个（或更多）实验来计算指标。

在 GitHub 上编辑此页面源文件。

以编程方式连接这些文档到 Claude、VSCode 等，通过 MCP 获取实时答案。

数据集

设置评估

分析实验结果

标注与人工反馈

常见数据类型

基本示例

评估器参数

评估器输出

其他示例

数据集

设置评估

分析实验结果

标注与人工反馈

常见数据类型

​基本示例

​评估器参数

​评估器输出

​其他示例

​相关

基本示例

评估器参数

评估器输出

其他示例

相关