配置 LangSmith 以实现规模化

自托管的 LangSmith 实例可以处理大量的跟踪和用户。自托管部署的默认配置可以处理相当大的负载，您可以配置您的部署以实现更高的规模。本页描述了扩展注意事项，并提供了一些示例来帮助配置您的自托管实例。有关配置示例，请参阅LangSmith 配置示例。

总结

下表概述了针对不同负载模式（读取/写入）的不同 LangSmith 配置

	低/低	低/高	高/低	中/中	高/高
	5	5	50	20	50
	10	1000	10	100	1000
前端副本	1（默认）	4	2	2	4
平台后端副本	3（默认）	20	3（默认）	3（默认）	20
队列副本	3（默认）	160	6	10	160
后端副本	2（默认）	5	40	16	50
Redis 资源	8 Gi（默认）	200 Gi 外部	8 Gi（默认）	13Gi 外部	200 Gi 外部
ClickHouse 资源	4 CPU 16 Gi（默认）	10 CPU 32Gi 内存	8 CPU 每个副本 16 Gi	16 CPU 24Gi 内存	14 CPU 每个副本 24 Gi
ClickHouse 设置	单实例	单实例	3 节点	单实例	3 节点
	2 CPU 8 GB 内存 10GB 存储（外部）	2 CPU 8 GB 内存 10GB 存储（外部）	2 CPU 8 GB 内存 10GB 存储（外部）	2 CPU 8 GB 内存 10GB 存储（外部）	2 CPU 8 GB 内存 10GB 存储（外部）
Blob 存储	禁用	启用	启用	启用	启用

下面我们将更详细地介绍读取和写入路径，并提供一个 `values.yaml` 片段，供您在自托管的 LangSmith 实例中使用。

跟踪摄取（写入路径）

对写入路径产生负载的常见用法

通过 Python 或 JavaScript LangSmith SDK 摄取跟踪
通过 `@traceable` 包装器摄取跟踪
通过 `/runs/multipart` 端点提交跟踪

在跟踪摄取中扮演重要角色的服务

平台后端服务：接收初始请求以摄取跟踪，并将跟踪放置在 Redis 队列中
Redis 缓存：用于对需要持久化的跟踪进行排队
队列服务：持久化跟踪以供查询
ClickHouse：用于跟踪的持久存储

在扩展写入路径（跟踪摄取）时，监控上述四个服务/资源会很有帮助。以下是一些可以帮助提高跟踪摄取性能的典型更改

如果 ClickHouse 接近资源限制，请为其提供更多资源（CPU 和内存）。
如果摄取请求响应时间过长，请增加平台后端 pod 的数量。
如果跟踪从 Redis 处理的速度不够快，请增加队列服务 pod 副本。
如果您发现当前的 Redis 实例达到资源限制，请使用更大的 Redis 缓存。这也可能是摄取请求时间过长的原因。

跟踪查询（读取路径）

对读取路径产生负载的常见用法

前端用户查看跟踪项目或单个跟踪
用于查询跟踪信息的脚本
访问 `/runs/query` 或 `/runs/` API 端点

在查询跟踪中扮演重要角色的服务

后端服务：接收请求并向 ClickHouse 提交查询，然后响应请求
ClickHouse：跟踪的持久存储。这是请求跟踪信息时查询的主要数据库。

在扩展读取路径（跟踪查询）时，监控上述两个服务/资源会很有帮助。以下是一些可以帮助提高跟踪查询性能的典型更改

增加后端服务 pod 的数量。如果后端服务 pod 达到 1 核 CPU 使用率，这将最有影响力。
为 ClickHouse 提供更多资源（CPU 或内存）。ClickHouse 可能非常消耗资源，但它应该会带来更好的性能。
迁移到复制的 ClickHouse 集群。添加 ClickHouse 副本有助于提高读取性能，但我们建议保持在 5 个副本以下（从 3 个开始）。

有关如何将其转换为 helm chart 值的更精确指导，请参阅以下部分中的示例。如果您不确定您的 LangSmith 实例为何无法处理某种负载模式，请联系 LangChain 团队。

用于规模化的 LangSmith 配置示例

下面我们根据预期的读取和写入负载提供一些 LangSmith 配置示例。对于读取负载（跟踪查询）：

低表示大约 5 个用户同时查看跟踪（每秒大约 10 个请求）
中表示大约 20 个用户同时查看跟踪（每秒大约 40 个请求）
高表示大约 50 个用户同时查看跟踪（每秒大约 100 个请求）

对于写入负载（跟踪摄取）

低表示每秒提交多达 10 个跟踪
中表示每秒提交多达 100 个跟踪
高表示每秒提交多达 1000 个跟踪

确切的最佳配置取决于您的使用情况和跟踪负载。结合上述信息和您的具体使用情况，根据您的需求更新您的 LangSmith 配置，使用以下示例。如果您有任何疑问，请联系 LangChain 团队。

低读取，低写入

默认的 LangSmith 配置将处理此负载。此处无需自定义资源配置。

低读取，高写入

您有非常高规模的跟踪摄取，但前端查询跟踪的用户数量只有个位数。为此，我们建议采用以下配置：

config:
  blobStorage:
    # Please also set the other keys to connect to your blob storage. See configuration section.
    enabled: true
  settings:
    redisRunsExpirySeconds: "3600"
# ttl:
#   enabled: true
#   ttl_period_seconds:
#     longlived: "7776000"  # 90 days (default is 400 days)
#     shortlived: "604800"  # 7 days (default is 14 days)

frontend:
  deployment:
    replicas: 4 # OR enable autoscaling to this level (example below)
# autoscaling:
#   enabled: true
#   maxReplicas: 4
#   minReplicas: 2

platformBackend:
  deployment:
    replicas: 20 # OR enable autoscaling to this level (example below)
# autoscaling:
#   enabled: true
#   maxReplicas: 20
#   minReplicas: 8

## Note that we are actively working on improving performance of this service to reduce the number of replicas.
queue:
  deployment:
    replicas: 160 # OR enable autoscaling to this level (example below)
# autoscaling:
#   enabled: true
#   maxReplicas: 160
#   minReplicas: 40

backend:
  deployment:
    replicas: 5 # OR enable autoscaling to this level (example below)
# autoscaling:
#   enabled: true
#   maxReplicas: 5
#   minReplicas: 3

## Ensure your Redis cache is at least 200 GB
redis:
  external:
    enabled: true
    existingSecretName: langsmith-redis-secret # Set the connection url for your external Redis instance (200+ GB)

clickhouse:
  statefulSet:
    persistence:
      # This may depend on your configured TTL (see config section).
      # We recommend 600Gi for every shortlived TTL day if operating at this scale constantly.
      size: 4200Gi # This assumes 7 days TTL and operating a this scale constantly.
    resources:
      requests:
        cpu: "10"
        memory: "32Gi"
      limits:
        cpu: "16"
        memory: "48Gi"

commonEnv:
  - name: "CLICKHOUSE_ASYNC_INSERT_WAIT_PCT_FLOAT"
    value: "0"

高读取，低写入

您有相对较低规模的跟踪摄取，但有许多前端用户查询跟踪和/或有脚本频繁访问 `/runs/query` 或 `/runs/` 端点。 为此，我们强烈建议设置一个复制的 ClickHouse 集群，以实现低延迟的高读取规模。 有关如何设置复制的 ClickHouse 集群的更多指导，请参阅我们的外部 ClickHouse 文档。对于此负载模式，我们建议使用 3 节点复制设置，其中集群中的每个副本应具有 8+ 核 CPU 和 16+ GB 内存的资源请求，以及 12 核 CPU 和 32 GB 内存的资源限制。为此，我们建议采用以下配置：

config:
  blobStorage:
    # Please also set the other keys to connect to your blob storage. See configuration section.
    enabled: true

frontend:
  deployment:
    replicas: 2

queue:
  deployment:
    replicas: 6 # OR enable autoscaling to this level (example below)
# autoscaling:
#   enabled: true
#   maxReplicas: 6
#   minReplicas: 4

backend:
  deployment:
    replicas: 40 # OR enable autoscaling to this level (example below)
# autoscaling:
#   enabled: true
#   maxReplicas: 40
#   minReplicas: 16

# We strongly recommend setting up a replicated clickhouse cluster for this load.
# Update these values as needed to connect to your replicated clickhouse cluster.
clickhouse:
  external:
    # If using a 3 node replicated setup, each replica in the cluster should have resource requests of 8+ cores and 16+ GB memory, and resource limit of 12 cores and 32 GB memory.
    enabled: true
    host: langsmith-ch-clickhouse-replicated.default.svc.cluster.local
    port: "8123"
    nativePort: "9000"
    user: "default"
    password: "password"
    database: "default"
    cluster: "replicated"

中等读取，中等写入

这是一个很好的通用配置，应该能够处理 LangSmith 的大多数使用模式。在内部测试中，此配置使我们能够扩展到每秒摄取 100 个跟踪和每秒 40 个读取请求。为此，我们建议采用以下配置：

config:
  blobStorage:
    # Please also set the other keys to connect to your blob storage. See configuration section.
    enabled: true
  settings:
    redisRunsExpirySeconds: "3600"

frontend:
  deployment:
    replicas: 2

queue:
  deployment:
    replicas: 10 # OR enable autoscaling to this level (example below)
# autoscaling:
#   enabled: true
#   maxReplicas: 10
#   minReplicas: 5

backend:
  deployment:
    replicas: 16 # OR enable autoscaling to this level (example below)
# autoscaling:
#   enabled: true
#   maxReplicas: 16
#   minReplicas: 8

redis:
  statefulSet:
    resources:
      requests:
        memory: 13Gi
      limits:
        memory: 13Gi

  # -- For external redis instead use something like below --
  # external:
  #   enabled: true
  #   connectionUrl: "<URL>" OR existingSecretName: "<SECRET-NAME>"

clickhouse:
  statefulSet:
    persistence:
      # This may depend on your configured TTL.
      # We recommend 60Gi for every shortlived TTL day if operating at this scale constantly.
      size: 420Gi # This assumes 7 days TTL and operating a this scale constantly.
    resources:
      requests:
        cpu: "16"
        memory: "24Gi"
      limits:
        cpu: "28"
        memory: "40Gi"

commonEnv:
  - name: "CLICKHOUSE_ASYNC_INSERT_WAIT_PCT_FLOAT"
    value: "0"

如果使用上述配置仍然发现读取速度缓慢，我们建议迁移到复制的 ClickHouse 集群设置

高读取，高写入

您有非常高的跟踪摄取速率（接近每秒提交 1000 个跟踪），并且还有许多用户在前端查询跟踪（超过 50 个用户）和/或脚本持续向 `/runs/query` 或 `/runs/` 端点发出请求。 为此，我们强烈建议设置一个复制的 ClickHouse 集群，以防止在高写入规模下读取性能下降。 有关如何设置复制的 ClickHouse 集群的更多指导，请参阅我们的外部 ClickHouse 文档。对于此负载模式，我们建议使用 3 节点复制设置，其中集群中的每个副本应具有 14+ 核 CPU 和 24+ GB 内存的资源请求，以及 20 核 CPU 和 48 GB 内存的资源限制。我们还建议 ClickHouse 的每个节点/实例每天的 TTL（根据以下配置）具有 600 Gi 的卷存储。总而言之，我们建议采用以下配置：

config:
  blobStorage:
    # Please also set the other keys to connect to your blob storage. See configuration section.
    enabled: true
  settings:
    redisRunsExpirySeconds: "3600"
# ttl:
#   enabled: true
#   ttl_period_seconds:
#     longlived: "7776000"  # 90 days (default is 400 days)
#     shortlived: "604800"  # 7 days (default is 14 days)

frontend:
  deployment:
    replicas: 4 # OR enable autoscaling to this level (example below)
# autoscaling:
#   enabled: true
#   maxReplicas: 4
#   minReplicas: 2

platformBackend:
  deployment:
    replicas: 20 # OR enable autoscaling to this level (example below)
# autoscaling:
#   enabled: true
#   maxReplicas: 20
#   minReplicas: 8

## Note that we are actively working on improving performance of this service to reduce the number of replicas.
queue:
  deployment:
    replicas: 160 # OR enable autoscaling to this level (example below)
# autoscaling:
#   enabled: true
#   maxReplicas: 160
#   minReplicas: 40

backend:
  deployment:
    replicas: 50 # OR enable autoscaling to this level (example below)
# autoscaling:
#   enabled: true
#   maxReplicas: 50
#   minReplicas: 20

## Ensure your Redis cache is at least 200 GB
redis:
  external:
    enabled: true
    existingSecretName: langsmith-redis-secret # Set the connection url for your external Redis instance (200+ GB)

# We strongly recommend setting up a replicated clickhouse cluster for this load.
# Update these values as needed to connect to your replicated clickhouse cluster.
clickhouse:
  external:
    # If using a 3 node replicated setup, each replica in the cluster should have resource requests of 14+ cores and 24+ GB memory, and resource limit of 20 cores and 48 GB memory.
    enabled: true
    host: langsmith-ch-clickhouse-replicated.default.svc.cluster.local
    port: "8123"
    nativePort: "9000"
    user: "default"
    password: "password"
    database: "default"
    cluster: "replicated"

commonEnv:
  - name: "CLICKHOUSE_ASYNC_INSERT_WAIT_PCT_FLOAT"
    value: "0"

确保 Kubernetes 集群配置了足够的资源以扩展到建议的大小。部署后，Kubernetes 集群中的所有 pod 都应处于 `Running` 状态。卡在 `Pending` 状态的 pod 可能表明您已达到节点池限制或需要更大的节点。此外，确保部署在集群上的任何 ingress 控制器都能够处理所需的负载，以防止瓶颈。

在 GitHub 上编辑此页面源文件。

以编程方式连接这些文档到 Claude、VSCode 等，通过 MCP 获取实时答案。

概览

混合

自托管

总结

跟踪摄取（写入路径）

跟踪查询（读取路径）

用于规模化的 LangSmith 配置示例

低读取，低写入

低读取，高写入

高读取，低写入

中等读取，中等写入

高读取，高写入

概览

混合

自托管

​总结

​跟踪摄取（写入路径）

​跟踪查询（读取路径）

​用于规模化的 LangSmith 配置示例

​低读取，低写入

​低读取，高写入

​高读取，低写入

​中等读取，中等写入

​高读取，高写入

总结

跟踪摄取（写入路径）

跟踪查询（读取路径）

用于规模化的 LangSmith 配置示例

低读取，低写入

低读取，高写入

高读取，低写入

中等读取，中等写入

高读取，高写入