跳到主要内容
阿里云 MaxCompute(原名 ODPS)是一个通用的、完全托管的、多租户的数据处理平台,用于大规模数据仓库。MaxCompute 支持各种数据导入解决方案和分布式计算模型,使用户能够有效地查询海量数据集、降低生产成本并确保数据安全。
MaxComputeLoader 允许您执行 MaxCompute SQL 查询,并将结果加载为每行一个文档。
pip install -qU  pyodps
Collecting pyodps
  Downloading pyodps-0.11.4.post0-cp39-cp39-macosx_10_9_universal2.whl (2.0 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.0/2.0 MB 1.7 MB/s eta 0:00:0000:0100:010m
Requirement already satisfied: charset-normalizer>=2 in /Users/newboy/anaconda3/envs/langchain/lib/python3.9/site-packages (from pyodps) (3.1.0)
Requirement already satisfied: urllib3<2.0,>=1.26.0 in /Users/newboy/anaconda3/envs/langchain/lib/python3.9/site-packages (from pyodps) (1.26.15)
Requirement already satisfied: idna>=2.5 in /Users/newboy/anaconda3/envs/langchain/lib/python3.9/site-packages (from pyodps) (3.4)
Requirement already satisfied: certifi>=2017.4.17 in /Users/newboy/anaconda3/envs/langchain/lib/python3.9/site-packages (from pyodps) (2023.5.7)
Installing collected packages: pyodps
Successfully installed pyodps-0.11.4.post0

基本用法

要实例化加载器,您需要一个要执行的 SQL 查询、您的 MaxCompute 端点和项目名称,以及您的访问 ID 和秘密访问密钥。访问 ID 和秘密访问密钥可以直接通过 access_idsecret_access_key 参数传入,也可以设置为环境变量 MAX_COMPUTE_ACCESS_IDMAX_COMPUTE_SECRET_ACCESS_KEY
from langchain_community.document_loaders import MaxComputeLoader
base_query = """
SELECT *
FROM (
    SELECT 1 AS id, 'content1' AS content, 'meta_info1' AS meta_info
    UNION ALL
    SELECT 2 AS id, 'content2' AS content, 'meta_info2' AS meta_info
    UNION ALL
    SELECT 3 AS id, 'content3' AS content, 'meta_info3' AS meta_info
) mydata;
"""
endpoint = "<ENDPOINT>"
project = "<PROJECT>"
ACCESS_ID = "<ACCESS ID>"
SECRET_ACCESS_KEY = "<SECRET ACCESS KEY>"
loader = MaxComputeLoader.from_params(
    base_query,
    endpoint,
    project,
    access_id=ACCESS_ID,
    secret_access_key=SECRET_ACCESS_KEY,
)
data = loader.load()
print(data)
[Document(page_content='id: 1\ncontent: content1\nmeta_info: meta_info1', metadata={}), Document(page_content='id: 2\ncontent: content2\nmeta_info: meta_info2', metadata={}), Document(page_content='id: 3\ncontent: content3\nmeta_info: meta_info3', metadata={})]
print(data[0].page_content)
id: 1
content: content1
meta_info: meta_info1
print(data[0].metadata)
{}

指定哪些列是内容,哪些是元数据

您可以使用 page_content_columnsmetadata_columns 参数来配置哪些列的子集应加载为文档的内容,哪些应加载为元数据。
loader = MaxComputeLoader.from_params(
    base_query,
    endpoint,
    project,
    page_content_columns=["content"],  # Specify Document page content
    metadata_columns=["id", "meta_info"],  # Specify Document metadata
    access_id=ACCESS_ID,
    secret_access_key=SECRET_ACCESS_KEY,
)
data = loader.load()
print(data[0].page_content)
content: content1
print(data[0].metadata)
{'id': 1, 'meta_info': 'meta_info1'}

以编程方式连接这些文档到 Claude、VSCode 等,通过 MCP 获取实时答案。
© . This site is unofficial and not affiliated with LangChain, Inc.