Pinecone（稀疏）

Pinecone 是一个功能强大的向量数据库。

本笔记展示了如何使用与 Pinecone 向量数据库相关的功能。

设置

要使用 PineconeSparseVectorStore，您首先需要安装合作伙伴包以及本笔记中使用的其他包。

pip install -qU "langchain-pinecone==0.2.5"

WARNING: pinecone 6.0.2 does not provide the extra 'async'

凭据

创建一个新的 Pinecone 帐户，或登录现有帐户，然后创建一个 API 密钥以在此笔记中使用。

import os
from getpass import getpass

from pinecone import Pinecone

# get API key at app.pinecone.io
os.environ["PINECONE_API_KEY"] = os.getenv("PINECONE_API_KEY") or getpass(
    "Enter your Pinecone API key: "
)

# initialize client
pc = Pinecone()

Enter your Pinecone API key: ··········

初始化

在初始化向量存储之前，让我们连接到 Pinecone 索引。如果不存在名为 index_name 的索引，则会创建一个。

from pinecone import AwsRegion, CloudProvider, Metric, ServerlessSpec

index_name = "langchain-sparse-vector-search"  # change if desired
model_name = "pinecone-sparse-english-v0"

if not pc.has_index(index_name):
    pc.create_index_for_model(
        name=index_name,
        cloud=CloudProvider.AWS,
        region=AwsRegion.US_EAST_1,
        embed={
            "model": model_name,
            "field_map": {"text": "chunk_text"},
            "metric": Metric.DOTPRODUCT,
        },
    )

index = pc.Index(index_name)
print(f"Index `{index_name}` host: {index.config.host}")

Index `langchain-sparse-vector-search` host: https://langchain-sparse-vector-search-yrrgefy.svc.aped-4627-b74a.pinecone.io

对于我们的稀疏嵌入模型，我们使用 pinecone-sparse-english-v0，我们这样初始化它

from langchain_pinecone.embeddings import PineconeSparseEmbeddings

sparse_embeddings = PineconeSparseEmbeddings(model=model_name)

现在我们的 Pinecone 索引和嵌入模型都已准备就绪，我们可以在 LangChain 中初始化我们的稀疏向量存储

from langchain_pinecone import PineconeSparseVectorStore

vector_store = PineconeSparseVectorStore(index=index, embedding=sparse_embeddings)

管理向量存储

创建向量存储后，我们可以通过添加和删除不同的项目来与其交互。

向向量存储添加项目

我们可以使用 add_documents 函数将项目添加到向量存储中。

from uuid import uuid4

from langchain_core.documents import Document

documents = [
    Document(
        page_content="I had chocolate chip pancakes and scrambled eggs for breakfast this morning.",
        metadata={"source": "social"},
    ),
    Document(
        page_content="The weather forecast for tomorrow is cloudy and overcast, with a high of 62 degrees.",
        metadata={"source": "news"},
    ),
    Document(
        page_content="Building an exciting new project with LangChain - come check it out!",
        metadata={"source": "social"},
    ),
    Document(
        page_content="Robbers broke into the city bank and stole $1 million in cash.",
        metadata={"source": "news"},
    ),
    Document(
        page_content="Wow! That was an amazing movie. I can't wait to see it again.",
        metadata={"source": "social"},
    ),
    Document(
        page_content="Is the new iPhone worth the price? Read this review to find out.",
        metadata={"source": "website"},
    ),
    Document(
        page_content="The top 10 soccer players in the world right now.",
        metadata={"source": "website"},
    ),
    Document(
        page_content="LangGraph is the best framework for building stateful, agentic applications!",
        metadata={"source": "social"},
    ),
    Document(
        page_content="The stock market is down 500 points today due to fears of a recession.",
        metadata={"source": "news"},
    ),
    Document(
        page_content="I have a bad feeling I am going to get deleted :(",
        metadata={"source": "social"},
    ),
]

uuids = [str(uuid4()) for _ in range(len(documents))]

vector_store.add_documents(documents=documents, ids=uuids)

['95b598af-c3dc-4a8a-bdb7-5d21283e5a86',
 '838614a5-5635-4efd-9ac3-5237a37a542b',
 '093fd11f-c85b-4c83-83f0-117df64ff442',
 'fb3ba32f-f802-410a-ad79-56f7bce938fe',
 '75cde9bf-7e91-4f06-8bae-c824dab16a08',
 '9de8f769-d604-4e56-b677-ee333cbc8e34',
 'f5f4ae97-88e6-4669-bcf7-87072bb08550',
 'f9f82811-187c-4b25-85b5-7a42b4da3bff',
 'ce45957c-e8fc-41ef-819b-1bd52b6fc815',
 '66cacc6f-b8e2-441b-9f7f-468788aad88f']

从向量存储中删除项目

我们可以使用 delete 方法从向量存储中删除记录，向其提供要删除的文档 ID 列表。

vector_store.delete(ids=[uuids[-1]])

查询向量存储

一旦我们将文档加载到向量存储中，我们就可以开始查询了。LangChain 中有各种方法可以做到这一点。首先，我们将通过 similarity_search 方法直接查询我们的 vector_store 来执行简单的向量搜索：

results = vector_store.similarity_search("I'm building a new LangChain project!", k=3)

for res in results:
    print(f"* {res.page_content} [{res.metadata}]")

* Building an exciting new project with LangChain - come check it out! [{'source': 'social'}]
* Building an exciting new project with LangChain - come check it out! [{'source': 'social'}]
* LangGraph is the best framework for building stateful, agentic applications! [{'source': 'social'}]

我们还可以为查询添加元数据过滤，以根据各种条件限制我们的搜索。让我们尝试一个简单的过滤器，将搜索限制为仅包含 source=="social" 的记录

results = vector_store.similarity_search(
    "I'm building a new LangChain project!",
    k=3,
    filter={"source": "social"},
)
for res in results:
    print(f"* {res.page_content} [{res.metadata}]")

* Building an exciting new project with LangChain - come check it out! [{'source': 'social'}]
* Building an exciting new project with LangChain - come check it out! [{'source': 'social'}]
* LangGraph is the best framework for building stateful, agentic applications! [{'source': 'social'}]

比较这些结果时，我们可以看到我们的第一个查询返回了一个来自 "website" 源的不同记录。在我们的后者，经过过滤的查询中，情况不再如此。

相似性搜索和分数

我们还可以进行搜索，同时以 (文档，分数) 元组列表的形式返回相似度分数。其中 文档 是一个 LangChain Document 对象，包含我们的文本内容和元数据。

results = vector_store.similarity_search_with_score(
    "I'm building a new LangChain project!", k=3, filter={"source": "social"}
)
for doc, score in results:
    print(f"[SIM={score:3f}] {doc.page_content} [{doc.metadata}]")

[SIM=12.959961] Building an exciting new project with LangChain - come check it out! [{'source': 'social'}]
[SIM=12.959961] Building an exciting new project with LangChain - come check it out! [{'source': 'social'}]
[SIM=1.942383] LangGraph is the best framework for building stateful, agentic applications! [{'source': 'social'}]

作为检索器

在我们的链和代理中，我们经常将向量存储用作 VectorStoreRetriever。要创建它，我们使用 as_retriever 方法

retriever = vector_store.as_retriever(
    search_type="similarity_score_threshold",
    search_kwargs={"k": 3, "score_threshold": 0.5},
)
retriever

VectorStoreRetriever(tags=['PineconeSparseVectorStore', 'PineconeSparseEmbeddings'], vectorstore=<langchain_pinecone.vectorstores_sparse.PineconeSparseVectorStore object at 0x7c8087b24290>, search_type='similarity_score_threshold', search_kwargs={'k': 3, 'score_threshold': 0.5})

我们现在可以使用 invoke 方法查询我们的检索器

retriever.invoke(
    input="I'm building a new LangChain project!", filter={"source": "social"}
)

/usr/local/lib/python3.11/dist-packages/langchain_core/vectorstores/base.py:1082: UserWarning: Relevance scores must be between 0 and 1, got [(Document(id='093fd11f-c85b-4c83-83f0-117df64ff442', metadata={'source': 'social'}, page_content='Building an exciting new project with LangChain - come check it out!'), 6.97998045), (Document(id='54f8f645-9f77-4aab-b9fa-709fd91ae3b3', metadata={'source': 'social'}, page_content='Building an exciting new project with LangChain - come check it out!'), 6.97998045), (Document(id='f9f82811-187c-4b25-85b5-7a42b4da3bff', metadata={'source': 'social'}, page_content='LangGraph is the best framework for building stateful, agentic applications!'), 1.471191405)]
  self.vectorstore.similarity_search_with_relevance_scores(

[Document(id='093fd11f-c85b-4c83-83f0-117df64ff442', metadata={'source': 'social'}, page_content='Building an exciting new project with LangChain - come check it out!'),
 Document(id='54f8f645-9f77-4aab-b9fa-709fd91ae3b3', metadata={'source': 'social'}, page_content='Building an exciting new project with LangChain - come check it out!'),
 Document(id='f9f82811-187c-4b25-85b5-7a42b4da3bff', metadata={'source': 'social'}, page_content='LangGraph is the best framework for building stateful, agentic applications!')]

用于检索增强生成的使用

有关如何将此向量存储用于检索增强生成 (RAG) 的指南，请参阅以下部分

API 参考

有关所有功能和配置的详细文档，请参阅 API 参考：python.langchain.com/api_reference/pinecone/vectorstores_sparse/langchain_pinecone.vectorstores_sparse.PineconeSparseVectorStore.html#langchain_pinecone.vectorstores_sparse.PineconeSparseVectorStore 稀疏嵌入：python.langchain.com/api_reference/pinecone/embeddings/langchain_pinecone.embeddings.PineconeSparseEmbeddings.html

在 GitHub 上编辑此页面源文件。

以编程方式连接这些文档到 Claude、VSCode 等，通过 MCP 获取实时答案。

热门提供商

按组件划分的集成

设置

凭据

初始化

管理向量存储

向向量存储添加项目

从向量存储中删除项目

查询向量存储

相似性搜索和分数

作为检索器

用于检索增强生成的使用

API 参考

热门提供商

按组件划分的集成

​设置

​凭据

​初始化

​管理向量存储

​向向量存储添加项目

​从向量存储中删除项目

​查询向量存储

​相似性搜索和分数

​作为检索器

​用于检索增强生成的使用

​API 参考

设置

凭据

初始化

管理向量存储

向向量存储添加项目

从向量存储中删除项目

查询向量存储

相似性搜索和分数

作为检索器

用于检索增强生成的使用

API 参考