英特尔视觉数据管理系统 (VDMS)

本笔记本介绍了如何将 VDMS 作为向量存储进行入门。

英特尔的视觉数据管理系统 (VDMS) 是一种用于高效访问大数据“视觉数据”的存储解决方案，旨在通过搜索作为图存储的相关视觉元数据来达到云规模，并为视觉数据提供机器友好的增强功能以实现更快访问。VDMS 在 MIT 许可下发布。有关 VDMS 的更多信息，请访问此页面，并在此处找到 LangChain API 参考这里。

VDMS 支持

K 近邻搜索
欧几里得距离 (L2) 和内积 (IP)
用于索引和计算距离的库：FaissFlat (默认)、FaissHNSWFlat、FaissIVFFlat、Flinng、TileDBDense、TileDBSparse
文本、图像和视频的嵌入
向量和元数据搜索

设置

要访问 VDMS 向量存储，您需要安装 langchain-vdms 集成包并通过公开可用的 Docker 镜像部署 VDMS 服务器。为简单起见，本笔记本将在本地主机上使用端口 55555 部署 VDMS 服务器。

pip install -qU "langchain-vdms>=0.1.3"
!docker run --no-healthcheck --rm -d -p 55555:55555 --name vdms_vs_test_nb intellabs/vdms:latest
!sleep 5

Note: you may need to restart the kernel to use updated packages.
c464076e292613df27241765184a673b00c775cecb7792ef058591c2cbf0bde8

凭据

您可以无需任何凭据使用 VDMS。要启用模型调用的自动化跟踪，请设置您的 LangSmith API 密钥：

os.environ["LANGSMITH_API_KEY"] = getpass.getpass("Enter your LangSmith API key: ")
os.environ["LANGSMITH_TRACING"] = "true"

初始化

使用 VDMS 客户端连接到 VDMS 向量存储，使用 FAISS IndexFlat 索引（默认）和欧几里得距离（默认）作为相似性搜索的距离度量。

# | output: false
# | echo: false

! pip install -qU langchain-huggingface
from langchain_huggingface import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")

from langchain_vdms.vectorstores import VDMS, VDMS_Client

collection_name = "test_collection_faiss_L2"

vdms_client = VDMS_Client(host="localhost", port=55555)

vector_store = VDMS(
    client=vdms_client,
    embedding=embeddings,
    collection_name=collection_name,
    engine="FaissFlat",
    distance_strategy="L2",
)

管理向量存储

向向量存储添加项目

import logging

logging.basicConfig()
logging.getLogger("langchain_vdms.vectorstores").setLevel(logging.INFO)

from langchain_core.documents import Document

document_1 = Document(
    page_content="I had chocolate chip pancakes and scrambled eggs for breakfast this morning.",
    metadata={"source": "tweet"},
    id=1,
)

document_2 = Document(
    page_content="The weather forecast for tomorrow is cloudy and overcast, with a high of 62 degrees.",
    metadata={"source": "news"},
    id=2,
)

document_3 = Document(
    page_content="Building an exciting new project with LangChain - come check it out!",
    metadata={"source": "tweet"},
    id=3,
)

document_4 = Document(
    page_content="Robbers broke into the city bank and stole $1 million in cash.",
    metadata={"source": "news"},
    id=4,
)

document_5 = Document(
    page_content="Wow! That was an amazing movie. I can't wait to see it again.",
    metadata={"source": "tweet"},
    id=5,
)

document_6 = Document(
    page_content="Is the new iPhone worth the price? Read this review to find out.",
    metadata={"source": "website"},
    id=6,
)

document_7 = Document(
    page_content="The top 10 soccer players in the world right now.",
    metadata={"source": "website"},
    id=7,
)

document_8 = Document(
    page_content="LangGraph is the best framework for building stateful, agentic applications!",
    metadata={"source": "tweet"},
    id=8,
)

document_9 = Document(
    page_content="The stock market is down 500 points today due to fears of a recession.",
    metadata={"source": "news"},
    id=9,
)

document_10 = Document(
    page_content="I have a bad feeling I am going to get deleted :(",
    metadata={"source": "tweet"},
    id=10,
)

documents = [
    document_1,
    document_2,
    document_3,
    document_4,
    document_5,
    document_6,
    document_7,
    document_8,
    document_9,
    document_10,
]

doc_ids = [str(i) for i in range(1, 11)]
vector_store.add_documents(documents=documents, ids=doc_ids)

['1', '2', '3', '4', '5', '6', '7', '8', '9', '10']

如果多次提供 ID，add_documents 不会检查 ID 是否唯一。因此，请使用 upsert 在添加之前删除现有的 ID 条目。

vector_store.upsert(documents, ids=doc_ids)

{'succeeded': ['1', '2', '3', '4', '5', '6', '7', '8', '9', '10'],
 'failed': []}

更新向量存储中的项目

updated_document_1 = Document(
    page_content="I had chocolate chip pancakes and fried eggs for breakfast this morning.",
    metadata={"source": "tweet"},
    id=1,
)

updated_document_2 = Document(
    page_content="The weather forecast for tomorrow is sunny and warm, with a high of 82 degrees.",
    metadata={"source": "news"},
    id=2,
)

vector_store.update_documents(
    ids=doc_ids[:2],
    documents=[updated_document_1, updated_document_2],
    batch_size=2,
)

从向量存储中删除项目

vector_store.delete(ids=doc_ids[-1])

True

查询向量存储

一旦您的向量存储被创建并添加了相关文档，您很可能希望在链或代理运行期间查询它。

直接查询

执行简单的相似性搜索可以按如下方式完成

results = vector_store.similarity_search(
    "LangChain provides abstractions to make working with LLMs easy",
    k=2,
    filter={"source": ["==", "tweet"]},
)
for doc in results:
    print(f"* ID={doc.id}: {doc.page_content} [{doc.metadata}]")

INFO:langchain_vdms.vectorstores:VDMS similarity search took 0.0063 seconds

* ID=3: Building an exciting new project with LangChain - come check it out! [{'source': 'tweet'}]
* ID=8: LangGraph is the best framework for building stateful, agentic applications! [{'source': 'tweet'}]

如果您想执行相似性搜索并接收相应的分数，可以运行

results = vector_store.similarity_search_with_score(
        "Will it be hot tomorrow?", k=1, filter={"source": ["==", "news"]}
)
for doc, score in results:
        print(f"* [SIM={score:3f}] {doc.page_content} [{doc.metadata}]")

INFO:langchain_vdms.vectorstores:VDMS similarity search took 0.0460 seconds

* [SIM=0.753577] The weather forecast for tomorrow is sunny and warm, with a high of 82 degrees. [{'source': 'news'}]

如果要使用嵌入执行相似性搜索，可以运行

results = vector_store.similarity_search_by_vector(
    embedding=embeddings.embed_query("I love green eggs and ham!"), k=1
)
for doc in results:
    print(f"* {doc.page_content} [{doc.metadata}]")

INFO:langchain_vdms.vectorstores:VDMS similarity search took 0.0044 seconds

* The weather forecast for tomorrow is sunny and warm, with a high of 82 degrees. [{'source': 'news'}]

通过转换为检索器进行查询

您还可以将向量存储转换为检索器，以便在您的链中更轻松地使用。

retriever = vector_store.as_retriever(
        search_type="similarity",
        search_kwargs={"k": 3},
)
results = retriever.invoke("Stealing from the bank is a crime")
for doc in results:
    print(f"* {doc.page_content} [{doc.metadata}]")

INFO:langchain_vdms.vectorstores:VDMS similarity search took 0.0042 seconds

* Robbers broke into the city bank and stole $1 million in cash. [{'source': 'news'}]
* The stock market is down 500 points today due to fears of a recession. [{'source': 'news'}]
* Is the new iPhone worth the price? Read this review to find out. [{'source': 'website'}]

retriever = vector_store.as_retriever(
    search_type="similarity_score_threshold",
    search_kwargs={
        "k": 1,
        "score_threshold": 0.0,  # >= score_threshold
    },
)
results = retriever.invoke("Stealing from the bank is a crime")
for doc in results:
    print(f"* {doc.page_content} [{doc.metadata}]")

INFO:langchain_vdms.vectorstores:VDMS similarity search took 0.0042 seconds

* Robbers broke into the city bank and stole $1 million in cash. [{'source': 'news'}]

retriever = vector_store.as_retriever(
        search_type="mmr",
        search_kwargs={"k": 1, "fetch_k": 10},
)
results = retriever.invoke(
        "Stealing from the bank is a crime", filter={"source": ["==", "news"]}
)
for doc in results:
    print(f"* {doc.page_content} [{doc.metadata}]")

INFO:langchain_vdms.vectorstores:VDMS mmr search took 0.0042 secs

* Robbers broke into the city bank and stole $1 million in cash. [{'source': 'news'}]

删除集合

之前，我们根据文档的 id 删除了文档。这里，由于没有提供 ID，因此删除了所有文档。

print("Documents before deletion: ", vector_store.count())

vector_store.delete(collection_name=collection_name)

print("Documents after deletion: ", vector_store.count())

Documents before deletion:  10
Documents after deletion:  0

用于检索增强生成的使用

有关如何将此向量存储用于检索增强生成 (RAG) 的指南，请参阅以下部分

使用其他引擎进行相似性搜索

VDMS 支持各种用于索引和计算距离的库：FaissFlat（默认）、FaissHNSWFlat、FaissIVFFlat、Flinng、TileDBDense 和 TileDBSparse。默认情况下，向量存储使用 FaissFlat。下面我们展示了一些使用其他引擎的示例。

使用 Faiss HNSWFlat 和欧几里得距离进行相似性搜索

在这里，我们使用 Faiss IndexHNSWFlat 索引和 L2 作为相似性搜索的距离度量将文档添加到 VDMS。我们搜索与查询相关的三个文档（k=3），并返回分数和文档。

db_FaissHNSWFlat = VDMS.from_documents(
    documents,
    client=vdms_client,
    ids=doc_ids,
    collection_name="my_collection_FaissHNSWFlat_L2",
    embedding=embeddings,
    engine="FaissHNSWFlat",
    distance_strategy="L2",
)
# Query
k = 3
query = "LangChain provides abstractions to make working with LLMs easy"
docs_with_score = db_FaissHNSWFlat.similarity_search_with_score(query, k=k, filter=None)

for res, score in docs_with_score:
    print(f"* [SIM={score:3f}] {res.page_content} [{res.metadata}]")

INFO:langchain_vdms.vectorstores:Descriptor set my_collection_FaissHNSWFlat_L2 created
INFO:langchain_vdms.vectorstores:VDMS similarity search took 0.1272 seconds

* [SIM=0.716791] Building an exciting new project with LangChain - come check it out! [{'source': 'tweet'}]
* [SIM=0.936718] LangGraph is the best framework for building stateful, agentic applications! [{'source': 'tweet'}]
* [SIM=1.834110] Is the new iPhone worth the price? Read this review to find out. [{'source': 'website'}]

使用 Faiss IVFFlat 和内积 (IP) 距离进行相似性搜索

我们使用 Faiss IndexIVFFlat 索引和 IP 作为相似性搜索的距离度量将文档添加到 VDMS。我们搜索与查询相关的三个文档（k=3），并返回分数和文档。

db_FaissIVFFlat = VDMS.from_documents(
    documents,
        client=vdms_client,
        ids=doc_ids,
        collection_name="my_collection_FaissIVFFlat_IP",
        embedding=embeddings,
        engine="FaissIVFFlat",
        distance_strategy="IP",
)

k = 3
query = "LangChain provides abstractions to make working with LLMs easy"
docs_with_score = db_FaissIVFFlat.similarity_search_with_score(query, k=k, filter=None)
for res, score in docs_with_score:
        print(f"* [SIM={score:3f}] {res.page_content} [{res.metadata}]")

INFO:langchain_vdms.vectorstores:Descriptor set my_collection_FaissIVFFlat_IP created
INFO:langchain_vdms.vectorstores:VDMS similarity search took 0.0052 seconds

* [SIM=0.641605] Building an exciting new project with LangChain - come check it out! [{'source': 'tweet'}]
* [SIM=0.531641] LangGraph is the best framework for building stateful, agentic applications! [{'source': 'tweet'}]
* [SIM=0.082945] Is the new iPhone worth the price? Read this review to find out. [{'source': 'website'}]

使用 FLINNG 和 IP 距离进行相似性搜索

在本节中，我们使用 FLINNG（Filters to Identify Near-Neighbor Groups）索引和 IP 作为相似性搜索的距离度量将文档添加到 VDMS。我们搜索与查询相关的三个文档（k=3），并返回分数和文档。

db_Flinng = VDMS.from_documents(
    documents,
    client=vdms_client,
    ids=doc_ids,
    collection_name="my_collection_Flinng_IP",
    embedding=embeddings,
    engine="Flinng",
    distance_strategy="IP",
)
# Query
k = 3
query = "LangChain provides abstractions to make working with LLMs easy"
docs_with_score = db_Flinng.similarity_search_with_score(query, k=k, filter=None)
for res, score in docs_with_score:
    print(f"* [SIM={score:3f}] {res.page_content} [{res.metadata}]")

INFO:langchain_vdms.vectorstores:Descriptor set my_collection_Flinng_IP created
INFO:langchain_vdms.vectorstores:VDMS similarity search took 0.0042 seconds

* [SIM=0.000000] I had chocolate chip pancakes and scrambled eggs for breakfast this morning. [{'source': 'tweet'}]
* [SIM=0.000000] I had chocolate chip pancakes and scrambled eggs for breakfast this morning. [{'source': 'tweet'}]
* [SIM=0.000000] I had chocolate chip pancakes and scrambled eggs for breakfast this morning. [{'source': 'tweet'}]

元数据过滤

在处理集合之前缩小范围会很有帮助。例如，可以使用 get_by_constraints 方法根据元数据过滤集合。字典用于过滤元数据。在这里，我们检索 langchain_id = "2" 的文档并将其从向量存储中删除。 注意： id 作为整数生成为附加元数据，而 langchain_id（内部 ID）是每个条目的唯一字符串。

response, response_array = db_FaissIVFFlat.get_by_constraints(
    db_FaissIVFFlat.collection_name,
        limit=1,
        include=["metadata", "embeddings"],
        constraints={"langchain_id": ["==", "2"]},
)

# Delete id=2
db_FaissIVFFlat.delete(collection_name=db_FaissIVFFlat.collection_name, ids=["2"])

print("Deleted entry:")
for doc in response:
        print(f"* ID={doc.id}: {doc.page_content} [{doc.metadata}]")

Deleted entry:
* ID=2: The weather forecast for tomorrow is cloudy and overcast, with a high of 62 degrees. [{'source': 'news'}]

response, response_array = db_FaissIVFFlat.get_by_constraints(
    db_FaissIVFFlat.collection_name,
        include=["metadata"],
)
for doc in response:
        print(f"* ID={doc.id}: {doc.page_content} [{doc.metadata}]")

* ID=10: I have a bad feeling I am going to get deleted :( [{'source': 'tweet'}]
* ID=9: The stock market is down 500 points today due to fears of a recession. [{'source': 'news'}]
* ID=8: LangGraph is the best framework for building stateful, agentic applications! [{'source': 'tweet'}]
* ID=7: The top 10 soccer players in the world right now. [{'source': 'website'}]
* ID=6: Is the new iPhone worth the price? Read this review to find out. [{'source': 'website'}]
* ID=5: Wow! That was an amazing movie. I can't wait to see it again. [{'source': 'tweet'}]
* ID=4: Robbers broke into the city bank and stole $1 million in cash. [{'source': 'news'}]
* ID=3: Building an exciting new project with LangChain - come check it out! [{'source': 'tweet'}]
* ID=1: I had chocolate chip pancakes and scrambled eggs for breakfast this morning. [{'source': 'tweet'}]

这里我们使用 id 来过滤一系列 ID，因为它是整数。

response, response_array = db_FaissIVFFlat.get_by_constraints(
    db_FaissIVFFlat.collection_name,
        include=["metadata", "embeddings"],
        constraints={"source": ["==", "news"]},
)
for doc in response:
        print(f"* ID={doc.id}: {doc.page_content} [{doc.metadata}]")

* ID=9: The stock market is down 500 points today due to fears of a recession. [{'source': 'news'}]
* ID=4: Robbers broke into the city bank and stole $1 million in cash. [{'source': 'news'}]

停止 VDMS 服务器

!docker kill vdms_vs_test_nb

vdms_vs_test_nb

API 参考

TODO：添加 API 参考

在 GitHub 上编辑此页面源文件。

以编程方式连接这些文档到 Claude、VSCode 等，通过 MCP 获取实时答案。

热门提供商

按组件划分的集成

设置

凭据

初始化

管理向量存储

向向量存储添加项目

更新向量存储中的项目

从向量存储中删除项目

查询向量存储

直接查询

通过转换为检索器进行查询

删除集合

用于检索增强生成的使用

使用其他引擎进行相似性搜索

使用 Faiss HNSWFlat 和欧几里得距离进行相似性搜索

使用 Faiss IVFFlat 和内积 (IP) 距离进行相似性搜索

使用 FLINNG 和 IP 距离进行相似性搜索

元数据过滤

停止 VDMS 服务器

API 参考

热门提供商

按组件划分的集成

​设置

​凭据

​初始化

​管理向量存储

​向向量存储添加项目

​更新向量存储中的项目

​从向量存储中删除项目

​查询向量存储

​直接查询

​通过转换为检索器进行查询

​删除集合

​用于检索增强生成的使用

​使用其他引擎进行相似性搜索

​使用 Faiss HNSWFlat 和欧几里得距离进行相似性搜索

​使用 Faiss IVFFlat 和内积 (IP) 距离进行相似性搜索

​使用 FLINNG 和 IP 距离进行相似性搜索

​元数据过滤

​停止 VDMS 服务器

​API 参考

设置

凭据

初始化

管理向量存储

向向量存储添加项目

更新向量存储中的项目

从向量存储中删除项目

查询向量存储

直接查询

通过转换为检索器进行查询

删除集合

用于检索增强生成的使用

使用其他引擎进行相似性搜索

使用 Faiss HNSWFlat 和欧几里得距离进行相似性搜索

使用 Faiss IVFFlat 和内积 (IP) 距离进行相似性搜索

使用 FLINNG 和 IP 距离进行相似性搜索

元数据过滤

停止 VDMS 服务器

API 参考