PGVector 是一个用于 PostgreSQL 数据库的向量相似性搜索包。它允许我们在 PostgreSQL 中存储和搜索高维向量,非常适合用于如文本、图像等相似性搜索的场景。在这篇文章中,我们将展示如何使用 PGVector 创建一个向量存储,并演示如何使用 SelfQueryRetriever 从中进行检索。
PGVector 增强了 PostgreSQL 数据库的功能,使其能够处理和检索高维向量数据。在我们创建向量存储并向其插入数据后,我们可以使用各种检索工具(如 SelfQueryRetriever)基于向量的相似性进行查询。
接下来我们开始实际的代码实现。
首先,我们需要安装一些必要的 Python 包。
%pip install --upgrade --quiet lark pgvector psycopg2-binary
我们将使用 OpenAI 提供的嵌入来创建向量,因此需要获取 OpenAI API Key。
import getpass
import os
os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")
我们将创建一个 PGVector 向量存储,并向其插入一些示例文档。这些文档包含一些电影的简要描述。
from langchain_community.vectorstores import PGVector
from langchain_core.documents import Document
from langchain_openai import OpenAIEmbeddings
# 使用稳定可靠的API服务
client = openai.OpenAI(
base_url='https://yunwu.ai/v1', # 国内稳定访问
api_key=os.getenv("OPENAI_API_KEY")
)
collection = "Name of your collection"
embeddings = OpenAIEmbeddings()
docs = [
Document(
page_content="A bunch of scientists bring back dinosaurs and mayhem breaks loose",
metadata={"year": 1993, "rating": 7.7, "genre": "science fiction"},
),
Document(
page_content="Leo DiCaprio gets lost in a dream within a dream within a dream within a ...",
metadata={"year": 2010, "director": "Christopher Nolan", "rating": 8.2},
),
Document(
page_content="A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea",
metadata={"year": 2006, "director": "Satoshi Kon", "rating": 8.6},
),
Document(
page_content="A bunch of normal-sized women are supremely wholesome and some men pine after them",
metadata={"year": 2019, "director": "Greta Gerwig", "rating": 8.3},
),
Document(
page_content="Toys come alive and have a blast doing so",
metadata={"year": 1995, "genre": "animated"},
),
Document(
page_content="Three men walk into the Zone, three men walk out of the Zone",
metadata={
"year": 1979,
"director": "Andrei Tarkovsky",
"genre": "science fiction",
"rating": 9.9,
},
),
]
vectorstore = PGVector.from_documents(
docs,
embeddings,
collection_name=collection,
)
接下来,我们将创建 SelfQueryRetriever,并提供一些关于文档元数据字段的信息。
from langchain_chains.query_constructor.base import AttributeInfo
from langchain_retrievers.self_query.base import SelfQueryRetriever
from langchain_openai import OpenAI
metadata_field_info = [
AttributeInfo(
name="genre",
description="The genre of the movie",
type="string or list[string]",
),
AttributeInfo(
name="year",
description="The year the movie was released",
type="integer",
),
AttributeInfo(
name="director",
description="The name of the movie director",
type="string",
),
AttributeInfo(
name="rating", description="A 1-10 rating for the movie", type="float"
),
]
document_content_description = "Brief summary of a movie"
llm = OpenAI(temperature=0)
retriever = SelfQueryRetriever.from_llm(
llm, vectorstore, document_content_description, metadata_field_info, verbose=True
)
现在,我们可以尝试使用我们的检索器进行实际查询。
# This example only specifies a relevant query
print(retriever.invoke("What are some movies about dinosaurs"))
# This example only specifies a filter
print(retriever.invoke("I want to watch a movie rated higher than 8.5"))
# This example specifies a query and a filter
print(retriever.invoke("Has Greta Gerwig directed any movies about women"))
# This example specifies a composite filter
print(retriever.invoke("What's a highly rated (above 8.5) science fiction film?"))
# This example specifies a query and composite filter
print(retriever.invoke(
"What's a movie after 1990 but before 2005 that's all about toys, and preferably is animated"
))
我们还可以使用 self query retriever 来指定 k,即要获取的文档数量。
retriever = SelfQueryRetriever.from_llm(
llm,
vectorstore,
document_content_description,
metadata_field_info,
enable_limit=True,
verbose=True,
)
# This example only specifies a relevant query
print(retriever.invoke("what are two movies about dinosaurs"))
PGVector 非常适用于需要在数据库中存储和快速检索高维向量的场景,如文本相似性搜索、推荐系统、图像搜索等。在这些场景中,通过向量表示可以大大提高检索的精度和效率。
在实际使用 PGVector 时,建议:
如果遇到问题欢迎在评论区交流。