本文将系统比较Pinecone与Milvus两大主流向量数据库的技术特点、性能表现和应用场景,提供详细的接入代码和最佳实践,帮助开发者为大模型应用选择并优化向量存储解决方案。
随着AI大模型的发展,向量数据(如文本、图像的嵌入表示)成为处理非结构化数据的核心方式。然而,传统的数据库无法高效地处理高维向量的相似性搜索问题。因此,向量数据库应运而生。
向量数据库是一种专门用于存储和检索高维向量数据的数据库。它支持快速执行近似最近邻搜索(ANN),即找出与目标向量最接近的一组向量。
向量嵌入是将非结构化数据(如文本、图像)转换为固定维度的数值向量的过程。例如:
text-embedding-ada-002
将文本转换为 1536 维向量。from openai import OpenAI
client = OpenAI()
response = client.embeddings.create(
input="这是一个示例文本",
model="text-embedding-ada-002"
)
embedding = response.data[0].embedding
print(len(embedding)) # 输出: 1536
常见的相似性度量方法包括:
通常使用余弦相似度来判断两个向量是否“语义上”相似。
索引类型 | 特点 | 适用场景 |
---|---|---|
HNSW (Hierarchical Navigable Small World) | 高效且支持动态更新 | 实时推荐系统 |
IVF (Inverted File Index) | 快速但不支持频繁更新 | 批量处理场景 |
FLAT | 精确但慢 | 小规模数据 |
对比项 | 传统数据库 | 向量数据库 |
---|---|---|
数据类型 | 结构化数据(如整数、字符串) | 高维向量 |
查询方式 | 精确匹配(如 SQL WHERE 条件) | 近似最近邻搜索(ANN) |
性能 | 插入/查询速度快 | 支持大规模向量数据的高效检索 |
Pinecone 是一个云托管的向量数据库服务,适合不想自己部署服务器的开发者。
pip install pinecone-client openai
import pinecone
from openai import OpenAI
# 初始化 Pinecone
pinecone.init(api_key="your-api-key", environment="your-environment")
# 创建索引(如果不存在)
if "knowledge-base" not in pinecone.list_indexes():
pinecone.create_index(
name="knowledge-base",
dimension=1536, # OpenAI ada-002 的输出维度
metric="cosine"
)
# 连接到索引
index = pinecone.Index("knowledge-base")
client = OpenAI()
# 生成向量嵌入
embeddings = client.embeddings.create(
input="需要向量化的文本内容",
model="text-embedding-ada-002"
).data[0].embedding
# 插入向量
index.upsert(
vectors=[
{
"id": "doc1",
"values": embeddings,
"metadata": {"source": "article", "author": "Zhang San"}
}
]
)
results = index.query(
vector=embeddings,
top_k=3,
include_metadata=True
)
for result in results['matches']:
print(f"ID: {result['id']}, Score: {result['score']}, Metadata: {result['metadata']}")
ID: doc1, Score: 0.987, Metadata: {'source': 'article', 'author': 'Zhang San'}
ID: doc2, Score: 0.876, Metadata: {'source': 'blog', 'author': 'Li Si'}
ID: doc3, Score: 0.765, Metadata: {'source': 'wiki', 'author': 'Wang Wu'}
Milvus 是一个开源的向量数据库,支持本地部署和分布式架构,适合对数据隐私要求较高的企业。
使用 Docker 快速启动:
docker run -d --name milvusdb -p 19530:19530 milvusdb/milvus:v2.4.3
pip install pymilvus numpy
from pymilvus import connections, Collection, utility
import numpy as np
# 连接 Milvus
connections.connect(host='localhost', port='19530')
# 创建集合
collection_name = "document_store"
if utility.has_collection(collection_name):
collection = Collection(collection_name)
else:
from pymilvus import FieldType, CollectionSchema, DataType
fields = [
{"name": "id", "type": DataType.INT64, "is_primary": True},
{"name": "embedding", "type": DataType.FLOAT_VECTOR, "dim": 1536},
{"name": "source", "type": DataType.VARCHAR, "max_length": 256},
{"name": "author", "type": DataType.VARCHAR, "max_length": 256}
]
schema = CollectionSchema(fields, description="Document Store")
collection = Collection(collection_name, schema=schema)
# 插入数据
import random
num_entities = 1000
ids = [i for i in range(num_entities)]
embeddings = [[random.random() for _ in range(1536)] for _ in range(num_entities)]
sources = ["article"] * num_entities
authors = ["Zhang San"] * num_entities
collection.insert([ids, embeddings, sources, authors])
collection.flush()
search_params = {"metric_type": "COSINE", "params": {"nprobe": 10}}
results = collection.search(
data=[embeddings[0]],
anns_field="embedding",
param=search_params,
limit=3,
expr="source == 'article'",
output_fields=["source", "author"]
)
for hit in results[0]:
print(f"ID: {hit.id}, Distance: {hit.distance}, Source: {hit.entity.get('source')}")
ID: 0, Distance: 0.0, Source: article
ID: 123, Distance: 0.123, Source: article
ID: 456, Distance: 0.135, Source: article
LangChain 是一个强大的框架,可以轻松集成向量数据库。
pip install langchain pinecone-client pymilvus openai
from langchain.vectorstores import Pinecone as LangchainPinecone
from langchain.embeddings.openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings()
vectorstore = LangchainPinecone.from_existing_index(
index_name="knowledge-base",
embedding=embeddings,
text_key="content"
)
from langchain.vectorstores import Milvus as LangchainMilvus
milvus_store = LangchainMilvus(
embedding_function=embeddings,
collection_name="document_store",
connection_args={"host": "localhost", "port": "19530"}
)
对比项 | Pinecone | Milvus |
---|---|---|
部署方式 | 云托管 | 自托管/云托管 |
易用性 | 高(开箱即用) | 中等(需部署) |
成本 | 按 API 调用计费 | 开源免费 |
扩展性 | 自动扩展 | 支持分布式部署 |
社区活跃度 | 商业支持 | 开源社区活跃 |
功能丰富度 | 丰富 | 更加灵活可定制 |
[用户查询]
↓
[LangChain混合搜索]
├─→ [Elasticsearch关键词搜索]
└─→ [Milvus向量相似性搜索]
↓
[结果融合排序]
↓
[最终返回结果]
pip install pymilvus elasticsearch langchain openai sentence-transformers fastapi uvicorn
docker pull milvusdb/milvus:v2.4.3
docker pull docker.elastic.co/elasticsearch/elasticsearch:8.11.3
from pymilvus import connections, Collection, utility, FieldSchema, CollectionSchema, DataType
# 连接Milvus
connections.connect(host='localhost', port='19530')
# 创建集合
collection_name = "document_vectors"
if not utility.has_collection(collection_name):
fields = [
FieldSchema(name="doc_id", dtype=DataType.INT64, is_primary=True),
FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=768),
FieldSchema(name="title", dtype=DataType.VARCHAR, max_length=512),
FieldSchema(name="content", dtype=DataType.VARCHAR, max_length=65535)
]
schema = CollectionSchema(fields, description="文档向量存储")
collection = Collection(collection_name, schema=schema)
# 创建索引
index_params = {
"index_type": "IVF_FLAT",
"params": {"nlist": 1024},
"metric_type": "COSINE"
}
collection.create_index(field_name="embedding", index_params=index_params)
else:
collection = Collection(collection_name)
collection.load()
from elasticsearch import Elasticsearch
from elasticsearch_dsl import analyzer
# 连接Elasticsearch
es_client = Elasticsearch(hosts=["http://localhost:9200"])
# 创建索引
index_name = "document_search"
if not es_client.indices.exists(index=index_name):
es_client.indices.create(
index=index_name,
body={
"settings": {
"analysis": {
"analyzer": {
"custom_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": ["lowercase", "stop", "stemmer"]
}
}
}
},
"mappings": {
"properties": {
"title": {"type": "text", "analyzer": "custom_analyzer"},
"content": {"type": "text", "analyzer": "custom_analyzer"}
}
}
}
)
from sentence_transformers import SentenceTransformer
class VectorService:
def __init__(self):
self.model = SentenceTransformer('all-MiniLM-L6-v2')
def text_to_vector(self, text: str) -> List[float]:
return self.model.encode(text).tolist()
vector_service = VectorService()
from langchain.vectorstores import Milvus
from langchain.schema import Document
from langchain.embeddings import HuggingFaceEmbeddings
# 初始化嵌入模型
embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
# 连接Milvus向量数据库
milvus_store = Milvus(
embedding_function=embeddings,
collection_name=collection_name,
connection_args={"host": "localhost", "port": "19530"}
)
def hybrid_search(query: str, top_k: int = 5):
# 向量搜索
vector_results = milvus_store.similarity_search(query, k=top_k)
# 关键词搜索
es_results = es_client.search(
index=index_name,
body={"query": {"multi_match": {"query": query, "fields": ["title^2", "content"]}}}
)
# 结果融合
combined_results = []
for hit in es_results['hits']['hits']:
combined_results.append({
"source": "elasticsearch",
"score": hit['_score'],
"title": hit['_source']['title'],
"content": hit['_source']['content']
})
for doc in vector_results:
combined_results.append({
"source": "milvus",
"score": doc.metadata.get("distance", 0),
"title": doc.metadata.get("title", ""),
"content": doc.page_content
})
# 按分数排序
combined_results.sort(key=lambda x: x["score"], reverse=True)
return combined_results[:top_k]
# 启动Milvus
docker run -d --name milvusdb -p 19530:19530 milvusdb/milvus:v2.4.3
# 启动Elasticsearch
docker run -d --name es-node -p 9200:9200 \
-e "discovery.type=single-node" \
-e "ES_JAVA_OPTS=-Xms1g -Xmx1g" \
docker.elastic.co/elasticsearch/elasticsearch:8.11.3
# 安装依赖
pip install -r requirements.txt
# 启动API服务
uvicorn search_api:app --host 0.0.0.0 --port 8000
[用户行为数据]
↓
[特征工程处理]
↓
[Pinecone向量存储]
↓
[相似用户查找]
↓
[内容推荐生成]
pip install pinecone-client openai pandas numpy scikit-learn fastapi uvicorn
import pinecone
from openai import OpenAI
# 初始化Pinecone
pinecone.init(api_key="your-api-key", environment="northamerica-northeast1-gcp")
# 创建索引
index_name = "user-behavior"
if index_name not in pinecone.list_indexes():
pinecone.create_index(
name=index_name,
dimension=1536, # OpenAI嵌入维度
metric="cosine"
)
# 连接索引
index = pinecone.Index(index_name)
client = OpenAI()
import pandas as pd
from sklearn.preprocessing import StandardScaler
class UserBehaviorVectorizer:
def __init__(self):
self.scaler = StandardScaler()
def _extract_features(self, user_data: dict) -> pd.DataFrame:
# 特征提取示例
features = {
"avg_session_time": user_data.get("session_duration", 0),
"click_rate": len(user_data.get("clicked_items", [])) / max(1, user_data.get("impressions", 1)),
"purchase_frequency": len(user_data.get("purchases", [])) / max(1, user_data.get("days_active", 1)),
"category_preference": user_data.get("preferred_category", "other"),
"device_usage": user_data.get("device_type", "desktop"),
"location": user_data.get("location", "unknown")
}
return pd.DataFrame([features])
def _normalize_features(self, df: pd.DataFrame) -> np.ndarray:
# 数值特征标准化
numerical = self.scaler.fit_transform(df[["avg_session_time", "click_rate", "purchase_frequency"]])
# 类别特征one-hot编码
categorical = pd.get_dummies(df[["category_preference", "device_usage", "location"]])
# 合并特征
return np.hstack([numerical, categorical.values])
def vectorize(self, user_data: dict) -> List[float]:
df = self._extract_features(user_data)
features = self._normalize_features(df)
# 使用OpenAI嵌入进行降维
response = client.embeddings.create(
input=features.tolist()[0],
model="text-embedding-ada-002"
)
return response.data[0].embedding
class RecommendationEngine:
def __init__(self):
self.user_vectorizer = UserBehaviorVectorizer()
def add_user(self, user_id: str, user_data: dict):
# 向量化用户行为
vector = self.user_vectorizer.vectorize(user_data)
# 存储到Pinecone
index.upsert(vectors=[{
"id": user_id,
"values": vector,
"metadata": {
"preferences": user_data.get("preferred_categories", []),
"recent_purchases": user_data.get("recent_purchases", []),
"location": user_data.get("location", "")
}
}])
def get_recommendations(self, user_id: str, top_k: int = 5):
# 获取用户向量
result = index.fetch(ids=[user_id])
if not result.vectors:
return []
user_vector = result.vectors[user_id].values
# 查找相似用户
similar_users = index.query(
vector=user_vector,
top_k=top_k+1, # 排除自己
include_metadata=True
)
# 过滤出相似用户的偏好
recommendations = []
for match in similar_users.matches:
if match.id == user_id:
continue
metadata = match.metadata
recommendations.extend(metadata.get("recent_purchases", []))
# 去重并返回
return list(set(recommendations))[:top_k]
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
app = FastAPI()
class UserData(BaseModel):
session_duration: float
clicked_items: List[str]
impressions: int
purchases: List[str]
days_active: int
preferred_category: str
device_type: str
location: str
recommendation_engine = RecommendationEngine()
@app.post("/users/{user_id}/update")
async def update_user_profile(user_id: str, user_data: UserData):
try:
recommendation_engine.add_user(user_id, user_data.dict())
return {"status": "success"}
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))
@app.get("/users/{user_id}/recommendations")
async def get_recommendations(user_id: str, top_k: int = 5):
try:
results = recommendation_engine.get_recommendations(user_id, top_k)
return {"recommendations": results}
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))
# 设置环境变量
export PINECONE_API_KEY="your-pinecone-api-key"
export OPENAI_API_KEY="your-openai-api-key"
# 安装依赖
pip install -r requirements.txt
# 启动服务
uvicorn recommendation_api:app --host 0.0.0.0 --port 8000
以上两个实战案例提供了完整的代码实现和部署指南,可根据具体需求进一步扩展和优化。
问题 | 解决方案 |
---|---|
向量维度不匹配 | 检查模型输出维度是否一致 |
查询超时 | 增加超时时间或优化索引 |
内存溢出 | 减少索引分片数或升级硬件 |
插入失败 | 检查主键唯一性约束 |
如果你喜欢这篇文章,请点赞、收藏,并分享给你的朋友!欢迎关注我的专栏《AI大模型应知应会100篇》,获取更多实用技术干货!