欢迎来到"RAG实战指南"系列的第21天!今天我们将深入探讨RAG系统中检索前处理与查询重写技术的核心原理和实现方法。在构建高质量RAG系统时,原始用户查询往往不够精确或完整,直接用于检索可能导致效果不佳。查询预处理和重写技术能够显著提升检索质量,是构建生产级RAG系统的关键环节。
通过本篇文章,您将掌握:
在RAG流程中,检索前处理的主要目标包括:
未经处理的原始查询通常存在以下问题:
问题类型 | 具体表现 | 影响后果 |
---|---|---|
模糊性 | "最新的政策"中的"最新"不明确 | 检索结果过时或不准 |
不完整性 | "如何安装"缺少具体软件名 | 返回泛化结果 |
歧义性 | "Python"指编程语言还是蛇 | 错误主题的文档 |
口语化 | “这玩意儿怎么用” | 难以匹配专业文档 |
主要查询重写技术包括:
标准查询预处理流程包含以下步骤:
def preprocess_query(raw_query, context=None):
"""完整的查询预处理流程"""
# 1. 基础清洗
cleaned = clean_text(raw_query)
# 2. 意图识别
intent = detect_intent(cleaned)
# 3. 实体识别
entities = extract_entities(cleaned)
# 4. 查询重写
rewritten = rewrite_query(cleaned, intent, entities, context)
# 5. 查询扩展
expanded = expand_query(rewritten, intent, entities)
return expanded
def clean_text(text):
"""基础文本清洗"""
# 移除特殊字符、多余空格等
text = re.sub(r'[^\w\s]', '', text)
text = re.sub(r'\s+', ' ', text).strip()
return text.lower()
利用大语言模型强大的理解能力实现智能查询重写:
from openai import OpenAI
client = OpenAI()
def llm_rewrite_query(query, context=None):
"""使用LLM进行查询重写"""
prompt = f"""
你是一个专业的查询优化助手。请根据以下规则重写用户查询:
1. 保持原意不变
2. 使用更专业的表达
3. 消除可能的歧义
4. 必要时添加相关上下文
原始查询: {query}
{f"上下文: {context}" if context else ""}
请输出优化后的查询:
"""
response = client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[{"role": "user", "content": prompt}],
temperature=0.3,
max_tokens=150
)
return response.choices[0].message.content.strip()
# 示例使用
original_query = "这玩意儿怎么安装"
rewritten = llm_rewrite_query(original_query)
print(f"原始查询: {original_query}")
print(f"重写后: {rewritten}")
结合领域知识实现规则驱动的查询扩展:
class QueryExpander:
def __init__(self, knowledge_base):
self.kb = knowledge_base # 领域知识库
def expand(self, query):
"""基于规则的查询扩展"""
# 1. 同义词扩展
synonyms = self._get_synonyms(query)
# 2. 领域术语扩展
terms = self._get_related_terms(query)
# 3. 首字母缩写扩展
acronyms = self._expand_acronyms(query)
# 合并所有扩展项
expanded = f"{query} {' '.join(synonyms)} {' '.join(terms)} {' '.join(acronyms)}"
return expanded.strip()
def _get_synonyms(self, text):
"""获取同义词"""
words = text.split()
synonyms = []
for word in words:
if word in self.kb.synonyms:
synonyms.extend(self.kb.synonyms[word])
return list(set(synonyms))
def _get_related_terms(self, text):
"""获取相关领域术语"""
related = []
for term in self.kb.terms:
if term in text:
related.extend(self.kb.term_relations[term])
return list(set(related))
def _expand_acronyms(self, text):
"""扩展首字母缩写"""
expansions = []
for acronym, expansion in self.kb.acronyms.items():
if acronym in text:
expansions.append(expansion)
return expansions
# 示例知识库
class KnowledgeBase:
def __init__(self):
self.synonyms = {
"安装": ["部署", "配置", "设置"],
"错误": ["异常", "问题", "故障"]
}
self.terms = ["Python", "Docker"]
self.term_relations = {
"Python": ["编程语言", "3.8版本", "虚拟环境"],
"Docker": ["容器", "镜像", "编排"]
}
self.acronyms = {
"API": "应用程序接口",
"CPU": "中央处理器"
}
# 使用示例
kb = KnowledgeBase()
expander = QueryExpander(kb)
query = "Python安装出现API错误"
expanded = expander.expand(query)
print(f"扩展后查询: {expanded}")
实现一个端到端的查询预处理管道,集成多种优化技术:
import re
from typing import Dict, List, Optional
class QueryPreprocessor:
def __init__(self, llm_client=None, knowledge_base=None):
self.llm = llm_client
self.kb = knowledge_base
self.cache = {} # 缓存优化结果
def process(self, raw_query: str, context: Optional[Dict] = None) -> str:
"""端到端查询处理管道"""
# 检查缓存
cache_key = f"{raw_query}-{str(context)}"
if cache_key in self.cache:
return self.cache[cache_key]
# 1. 基础清洗
cleaned = self._clean_text(raw_query)
# 2. 意图识别
intent = self._detect_intent(cleaned)
# 3. 实体识别
entities = self._extract_entities(cleaned)
# 4. 查询重写
rewritten = self._rewrite_query(cleaned, intent, entities, context)
# 5. 查询扩展
expanded = self._expand_query(rewritten, intent, entities)
# 缓存结果
self.cache[cache_key] = expanded
return expanded
def _clean_text(self, text: str) -> str:
"""基础文本清洗"""
text = re.sub(r'[^\w\s]', '', text)
text = re.sub(r'\s+', ' ', text).strip()
return text.lower()
def _detect_intent(self, text: str) -> str:
"""简单意图识别"""
if any(word in text for word in ["如何", "怎么", "怎样"]):
return "howto"
elif any(word in text for word in ["错误", "问题", "异常"]):
return "troubleshooting"
elif any(word in text for word in ["最新", "当前", "现在"]):
return "latest_info"
return "general"
def _extract_entities(self, text: str) -> List[str]:
"""简单实体识别"""
entities = []
if "python" in text:
entities.append("python")
if "docker" in text:
entities.append("docker")
return entities
def _rewrite_query(self, query: str, intent: str,
entities: List[str], context: Optional[Dict]) -> str:
"""查询重写"""
# 1. 如果有LLM客户端,优先使用LLM重写
if self.llm:
return self._llm_rewrite(query, context)
# 2. 基于规则的备用重写
rewritten = query
if intent == "howto":
if "安装" in query:
rewritten = f"如何正确安装和配置 {entities[0] if entities else ''}"
elif "使用" in query:
rewritten = f"{entities[0] if entities else '工具'} 的正确使用方法和最佳实践"
elif intent == "troubleshooting":
rewritten = f"{entities[0] if entities else '系统'} 常见问题和解决方案"
return rewritten or query
def _llm_rewrite(self, query: str, context: Optional[Dict]) -> str:
"""使用LLM重写查询"""
prompt = f"请将以下查询改写为更专业、明确的版本,保持原意不变:\n{query}"
if context:
prompt += f"\n上下文信息:{context}"
response = self.llm.chat.completions.create(
model="gpt-3.5-turbo",
messages=[{"role": "user", "content": prompt}],
temperature=0.3,
max_tokens=150
)
return response.choices[0].message.content.strip()
def _expand_query(self, query: str, intent: str, entities: List[str]) -> str:
"""查询扩展"""
if not self.kb:
return query
expander = QueryExpander(self.kb)
expanded = expander.expand(query)
# 根据意图添加额外术语
if intent == "howto":
expanded += " 步骤 指南 教程"
elif intent == "troubleshooting":
expanded += " 解决方案 修复方法"
return expanded
# 使用示例
preprocessor = QueryPreprocessor()
query = "python安装出错怎么办"
processed = preprocessor.process(query)
print(f"原始查询: {query}")
print(f"处理后: {processed}")
将查询预处理集成到LangChain RAG系统中的示例:
from langchain_core.retrievers import BaseRetriever
from langchain_core.callbacks import CallbackManagerForRetrieverRun
from typing import List
class PreprocessedRetriever(BaseRetriever):
"""集成查询预处理的检索器"""
def __init__(self, retriever, preprocessor):
super().__init__()
self.retriever = retriever
self.preprocessor = preprocessor
def _get_relevant_documents(self, query: str, *,
run_manager: CallbackManagerForRetrieverRun) -> List[Document]:
# 1. 预处理查询
processed_query = self.preprocessor.process(query)
# 2. 使用处理后查询检索
docs = self.retriever.get_relevant_documents(processed_query)
return docs
# 示例使用
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings
# 初始化向量库
vectorstore = Chroma(embedding_function=OpenAIEmbeddings())
retriever = vectorstore.as_retriever()
# 创建预处理检索器
preprocessor = QueryPreprocessor()
preprocessed_retriever = PreprocessedRetriever(retriever, preprocessor)
# 使用预处理后的检索器
query = "这玩意儿怎么装"
docs = preprocessed_retriever.get_relevant_documents(query)
print(f"检索到 {len(docs)} 个相关文档")
某电商平台需要处理大量用户咨询,包括:
原始用户查询通常存在:
设计专门的查询预处理模块:
class EcommerceQueryPreprocessor(QueryPreprocessor):
"""电商领域专用查询预处理器"""
def _detect_intent(self, text: str) -> str:
"""电商领域意图识别"""
if any(word in text for word in ["哪里", "到哪", "何时"]):
return "order_status"
elif any(word in text for word in ["退货", "换货", "退款"]):
return "return_policy"
elif any(word in text for word in ["优惠", "折扣", "活动"]):
return "promotion"
elif any(word in text for word in ["支付", "付款", "分期"]):
return "payment"
return super()._detect_intent(text)
def _extract_entities(self, text: str) -> List[str]:
"""电商实体识别"""
entities = super()._extract_entities(text)
if "订单" in text or "物流" in text:
entities.append("order")
if "支付宝" in text or "微信支付" in text:
entities.append("payment_method")
return entities
def _rewrite_query(self, query: str, intent: str,
entities: List[str], context: Optional[Dict]) -> str:
"""电商领域查询重写"""
if intent == "order_status":
return "如何查询订单物流状态"
elif intent == "return_policy":
return "当前退换货政策和操作流程"
elif intent == "promotion":
return "最新优惠活动和适用条件"
elif intent == "payment":
return "支付方式和分期政策说明"
return super()._rewrite_query(query, intent, entities, context)
# 电商知识库
class EcommerceKB(KnowledgeBase):
def __init__(self):
super().__init__()
self.synonyms.update({
"退货": ["退换货", "退款", "售后"],
"订单": ["物流", "运输", "配送"]
})
self.term_relations.update({
"支付": ["支付宝", "微信支付", "银行卡"],
"优惠": ["折扣", "满减", "促销"]
})
# 使用示例
kb = EcommerceKB()
preprocessor = EcommerceQueryPreprocessor(knowledge_base=kb)
queries = [
"我买的东西到哪了",
"怎么退货",
"最新优惠",
"免息分期"
]
for query in queries:
processed = preprocessor.process(query)
print(f"原始: {query} -> 处理后: {processed}")
实施查询预处理后,系统实现了:
技术 | 优势 | 适用场景 |
---|---|---|
规则重写 | 可控性强、性能高 | 领域明确、查询模式固定 |
LLM重写 | 灵活性强、处理复杂 | 开放领域、多样化查询 |
查询扩展 | 提高召回率 | 专业领域、术语丰富 |
意图识别 | 精准定位需求 | 多意图混合场景 |
规则方法:
LLM方法:
通用挑战:
在今天的文章中,我们深入探讨了RAG系统中的检索前处理与查询重写技术,包括:
关键收获:
明天我们将探讨【Day 22: 混合检索策略实现】,学习如何结合多种检索方法构建更强大的RAG系统。
文章标签:RAG, 检索增强生成, 查询优化, 信息检索, NLP, 大语言模型应用
文章简述:本文是"RAG实战指南"系列的第21篇,深入讲解了RAG系统中检索前处理与查询重写技术的原理和实现方法。文章详细介绍了查询预处理流程、基于规则和LLM的查询重写技术,并提供了完整的Python实现代码。通过一个电商客服系统的实际案例,展示了如何在实际业务场景中应用这些技术显著提升RAG系统的检索效果。开发者可以从中学习到构建高效查询预处理管道的实用方法和最佳实践。