目录
1. code
(1)数据加载与处理
(2)创建embeddings
(3)构建ChromaDB索引
(4)配置LLM Pipeline
(5)构建问答链并查询
(6)LCEL (LangChain Expression Language)
2. result
3. 有关 RAG 和 LangChain 的理解
3.1 RAG(Retrieval Augmented Generation)
3.2 LangChain
Large-Language-Model-Notebooks-Course/3-LangChain/3_1_RAG_langchain.ipynb at main · peremartra/Large-Language-Model-Notebooks-Course · GitHub
任务:实现了一个基于本地文档的问答系统,使用Hugging Face的大语言模型(LLM)和向量数据库ChromaDB。
MAX_NEWS = 1000
DOCUMENT_COLUMN = "title"
CSV_PATH = "./data/labelled_newscatcher_dataset.csv"
news_df = pd.read_csv(CSV_PATH, sep=';').head(MAX_NEWS)
# 将DataFrame转换为文档对象(使用标题列作为内容)
loader = DataFrameLoader(news_df, page_content_column=DOCUMENT_COLUMN)
documents = loader.load()
# 分割文档为小片段(chunk_size=250字符,重叠10字符)
text_splitter = CharacterTextSplitter(chunk_size=250, chunk_overlap=10)
texts = text_splitter.split_documents(documents)
(1)DataFrameLoader:将 DataFrame 转换为 LangChain 的 Document 对象列表。
Document 对象结构:
page_content:通过 page_content_column 指定的列(如 "title")会成为 Document 的文本内容
metadata:DataFrame 中的其他列会自动成为 Document 的元数据
(2)CharacterTextSplitter:按字符长度分割文档,滑动窗口式分割,重叠部分保留上下文关联,避免语义断裂。
embeddings = HuggingFaceEmbeddings(
model_name="sentence-transformers/all-MiniLM-L6-v2",
model_kwargs={'device': 'cuda'},
encode_kwargs={'normalize_embeddings': True} # 归一化向量提高检索效果
)
HuggingFaceEmbeddings 将文本转换为高维向量(嵌入表示)
CHROMA_DIR = "./chromadb" # 本地存储路径
chroma_db = Chroma.from_documents(
texts, # 分割后的文本
embeddings, # 嵌入模型
persist_directory=CHROMA_DIR # 本地持久化目录
)
ChromaDB是一种专门设计的向量数据库。作用:
向量存储:保存文本嵌入后的高维向量。
相似性检索:通过近似最近邻算法(ANN)快速找到与查询相似的向量。
持久化:将索引保存到本地目录(persist_directory),避免重复计算。
MODEL_NAME = "databricks/dolly-v2-3b"
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForCausalLM.from_pretrained(
MODEL_NAME ,
use_safetensors=True, # 使用安全张量格式
device_map="auto", # 自动分配GPU/CPU
torch_dtype="auto" # 自动选择计算精度(FP16/FP32)
)
# 构建文本生成管道
text_pipeline = pipeline(
"text-generation", # 指定任务类型:文本生成
model=model,
tokenizer=tokenizer,
max_new_tokens=256, # 限制生成文本长度
repetition_penalty=1.1 # 抑制重复内容
)
# 将管道封装为LangChain兼容对象
hf_llm = HuggingFacePipeline(pipeline=text_pipeline)
text_pipeline 封装了文本生成任务的全部流程,HuggingFacePipeline将其适配到LangChain框架 。
## 创建RAG链
qa_chain = RetrievalQA.from_chain_type(
llm=hf_llm, # 语言模型
retriever=chroma_db.as_retriever(), # ChromaDB检索器
chain_type="stuff" # 合并检索结果后输入模型
)
response = qa_chain.invoke("Can I buy a Toshiba laptop?")
print(response['result'])
(1)RetrievalQA 是实现RAG检索的核心类,实现:
Retrieval(检索):从ChromaDB中查找与问题相关的文档。
Augmented Generation(增强生成):将检索结果作为上下文输入LLM生成答案。
(2)LangChain 集成调用通过 from_chain_type 自动完成:
retriever:检索器,负责从知识库中查找与问题相关的文档。
llm:指定生成答案的 LLM。
chain_type:控制如何处理检索到的文档。
-- stuff:将检索到的所有文档内容拼接成一个长文本,直接作为上下文输入LLM。(适用于检索结果较少(如3-5个文档),且总文本量较小)
-- map_reduce:对每个文档单独生成答案,汇总所有答案,生成最终回答。(适用于文档数量多,且需要综合多来源信息)
-- refine:对第一个文档生成初始答案,依次用后续文档优化答案,逐步修正。(适用于需要高精度答案,且允许较长的处理时间)
-- map_rerank:对每个文档生成答案并评分,选择评分最高的答案作为最终结果。(适用于需要从多个候选答案中选优)
template = """Answer the question based on the following context:
{context}
Question: {question}
"""
prompt = ChatPromptTemplate.from_template(template)
chain = (
{"context": chroma_db.as_retriever(), "question": RunnablePassthrough()}
| prompt
| hf_llm
| StrOutputParser()
)
response2 = chain.invoke("Can I buy a Toshiba laptop?")
print(response2)
LCEL是声明式编程语法,通过统一接口(Runnable)和管道符(|
)连接组件,实现复杂任务链的构建 。
输入定义
chroma_db.as_retriever():从 Chroma 向量数据库检索与用户问题相关的文本片段。
RunnablePassthrough():直接将用户的问题传递给下一步。
→ 生成提示(| prompt):将检索到的 {context} 和 {question} 插入模板,生成完整的prompt
→ 调用模型(| hf_llm):接受 prompt 并生成回答
→ 解析输出(| StrOutputParser()):提取回答中的纯文本内容
(1)Chains 输出:
Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.
The Legendary Toshiba is Officially Done With Making Laptops
The Legendary Toshiba is Officially Done With Making Laptops
The Legendary Toshiba is Officially Done With Making Laptops
The Legendary Toshiba is Officially Done With Making Laptops
Question: Can I buy a Toshiba laptop?
Helpful Answer: No, Toshiba laptops as well as all other laptops made by Asus, Dell, HP, Lenovo and Samsung were all banned from being sold in India for violating strict import rules. This also means that you won’t be able to get warranty support or customer service from any of these companies anymore.
(2)LCEL 输出:
Human: Answer the question based on the following context:
[Document(metadata={'lang': 'en', 'published_date': '2020-08-10 12:00:13', 'domain': 'techweez.com', 'link': 'https://techweez.com/2020/08/10/toshiba-done-with-making-laptops/', 'topic': 'TECHNOLOGY'}, page_content='The Legendary Toshiba is Officially Done With Making Laptops'), Document(metadata={'link': 'https://techweez.com/2020/08/10/toshiba-done-with-making-laptops/', 'lang': 'en', 'domain': 'techweez.com', 'topic': 'TECHNOLOGY', 'published_date': '2020-08-10 12:00:13'}, page_content='The Legendary Toshiba is Officially Done With Making Laptops'), Document(metadata={'lang': 'en', 'link': 'https://techweez.com/2020/08/10/toshiba-done-with-making-laptops/', 'published_date': '2020-08-10 12:00:13', 'topic': 'TECHNOLOGY', 'domain': 'techweez.com'}, page_content='The Legendary Toshiba is Officially Done With Making Laptops'), Document(metadata={'link': 'https://techweez.com/2020/08/10/toshiba-done-with-making-laptops/', 'lang': 'en', 'published_date': '2020-08-10 12:00:13', 'topic': 'TECHNOLOGY', 'domain': 'techweez.com'}, page_content='The Legendary Toshiba is Officially Done With Making Laptops')]
Question: Can I buy a Toshiba laptop?
Resolution: It depends on what you mean by "buy". If you refer to buying a specific device that has been sold by a particular manufacturer, then no, you cannot do so because devices from a single manufacturer are not permitted to be sold or transferred under the EU's newGDPR. However, if you are asking whether Toshiba can still make laptops (i.e. it doesn't manufacture them now), then yes, they can continue to produce them.
(1)混合架构:检索 - 增强 - 生成,由检索模块(Retriever)和生成模块(Generator)组成
检索模块:从外部知识库(如文档、数据库)中筛选与用户问题相关的信息,通常通过向量化技术(如余弦相似度)匹配语义。
生成模块:基于检索到的上下文,利用大语言模型(如GPT、GLM)生成自然语言回答。
(2)工作流程:
知识储备:构建知识库,将结构化/非结构化数据(如PDF、数据库)转换为文本,并进行分块、向量化存储。
检索阶段:用户提问被向量化后,通过向量数据库检索最相关的文本片段。
增强生成:将检索结果与原始问题结合,构建增强提示(Prompt),输入生成模型生成最终回答。
开源框架,提供模块化组件和链式流程设计,支持快速构建RAG、Agent、对话系统等复杂应用。
组件化工具:以本文代码为例,LangChain 可以实现
文档加载与预处理from langchain_community.document_loaders import DataFrameLoader from langchain.text_splitter import CharacterTextSplitter
向量化与存储
from langchain_huggingface import HuggingFaceEmbeddings from langchain_community.vectorstores import Chroma
提示模板
from langchain_core.prompts import ChatPromptTemplate
链(Chain):将多个步骤串联为工作流。
基础链(LLMChain):单步任务处理(如文本生成)
多步顺序链(SequentialChain):串联多个子任务(如生成摘要 → 分析情感)
检索增强链(RetrievalQA):结合知识库实现问答智能体(Agent):让LLM调用外部工具(如计算器、API接口)完成复杂任务。