一、为什么需要对文本分块?
**当遇到"大象级"文档时...**
想象你面对一个几百页的文档,就像要吃掉一整头大象!️ 这时候,**文本分块**就是你的"餐刀"。
❗ **两大核心问题**:
1. **信息丢失风险**:一次性处理整个文档就像用渔网捞金鱼,总会漏掉细节!
2. **分块大小限制**:GPT-4的32K窗口就像电梯限载,超重可不行哦!
**分块好处**:
- 提高信息提取精度
- 避免上下文丢失
- 适配模型限制 ⚖️
---
## 二、8种文本分块方法大比拼
### 2.1 基础款:一般文本分块
```python
# 简单粗暴按长度切分
text = "我是很长很长的文本..."
chunk_size = 100
chunks = [text[i:i+chunk_size] for i in range(0, len(text), chunk_size)]
import re def split_sentences(text): sentence_delimiters = re.compile(u'[。?!;]|\n') return [s.strip() for s in sentence_delimiters.split(text) if s.strip()] # 示例输出: # ['文本分块是自然语言处理...', '这种分割通常是基于...']
import spacy nlp = spacy.load("zh_core_web_sm") doc = nlp("文本分块是...") [s for s in doc.sents] # 智能句子分割
from langchain.text_splitter import CharacterTextSplitter text_splitter = CharacterTextSplitter( chunk_size=35, chunk_overlap=0, separator='' )
text_splitter = RecursiveCharacterTextSplitter( chunk_size=100, chunk_overlap=20 # 像拼图一样重叠更聪明 )
from langchain.text_splitter import HTMLHeaderTextSplitter html_splitter = HTMLHeaderTextSplitter( headers_to_split_on=[("h1", "Header 1"), ("h2", "标题2")] )
# 标题1 ## 标题2 正文内容...
markdown_splitter = MarkdownHeaderTextSplitter( headers_to_split_on=[("#","H1"), ("##","H2")] )
from langchain.text_splitter import PythonCodeTextSplitter python_splitter = PythonCodeTextSplitter(chunk_size=100) # 自动识别class/def等代码结构
\section{Introduction} 内容...
from langchain.text_splitter import LatexTextSplitter latex_splitter = LatexTextSplitter(chunk_size=100)
推荐使用LanceDB:
开源免费
无服务器架构 ☁️
完美兼容Python生态
import lancedb db = lancedb.connect("./data") table = db.create_table("docs", data=[{"vector": embedding, "text": chunk}])