unstructured
是一个 Python 开源库,设计用于处理和预处理非结构化数据(如 PDF、Word 文档、HTML、图片等),将其转换为结构化格式,方便下游机器学习(ML)或大语言模型(LLM)任务。它提供模块化的组件(称为“bricks”),支持文档分区、清理和格式化,广泛应用于数据管道、RAG(Retrieval-Augmented Generation)系统和文档分析。
以下是对 unstructured
库的详细介绍,包括其功能、用法和实际应用,结合近期信息(截至 2025)。
近期动态:
UnstructuredLoader
简化数据加载。beautifulsoup4
:HTML 解析。lxml
:XML 处理。nltk
:文本处理。tesseract
(OCR)、poppler
(PDF 处理)、pandoc
(EPUB/RTF)。pip install unstructured
pip install "unstructured[local-inference]"
pip install "unstructured[docx]"
pip install unstructured-client
pdf2image
文档:https://pdf2image.readthedocs.io/。# Mac
brew install libmagic
# Ubuntu
sudo apt-get install libmagic1
import unstructured
print(unstructured.__version__) # 示例输出: 0.16.17
Docker 支持:
docker pull downloads.unstructured.io/unstructured-io/unstructured:latest
docker run -dt --name unstructured downloads.unstructured.io/unstructured-io/unstructured:latest
docker exec -it unstructured bash
unstructured
的核心是通过“bricks”处理文档,分为分区、清理和格式化三大类。以下是主要功能和示例。
将文档拆分为结构化元素(如标题、段落、表格),支持自动文件类型检测。
from unstructured.partition.auto import partition
# 解析 PDF 文件
elements = partition(filename="example.pdf")
for element in elements[:5]:
print(f"{element.category}: {element.text}")
输出示例:
Title: Introduction
NarrativeText: This is the first paragraph...
ListItem: - Item 1
说明:
partition
自动检测文件类型,调用特定分区函数(如 partition_pdf
、partition_docx
)。Element
对象列表,包含类别(Title
、NarrativeText
、ListItem
、Table
等)和元数据。特定文件类型分区:
from unstructured.partition.pdf import partition_pdf
# 高分辨率解析(包含表格)
elements = partition_pdf(filename="example.pdf", strategy="hi_res")
说明:
strategy="hi_res"
使用计算机视觉和 OCR 提取表格,适合复杂 PDF。tesseract
和 poppler
。移除无关内容,如样板文本、标点或句段碎片。
from unstructured.cleaners.core import clean, remove_punctuation
text = "Hello, World!!! This is a test..."
cleaned_text = clean(text, lowercase=True) # 转换为小写并清理
cleaned_text = remove_punctuation(cleaned_text) # 移除标点
print(cleaned_text) # 输出: hello world this is a test
说明:
Element
文本。将数据格式化为下游任务的输入,如 JSON 或 LLM 训练数据。
from unstructured.staging.base import convert_to_dict
# 转换为 JSON
elements = partition(filename="example.docx")
json_data = convert_to_dict(elements)
print(json_data[:2]) # 输出前两个元素
输出示例:
[
{"type": "Title", "text": "Introduction", "metadata": {...}},
{"type": "NarrativeText", "text": "This is the first paragraph...", "metadata": {...}}
]
说明:
convert_to_dict
将元素列表转为 JSON,适合 LLM 或数据分析。stage_for_transformers
(Hugging Face 集成)。与 LangChain 配合,通过 UnstructuredLoader
加载文档。
from langchain_unstructured import UnstructuredLoader
# 本地加载
loader = UnstructuredLoader(file_path="example.pdf")
docs = loader.load()
print(docs[0].page_content[:100]) # 输出提取的文本
# 使用 Serverless API
loader = UnstructuredLoader(
file_path="example.pdf",
api_key="your_api_key",
strategy="hi_res"
)
docs = loader.load()
说明:
unstructured
和 langchain_unstructured
。unstructured-client
和 langchain_unstructured
,以及 API 密钥(可从 https://unstructured.io/ 获取)。UnstructuredCSVLoader
、UnstructuredHTMLLoader
等。通过 API 进行高效处理,减少本地依赖。
from unstructured_client import UnstructuredClient
client = UnstructuredClient(api_key_auth="your_api_key")
with open("example.pdf", "rb") as f:
response = client.general.partition(file=f, strategy="hi_res")
print(response.elements[:2])
说明:
unstructured-client
。在 Docker 容器中运行 unstructured
。
# 在容器内运行 Python 脚本
from unstructured.partition.auto import partition
elements = partition(filename="/data/example.pdf")
print([str(el) for el in elements[:5]])
说明:
partition
函数一键解析,降低学习曲线。tesseract
、poppler
)。DO_NOT_TRACK=true
禁用。SCARF_NO_ANALYTICS=true
禁用 Scarf 统计。与替代方案对比:
unstructured
,但封装更简单。unstructured
快 25 倍,支持类似格式,但生态较新。示例(RAG 管道):
from unstructured.partition.auto import partition
from unstructured.staging.base import convert_to_dict
from langchain_unstructured import UnstructuredLoader
import json
# 解析文档
elements = partition(filename="report.pdf", strategy="hi_res")
json_data = convert_to_dict(elements)
# 保存为 JSON
with open("output.json", "w") as f:
json.dump(json_data, f, indent=2)
# 加载到 LangChain
loader = UnstructuredLoader(file_path="report.pdf")
docs = loader.load()
# 假设使用 LLM 进行问答
from langchain.llms import OpenAI
llm = OpenAI(api_key="your_openai_key")
response = llm(f"Summarize: {docs[0].page_content[:500]}")
print(response)
说明:
pyenv
管理虚拟环境,推荐 Python 3.8.15。unstructured-client
处理文件。scripts/collect_env.py
收集环境信息。tesseract
和 poppler
,否则报错。pip install "unstructured[all-docs]"
安装所有文档类型依赖。unstructured[docx]
)减少开销。strategy="hi_res"
提取表格,但计算成本较高。export DO_NOT_TRACK=true
或 SCARF_NO_ANALYTICS=true
。以下是一个综合示例,展示分区、清理、格式化和 LangChain 集成:
from unstructured.partition.auto import partition
from unstructured.cleaners.core import clean, remove_punctuation
from unstructured.staging.base import convert_to_dict
from langchain_unstructured import UnstructuredLoader
import json
# 配置日志(使用 loguru)
from loguru import logger
logger.add("app.log", rotation="1 MB", level="INFO")
# 解析 PDF
logger.info("Starting PDF processing")
try:
elements = partition(filename="sample.pdf", strategy="hi_res")
except Exception as e:
logger.exception("Failed to process PDF")
raise
# 清理文本
cleaned_elements = []
for element in elements:
text = clean(element.text, lowercase=True)
text = remove_punctuation(text)
cleaned_elements.append({"type": element.category, "text": text})
logger.info("Text cleaning completed")
# 转换为 JSON
json_data = convert_to_dict(cleaned_elements)
with open("output.json", "w") as f:
json.dump(json_data, f, indent=2)
logger.info("JSON output saved")
# LangChain 集成
loader = UnstructuredLoader(file_path="sample.pdf", strategy="hi_res")
docs = loader.load()
logger.info(f"Loaded {len(docs)} documents")
# 打印前 100 个字符
print(docs[0].page_content[:100])
输出示例(app.log):
2025-05-09T01:33:56.123 | INFO | Starting PDF processing
2025-05-09T01:33:57.124 | INFO | Text cleaning completed
2025-05-09T01:33:57.125 | INFO | JSON output saved
2025-05-09T01:33:57.126 | INFO | Loaded 1 documents
说明:
partition
解析 PDF,hi_res
策略提取表格。