pdfminer.six
是 Python 中用于解析 PDF 文档的权威库(支持 Python 3+),它能够提取文本、图片、表格及元数据。以下是从基础到高级的全面讲解:
文档结构模型
PDF 文档由多层对象组成:
PDFDocument
:整个文档对象PDFPage
:单页对象LTTextBox
/ LTTextLine
:文本块/行LTFigure
:图像/表格容器LTImage
:原始图像数据LTRect
:矩形(表格边框)关键组件
TextConverter
)pip install pdfminer.six
from pdfminer.high_level import extract_text
text = extract_text("document.pdf")
print(text)
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from io import StringIO
def extract_text(pdf_path):
output_string = StringIO()
with open(pdf_path, 'rb') as f:
# 初始化
parser = PDFParser(f)
doc = PDFDocument(parser)
rsrcmgr = PDFResourceManager()
device = TextConverter(rsrcmgr, output_string, laparams=LAParams())
interpreter = PDFPageInterpreter(rsrcmgr, device)
# 逐页处理
for page in PDFPage.create_pages(doc):
interpreter.process_page(page)
return output_string.getvalue()
text = extract_text("document.pdf")
from pdfminer.high_level import extract_pages
from pdfminer.layout import LTTextContainer, LTFigure, LTImage
for page_layout in extract_pages("document.pdf"):
for element in page_layout:
# 文本块
if isinstance(element, LTTextContainer):
print("文本:", element.get_text().strip())
# 图像/表格容器
elif isinstance(element, LTFigure):
for item in element:
if isinstance(item, LTImage):
# 处理图像(见下节)
pass
# 其他元素(如线条、矩形)
elif hasattr(element, 'width'):
print(f"图形对象: {element}")
from pdfminer.image import ImageWriter
def extract_images(pdf_path, output_dir):
iw = ImageWriter(output_dir)
for page_layout in extract_pages(pdf_path):
for element in page_layout:
if isinstance(element, LTImage):
# 保存图片并返回文件名
iw.export_image(element)
库本身不直接支持表格识别,但可通过布局信息重建:
# 示例:检测表格(伪代码)
for element in page_layout:
if isinstance(element, LTRect):
# 根据矩形位置推断表格范围
pass
if isinstance(element, LTTextContainer):
# 按坐标对齐文本到单元格
text = element.get_text()
bbox = (element.x0, element.y0, element.x1, element.y1)
提示:复杂表格建议结合
camelot
或tabula-py
库。
from pdfminer.pdfdocument import PDFPasswordIncorrect
try:
text = extract_text("encrypted.pdf", password="mypass")
except PDFPasswordIncorrect:
print("密码错误!")
for element in extract_pages("doc.pdf"):
print(element) # 打印所有对象的类型和坐标
import logging
logging.basicConfig(level=logging.DEBUG)
from pdfminer.converter import HTMLConverter
with open("output.html", "wb") as out_f:
HTMLConverter(rsrcmgr, out_f, scale=1.5).process_page(page)
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfinterp import PDFPageInterpreter
def get_fonts(pdf_path):
with open(pdf_path, 'rb') as f:
interpreter = PDFPageInterpreter(rsrcmgr, device)
for page in PDFPage.get_pages(f):
interpreter.process_page(page)
fonts = interpreter.device.page.fonts # 获取当前页字体
中文乱码
确保安装 cmap
资源:
from pdfminer.cmapdb import CMapDB
CMapDB.debug = True # 调试字符映射
性能优化
extract_pages(..., caching=True)
启用缓存extract_pages(..., page_numbers=[0, 1, 2])
内容缺失
检查布局参数 LAParams
:
LAParams(
line_overlap=0.5,
char_margin=2.0,
word_margin=0.1,
boxes_flow=0.5 # 调整文本流阈值
)
PyPDF2
:简单文本提取(精度低)pdfplumber
:基于 pdfminer
的增强版(推荐表格提取)pymupdf
(fitz):高性能渲染/文本提取通过掌握 pdfminer.six
的核心流程和布局模型,可灵活应对大多数PDF解析需求。复杂场景建议结合其他库增强功能。