心血来潮想把词书pdf(只有扫描版)转化成电子版,然后插到某生词APP去复习
然后有两个想法:
1. 按照A-Z等来分词单
2.PDF转文字
1.那首先需要把PDF分开,这个用PyPDF2可以达成
PDF参考文章:掌握PDF文件处理的神器:Python PyPDF2库详解-CSDN博客
写了一个功能,允许用户一次性输入多个页码范围:
from PyPDF2 import PdfReader, PdfWriter
def split_pdf_by_page_ranges(input_file, page_ranges, output_dir=None):
"""按多个页码范围拆分PDF"""
reader = PdfReader(input_file)
total_pages = len(reader.pages)
# 确保输出目录存在
if output_dir and not os.path.exists(output_dir):
os.makedirs(output_dir)
# 处理每个页码范围
for start_page, end_page, output_file in page_ranges:
# 验证页码范围
if start_page < 0 or start_page >= total_pages:
print(f"警告: 起始页码 {start_page} 超出范围 (总页数: {total_pages}),跳过此范围")
continue
if end_page < start_page or end_page >= total_pages:
print(f"警告: 结束页码 {end_page} 超出范围 (总页数: {total_pages}),调整为最大页码")
end_page = total_pages - 1
# 构建完整的输出文件路径
if output_dir:
output_file = os.path.join(output_dir, output_file)
# 创建并保存新PDF
writer = PdfWriter()
for page_num in range(start_page, end_page + 1):
writer.add_page(reader.pages[page_num])
with open(output_file, 'wb') as output:
writer.write(output)
print(f"已保存文件: {output_file}")
示例:
page_ranges = [
(10, 14, "黄皮书A.pdf"), # 第11页到第15页
(15, 19, "黄皮书B.pdf"), # 第16页到第20页
(20, 24, "黄皮书C.pdf"), # 第21页到第25页
]
# 调用新函数进行多范围拆分
split_pdf_by_page_ranges(file, page_ranges, root)
不过这样其实还是有些问题,就是需要自己看着目录来数页,不过暂时没想到更好的方法(目录识别出来有问题)
2.PDF转文字(OCR)
提取的话直接用pdf_reader = PyPDF2.PdfFileReader(pdf)就行,但是我这个不知道是不是因为是扫描版,提取不出来。所以只能曲线救国了。
先把PDF转img
import fitz
# pdf转图片
def pdf2img(pdf_path, img_path):
# 读取PDF文件
pdf = fitz.open(pdf_path)
# 遍历每一页
for page_num in range(len(pdf)):
page = pdf.load_page(page_num)
# 将每一页转换为图像
pix = page.get_pixmap()
# 检验Img_path存在
if not os.path.exists(img_path):
os.makedirs(img_path)
# 保存图像
pix.save(f"{img_path}/page_{page_num + 1}.png")
# 关闭PDF文件
pdf.close()
然后把图片转文字(OCR)
本来想下载easyocr等库,但是不知道为什么一直报错。不想去看路径了,所以采取了调用百度API的方式,参考文档通用文字识别(高精度版) - 文字识别OCR
import base64
import requests
API_KEY = 你的API_KEY
SECRET_KEY =你的SECRET_KEY
def baiduOCR(img_path):
url = "https://aip.baidubce.com/rest/2.0/ocr/v1/general_basic?access_token=" + get_access_token()
# image 可以通过 get_file_content_as_base64("C:\fakepath\page_1.png",True) 方法获取
payload = 'image='+get_file_content_as_base64(img_path,True)
headers = {
'Content-Type': 'application/x-www-form-urlencoded',
'Accept': 'application/json'
}
response = requests.request("POST", url, headers=headers, data=payload.encode("utf-8"))
print(response.text)
def get_file_content_as_base64(path, urlencoded=False):
"""
获取文件base64编码
:param path: 文件路径
:param urlencoded: 是否对结果进行urlencoded
:return: base64编码信息
"""
with open(path, "rb") as f:
content = base64.b64encode(f.read()).decode("utf8")
if urlencoded:
content = urllib.parse.quote_plus(content)
# print(content)
return content
def get_access_token():
"""
使用 AK,SK 生成鉴权签名(Access Token)
:return: access_token,或是None(如果错误)
"""
url = "https://aip.baidubce.com/oauth/2.0/token"
params = {"grant_type": "client_credentials", "client_id": API_KEY, "client_secret": SECRET_KEY}
return str(requests.post(url, params=params).json().get("access_token"))
然后得到的内容是这样
{"words_result":[{"words":"a canary in the coal mine"},{"words":"position"},{"words":"煤矿中的金丝雀(表示危险"},{"words":"澄用市场支配地位"},{"words":"的先兆)"},{"words":"◆Academy Awards"},{"words":"◆a cat may look at a king"},{"words":"奥斯卡金像奖(美同电影艺"},{"words":"小人物也有权利:人人平等"},{"words":"术与科学学院奖)"},{"words":"a fig leaf"},{"words":"◆access denial"},{"words":"遮羞布"},{"words":"拒绝访问"},{"words":"◆a global hit"},{"words":"accessible elevator"},{"words":"风摩全球"},{"words":"无障碍电梯"},{"words":"◆a going concern"},{"words":"accommodative monetary"},{"words":"盈利企业"},{"words":"policy"},{"words":"◆a leap in the dark"},{"words":"融通性货币政策"},{"words":"冒险举动:轻举妄动"},{"words":"◆account balance"},{"words":"◆a shot in the arm"},{"words":"账户结余"},{"words":"刺激因素:强心剂"},{"words":"accrued depreciation"},{"words":"◆a slim chance"},{"words":"应计折旧"},{"words":"希望渺茫"},{"words":"acculturation"},{"words":"◆a storm in a teacup"},{"words":"涵化:文化适应;文化移人:"},{"words":"小题大做,大惊小怪"},{"words":"文化互渗"},{"words":"◆a tip off"},{"words":"accumulated fund"},{"words":"密报;举报"},{"words":"累计基金"},{"words":"◆a vampire shopper"},{"words":"◆Achilles'heel"},{"words":"吸血鬼购物者(指半夜网购"},{"words":"阿基里斯之骤:致命弱点:"},{"words":"的人)"},{"words":"要害"},{"words":"◆abducted children"},{"words":"acrobatic gymnastics"},{"words":"被拐儿童"},{"words":"技巧运动"},{"words":"abuse of dominant market"},
# {"words":"active-duty soldier"},{"words":"2"}],"words_result_num":56,"log_id":1922546852819361980}
不过由于这本书是左右排的,这里的中文和英文好像很乱,没对上,还在想怎么样比较好。