Python|扫描版词书转文字(PyPDF、OCR)

心血来潮想把词书pdf(只有扫描版)转化成电子版,然后插到某生词APP去复习

然后有两个想法:

1. 按照A-Z等来分词单

2.PDF转文字

1.那首先需要把PDF分开,这个用PyPDF2可以达成

PDF参考文章:掌握PDF文件处理的神器:Python PyPDF2库详解-CSDN博客

写了一个功能,允许用户一次性输入多个页码范围:

from PyPDF2 import PdfReader, PdfWriter
def split_pdf_by_page_ranges(input_file, page_ranges, output_dir=None):
    """按多个页码范围拆分PDF"""
    reader = PdfReader(input_file)
    total_pages = len(reader.pages)

    # 确保输出目录存在
    if output_dir and not os.path.exists(output_dir):
        os.makedirs(output_dir)

    # 处理每个页码范围
    for start_page, end_page, output_file in page_ranges:
        # 验证页码范围
        if start_page < 0 or start_page >= total_pages:
            print(f"警告: 起始页码 {start_page} 超出范围 (总页数: {total_pages}),跳过此范围")
            continue

        if end_page < start_page or end_page >= total_pages:
            print(f"警告: 结束页码 {end_page} 超出范围 (总页数: {total_pages}),调整为最大页码")
            end_page = total_pages - 1

        # 构建完整的输出文件路径
        if output_dir:
            output_file = os.path.join(output_dir, output_file)

        # 创建并保存新PDF
        writer = PdfWriter()
        for page_num in range(start_page, end_page + 1):
            writer.add_page(reader.pages[page_num])

        with open(output_file, 'wb') as output:
            writer.write(output)
        print(f"已保存文件: {output_file}")

示例:

   page_ranges = [
         (10, 14, "黄皮书A.pdf"),  # 第11页到第15页
         (15, 19, "黄皮书B.pdf"),  # 第16页到第20页
         (20, 24, "黄皮书C.pdf"),  # 第21页到第25页
     ]
    
     # 调用新函数进行多范围拆分
     split_pdf_by_page_ranges(file, page_ranges, root)

不过这样其实还是有些问题,就是需要自己看着目录来数页,不过暂时没想到更好的方法(目录识别出来有问题)

2.PDF转文字(OCR)

提取的话直接用pdf_reader = PyPDF2.PdfFileReader(pdf)就行,但是我这个不知道是不是因为是扫描版,提取不出来。所以只能曲线救国了。

先把PDF转img

import fitz
# pdf转图片
def pdf2img(pdf_path, img_path):
    # 读取PDF文件
    pdf = fitz.open(pdf_path)
    # 遍历每一页
    for page_num in range(len(pdf)):
        page = pdf.load_page(page_num)
        # 将每一页转换为图像
        pix = page.get_pixmap()
        # 检验Img_path存在
        if not os.path.exists(img_path):
            os.makedirs(img_path)
        # 保存图像
        pix.save(f"{img_path}/page_{page_num + 1}.png")
    # 关闭PDF文件
    pdf.close()

然后把图片转文字(OCR)

本来想下载easyocr等库,但是不知道为什么一直报错。不想去看路径了,所以采取了调用百度API的方式,参考文档通用文字识别(高精度版) - 文字识别OCR

import base64
import requests

API_KEY = 你的API_KEY
SECRET_KEY =你的SECRET_KEY
def baiduOCR(img_path):
    url = "https://aip.baidubce.com/rest/2.0/ocr/v1/general_basic?access_token=" + get_access_token()

    # image 可以通过 get_file_content_as_base64("C:\fakepath\page_1.png",True) 方法获取
    payload = 'image='+get_file_content_as_base64(img_path,True)
    headers = {
        'Content-Type': 'application/x-www-form-urlencoded',
        'Accept': 'application/json'
    }

    response = requests.request("POST", url, headers=headers, data=payload.encode("utf-8"))

    print(response.text)


def get_file_content_as_base64(path, urlencoded=False):
    """
    获取文件base64编码
    :param path: 文件路径
    :param urlencoded: 是否对结果进行urlencoded
    :return: base64编码信息
    """
    with open(path, "rb") as f:
        content = base64.b64encode(f.read()).decode("utf8")
        if urlencoded:
            content = urllib.parse.quote_plus(content)
    # print(content)
    return content


def get_access_token():
    """
    使用 AK,SK 生成鉴权签名(Access Token)
    :return: access_token,或是None(如果错误)
    """
    url = "https://aip.baidubce.com/oauth/2.0/token"
    params = {"grant_type": "client_credentials", "client_id": API_KEY, "client_secret": SECRET_KEY}
    return str(requests.post(url, params=params).json().get("access_token"))

然后得到的内容是这样

{"words_result":[{"words":"a canary in the coal mine"},{"words":"position"},{"words":"煤矿中的金丝雀(表示危险"},{"words":"澄用市场支配地位"},{"words":"的先兆)"},{"words":"◆Academy Awards"},{"words":"◆a cat may look at a king"},{"words":"奥斯卡金像奖(美同电影艺"},{"words":"小人物也有权利:人人平等"},{"words":"术与科学学院奖)"},{"words":"a fig leaf"},{"words":"◆access denial"},{"words":"遮羞布"},{"words":"拒绝访问"},{"words":"◆a global hit"},{"words":"accessible elevator"},{"words":"风摩全球"},{"words":"无障碍电梯"},{"words":"◆a going concern"},{"words":"accommodative monetary"},{"words":"盈利企业"},{"words":"policy"},{"words":"◆a leap in the dark"},{"words":"融通性货币政策"},{"words":"冒险举动:轻举妄动"},{"words":"◆account balance"},{"words":"◆a shot in the arm"},{"words":"账户结余"},{"words":"刺激因素:强心剂"},{"words":"accrued depreciation"},{"words":"◆a slim chance"},{"words":"应计折旧"},{"words":"希望渺茫"},{"words":"acculturation"},{"words":"◆a storm in a teacup"},{"words":"涵化:文化适应;文化移人:"},{"words":"小题大做,大惊小怪"},{"words":"文化互渗"},{"words":"◆a tip off"},{"words":"accumulated fund"},{"words":"密报;举报"},{"words":"累计基金"},{"words":"◆a vampire shopper"},{"words":"◆Achilles'heel"},{"words":"吸血鬼购物者(指半夜网购"},{"words":"阿基里斯之骤:致命弱点:"},{"words":"的人)"},{"words":"要害"},{"words":"◆abducted children"},{"words":"acrobatic gymnastics"},{"words":"被拐儿童"},{"words":"技巧运动"},{"words":"abuse of dominant market"},
# {"words":"active-duty soldier"},{"words":"2"}],"words_result_num":56,"log_id":1922546852819361980}

不过由于这本书是左右排的,这里的中文和英文好像很乱,没对上,还在想怎么样比较好。

你可能感兴趣的:(pdf,ocr,python)