个人名片
作者简介:java领域优质创作者
个人主页:码农阿豪
工作室:新空间代码工作室(提供各种软件服务)
个人邮箱:[[email protected]]
个人微信:15279484656
个人导航网站:www.forff.top
座右铭:总有人要赢。为什么不能是我呢?
码农阿豪系列专栏导航
面试专栏:收集了java相关高频面试题,面试实战总结️
Spring5系列专栏:整理了Spring5重要知识点与实战演练,有案例可直接使用
Redis专栏:Redis从零到一学习分享,经验总结,案例实战
全栈系列专栏:海纳百川有容乃大,可能你想要的东西里面都有
在大数据时代,完整准确的数据是业务发展的基石。本文将以处理手机号数据缺失问题为例,详细介绍如何通过Python实现高效的数据校验与补全方案。我们将从需求分析开始,逐步深入解决方案的设计与优化,最终形成一个可应用于生产环境的高效脚本。
假设我们有一个包含数千万手机号的数据库表,其中手机号被拆分为以下几个字段存储:
我们首先实现一个基础版本,包含以下核心功能:
def generate_phone_prefix_suffix_pairs() -> List[Tuple[str, str]]:
prefixes = ['157', '185', '178', '172']
return [(prefix, f"{suffix:04d}")
for prefix in prefixes
for suffix in range(10000)]
def get_existing_middles(cursor, prefix: str, suffix: str) -> Set[str]:
cursor.execute("""
SELECT SUBSTRING(phone_number, 4, 4)
FROM phone_numbers
WHERE prefix=%s AND suffix=%s
""", (prefix, suffix))
return {row[0] for row in cursor.fetchall()}
def fill_missing_numbers_basic():
conn = pymysql.connect(**DB_CONFIG)
try:
with conn.cursor() as cursor:
for prefix, suffix in generate_phone_prefix_suffix_pairs():
existing = get_existing_middles(cursor, prefix, suffix)
missing = {f"{i:04d}" for i in range(10000)} - existing
for middle in missing:
phone = f"{prefix}{middle}{suffix}"
cursor.execute("""
INSERT INTO phone_numbers
VALUES (%s, %s, %s, %s, %s)
""", (prefix, suffix, phone, "省", "市"))
conn.commit()
finally:
conn.close()
基础版本存在几个明显问题:
使用executemany
实现批量插入,大幅提高性能:
def fill_missing_numbers_batch():
batch_size = 1000 # 每批插入1000条
conn = pymysql.connect(**DB_CONFIG)
try:
with conn.cursor() as cursor:
for prefix, suffix in generate_phone_prefix_suffix_pairs():
existing = get_existing_middles(cursor, prefix, suffix)
missing = list({f"{i:04d}" for i in range(10000)} - existing)
for i in range(0, len(missing), batch_size):
batch = missing[i:i + batch_size]
values = [(prefix, suffix, f"{prefix}{m}{suffix}", "省", "市")
for m in batch]
cursor.executemany(INSERT_SQL, values)
conn.commit()
finally:
conn.close()
添加详细的日志记录和进度监控:
def setup_logging():
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s - %(levelname)s - %(message)s',
handlers=[
logging.FileHandler('fill_missing.log'),
logging.StreamHandler()
]
)
def log_progress(processed, total, start_time):
if processed % 100 == 0:
elapsed = time.time() - start_time
remaining = (elapsed / processed) * (total - processed)
logging.info(
f"进度: {processed}/{total} "
f"({processed/total:.1%}) | "
f"预计剩余: {remaining/60:.1f}分钟"
)
增强异常处理和事务管理:
def fill_missing_numbers_safe():
try:
conn = pymysql.connect(**DB_CONFIG)
with conn.cursor() as cursor:
for prefix, suffix in generate_phone_prefix_suffix_pairs():
try:
# 处理逻辑
conn.commit()
except Exception as e:
conn.rollback()
logging.error(f"处理失败: {prefix}{suffix}, 错误: {str(e)}")
continue
except Exception as e:
logging.error("程序异常终止", exc_info=True)
finally:
if conn: conn.close()
将上述优化整合后的完整实现:
import pymysql
import logging
import time
from typing import List, Tuple, Set
# 配置项
DB_CONFIG = {...}
BATCH_SIZE = 1000
LOG_INTERVAL = 100
def main():
setup_logging()
logging.info("开始执行号码补全任务")
start_time = time.time()
total = 4 * 10000 # 4前缀×10000后缀
processed = 0
try:
conn = pymysql.connect(**DB_CONFIG)
with conn.cursor() as cursor:
for prefix, suffix in generate_phone_prefix_suffix_pairs():
processed += 1
log_progress(processed, total, start_time)
try:
existing = get_existing_middles(cursor, prefix, suffix)
missing = calculate_missing(existing)
if missing:
batch_insert(cursor, conn, prefix, suffix, missing)
except Exception as e:
handle_error(conn, prefix, suffix, e)
continue
log_completion(start_time, total)
except Exception as e:
logging.error("主程序异常", exc_info=True)
finally:
if conn: conn.close()
def batch_insert(cursor, conn, prefix, suffix, missing):
for i in range(0, len(missing), BATCH_SIZE):
batch = missing[i:i + BATCH_SIZE]
values = [(prefix, suffix, f"{prefix}{m}{suffix}", "省", "市")
for m in batch]
try:
cursor.executemany(INSERT_SQL, values)
conn.commit()
logging.debug(f"插入成功: {prefix}{suffix} 批次{i//BATCH_SIZE+1}")
except Exception as e:
conn.rollback()
raise
我们对不同实现进行了性能测试:
方案 | 处理速度 | 内存占用 | 可追踪性 | 容错性 |
---|---|---|---|---|
基础版 | 100条/分钟 | 高 | 无 | 差 |
批量版 | 5000条/分钟 | 中 | 基本 | 中 |
完整版 | 8000条/分钟 | 低 | 完善 | 强 |
本文详细介绍了如何通过Python高效处理手机号数据缺失问题。从基础实现开始,逐步引入批量操作、进度监控、异常处理等优化手段,最终形成了一个健壮的解决方案。关键点包括:
这套方案不仅适用于手机号处理,也可推广到其他需要数据校验与补全的场景。读者可以根据实际需求调整批量大小、日志级别等参数,以获得最佳性能。