学习目标
✅ 理解Python并发模型(对比Java的多线程与线程池)
✅ 掌握asyncio
协程编程(对比Java的虚拟线程)
✅ 实现多进程加速计算密集型任务
✅ 完成高并发网络爬虫实战
特性 | Java | Python | 核心差异 |
---|---|---|---|
线程实现 | OS线程(java.lang.Thread ) |
OS线程(受GIL限制) | Python线程不适合CPU密集型任务 |
协程支持 | 虚拟线程(Loom项目) | asyncio 协程(单线程异步) |
Python协程更轻量 |
进程并行 | ProcessBuilder 创建新进程 |
multiprocessing 模块 |
Python进程间通信更简洁 |
线程池 | ExecutorService |
concurrent.futures.ThreadPoolExecutor |
接口设计类似 |
import threading
# 创建线程(类似Java的Runnable)
def task(n):
print(f"线程执行: {n}")
# 启动线程(对比Java的Thread.start())
threads = []
for i in range(3):
t = threading.Thread(target=task, args=(i,))
threads.append(t)
t.start()
for t in threads:
t.join() # 等待线程结束
GIL限制示例:
# CPU密集型任务多线程无加速效果
def count_down(n):
while n > 0:
n -= 1
# 单线程执行
%time count_down(10**7) # 约0.3秒
# 双线程执行
t1 = threading.Thread(target=count_down, args=(5e6,))
t2 = threading.Thread(target=count_down, args=(5e6,))
%time t1.start(); t2.start(); t1.join(); t2.join() # 约0.6秒(更慢)
import asyncio
async def fetch_data(url):
print(f"开始请求: {url}")
await asyncio.sleep(1) # 模拟IO等待
return f"{url}响应数据"
async def main():
# 并发执行(类似Java的CompletableFuture)
task1 = asyncio.create_task(fetch_data("https://api/1"))
task2 = asyncio.create_task(fetch_data("https://api/2"))
results = await asyncio.gather(task1, task2)
print(results)
asyncio.run(main()) # 总耗时约1秒(非2秒)
import aiohttp
async def fetch_page(url):
async with aiohttp.ClientSession() as session:
async with session.get(url) as response:
return await response.text()
async def crawl():
urls = ["https://example.com", "https://example.org"]
tasks = [fetch_page(url) for url in urls]
pages = await asyncio.gather(*tasks)
print(f"抓取到{len(pages)}个页面")
from multiprocessing import Pool
def cpu_intensive(n):
return sum(i*i for i in range(n))
if __name__ == "__main__":
with Pool(4) as pool: # 4进程
results = pool.map(cpu_intensive, [10**7]*4)
print(sum(results)) # 比单进程快约4倍(绕过GIL)
from multiprocessing import Process, Queue
def worker(q):
q.put("子进程数据")
if __name__ == "__main__":
q = Queue()
p = Process(target=worker, args=(q,))
p.start()
print(q.get()) # 接收数据
p.join()
并发抓取多个新闻网站首页
提取标题与关键内容
统计高频关键词
支持同步/异步两种模式
异步爬虫核心:
import aiohttp
from bs4 import BeautifulSoup
async def fetch_news(url):
async with aiohttp.ClientSession() as session:
async with session.get(url) as resp:
html = await resp.text()
soup = BeautifulSoup(html, "lxml")
return {
"url": url,
"title": soup.title.text.strip(),
"content": soup.find("div", class_="content").text[:100]
}
async def main(urls):
tasks = [fetch_news(url) for url in urls]
return await asyncio.gather(*tasks)
多进程关键词统计:
from multiprocessing import Pool
from collections import Counter
def count_keywords(text):
words = re.findall(r"\w+", text.lower())
return Counter(words)
if __name__ == "__main__":
news_data = [...] # 爬取结果
with Pool() as pool:
counters = pool.map(count_keywords, [n["content"] for n in news_data])
total = sum(counters, Counter())
print(total.most_common(10))
GIL的影响范围
仅影响CPython解释器的原生线程
使用C扩展(如NumPy)或multiprocessing
可规避
异步编程范式
Python的async/await
是语法糖,Java的虚拟线程更透明
Python事件循环需显式管理(asyncio.run()
)
进程序列化限制
Python多进程间传递对象需可pickle序列化
# 自定义类的实例需实现__reduce__方法
class Data:
def __init__(self, value):
self.value = value
def __reduce__(self):
return (self.__class__, (self.value,))
实现协程池限制并发数
import asyncio
from aiothrottle import Throttler
async def limited_crawl(urls, concurrency=5):
throttler = Throttler(concurrency)
async with aiohttp.ClientSession() as session:
tasks = [throttler.acquire(fetch(session, url)) for url in urls]
return await asyncio.gather(*tasks)
结合线程与协程
# 在协程中执行阻塞IO操作
async def run_blocking(func, *args):
loop = asyncio.get_event_loop()
return await loop.run_in_executor(None, func, *args)
分布式任务队列
# 使用Celery实现(类似Java的Quartz)
from celery import Celery
app = Celery("tasks", broker="redis://localhost")
@app.task
def process_data(data):
return data.upper()
process_data.delay("hello") # 异步执行
通过第十一天的学习,您将掌握:
1️⃣ Python并发编程的核心模型与限制
2️⃣ 协程在高IO场景下的性能优势
3️⃣ 多进程并行计算的最佳实践
4️⃣ 复杂并发系统的调试与优化技巧