十五天Python系统学习教程第十一天

Day 11 详细学习计划:Python并发与并行编程

学习目标
✅ 理解Python并发模型(对比Java的多线程与线程池)
✅ 掌握asyncio协程编程(对比Java的虚拟线程)
✅ 实现多进程加速计算密集型任务
✅ 完成高并发网络爬虫实战


一、并发模型核心对比(Java vs Python)

特性 Java Python 核心差异
线程实现 OS线程(java.lang.Thread OS线程(受GIL限制) Python线程不适合CPU密集型任务
协程支持 虚拟线程(Loom项目) asyncio协程(单线程异步) Python协程更轻量
进程并行 ProcessBuilder创建新进程 multiprocessing模块 Python进程间通信更简洁
线程池 ExecutorService concurrent.futures.ThreadPoolExecutor 接口设计类似

二、多线程与GIL机制(1小时)

2.1 Python线程使用(对比Java)

import threading  

# 创建线程(类似Java的Runnable)  
def task(n):  
    print(f"线程执行: {n}")  

# 启动线程(对比Java的Thread.start())  
threads = []  
for i in range(3):  
    t = threading.Thread(target=task, args=(i,))  
    threads.append(t)  
    t.start()  

for t in threads:  
    t.join()  # 等待线程结束  

GIL限制示例

# CPU密集型任务多线程无加速效果  
def count_down(n):  
    while n > 0:  
        n -= 1  

# 单线程执行  
%time count_down(10**7)  # 约0.3秒  

# 双线程执行  
t1 = threading.Thread(target=count_down, args=(5e6,))  
t2 = threading.Thread(target=count_down, args=(5e6,))  
%time t1.start(); t2.start(); t1.join(); t2.join()  # 约0.6秒(更慢)  

三、协程与异步编程(1小时)

3.1 asyncio基础(对比Java虚拟线程)

import asyncio  

async def fetch_data(url):  
    print(f"开始请求: {url}")  
    await asyncio.sleep(1)  # 模拟IO等待  
    return f"{url}响应数据"  

async def main():  
    # 并发执行(类似Java的CompletableFuture)  
    task1 = asyncio.create_task(fetch_data("https://api/1"))  
    task2 = asyncio.create_task(fetch_data("https://api/2"))  
    results = await asyncio.gather(task1, task2)  
    print(results)  

asyncio.run(main())  # 总耗时约1秒(非2秒) 
3.2 异步HTTP客户端(对比Java的AsyncHttpClient)

import aiohttp  

async def fetch_page(url):  
    async with aiohttp.ClientSession() as session:  
        async with session.get(url) as response:  
            return await response.text()  

async def crawl():  
    urls = ["https://example.com", "https://example.org"]  
    tasks = [fetch_page(url) for url in urls]  
    pages = await asyncio.gather(*tasks)  
    print(f"抓取到{len(pages)}个页面")  

四、多进程并行计算(1小时)

4.1 进程池(对比Java的ForkJoinPool)

from multiprocessing import Pool  

def cpu_intensive(n):  
    return sum(i*i for i in range(n))  

if __name__ == "__main__":  
    with Pool(4) as pool:  # 4进程  
        results = pool.map(cpu_intensive, [10**7]*4)  
    print(sum(results))  # 比单进程快约4倍(绕过GIL)  
4.2 进程间通信(IPC)

from multiprocessing import Process, Queue  

def worker(q):  
    q.put("子进程数据")  

if __name__ == "__main__":  
    q = Queue()  
    p = Process(target=worker, args=(q,))  
    p.start()  
    print(q.get())  # 接收数据  
    p.join()  

五、实战项目:高并发新闻爬虫(1.5小时)

5.1 需求分析
  • 并发抓取多个新闻网站首页

  • 提取标题与关键内容

  • 统计高频关键词

  • 支持同步/异步两种模式

5.2 核心实现

异步爬虫核心

import aiohttp  
from bs4 import BeautifulSoup  

async def fetch_news(url):  
    async with aiohttp.ClientSession() as session:  
        async with session.get(url) as resp:  
            html = await resp.text()  
            soup = BeautifulSoup(html, "lxml")  
            return {  
                "url": url,  
                "title": soup.title.text.strip(),  
                "content": soup.find("div", class_="content").text[:100]  
            }  

async def main(urls):  
    tasks = [fetch_news(url) for url in urls]  
    return await asyncio.gather(*tasks)  

多进程关键词统计

from multiprocessing import Pool  
from collections import Counter  

def count_keywords(text):  
    words = re.findall(r"\w+", text.lower())  
    return Counter(words)  

if __name__ == "__main__":  
    news_data = [...]  # 爬取结果  
    with Pool() as pool:  
        counters = pool.map(count_keywords, [n["content"] for n in news_data])  
    total = sum(counters, Counter())  
    print(total.most_common(10))  

六、Java开发者注意事项

  1. GIL的影响范围

    • 仅影响CPython解释器的原生线程

    • 使用C扩展(如NumPy)或multiprocessing可规避

  2. 异步编程范式

    • Python的async/await是语法糖,Java的虚拟线程更透明

    • Python事件循环需显式管理(asyncio.run()

  3. 进程序列化限制

    • Python多进程间传递对象需可pickle序列化

    # 自定义类的实例需实现__reduce__方法  
    class Data:  
        def __init__(self, value):  
            self.value = value  
        def __reduce__(self):  
            return (self.__class__, (self.value,))  


七、扩展练习

  1. 实现协程池限制并发数

    import asyncio  
    from aiothrottle import Throttler  
    
    async def limited_crawl(urls, concurrency=5):  
        throttler = Throttler(concurrency)  
        async with aiohttp.ClientSession() as session:  
            tasks = [throttler.acquire(fetch(session, url)) for url in urls]  
            return await asyncio.gather(*tasks)  

  2. 结合线程与协程

    # 在协程中执行阻塞IO操作  
    async def run_blocking(func, *args):  
        loop = asyncio.get_event_loop()  
        return await loop.run_in_executor(None, func, *args)  

  3. 分布式任务队列

    # 使用Celery实现(类似Java的Quartz)  
    from celery import Celery  
    
    app = Celery("tasks", broker="redis://localhost")  
    @app.task  
    def process_data(data):  
        return data.upper()  
    
    process_data.delay("hello")  # 异步执行  


通过第十一天的学习,您将掌握:
1️⃣ Python并发编程的核心模型与限制
2️⃣ 协程在高IO场景下的性能优势
3️⃣ 多进程并行计算的最佳实践
4️⃣ 复杂并发系统的调试与优化技巧

你可能感兴趣的:(python学习,python,开发语言,学习,java)