python并发编程(3)---线程池(ThreadPoolExecutor)

这里写目录标题

  • 线程池的原理
  • 使用线程池的好处
  • ThreadPoolExecutor使用语法
    • 用法1
  • 线程池改造爬虫程序

线程池的原理

线程需要cpu的调度

  • 线程的生命周期
    python并发编程(3)---线程池(ThreadPoolExecutor)_第1张图片
    新建线程系统需要分配资源,终止线程系统需要回收资源,比较耗时间。线程池就是重用线程来减去新建线程,终止线程的开销。
    python并发编程(3)---线程池(ThreadPoolExecutor)_第2张图片
    线程池里边是事先建立好的线程,线程池等待任务的到来。

使用线程池的好处

  1. 减去大量线程的建立和终止带来的时间开销
  2. 线程池里边线程数量有限,能有效避免系统因为创建线程过多,而导致系统负荷多大导致电脑变慢

ThreadPoolExecutor使用语法

  • 两种用法

用法1

from concurrent.futures import ThreadPoolExecutor,as_completed
with ThreadPlloExcutor() as Pool:
	results=pool.map(craw,urls)
for result in results:
	print(result)
with concurrent.futures.ThreadPoolExecutor() as pool:
    futures={
     }
    for url,html in htmls:
        future=pool.submit(parse,html)
        futures[future]=url
        
    # for future,url in futures.items():
    #     print(url,future.result())
    for future in concurrent.futures.as_completed(futures):
        url=futures[future]
        print(url,future.result())

map和submit的区别是,map需要事先准备好任务,然后往线程池里边拿。submit是一个个任务往线程池拿

线程池改造爬虫程序

# -*- coding: utf-8 -*-
"""
Created on Tue May  4 15:24:11 2021

@author: hellohaojun
"""

import threading 
import requests
import time
import concurrent.futures
from bs4 import BeautifulSoup


urls=[
      f"https://www.cnblogs.com/#p{page}"
      for page in range(1,50+1)
      ]

def craw(url):
    r=requests.get(url)
    return r.text

def parse(html):
    soup=BeautifulSoup(html,"html.parser")
    links=soup.find_all("a",class_="post_item-title")
    return[(link["href"],link.get_text())for link in links]

#craw
with concurrent.futures.ThreadPoolExecutor() as pool:
    htmls=pool.map(craw,urls)
    htmls=list(zip(urls,htmls))
    for url,html in htmls:
        print(url,len(html))
        
print("craw over")

#parse
with concurrent.futures.ThreadPoolExecutor() as pool:
    futures={
     }
    for url,html in htmls:
        future=pool.submit(parse,html)
        futures[future]=url
        
    # for future,url in futures.items():
    #     print(url,future.result())
    for future in concurrent.futures.as_completed(futures):
        url=futures[future]
        print(url,future.result())

https://www.cnblogs.com/#p44 69419
https://www.cnblogs.com/#p45 69419
https://www.cnblogs.com/#p46 69419
https://www.cnblogs.com/#p47 69419
https://www.cnblogs.com/#p48 69419
https://www.cnblogs.com/#p49 69419
https://www.cnblogs.com/#p50 69419
craw over
https://www.cnblogs.com/#p14 []
https://www.cnblogs.com/#p2 []
https://www.cnblogs.com/#p16 []
https://www.cnblogs.com/#p8 []
https://www.cnblogs.com/#p6 []
https://www.cnblogs.com/#p9 []
https://www.cnblogs.com/#p1 []
https://www.cnblogs.com/#p12 []
https://www.cnblogs.com/#p13 []

你可能感兴趣的:(python,python,多线程)