今天我们来介绍如何使用多线程和多进程来爬虫,以便加快爬取效率。多进程和多线程在大部分情况下都能加快处理效率,缩短处理时间,但是会出现通信、数据 共享及加锁问题等。简便起见,我们可以使用 Python 的标准库 multiprocessing 模块,该模块可以让我们很容易的利用多进程和多线程来处理任务。
那么,现在问题来了???我们到底是该使用多进程呢,还是该使用多线程?他们之间的区别又是什么???一般来说,计算密集型任务适合多进程,IO 密集型任务适合多线程。 所谓 IO 密集型任务,就是类似于网络交互、文件读写、网络爬虫等任务,这些任务不依赖 CPU 的操作,因此可以通过使用多线程来大大提高爬虫程序的效率。
首先,我们先通过一个简单的例子,来切身体会一下多线程的速度。
# -*- coding: utf-8 -*-
# # @Author: lemon
# # @Date: 2019-09-12 9:12
# # @Last Modified by: lemon
# # @Last Modified time: 2019-09-12 9:30
# # @function: 多线程示例
import time
from multiprocessing import Pool
def run(fn):
'''
:param fn: 列表中的一个元素
:return:
'''
time.sleep(1)
print(fn*fn)
if __name__ == "__main__":
testFL = [1,2,3,4,5,6]
print (f'########### 顺序执行 ###########') #顺序执行(也就是串行执行, 单进程)
s = time.time()
for fn in testFL:
run(fn)
t1 = time.time()
print (f'顺序执行时间 : {int(t1 - s)}')
print ('########### 并行执行 ###########') #创建多个进程,并行执行
pool = Pool(10) # 创建拥有10个进程数量的进程池
pool.map(run, testFL) # testFL:要处理的数据列表,run:处理testFL列表中数据的函数
pool.close() # 关闭进程池,不再接受新的进程
pool.join() # 主进程阻塞等待子进程的退出
t2 = time.time()
print (f'并行执行时间 : {int(t2 - t1)}')
通过对串行和并行程序执行时间的对比,我们发现使用多线程并行执行程序将大大加快程序的处理速度。串行执行完上面的程序需要6s钟的时间,而并行则只需要2s钟即可完成。
下面,我们来对之前的链家二手房爬虫程序进行改进,利用多线程来爬虫。如果对爬虫内容还不是很清楚的小伙伴,可看我之前的博客,再进行这部分内容的学习。
# -*- coding: utf-8 -*-
# # @Author: lemon
# # @Date: 2019-09-011 23:48
# # @Last Modified by: lemon
# # @Last Modified time: 2019-09-011 23:58
# # @function: 链家二手房 多线程爬虫实例
from multiprocessing.dummy import Pool as pl
from lxml import etree
import requests
import time
import pandas as pd
# 定义爬虫头部
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36',
'Cookie': 'TY_SESSION_ID=68ea2c4e-5277-4dab-b8f6-2bdde5e61c57; lianjia_uuid=840489de-2c37-40de-9c15-522d4f0a4b7d; Hm_lvt_9152f8221cb6243a53c83b956842be8a=1568118418; select_city=430100; digv_extends=%7B%22utmTrackId%22%3A%2221583074%22%7D; all-lj=3d8def84426f51ac8062bdea518a8717; lianjia_ssid=1c49f18b-d180-45d1-8317-d59aa8e69c1d; _qzjc=1; _jzqa=1.3637271738872210000.1568118544.1568118544.1568118544.1; _jzqc=1; _jzqy=1.1568118544.1568118544.1.jzqsr=baidu|jzqct=%E9%93%BE%E5%AE%B6.-; _jzqckmp=1; UM_distinctid=16d1b26177746b-06b9b5b00fc26-e343166-144000-16d1b26177891e; CNZZDATA1255849590=1544917587-1568118208-https%253A%252F%252Fsp0.baidu.com%252F%7C1568118208; CNZZDATA1254525948=1761745203-1568117644-https%253A%252F%252Fsp0.baidu.com%252F%7C1568117644; CNZZDATA1255633284=1446432833-1568115241-https%253A%252F%252Fsp0.baidu.com%252F%7C1568115241; _smt_uid=5d779710.58312570; CNZZDATA1255604082=118953400-1568116187-https%253A%252F%252Fsp0.baidu.com%252F%7C1568116187; sajssdk_2015_cross_new_user=1; sensorsdata2015jssdkcross=%7B%22distinct_id%22%3A%2216d1b261a3b5a4-0f07778157987f-e343166-1327104-16d1b261a3c575%22%2C%22%24device_id%22%3A%2216d1b261a3b5a4-0f07778157987f-e343166-1327104-16d1b261a3c575%22%2C%22props%22%3A%7B%22%24latest_traffic_source_type%22%3A%22%E4%BB%98%E8%B4%B9%E5%B9%BF%E5%91%8A%E6%B5%81%E9%87%8F%22%2C%22%24latest_referrer%22%3A%22https%3A%2F%2Fsp0.baidu.com%2F9q9JcDHa2gU2pMbgoY3K%2Fadrc.php%3Ft%3D06KL00c00fZg9KY0FN9B0nVfAs02-LII00000Fo-c-C00000LEiQaL.THLKVQ1i0A3qnjcsnjn1rHKxn-qCmyqxTAT0T1d-mhFhnHf3uW0snj0srH990ZRqf1cLwbnkwRR1PRckfRczPRNAP1T%22%2C%22%24latest_referrer_host%22%3A%22sp0.baidu.com%22%2C%22%24latest_search_keyword%22%3A%22%E9%93%BE%E5%AE%B6%22%2C%22%24latest_utm_source%22%3A%22baidu%22%2C%22%24latest_utm_medium%22%3A%22pinzhuan%22%2C%22%24latest_utm_campaign%22%3A%22sousuo%22%2C%22%24latest_utm_content%22%3A%22biaotimiaoshu%22%2C%22%24latest_utm_term%22%3A%22biaoti%22%7D%7D; _ga=GA1.2.144608934.1568118547; _gid=GA1.2.694153477.1568118547; Hm_lpvt_9152f8221cb6243a53c83b956842be8a=1568118582; srcid=eyJ0Ijoie1wiZGF0YVwiOlwiOGQ5NGM2ZGMwNWIwMWZjYTRmOWJiNTI5MTdmNjE4ODRmNWY2ZDZlOTE3MzhhNWRkYjZiMTcxNDJmZDc4N2Q1ZmNiMTA2NGFhOWMzMDk1MjZkODFhZTQyYWJhM2NiMzBkZDhhMmQ4MmNkODAwYTJhMDU1ODM2ODNlMmRlYTliYmMzYTVkYmExNjZjNjhmMGJmZjVmNTY3OTE3MDE2NzAwMWZjZmQzZTdmYmZlOGIwZTc1NDE4NDgyZmVmZWUwNzQwMDY2OTczYTdhYzQzZTExNWNkZmYxZjExYTc5MmQzMzM1MTkxOTQzMTk4N2RlY2ZhNTI0ZWI3MjhiMDg5YzU3NmYwMmNkYjZmMDFmNmViOTdmNTExYzZmOTliNzA0MDVkNDVkMTYzY2E5Y2Y2MzUwNDVkYjQ3MzliZTFmNzBjMGNkYjUzYzU3YjM1ZGYxMzQ4ODljZWMzNzg3ZjAxMzczZFwiLFwia2V5X2lkXCI6XCIxXCIsXCJzaWduXCI6XCI2M2JkMTI2Y1wifSIsInIiOiJodHRwczovL2NzLmxpYW5qaWEuY29tL2Vyc2hvdWZhbmcvIiwib3MiOiJ3ZWIiLCJ2IjoiMC4xIn0=; _qzja=1.892590918.1568118543804.1568118543804.1568118543804.1568118574079.1568118581879.0.0.0.4.1; _qzjb=1.1568118543804.4.0.0.0; _qzjto=4.1.0; _jzqb=1.4.10.1568118544.1'
}
pre_url = 'https://cs.lianjia.com/ershoufang/pg'
# 下载页面, 并返回页面信息提取器
def download(url):
html = requests.get(url, headers=headers)
time.sleep(2)
return etree.HTML(html.text)
# 下载图片
def download_img(image_url, image_name):
img = requests.get(image_url, headers=headers)
with open('./lianjia_house_image/{}.jpg'.format(image_name), 'wb') as f:
f.write(img.content)
# 将爬取道德数据写入 DataFrame
def writer_to_excel(data):
frame = pd.DataFrame(data)
frame.to_excel('链家二手房数据.xlsx')
# 爬取链家二手房信息
def spider(list_url):
# 存储 DataFrame 中用到的数据
data = {
'title': [],
'layout': [],
'location': [],
'value': [],
'house_years': [],
'mortgage_info': []
}
selector = download(list_url)
house_list = selector.xpath('//*[@id="content"]/div[1]/ul/li')
for house in house_list:
title = house.xpath('div[1]/div[1]/a/text()')[0]
layout = house.xpath('div[1]/div[2]/div/text()')[0]
location = house.xpath('div[1]/div[3]/div/text()')[0]
value = house.xpath('div[1]/div[6]/div[1]/span/text()')[0]
image_url = house.xpath('a/img[2]/@data-original')
if len(image_url) > 0:
download_img(image_url[0], title)
else:
print(f'{title} 图片不存在...')
print(title)
# 构造详情页 URL
house_detail_url = house.xpath('div[1]/div[1]/a/@href')[0]
sel = download(house_detail_url) # 下载详情页
time.sleep(1)
house_years = sel.xpath('//*[@id="introduction"]/div/div/div[2]/div[2]/ul/li[5]/span[2]/text()')[0]
mortgage_info = sel.xpath('//*[@id="introduction"]/div/div/div[2]/div[2]/ul/li[7]/span[2]/text()')[0].strip() # 去除首尾空格
data['title'].append(title)
data['layout'].append(layout)
data['location'].append(location)
data['value'].append(value)
data['house_years'].append(house_years)
data['mortgage_info'].append(mortgage_info)
writer_to_excel(data)
if __name__ == '__main__':
pool = pl(4) # 初始化线程池
house_url = [pre_url + str(x) + '/' for x in range(1,3)] # 列表推导式生成爬取页面的 URL 列表
pool.map(spider, house_url)
pool.close() # 关闭线程池, 使其不再接受新的任务
pool.join() # 主线程阻塞等待子线程的退出, 要在 close() 方法之后使用
有兴趣的小伙伴可以与之前的程序进行对比,看爬取同样多的内容,哪一个更快。