在数据驱动的时代,爬虫技术已成为获取信息的核心手段。对于初学者,掌握基础的requests库可完成简单数据采集。但面对以下场景时:
需要更专业的工具与方法。本文将深入讲解Scrapy框架的工程化实践,并通过异步编程实现性能突破。掌握这些技能后,你的爬虫将实现从"玩具级"到"工业级"的跨越!
一个标准的Scrapy项目包含以下核心组件:
myproject/
├── scrapy.cfg # 项目部署配置
└── myproject/
├── __init__.py
├── items.py # 数据容器定义
├── middlewares.py # 中间件体系
├── pipelines.py # 数据处理管道
├── settings.py # 全局配置
└── spiders/ # 爬虫核心
├── __init__.py
└── example.py # 爬虫实现
import scrapy
class NewsSpider(scrapy.Spider):
name = "news"
def start_requests(self):
urls = [
'https://news.site/page/1',
'https://news.site/page/2'
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
# 提取文章标题和链接
articles = response.css('div.article')
for article in articles:
yield {
'title': article.css('h2::text').get(),
'link': article.css('a::attr(href)').get(),
}
# 自动翻页处理
next_page = response.css('a.next-page::attr(href)').get()
if next_page:
yield response.follow(next_page, self.parse)
关键特性:
典型数据处理流程:
# pipelines.py
import pymongo
class MongoDBPipeline:
def __init__(self):
self.client = pymongo.MongoClient('mongodb://localhost:27017')
self.db = self.client['news_database']
def process_item(self, item, spider):
self.db['articles'].insert_one(dict(item))
return item
class DuplicatesPipeline:
def __init__(self):
self.urls_seen = set()
def process_item(self, item, spider):
if item['url'] in self.urls_seen:
raise DropItem("重复项 %s" % item)
self.urls_seen.add(item['url'])
return item
管道组合配置:
# settings.py
ITEM_PIPELINES = {
'myproject.pipelines.DuplicatesPipeline': 300,
'myproject.pipelines.MongoDBPipeline': 800,
}
# middlewares.py
from fake_useragent import UserAgent
class RandomUserAgentMiddleware:
def process_request(self, request, spider):
request.headers['User-Agent'] = UserAgent().random
class ProxyMiddleware:
def process_request(self, request, spider):
request.meta['proxy'] = "http://proxy.example.com:8080"
启用中间件:
# settings.py
DOWNLOADER_MIDDLEWARES = {
'myproject.middlewares.RandomUserAgentMiddleware': 543,
'myproject.middlewares.ProxyMiddleware': 755,
}
Scrapy-Redis架构通过Redis实现分布式任务队列和去重机制,支持多节点协同爬取。核心组件包括:
安装依赖:
pip install scrapy-redis redis
修改settings.py:
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
REDIS_URL = 'redis://localhost:6379'
编写分布式爬虫:
from scrapy_redis.spiders import RedisSpider
class ClusterSpider(RedisSpider):
name = 'distributed_spider'
redis_key = 'myspider:start_urls'
def parse(self, response):
# 数据解析逻辑
yield {
'url': response.url,
'content': response.css('body::text').get()
}
启动命令:
# 多节点并行运行(需先向Redis插入起始URL)
scrapy runspider cluster_spider.py
redis-cli lpush myspider:start_urls https://example.com
同步请求需等待前一个请求完成才能发起下一个,而异步编程通过事件循环(Event Loop)实现非阻塞IO,允许多个请求同时处理。下图对比了同步与异步的执行流程:
import aiohttp
import asyncio
from datetime import datetime
CONCURRENCY = 100 # 并发控制
async def fetch(session, url, semaphore):
async with semaphore:
try:
async with session.get(url, timeout=10) as response:
print(f"{datetime.now()} 正在抓取 {url}")
return await response.text()
except Exception as e:
print(f"请求失败: {url}, 错误: {str(e)}")
return None
async def main(urls):
connector = aiohttp.TCPConnector(limit=0) # 不限制连接数
async with aiohttp.ClientSession(connector=connector) as session:
semaphore = asyncio.Semaphore(CONCURRENCY)
tasks = [fetch(session, url, semaphore) for url in urls]
results = await asyncio.gather(*tasks)
return [r for r in results if r is not None]
if __name__ == "__main__":
urls = [f'https://example.com/page/{i}' for i in range(1, 1001)]
results = asyncio.run(main(urls))
print(f"成功获取 {len(results)} 个页面")
性能对比测试:
| 方式 | 1000请求耗时 | CPU占用 | 内存消耗 |
|---|---|---|---|
| 同步请求 | 82.3s | 12% | 150MB |
| 异步请求(100并发) | 6.7s | 68% | 210MB |
async def fetch(session, url):
await asyncio.sleep(random.uniform(0.5, 1.5)) # 随机延迟0.5-1.5秒
# 请求逻辑
使用tenacity库实现可靠重试:
from tenacity import retry, stop_after_attempt, wait_exponential
@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=2, max=10))
async def fetch_with_retry(session, url):
async with session.get(url) as response:
return await response.text()
asyncio.Semaphore限制同时运行的任务数asyncio-throttle库控制每秒请求数from asyncio_throttle import Throttler
throttler = Throttler(rate_limit=50) # 每秒最多50次请求
async with throttler:
await fetch(session, url)
| 场景 | 推荐方案 | 优势 |
|---|---|---|
| 中小型定向采集 | Scrapy | 开发效率高,内置完善组件 |
| 大规模分布式采集 | Scrapy-Redis | 支持横向扩展,故障自动转移 |
| API高频采集 | aiohttp+asyncio | 单机性能极致,适合万级QPS场景 |
| 动态渲染页面采集 | Playwright/Puppeteer | 支持JavaScript渲染,模拟真实浏览器行为 |
| 反爬对抗(验证码) | 结合OCR/打码平台 | 突破图形验证,需结合机器学习 |
uvloop替换默认事件循环进一步提升效率。通过理论与实践结合,你将能够构建稳定、高效、可扩展的爬虫系统,在数据采集领域实现从初级开发者到资深工程师的跨越!