工具类型 |
代表性工具 |
核心优势 |
适用场景 |
学习成本 |
HTTP 请求 |
Requests |
简单易用,同步 / 异步支持 |
静态页面爬取 |
★☆☆☆☆ |
aiohttp |
高性能异步 IO |
高并发大规模爬取 |
★★★☆☆ |
|
PyCurl |
C 语言内核,极致性能 |
高频交易数据抓取 |
★★★★☆ |
|
页面解析 |
BeautifulSoup |
灵活 API,支持多种解析器 |
复杂 HTML 结构解析 |
★☆☆☆☆ |
lxml |
基于 libxml2,性能最优 |
超大数据量解析 |
★★☆☆☆ |
|
PyQuery |
jQuery 风格语法 |
前端开发者快速上手 |
★☆☆☆☆ |
|
动态渲染 |
Selenium |
完整浏览器环境 |
复杂 JavaScript 交互页面 |
★★★☆☆ |
Playwright |
多浏览器支持,现代化 API |
跨浏览器自动化测试与爬取 |
★★★☆☆ |
|
Scrapy-Splash |
Scrapy 集成渲染服务 |
框架内动态页面处理 |
★★★★☆ |
|
爬虫框架 |
Scrapy |
全功能爬虫框架 |
大规模结构化数据采集 |
★★★★☆ |
Crawlera |
Scrapy 云服务 |
企业级反爬场景 |
★★★☆☆ |
|
浏览器自动化 |
selenium-wire |
请求 / 响应拦截 |
API 数据抓取 |
★★★☆☆ |
MechanicalSoup |
表单操作简化 |
登录验证页面爬取 |
★★☆☆☆ |
|
特殊场景 |
Newspaper3k |
新闻内容自动提取 |
媒体资讯类网站 |
★☆☆☆☆ |
ScrapingBee |
API 化渲染服务 |
快速获取渲染结果 |
★☆☆☆☆ |
Requests(基础款)
import requests # 发送带请求头的GET请求 headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36...' } response = requests.get('https://example.com', headers=headers) print(response.status_code) print(response.json()) # 解析JSON响应 |
aiohttp(异步高性能)
import asyncio import aiohttp async def fetch(session, url): async with session.get(url) as response: return await response.text() async def main(): urls = ['url1', 'url2', 'url3'] async with aiohttp.ClientSession() as session: tasks = [fetch(session, url) for url in urls] htmls = await asyncio.gather(*tasks) print([len(html) for html in htmls]) asyncio.run(main()) |
BeautifulSoup(灵活解析)
from bs4 import BeautifulSoup soup = BeautifulSoup(html_content, 'lxml') # CSS选择器 title = soup.select_one('h1.title').text # 遍历元素 for item in soup.select('div.item'): print(item.get('id')) |
lxml(高性能解析)
from lxml import etree tree = etree.HTML(html_content) # XPath选择 title = tree.xpath('//h1[@class="title"]/text()')[0] # 批量提取 links = tree.xpath('//a/@href') |
Playwright(现代自动化)
from playwright.sync_api import sync_playwright with sync_playwright() as p: browser = p.chromium.launch(headless=True) page = browser.new_page()
# 访问并等待JavaScript执行 page.goto('https://example.com') page.wait_for_selector('div.loaded-content')
# 截图或提取数据 page.screenshot(path='screenshot.png') content = page.inner_text('div.content')
browser.close() |
Selenium(传统方案)
from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC driver = webdriver.Chrome() driver.get('https://example.com') # 显式等待 element = WebDriverWait(driver, 10).until( EC.presence_of_element_located((By.CSS_SELECTOR, 'div.content')) ) print(element.text) driver.quit() |
Scrapy(全功能框架)
import scrapy class ProductSpider(scrapy.Spider): name = 'products' start_urls = ['https://example.com/products']
def parse(self, response): for product in response.css('div.product'): yield { 'name': product.css('h3::text').get(), 'price': product.css('span.price::text').get(), 'url': product.css('a::attr(href)').get(), }
# 自动翻页 next_page = response.css('a.next-page::attr(href)').get() if next_page: yield response.follow(next_page, self.parse) |
运行命令:
scrapy crawl products -o products.json |
Newspaper3k(新闻内容提取)
from newspaper import Article url = 'https://example.com/article' article = Article(url) article.download() article.parse() print(article.title) print(article.text) print(article.authors) print(article.publish_date) # 自动提取关键词和摘要 article.nlp() print(article.keywords) print(article.summary) |
selenium-wire(请求拦截)
from seleniumwire import webdriver # 配置代理和请求头 options = { 'proxy': { 'http': 'http://user:[email protected]:8080', 'https': 'http://user:[email protected]:8080', } } driver = webdriver.Chrome(seleniumwire_options=options) driver.get('https://example.com') # 查看所有请求 for request in driver.requests: if request.response: print( request.url, request.response.status_code, request.response.headers['Content-Type'] ) |
headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36', 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8', 'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8', 'Referer': 'https://google.com', 'Connection': 'keep-alive', } |
免费代理获取:
import requests from lxml import etree def get_proxies(): url = 'https://free-proxy-list.net/' response = requests.get(url) tree = etree.HTML(response.text)
proxies = [] for row in tree.xpath('//table[@id="proxylisttable"]/tbody/tr'): ip = row.xpath('./td[1]/text()')[0] port = row.xpath('./td[2]/text()')[0] proxies.append(f'{ip}:{port}')
return proxies |
代理验证:
import requests def check_proxy(proxy): try: response = requests.get( 'https://httpbin.org/ip', proxies={'http': proxy, 'https': proxy}, timeout=5 ) if response.status_code == 200: return True except: return False |
第三方打码平台 API 示例:
import requests import base64 def recognize_captcha(image_path): with open(image_path, 'rb') as f: image_data = f.read()
# 转换为Base64 image_base64 = base64.b64encode(image_data).decode()
# 调用打码平台API response = requests.post( 'https://api.dama.com/recognize', json={ 'image': image_base64, 'type': 'common' }, headers={'API-Key': 'your_api_key'} )
return response.json()['code'] |
同步代码(慢):
import requests for url in urls: response = requests.get(url) process(response) |
异步代码(快):
import asyncio import aiohttp async def fetch_and_process(url): async with aiohttp.ClientSession() as session: async with session.get(url) as response: data = await response.text() process(data) # 处理函数需适配异步 asyncio.gather(*[fetch_and_process(url) for url in urls]) |
Scrapy-Redis 实现:
# settings.py SCHEDULER = "scrapy_redis.scheduler.Scheduler" DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter" REDIS_HOST = "localhost" REDIS_PORT = 6379 # 多个爬虫节点共享同一个Redis队列 scrapy crawl myspider -s REDIS_URL=redis://localhost:6379/0 |
开始 │ ├── 静态页面? │ │ │ ├── 是 → Requests + BeautifulSoup │ │ │ └── 否 → 动态渲染需求? │ │ │ ├── 是 → 复杂交互? │ │ │ │ │ ├── 是 → Playwright/Selenium │ │ │ │ │ └── 否 → Scrapy-Splash/ScrapingBee │ │ │ └── 否 → 结构化数据? │ │ │ ├── 是 → Scrapy框架 │ │ │ └── 否 → 特殊场景? │ │ │ ├── 是 → 根据场景选择(Newspaper3k等) │ │ │ └── 否 → 基础工具组合 |
建议在开始项目前,先阅读目标网站的《服务条款》和《隐私政策》,必要时咨询法律顾问。
掌握这些工具和策略后,你可以应对 90% 以上的爬虫场景。建议从简单项目入手,逐步积累经验,遇到复杂反爬机制时再针对性地深入学习。