当你在Python爬虫项目中看到这个红彤彤的报错时(特别是用urllib库的小伙伴),先放下准备砸键盘的冲动!这个HTTP 403状态码就像网站保安在对你喊:“我知道你想干啥,但就是不让进!”
举个真实案例:上周我用requests库抓取某电商网站数据时,前100次请求都很顺利,突然就开始狂喷403错误。最后发现是对方服务器把我的IP识别为爬虫了(哭)…
很多网站会检查请求头中的User-Agent字段。如果你用默认的Python UA,就像在脑门上贴着"我是爬虫"的标签
# 错误示范(千万别学!)
import urllib.request
response = urllib.request.urlopen('https://example.com')
连续快速请求会让服务器认为你在进行DDoS攻击。有次我设置0.1秒间隔请求,结果5分钟后就被封IP了(血泪教训)
就像进VIP室需要会员卡,某些页面必须携带cookie或token才能访问
包括但不限于:
from urllib import request
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}
req = request.Request(url='https://example.com', headers=headers)
response = request.urlopen(req)
(超级重要) 推荐常用UA清单:
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.1 Safari/605.1.15
headers = {
'Referer': 'https://www.google.com/',
# 其他头信息...
}
import requests
session = requests.Session()
session.headers.update({
'User-Agent': '...',
'Accept-Language': 'zh-CN,zh;q=0.9'
})
# 先登录获取cookies
login_response = session.post(login_url, data=credentials)
# 后续请求自动携带cookies
data = session.get(target_url).json()
import time
import random
for page in range(1, 101):
# 随机延迟1-3秒
time.sleep(1 + 2 * random.random())
# 发送请求...
proxies = {
'http': 'http://10.10.1.10:3128',
'https': 'http://10.10.1.10:1080',
}
response = requests.get(url, proxies=proxies)
(避坑指南) 免费代理的三大陷阱:
from http.cookiejar import CookieJar
# 创建cookie处理器
cookie_jar = CookieJar()
handler = request.HTTPCookieProcessor(cookie_jar)
opener = request.build_opener(handler)
# 模拟登录
login_data = {'username': 'xxx', 'password': 'xxx'}
req = request.Request(login_url, data=urlencode(login_data).encode())
opener.open(req)
# 后续请求自动携带cookie
response = opener.open(target_url)
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
options = Options()
options.add_argument("--headless") # 无界面模式
options.add_argument("user-agent=Mozilla/5.0...")
driver = webdriver.Chrome(options=options)
driver.get(url)
page_source = driver.page_source
import requests
from fake_useragent import UserAgent
ua = UserAgent()
headers = {
'User-Agent': ua.random,
'Referer': 'https://www.google.com/',
'Accept-Encoding': 'gzip, deflate, br'
}
proxies = {'https': 'http://premium_proxy:port'}
cookies = {'session_id': 'xxxxxx'}
response = requests.get(
url,
headers=headers,
proxies=proxies,
cookies=cookies,
timeout=10
)
curl -v -H "User-Agent: Mozilla/5.0..." -H "Referer: https://google.com" https://target-site.com
逐步添加请求头参数,定位被拦截的关键字段
from tenacity import retry, stop_after_attempt, wait_exponential
@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=4, max=10))
def safe_request(url):
response = requests.get(url)
if response.status_code == 403:
raise Exception("触发反爬")
return response
最后提醒:技术虽好,但要注意法律边界!爬取数据前务必确认网站的合规要求,保护数据隐私安全人人有责~