工欲善其事必先利其器!咱们先用conda创建专属环境:
conda create -n spider_env python=3.8
conda activate spider_env
pip install requests beautifulsoup4 pandas -i https://pypi.tuna.tsinghua.edu.cn/simple
(小贴士)新手常见坑:
import requests
from bs4 import BeautifulSoup
url = 'https://movie.douban.com/top250'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
movies = []
for item in soup.select('.item'):
title = item.select_one('.title').text
rating = item.select_one('.rating_num').text
movies.append({'电影名称': title, '评分': rating})
print(movies[:5]) # 打印前五部电影
(避坑指南)新手常见报错:
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
'Referer': 'https://www.douban.com/',
'Cookie': '你的实际cookie' # 用浏览器登录后获取
}
import random
import time
for page in range(0, 250, 25):
url = f'https://movie.douban.com/top250?start={page}'
time.sleep(random.uniform(1, 3)) # 随机等待1-3秒
# 发送请求代码...
proxies = {
'http': 'http://10.10.1.10:3128',
'https': 'http://10.10.1.10:1080',
}
response = requests.get(url, headers=headers, proxies=proxies)
(重要提示)免费代理陷阱:
import pandas as pd
df = pd.DataFrame(movies)
df.to_csv('douban_top250.csv', index=False, encoding='utf_8_sig')
import pymysql
conn = pymysql.connect(host='localhost', user='root', password='123456', db='spider')
cursor = conn.cursor()
sql = '''CREATE TABLE IF NOT EXISTS movies(
id INT AUTO_INCREMENT PRIMARY KEY,
title VARCHAR(100),
rating FLOAT)
'''
cursor.execute(sql)
for movie in movies:
cursor.execute("INSERT INTO movies(title, rating) VALUES (%s, %s)",
(movie['电影名称'], movie['评分']))
conn.commit()
import aiohttp
import asyncio
async def fetch(session, url):
async with session.get(url) as response:
return await response.text()
async def main():
async with aiohttp.ClientSession() as session:
html = await fetch(session, url)
# 解析代码...
asyncio.run(main())
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
options = Options()
options.add_argument('--headless') # 无界面模式
driver = webdriver.Chrome(options=options)
driver.get(url)
page_source = driver.page_source
(血泪教训)真实案例警示:
(学习资源推荐)
最后送大家一句话:爬虫虽好,可不要贪杯哦!技术是把双刃剑,用对了能开山劈石,用错了…(你懂的)