在这个数据为王的时代(敲黑板!!!),掌握爬虫技能就像拥有阿拉丁神灯!举个栗子:做市场调研不用手动查资料、写论文不用到处找数据、甚至追剧都能自动获取更新提醒…但是注意了!咱们要当"文明挖矿工",只采集公开数据,绝不碰敏感内容!
# 安装核心库(用清华源飞一般的感觉)
pip install requests beautifulsoup4 -i https://pypi.tuna.tsinghua.edu.cn/simple
# 推荐开发工具
- VS Code(插件装Python和Pylance就行)
- Jupyter Notebook(适合新手边写边看)
import requests
from bs4 import BeautifulSoup
import csv
import time
def douban_spider():
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) 苹果WebKit/537.36'
}
# CSV文件准备(记得用utf-8-sig编码解决中文乱码)
with open('douban_top250.csv', 'w', newline='', encoding='utf-8-sig') as f:
writer = csv.writer(f)
writer.writerow(['排名', '书名', '评分', '评语', '链接'])
# 翻页逻辑(总共10页)
for page in range(0, 250, 25):
url = f'https://book.douban.com/top250?start={page}'
response = requests.get(url, headers=headers)
# 解析HTML(重点!)
soup = BeautifulSoup(response.text, 'html.parser')
items = soup.find_all('tr', class_='item')
for item in items:
rank = item.find('div', class_='pl2').get_text(strip=True)
title = item.find('span', class_='title').text
rating = item.find('span', class_='rating_nums').text
quote = item.find('span', class_='inq').text if item.find('span', class_='inq') else ''
link = item.find('a')['href']
writer.writerow([rank, title, rating, quote, link])
# 绅士间隔(重要!)
time.sleep(3)
if __name__ == '__main__':
douban_spider()
# 随机User-Agent生成器
from fake_useragent import UserAgent
ua = UserAgent()
headers = {'User-Agent': ua.random}
# 使用代理IP(示例格式)
proxies = {
'http': 'http://10.10.1.10:3128',
'https': 'http://10.10.1.10:1080',
}
response = requests.get(url, proxies=proxies)
# 模拟真人操作(适合动态加载网站)
from selenium import webdriver
driver = webdriver.Chrome()
driver.get('https://www.douban.com')
# 添加智能等待
driver.implicitly_wait(10)
最后说句掏心窝的话:爬虫本质是工具,关键看你怎么用。记得去年我用爬虫帮导师收集科研数据,论文效率直接提升70%!但千万别用来做坏事,技术是把双刃剑,且用且珍惜啊~