前面两篇讲了有关爬虫系统的搭建以及爬虫中需要的代理ip池的搭建的全过程,接下来我将写一个爬虫系统以及代理ip池异常监控的程序,主要功能就是监控是否发生异常,及时通过邮件提醒管理员排查故障,这样整体的一套流程就全部清晰了,能够完美适配任何项目。
在Linux系统上监控爬虫运行状态并实现异常邮件通知,可以通过以下方案实现:
# 检查爬虫进程是否存在
pgrep -f "spider.py" > /dev/null
if [ $? -eq 0 ]; then
echo "爬虫进程运行中"
else
echo "爬虫进程已停止!需要告警"
fi
在爬虫代码中添加心跳记录:
# spider.py
import time
import json
def record_heartbeat():
heartbeat = {
"timestamp": time.time(),
"status": "running",
"items_scraped": 1000 # 实际抓取数量
}
with open("/var/log/spider_heartbeat.json", "w") as f:
json.dump(heartbeat, f)
# 在爬虫主循环中定期调用
while True:
# ...爬虫工作...
record_heartbeat()
time.sleep(60) # 每分钟记录一次
# 检查日志中是否包含成功标记
tail -n 100 /var/log/spider.log | grep "Scraped 100 items" > /dev/null
if [ $? -ne 0 ]; then
echo "日志无更新!需要告警"
fi
# 检查数据库最新数据时间
last_data_time=$(mysql -uuser -ppass -e "SELECT MAX(created_at) FROM items" spider_db)
current_time=$(date +%s)
if [ $(($current_time - $(date -d "$last_data_time" +%s))) -gt 3600 ]; then
echo "数据超过1小时未更新!需要告警"
fi
sudo apt install mailutils -y
send_alert.sh
#!/bin/bash
# send_alert.sh
ALERT_SUBJECT="爬虫系统异常告警"
ALERT_CONTENT=$(cat <<EOF
服务器: $(hostname)
时间: $(date)
异常类型: $1
详细信息:
$2
EOF
)
echo "$ALERT_CONTENT" | mail -s "$ALERT_SUBJECT" [email protected]
spider_monitor.sh
#!/bin/bash
# spider_monitor.sh
# 1. 进程检查
if ! pgrep -f "spider.py" > /dev/null; then
./send_alert.sh "进程停止" "爬虫进程未运行!"
exit 1
fi
# 2. 心跳检查
HEARTBEAT_FILE="/var/log/spider_heartbeat.json"
if [ ! -f "$HEARTBEAT_FILE" ]; then
./send_alert.sh "心跳丢失" "心跳文件不存在!"
exit 1
fi
timestamp=$(jq -r '.timestamp' $HEARTBEAT_FILE 2>/dev/null)
current_time=$(date +%s)
if [ -z "$timestamp" ] || [ $(($current_time - $timestamp)) -gt 1200 ]; then
./send_alert.sh "心跳超时" "心跳超过20分钟未更新!最后时间: $(date -d @$timestamp)"
exit 1
fi
# 3. 日志检查
if ! tail -n 100 /var/log/spider.log | grep "Scraped [0-9]\+ items" > /dev/null; then
./send_alert.sh "日志异常" "最近100行日志未发现抓取记录!"
exit 1
fi
# 4. 数据产出检查
last_data_time=$(mysql -uuser -ppass -e "SELECT MAX(created_at) FROM items" -NB spider_db 2>/dev/null)
if [ -z "$last_data_time" ]; then
./send_alert.sh "数据库错误" "无法获取最新数据时间!"
exit 1
fi
db_time=$(date -d "$last_data_time" +%s)
if [ $(($current_time - $db_time)) -gt 7200 ]; then
./send_alert.sh "数据异常" "数据超过2小时未更新!最后数据时间: $last_data_time"
exit 1
fi
echo "$(date) - 爬虫运行正常"
# 每5分钟执行一次监控
crontab -e
*/5 * * * * /path/to/spider_monitor.sh >> /var/log/spider_monitor.log 2>&1
# metrics.py
from prometheus_client import start_http_server, Counter, Gauge
# 定义指标
PAGES_SCRAPED = Counter('spider_pages_scraped', '已抓取页面数')
ITEMS_SCRAPED = Counter('spider_items_scraped', '已抓取数据项')
LAST_SUCCESS = Gauge('spider_last_success', '最后成功时间戳')
# 在爬虫中更新指标
def record_success(items):
ITEMS_SCRAPED.inc(len(items))
LAST_SUCCESS.set_to_current_time()
# 启动指标服务器(在8888端口)
start_http_server(8888)
# prometheus.yml
scrape_configs:
- job_name: 'spider'
static_configs:
- targets: ['spider-server:8888']
# alertmanager.yml
route:
receiver: 'email-alerts'
receivers:
- name: 'email-alerts'
email_configs:
- to: '[email protected]'
from: '[email protected]'
smarthost: 'smtp.example.com:587'
auth_username: 'user'
auth_password: 'pass'
send_resolved: true
# spider_alerts.yml
groups:
- name: spider
rules:
- alert: SpiderDown
expr: up{job="spider"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "爬虫服务下线"
description: "爬虫服务 {{ $labels.instance }} 已停止运行超过5分钟"
- alert: NoRecentData
expr: time() - spider_last_success > 3600
for: 15m
labels:
severity: warning
annotations:
summary: "爬虫数据停止更新"
description: "爬虫服务 {{ $labels.instance }} 已超过1小时未产生新数据"
# 在监控脚本中添加重启逻辑
if ! pgrep -f "spider.py" > /dev/null; then
./send_alert.sh "进程停止" "尝试重启爬虫..."
cd /path/to/spider
source .venv/bin/activate
nohup python spider.py >> /var/log/spider.log 2>&1 &
sleep 10
if pgrep -f "spider.py"; then
./send_alert.sh "重启成功" "爬虫进程已恢复"
else
./send_alert.sh "重启失败" "爬虫重启失败,需要人工干预!"
fi
fi
1、多维度监控:
2、分级告警:
3、告警收敛:
# 避免重复告警
if [ -f "/tmp/last_alert" ] && [ $(($(date +%s) - $(stat -c %Y /tmp/last_alert)) -lt 3600 ]; then
echo "1小时内已发送过告警,跳过"
exit 0
fi
touch /tmp/last_alert
4、可视化看板:
通过以上方案,可以实现对爬虫系统的全面监控,在出现异常时及时通知管理员,确保爬虫服务的稳定运行。生产环境推荐使用Prometheus+Grafana+Alertmanager的组合方案,它提供了更强大的监控能力和可视化效果。