为XScholar文献下载系统搭建监控告警体系,需要部署Prometheus、AlertManager、Grafana等监控服务。项目已有Prometheus和Grafana在运行,需要在此基础上新增AlertManager服务。
# 现有的docker-compose.yml
version: '3.8'
services:
prometheus:
image: prom/prometheus:latest
container_name: prometheus
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus_data:/prometheus
# ... 其他配置
初始配置中没有AlertManager服务,需要在不影响现有服务的情况下添加。
在现有docker-compose.yml基础上添加AlertManager服务:
# 新增的AlertManager服务配置
alertmanager:
image: prom/alertmanager:latest
container_name: alertmanager
ports:
- "9093:9093"
volumes:
- ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
- alertmanager_data:/alertmanager
command:
- '--config.file=/etc/alertmanager/alertmanager.yml'
- '--storage.path=/alertmanager'
restart: unless-stopped
networks:
- monitoring
# 在volumes部分新增
volumes:
prometheus_data:
grafana_data:
alertmanager_data: # 新增
原有Prometheus配置中没有rule_files和alerting配置。
修改prometheus.yml添加告警相关配置:
global:
scrape_interval: 15s
evaluation_interval: 15s
# 新增:告警规则文件配置
rule_files:
- "prometheus-alerts.yml"
# 新增:AlertManager配置
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
scrape_configs:
# ... 保持原有配置不变
# alertmanager.yml
global:
resolve_timeout: 5m
route:
group_by: ['alertname', 'severity']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'dingtalk-webhook'
routes:
# 严重告警立即发送
- match:
severity: critical
receiver: 'dingtalk-webhook'
group_wait: 5s
repeat_interval: 30m
receivers:
- name: 'dingtalk-webhook'
webhook_configs:
- url: 'http://host.docker.internal:8060/dingtalk/webhook'
send_resolved: true
$ docker-compose version
docker: 'compose' is not a docker command.
$ docker-compose
-bash: /usr/local/bin/docker-compose: Permission denied
docker compose
而非docker-compose
# 方案1:修复权限
sudo chmod +x /usr/local/bin/docker-compose
# 方案2:使用新版命令(推荐)
docker compose up -d
docker compose down
docker compose
命令level=ERROR msg="loading groups failed" err="yaml: invalid Unicode character"
告警规则文件中包含Unicode特殊字符(表情符号),导致Prometheus无法解析。
# 问题配置 - 包含表情符号
- alert: XScholarServiceDown
annotations:
summary: " 服务宕机告警" # 导致解析错误
# 正确配置 - 纯ASCII字符
- alert: XScholarServiceDown
annotations:
summary: "Service Down Alert" # 移除特殊字符
yamllint
或Python验证YAML语法# YAML语法验证
python3 -c "import yaml; yaml.safe_load(open('prometheus-alerts.yml'))"
prometheus/
├── docker-compose.yml # 容器编排配置
├── prometheus.yml # Prometheus主配置
├── prometheus-alerts.yml # 告警规则定义
├── alertmanager.yml # AlertManager配置
└── dingtalk-webhook.py # 钉钉集成服务
# 启动所有服务
docker compose up -d
# 检查服务状态
docker compose ps
# 验证服务可访问性
curl http://localhost:9090/api/v1/status/config # Prometheus
curl http://localhost:9093/api/v1/status # AlertManager
基础监控服务部署完成后,下一阶段将重点关注:
这个阶段的重点是建立稳定的基础设施,为后续的应用监控和业务告警奠定基础。