随着微服务架构的普及,应用监控已成为生产环境的必备能力。本文深入探讨如何通过Spring Boot Actuator提供深度应用监控,配合Prometheus和Grafana构建完整的企业级监控解决方案。
组件 | 角色 | 关键能力 |
---|---|---|
Actuator | 指标暴露器 | 提供健康检查、指标端点 |
Prometheus | 指标收集与存储 | 多维数据模型、PromQL查询语言 |
Grafana | 数据可视化 | 灵活仪表盘、警报管理 |
Alertmanager | 告警处理 | 分组、静默、路由通知 |
# application.yml
management:
endpoints:
web:
exposure:
include: "*" # 暴露所有端点
endpoint:
health:
show-details: always
probes:
enabled: true # 启用K8s就绪和存活探针
metrics:
enabled: true
server:
port: 9001 # 监控专用端口
@Component
public class CustomHealthIndicator implements HealthIndicator {
private final ThirdPartyService service;
@Override
public Health health() {
if (service.isAvailable()) {
return Health.up()
.withDetail("version", "3.2.1")
.build();
}
return Health.down()
.withDetail("error", service.getLastError())
.build();
}
}
// 订单业务指标监控
@Service
public class OrderService {
private final Counter orderCounter;
private final DistributionSummary orderAmountSummary;
public OrderService(MeterRegistry registry) {
orderCounter = Counter.builder("orders.count")
.description("Total processed orders")
.tag("type", "online") // 按类型打标签
.register(registry);
orderAmountSummary = DistributionSummary.builder("orders.amount")
.baseUnit("CNY")
.description("Order amount distribution")
.publishPercentiles(0.5, 0.95, 0.99) // 百分位数
.register(registry);
}
@Transactional
public void processOrder(Order order) {
// 业务逻辑...
orderCounter.increment();
orderAmountSummary.record(order.getAmount());
}
}
# prometheus.yml
scrape_configs:
- job_name: 'spring-boot-apps'
metrics_path: '/actuator/prometheus'
scrape_interval: 15s
static_configs:
- targets: ['app1:9001', 'app2:9001']
relabel_configs:
- source_labels: [__address__]
target_label: __metrics_path__
-- 系统CPU使用率
sum(rate(system_cpu_usage{application="order-service"}[1m])) by (instance)
-- JVM内存使用
jvm_memory_used_bytes{area="heap"} / jvm_memory_max_bytes{area="heap"}
-- 应用线程状态
sum(threads_states_threads{state="blocked"}) by (instance)
-- 订单处理成功率
rate(orders_count_total{status="success"}[5m]) /
rate(orders_count_total[5m])
# Prometheus启动参数优化
prometheus \
--config.file=/etc/prometheus/prometheus.yml \
--storage.tsdb.retention.time=30d \
--storage.tsdb.path=/data/prometheus \
--web.console.templates=/etc/prometheus/consoles \
--query.max-concurrency=16 \
--query.timeout=2m
// 查询表达式
sum(jvm_memory_used_bytes{area="heap"}) by (instance) / 1024^2
sum(jvm_memory_max_bytes{area="heap"}) by (instance) / 1024^2
// 请求速率
sum(rate(http_server_requests_seconds_count[1m])) by (uri, method, status)
// 99分位延迟
histogram_quantile(0.99,
sum(rate(http_server_requests_seconds_bucket[1m])) by (le, uri))
# 使用Terraform管理Grafana
resource "grafana_dashboard" "spring_boot_monitoring" {
config_json = file("${path.module}/dashboards/spring-boot.json")
}
# 配置仪表板定时快照
resource "grafana_snapshot" "daily_report" {
dashboard_uid = grafana_dashboard.spring_boot_monitoring.uid
name = "Daily Report"
expires = "24h"
}
groups:
- name: spring-boot-alerts
rules:
- alert: HighHeapUsage
expr: jvm_memory_used_bytes{area="heap"} / jvm_memory_max_bytes{area="heap"} > 0.9
for: 5m
labels:
severity: critical
annotations:
summary: "高堆内存使用: {{ $value }}%"
description: "实例 {{ $labels.instance }} 堆内存使用超过90%"
- alert: HighErrorRate
expr: rate(http_server_requests_seconds_count{status=~"5.."}[5m]) /
rate(http_server_requests_seconds_count[5m]) > 0.05
for: 2m
labels:
severity: warning
annotations:
summary: "高错误率: {{ $value }}"
description: "{{ $labels.uri }} 5xx错误率超过5%"
route:
group_by: [alertname, cluster]
group_wait: 30s
group_interval: 5m
repeat_interval: 3h
receiver: 'dingtalk'
receivers:
- name: 'dingtalk'
dingtalk_configs:
- url: 'https://oapi.dingtalk.com/robot/send?access_token=xxx'
message: '{{ template "dingtalk.default" . }}'
at_mobiles: ['13800000000']
templates:
- '/etc/alertmanager/dingtalk.tmpl'
# 配置基于时间的告警降噪
routes:
- match:
severity: warning
mute_time_intervals:
- off_hours
mute_time_intervals:
- name: off_hours
time_intervals:
- weekdays: ['saturday', 'sunday']
- times:
- start_time: '18:00'
end_time: '09:00'
访问控制三重屏障:
# Prometheus认证配置
basic_auth_users:
prometheus: "$bcrypt$"
# Actuator端点保护
spring:
security:
user:
name: actuator
password: "{bcrypt}$2a$10$..."
网络隔离策略:
[Kubernetes网络策略]
| 组件 | 入站允许 | 出站允许 |
|---------------|---------------------|-----------------|
| Prometheus | 仅Grafana | 所有监控目标 |
| Grafana | 仅运维网络 | Prometheus/LDAP |
| Spring Boot | 应用流量+Prometheus | 无出站 |
# Prometheus HA部署
global:
external_labels:
replica: '1' # 副本标识
remote_write:
- url: "https://thanos-receive.example.com/api/v1/receive"
queue_config:
max_samples_per_send: 5000
capacity: 10000
# Prometheus Operator ServiceMonitor
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: spring-boot-monitor
spec:
selector:
matchLabels:
app: order-service
endpoints:
- port: actuator
interval: 30s
path: /actuator/prometheus
io.opentelemetry
opentelemetry-api
io.opentelemetry.instrumentation
opentelemetry-spring-boot-autoconfigure
// 自定义Trace配置
@Bean
OpenTelemetry openTelemetry() {
return OpenTelemetrySdk.builder()
.setTracerProvider(tracerProvider())
.setPropagators(ContextPropagators.create(W3CTraceContextPropagator.getInstance()))
.buildAndRegisterGlobal();
}
/spring-boot-monitoring-demo
├── order-service # Spring Boot应用
│ ├── src/main/java
│ │ └── com/example/order
│ │ ├── actuator # 自定义监控端点
│ │ └── config # Actuator配置
├── monitoring-setup
│ ├── prometheus # 配置和告警规则
│ ├── grafana # 仪表板JSON
│ ├── alertmanager # 通知配置
│ └── docker-compose.yml # 一键启动
docker-compose up -d
访问入口:
价值维度 | 关键指标 | 实施前 | 实施后 | 提升幅度 |
---|---|---|---|---|
故障恢复 | MTTR(平均恢复时间) | 120分钟 | 18分钟 | 85%↓ |
性能优化 | 系统吞吐量 | 1200 TPS | 3500 TPS | 191%↑ |
资源利用 | CPU使用率峰值 | 95% | 65% | 32%↓ |
业务洞察 | 决策响应速度 | 3天 | 实时 | 100%↑ |
成本控制 | 基础设施成本 | ¥120,000/月 | ¥78,000/月 | 35%↓ |
最终建议:
据2024年DevOps报告统计,采用完善监控体系的企业平均故障恢复时间(MTTR)缩短了78%,系统可用性达到99.99%。遵循本文指南,帮助筒子们在3周内建立专业的Spring Boot应用监控系统。