在微服务架构下,各组件、服务分散部署,互相依赖性增加,单纯依靠人工巡检难以及时发现故障。健康检查(Health Check)作为监控与自愈机制的基石,能够让我们:
本节要点:为什么微服务需要健康检查?Health Checks 在分布式系统里有哪些主要用途?
在 .NET 中,Health Checks 基于 Microsoft.Extensions.Diagnostics.HealthChecks
扩展,由 ASP.NET Core 提供一整套健康检查接口与中间件。下面我们依次展示如何引入依赖、注册各类检查、集成 HealthChecks UI、并自定义响应格式。
要使用 SQL Server、Redis、外部 HTTP 等检查,需要先安装对应的 NuGet 包。例如,在项目根目录执行:
# 核心健康检查包
dotnet add package Microsoft.Extensions.Diagnostics.HealthChecks
# SQL Server 健康检查扩展
dotnet add package AspNetCore.HealthChecks.SqlServer
# Redis 健康检查扩展
dotnet add package AspNetCore.HealthChecks.Redis
# HTTP/URL 健康检查扩展
dotnet add package AspNetCore.HealthChecks.System
# HealthChecks UI(可视化界面)
dotnet add package AspNetCore.HealthChecks.UI
dotnet add package AspNetCore.HealthChecks.UI.InMemory.Storage # 演示或开发环境
# 若需持久化存储,可替换为 SqlServer/PostgreSQL 存储包
提示:在生产环境,建议将 HealthChecks UI 的存储改为持久化存储(如 SQL Server、PostgreSQL),否则重启后历史记录会丢失。
在 Program.cs
(或 Startup.cs
)中,先将 IConfiguration
拿到本地,然后通过 AddHealthChecks()
注册各类检查项。示例代码如下:
var builder = WebApplication.CreateBuilder(args);
var configuration = builder.Configuration;
var services = builder.Services;
// 1. 注册基本健康检查
services.AddHealthChecks()
// 自检项:确保应用启动后至少返回 Healthy
.AddCheck("Self", () => HealthCheckResult.Healthy("I'm alive"))
// SQL Server 检查:超时 3s,失败返回 Unhealthy,带上 tags 便于筛选
.AddSqlServer(
configuration["ConnectionStrings:Default"],
name: "SqlServer",
failureStatus: HealthStatus.Unhealthy,
tags: new[] { "db", "sql" },
timeout: TimeSpan.FromSeconds(3)
)
// Redis 检查:超时 2s,失败返回 Degraded
.AddRedis(
configuration["ConnectionStrings:Redis"],
name: "Redis",
failureStatus: HealthStatus.Degraded,
tags: new[] { "cache", "redis" },
timeout: TimeSpan.FromSeconds(2)
)
// 外部 HTTP/URL 检查:超时 1s,失败返回 Unhealthy
.AddUrlGroup(
new Uri(configuration["ExternalServices:PingUrl"]),
name: "ExternalAPI",
failureStatus: HealthStatus.Unhealthy,
tags: new[] { "http", "external" },
timeout: TimeSpan.FromSeconds(1)
);
要点:
- AddCheck(“Self”, …):自检项,保持应用启动后能返回健康;
- ️ tags:为每个检查项打标签,后续在 UI 或 Gateway 可以根据标签筛选;
- ⏱️ timeout:超过该时长视为检查失败。
HealthChecks UI 提供可视化界面,帮助运维团队查看各个端点历史状态。示例注册如下:
// 2.3.1 注册 UI 服务
services.AddHealthChecksUI(setup =>
{
setup.SetEvaluationTimeInSeconds(60); // 每 60s 重新评估一次
setup.MaximumHistoryEntriesPerEndpoint(50); // 每个端点保留 50 条历史记录
// 仅监控 /health-status 这个端点
setup.AddHealthCheckEndpoint("MicroservicesHealth", "/health-status");
})
// 开发环境或演示环境使用内存存储
.AddInMemoryStorage();
// 若要生产环境使用 SQL Server 存储,请替换为:
// .AddSqlServerStorage(configuration["ConnectionStrings:HealthChecksUI"]);
安全性提示:
- 建议对 UI 界面添加授权策略(如“AdminOnly”),否则任何人都能查看或篡改数据。️
默认情况下,MapHealthChecks
只会返回 HTTP 200 和简单的“Healthy/Unhealthy”文本。通常我们希望输出更丰富的 JSON,并根据总体健康状态设置 HTTP 状态码,还要对异常进行日志告警。示例如下:
var app = builder.Build();
var logger = app.Services.GetRequiredService<ILogger<Program>>();
// 将 Health Checks 映射到 /health-status
app.MapHealthChecks("/health-status", new HealthCheckOptions
{
// 允许在端点调用失败时返回 503
ResultStatusCodes =
{
[HealthStatus.Healthy] = StatusCodes.Status200OK,
[HealthStatus.Degraded] = StatusCodes.Status503ServiceUnavailable,
[HealthStatus.Unhealthy] = StatusCodes.Status503ServiceUnavailable
},
ResponseWriter = async (context, report) =>
{
// 若请求取消,提前返回
context.RequestAborted.ThrowIfCancellationRequested();
// 遍历所有检查结果,如果有非 Healthy,写警告日志
foreach (var entry in report.Entries)
{
if (entry.Value.Status != HealthStatus.Healthy)
{
logger.LogWarning(
"Health Check '{Name}' status: {Status}. Error: {Error}",
entry.Key,
entry.Value.Status,
entry.Value.Exception?.Message
);
}
}
// 自定义 JSON 格式
var response = new
{
status = report.Status.ToString(),
totalDuration = report.TotalDuration.TotalMilliseconds + " ms",
results = report.Entries.Select(e => new
{
name = e.Key,
status = e.Value.Status.ToString(),
duration = e.Value.Duration.TotalMilliseconds + " ms",
error = e.Value.Exception != null
? "Error occurred, see logs for details"
: null
})
};
context.Response.ContentType = "application/json; charset=utf-8";
var options = new JsonSerializerOptions
{
WriteIndented = true,
PropertyNamingPolicy = JsonNamingPolicy.CamelCase
};
await context.Response.WriteAsync(JsonSerializer.Serialize(response, options));
}
})
.RequireAuthorization("HealthCheckPolicy"); // 仅允许特定角色访问
要点:
- 将 Degraded/Unhealthy 状态映射为 HTTP 503 ⚠️;
- 在
ResponseWriter
中遍历每个检查项,若非 Healthy,就写警告日志 ;- 不直接将
Exception.Message
返回客户端,避免泄露内部实现 ;- 使用
JsonSerializerOptions
美化 JSON 。
下面以一个 API Gateway 为例,演示如何聚合多个微服务的 /health-status
,并在本地提供一个“聚合健康状态”端点。
// 在 Program.cs 中继续配置
builder.Services.AddAuthorization(options =>
{
options.AddPolicy("GatewayHealthPolicy", policy =>
policy.RequireRole("GatewayAdmin"));
});
var app = builder.Build();
app.UseAuthentication();
app.UseAuthorization();
// 注入 HttpClient,用于调用下游服务
builder.Services.AddHttpClient("HealthClient")
.ConfigureHttpClient(client =>
{
client.Timeout = TimeSpan.FromSeconds(2); // 每个请求超时 2s
});
app.MapGet("/aggregate-health", async (IHttpClientFactory httpFactory) =>
{
var urls = new[]
{
"http://serviceA/health-status",
"http://serviceB/health-status",
"http://serviceC/health-status"
};
var client = httpFactory.CreateClient("HealthClient");
var tasks = urls.Select(async url =>
{
try
{
var cts = new CancellationTokenSource(TimeSpan.FromSeconds(2));
var resp = await client.GetStringAsync(url, cts.Token);
// 简单判断下游状态是否包含 "Unhealthy"
var isUnhealthy = resp.Contains("\"status\":\"Unhealthy\"");
return new { Url = url, IsUnhealthy = isUnhealthy };
}
catch
{
// 调用失败,视为 Unhealthy
return new { Url = url, IsUnhealthy = true };
}
});
var results = await Task.WhenAll(tasks);
// 若任一服务不可用,则聚合状态为 Unhealthy
var aggregateStatus = results.Any(r => r.IsUnhealthy) ? "Unhealthy" : "Healthy";
var response = new
{
aggregateStatus,
details = results.Select(r => new
{
service = r.Url,
status = r.IsUnhealthy ? "Unhealthy" : "Healthy"
})
};
return Results.Json(response);
})
.RequireAuthorization("GatewayHealthPolicy");
说明:
- 使用
IHttpClientFactory
创建带超时设置的 HttpClient ⏲️;- 并发调用各下游
/health-status
,若任一返回中包含"Unhealthy"
,则认为该服务不可用 ;- 最后再聚合结果,返回一个整体状态及每个服务的健康情况。
在 Kubernetes 环境中,我们通常要配置两个探针:
因为 /health-status
包含了对数据库/Redis/外部 API 的检查,不建议直接当作 livenessProbe,否则只要依赖短暂不可用就会不断重启。最佳实践如下:
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-microservice
spec:
replicas: 3
selector:
matchLabels:
app: my-microservice
template:
metadata:
labels:
app: my-microservice
spec:
containers:
- name: web
image: myregistry/my-microservice:latest
ports:
- containerPort: 80
# livenessProbe:只检查应用自检 ping 接口
livenessProbe:
httpGet:
path: /health/ping
port: 80
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 2
failureThreshold: 3
# readinessProbe:检查完整的 /health-status
readinessProbe:
httpGet:
path: /health-status
port: 80
initialDelaySeconds: 15
periodSeconds: 20
timeoutSeconds: 5
failureThreshold: 2
# 环境变量或 ConfigMap 挂载可灵活配置连接字符串
env:
- name: ConnectionStrings__Default
valueFrom:
secretKeyRef:
name: my-secrets
key: SqlConnectionString
- name: ConnectionStrings__Redis
valueFrom:
secretKeyRef:
name: my-secrets
key: RedisConnectionString
- name: ExternalServices__PingUrl
value: "https://api.external.com/ping"
要点:
/health/ping
端点仅检查自检项(见 2.2 中.AddCheck("Self", …)
),保证应用运行即可;/health-status
端点同时检查数据库、Redis、外部 API,仅当所有依赖都可用时才返回 Healthy;- 合理设置
initialDelaySeconds
、periodSeconds
、timeoutSeconds
与failureThreshold
,避免频繁误判;- 如果探针端点公开在公网上,一定要在 Ingress 或 Service 层面添加 IP 白名单或身份验证。
不要泄露异常细节
Exception.Message
只写入日志,不直接返回给客户端,防止敏感信息泄露。鉴权与授权
/health-status
和 /hc-ui
端点均要加上授权策略。例如在 Program.cs
中:builder.Services.AddAuthorization(options =>
{
options.AddPolicy("HealthCheckPolicy", policy =>
policy.RequireRole("HealthAdmin"));
options.AddPolicy("UIAccessPolicy", policy =>
policy.RequireRole("OpsUser"));
});
app.UseAuthentication();
app.UseAuthorization();
app.MapHealthChecks("/health-status", new HealthCheckOptions { … })
.RequireAuthorization("HealthCheckPolicy");
app.MapHealthChecksUI(options =>
{
options.UIPath = "/hc-ui";
}).RequireAuthorization("UIAccessPolicy");
说明: 只有拥有对应角色的用户才可访问健康检查端点和 UI 界面。
日志告警与追踪
foreach (var entry in report.Entries)
{
if (entry.Value.Status != HealthStatus.Healthy)
{
logger.LogError(
"【HealthCheck告警】{Name} 状态: {Status},详细: {Exception}",
entry.Key,
entry.Value.Status,
entry.Value.Exception?.ToString()
);
}
}
ILogger
将日志发送到 Elasticsearch / Seq / Kibana 等日志分析平台。Graceful Shutdown(优雅下线)
Program.cs
中订阅 ApplicationStopping
事件:var lifetime = app.Services.GetRequiredService<IHostApplicationLifetime>();
lifetime.ApplicationStopping.Register(() =>
{
// 将自检状态置为 Unhealthy,通知 Kubernetes 不再发流量
// 具体实现可将“Self”检查改为动态返回 Unhealthy
});
监控健康检查耗时 ⏱️
// 在自定义 ResponseWriter 中添加
var gauge = Metrics.CreateGauge("health_check_duration_seconds", "健康检查耗时(秒)", "dependency");
foreach (var entry in report.Entries)
{
gauge.WithLabels(entry.Key).Set(entry.Value.Duration.TotalSeconds);
}
抖动过滤与熔断 ⚡
/health-status
上触发短暂 Unhealthy,造成 Pod 频繁重启。可使用“熔断+滑动窗口”策略,将检查结果在内存中缓存一定时长:
using Polly;
using Polly.CircuitBreaker;
// 定义熔断策略
var breakerPolicy = Policy.Handle<Exception>()
.CircuitBreakerAsync(
exceptionsAllowedBeforeBreaking: 3,
durationOfBreak: TimeSpan.FromSeconds(30)
);
services.AddHealthChecks()
.AddCheck("SqlServerWithBreaker", async () =>
{
return await breakerPolicy.ExecuteAsync(async () =>
{
// 这里实际调用 SQL Server 做检查
// …
return HealthCheckResult.Healthy();
});
});
水平扩展与缓存 ☁️
/health-status
时先读取缓存。MicroserviceDemo
├─ Program.cs
├─ appsettings.json
├─ Controllers
│ └─ WeatherController.cs
├─ HealthChecks
│ └─ CustomHealthChecks.cs # 可放置自定义熔断/滑动窗口检查逻辑
├─ Properties
│ └─ launchSettings.json
└─ Dockerfile
{
"ConnectionStrings": {
"Default": "Server=.;Database=MyDb;User Id=sa;Password=Your_password;",
"Redis": "localhost:6379"
},
"ExternalServices": {
"PingUrl": "https://api.external.com/ping"
},
"HealthChecksUI": {
"HealthChecks-UI": [
{
"Name": "MicroservicesHealth",
"Uri": "/health-status"
}
],
"EvaluationTimeOnSeconds": 60,
"MinimumSecondsBetweenFailureNotifications": 50
},
"Logging": {
"LogLevel": {
"Default": "Information",
"Microsoft.AspNetCore": "Warning"
}
}
}
注意:示例仅供参考,请根据实际项目需求调整配置项及命名规范。