关键词: postgresql 故障与性能监控
PostgreSQL作为世界上最先进的开源关系型数据库管理系统,在企业级应用中承担着关键的数据存储和处理任务。随着业务规模的不断扩大和数据量的急剧增长,数据库的稳定性和性能直接影响着整个业务系统的运行效果。
在现代企业环境中,数据库故障可能导致:
有效的数据库监控可以:
一个完整的PostgreSQL监控体系应该包含多个层次和维度的监控组件。下图展示了典型的监控架构:
硬件层监控:
操作系统层监控:
数据库层监控:
应用层监控:
PostgreSQL故障监控需要关注以下核心指标:
-- 查询当前连接数
SELECT count(*) as current_connections
FROM pg_stat_activity;
-- 查询最大连接数限制
SELECT setting as max_connections
FROM pg_settings
WHERE name = 'max_connections';
-- 连接使用率
SELECT
count(*) as current_connections,
setting::int as max_connections,
round(count(*)::numeric/setting::numeric*100, 2) as connection_usage_percent
FROM pg_stat_activity, pg_settings
WHERE pg_settings.name = 'max_connections';
-- 查询当前锁等待情况
SELECT
blocked_locks.pid AS blocked_pid,
blocked_activity.usename AS blocked_user,
blocking_locks.pid AS blocking_pid,
blocking_activity.usename AS blocking_user,
blocked_activity.query AS blocked_statement,
blocking_activity.query AS current_statement_in_blocking_process,
blocked_activity.application_name AS blocked_application,
blocking_activity.application_name AS blocking_application
FROM pg_catalog.pg_locks blocked_locks
JOIN pg_catalog.pg_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pid
JOIN pg_catalog.pg_locks blocking_locks ON blocking_locks.locktype = blocked_locks.locktype
AND blocking_locks.DATABASE IS NOT DISTINCT FROM blocked_locks.DATABASE
AND blocking_locks.relation IS NOT DISTINCT FROM blocked_locks.relation
AND blocking_locks.page IS NOT DISTINCT FROM blocked_locks.page
AND blocking_locks.tuple IS NOT DISTINCT FROM blocked_locks.tuple
AND blocking_locks.virtualxid IS NOT DISTINCT FROM blocked_locks.virtualxid
AND blocking_locks.transactionid IS NOT DISTINCT FROM blocked_locks.transactionid
AND blocking_locks.classid IS NOT DISTINCT FROM blocked_locks.classid
AND blocking_locks.objid IS NOT DISTINCT FROM blocked_locks.objid
AND blocking_locks.objsubid IS NOT DISTINCT FROM blocked_locks.objsubid
AND blocking_locks.pid != blocked_locks.pid
JOIN pg_catalog.pg_stat_activity blocking_activity ON blocking_activity.pid = blocking_locks.pid
WHERE NOT blocked_locks.GRANTED;
#!/bin/bash
# PostgreSQL健康检查脚本
DB_HOST="localhost"
DB_PORT="5432"
DB_NAME="postgres"
DB_USER="monitoring_user"
# 检查数据库连接
check_connection() {
pg_isready -h $DB_HOST -p $DB_PORT -U $DB_USER
if [ $? -ne 0 ]; then
echo "ERROR: Cannot connect to PostgreSQL"
send_alert "PostgreSQL连接失败"
return 1
fi
return 0
}
# 检查复制延迟
check_replication_lag() {
LAG=$(psql -h $DB_HOST -p $DB_PORT -U $DB_USER -d $DB_NAME -t -c "
SELECT EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp()))::int;
")
if [ $LAG -gt 300 ]; then # 5分钟延迟阈值
echo "WARNING: Replication lag is ${LAG} seconds"
send_alert "PostgreSQL复制延迟超过5分钟: ${LAG}秒"
fi
}
# 检查磁盘使用率
check_disk_usage() {
USAGE=$(df -h /var/lib/postgresql | awk 'NR==2 {print $5}' | sed 's/%//')
if [ $USAGE -gt 85 ]; then
echo "WARNING: Disk usage is ${USAGE}%"
send_alert "PostgreSQL磁盘使用率过高: ${USAGE}%"
fi
}
# 发送告警
send_alert() {
MESSAGE=$1
# 这里可以集成钉钉、企业微信、邮件等告警方式
echo "$(date): $MESSAGE" >> /var/log/postgresql_alerts.log
}
# 主检查流程
main() {
echo "开始PostgreSQL健康检查 - $(date)"
check_connection || exit 1
check_replication_lag
check_disk_usage
echo "健康检查完成 - $(date)"
}
main
PostgreSQL提供了pg_stat_statements扩展来监控SQL语句的执行统计:
-- 启用pg_stat_statements
CREATE EXTENSION IF NOT EXISTS pg_stat_statements;
-- 查询最慢的10个SQL语句
SELECT
query,
calls,
total_time,
mean_time,
rows,
100.0 * shared_blks_hit / nullif(shared_blks_hit + shared_blks_read, 0) AS hit_percent
FROM pg_stat_statements
ORDER BY mean_time DESC
LIMIT 10;
-- 查询执行次数最多的SQL
SELECT
query,
calls,
total_time,
mean_time,
rows
FROM pg_stat_statements
ORDER BY calls DESC
LIMIT 10;
-- 整体缓存命中率
SELECT
round(
sum(blks_hit) * 100.0 / sum(blks_hit + blks_read), 2
) AS cache_hit_ratio
FROM pg_stat_database;
-- 各数据库的缓存命中率
SELECT
datname,
round(
blks_hit * 100.0 / (blks_hit + blks_read), 2
) AS cache_hit_ratio
FROM pg_stat_database
WHERE blks_read > 0;
-- 表级别的缓存命中率
SELECT
schemaname,
tablename,
round(
heap_blks_hit * 100.0 / (heap_blks_hit + heap_blks_read), 2
) AS table_cache_hit_ratio
FROM pg_statio_user_tables
WHERE heap_blks_read > 0
ORDER BY table_cache_hit_ratio;
-- 查看内存相关配置
SELECT
name,
setting,
unit,
category
FROM pg_settings
WHERE name IN (
'shared_buffers',
'work_mem',
'maintenance_work_mem',
'effective_cache_size'
);
-- 查看当前内存使用情况
SELECT
pg_size_pretty(pg_database_size(current_database())) as database_size,
pg_size_pretty(
pg_relation_size('pg_class')
) as pg_class_size;
-- 表的I/O统计
SELECT
schemaname,
tablename,
heap_blks_read,
heap_blks_hit,
idx_blks_read,
idx_blks_hit,
toast_blks_read,
toast_blks_hit
FROM pg_statio_user_tables
ORDER BY heap_blks_read + idx_blks_read + toast_blks_read DESC;
-- 索引使用统计
SELECT
schemaname,
tablename,
indexname,
idx_tup_read,
idx_tup_fetch
FROM pg_stat_user_indexes
ORDER BY idx_tup_read DESC;
以下是一个典型的性能监控仪表盘架构:
# prometheus.yml配置
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
- "postgresql_rules.yml"
scrape_configs:
- job_name: 'postgresql'
static_configs:
- targets: ['localhost:9187']
scrape_interval: 5s
metrics_path: /metrics
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
# postgres_exporter启动脚本
#!/bin/bash
export DATA_SOURCE_NAME="postgresql://monitoring_user:password@localhost:5432/postgres?sslmode=disable"
./postgres_exporter \
--web.listen-address=:9187 \
--log.level=info \
--extend.query-path=/etc/postgres_exporter/queries.yaml
自定义查询配置(queries.yaml):
pg_replication_lag:
query: "SELECT EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp()))::float as lag"
master: true
metrics:
- lag:
usage: "GAUGE"
description: "Replication lag behind master in seconds"
pg_database_size:
query: "SELECT pg_database.datname, pg_database_size(pg_database.datname) as size FROM pg_database"
master: true
metrics:
- datname:
usage: "LABEL"
description: "Name of the database"
- size:
usage: "GAUGE"
description: "Disk space used by the database"
pg_slow_queries:
query: "SELECT query, calls, total_time, mean_time FROM pg_stat_statements WHERE mean_time > 1000 ORDER BY mean_time DESC LIMIT 10"
master: true
metrics:
- query:
usage: "LABEL"
description: "Query text"
- calls:
usage: "COUNTER"
description: "Number of times executed"
- total_time:
usage: "COUNTER"
description: "Total time spent in the statement"
- mean_time:
usage: "GAUGE"
description: "Mean time spent in the statement"
# postgresql.conf关键配置
log_destination = 'csvlog'
logging_collector = on
log_directory = 'pg_log'
log_filename = 'postgresql-%Y-%m-%d_%H%M%S.log'
log_rotation_age = 1d
log_rotation_size = 100MB
log_min_duration_statement = 1000 # 记录执行时间超过1秒的语句
log_line_prefix = '%t [%p]: [%l-1] user=%u,db=%d,app=%a,client=%h '
log_checkpoints = on
log_connections = on
log_disconnections = on
log_lock_waits = on
log_temp_files = 1024 # 记录大于1MB的临时文件
# filebeat.yml
filebeat.inputs:
- type: log
enabled: true
paths:
- /var/lib/postgresql/*/pg_log/*.log
fields:
service: postgresql
environment: production
multiline.pattern: '^\d{4}-\d{2}-\d{2}'
multiline.negate: true
multiline.match: after
output.elasticsearch:
hosts: ["elasticsearch:9200"]
index: "postgresql-logs-%{+yyyy.MM.dd}"
processors:
- add_host_metadata:
when.not.contains.tags: forwarded
logging.level: info
logging.to_files: true
logging.files:
path: /var/log/filebeat
name: filebeat
keepfiles: 7
permissions: 0644
工具 | 优势 | 劣势 | 适用场景 |
---|---|---|---|
Prometheus + Grafana | 云原生,生态丰富,可扩展性强 | 学习成本高,配置复杂 | 大规模、云环境 |
Zabbix | 功能全面,支持多种协议 | 界面较老,性能一般 | 传统IT环境 |
Nagios | 稳定可靠,插件丰富 | 配置复杂,界面简陋 | 小型环境 |
DataDog | 易用性好,SaaS服务 | 成本高,数据安全性 | 快速部署需求 |
pgMonitor | 专为PostgreSQL设计 | 功能相对单一 | PostgreSQL专项监控 |
version: '3.8'
services:
postgresql:
image: postgres:14
environment:
POSTGRES_DB: testdb
POSTGRES_USER: postgres
POSTGRES_PASSWORD: password
volumes:
- postgres_data:/var/lib/postgresql/data
- ./postgresql.conf:/etc/postgresql/postgresql.conf
ports:
- "5432:5432"
command: postgres -c config_file=/etc/postgresql/postgresql.conf
postgres-exporter:
image: prometheuscommunity/postgres-exporter
environment:
DATA_SOURCE_NAME: "postgresql://postgres:password@postgresql:5432/testdb?sslmode=disable"
ports:
- "9187:9187"
depends_on:
- postgresql
prometheus:
image: prom/prometheus
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- ./postgresql_rules.yml:/etc/prometheus/postgresql_rules.yml
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--web.console.libraries=/etc/prometheus/console_libraries'
- '--web.console.templates=/etc/prometheus/consoles'
- '--storage.tsdb.retention.time=200h'
- '--web.enable-lifecycle'
depends_on:
- postgres-exporter
grafana:
image: grafana/grafana
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin
volumes:
- grafana_data:/var/lib/grafana
- ./grafana/dashboards:/etc/grafana/provisioning/dashboards
- ./grafana/datasources:/etc/grafana/provisioning/datasources
depends_on:
- prometheus
alertmanager:
image: prom/alertmanager
ports:
- "9093:9093"
volumes:
- ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
command:
- '--config.file=/etc/alertmanager/alertmanager.yml'
- '--storage.path=/alertmanager'
- '--web.external-url=http://localhost:9093'
volumes:
postgres_data:
grafana_data:
# postgresql_rules.yml
groups:
- name: postgresql-alerts
rules:
- alert: PostgreSQLDown
expr: pg_up == 0
for: 0m
labels:
severity: critical
annotations:
summary: "PostgreSQL实例 {{ $labels.instance }} 已宕机"
description: "PostgreSQL实例 {{ $labels.instance }} 已经宕机超过5分钟"
- alert: PostgreSQLHighConnections
expr: (pg_stat_database_numbackends / pg_settings_max_connections) * 100 > 80
for: 5m
labels:
severity: warning
annotations:
summary: "PostgreSQL连接数过高"
description: "PostgreSQL实例 {{ $labels.instance }} 连接使用率超过80%,当前值: {{ $value }}%"
- alert: PostgreSQLReplicationLag
expr: pg_replication_lag > 300
for: 1m
labels:
severity: critical
annotations:
summary: "PostgreSQL复制延迟过高"
description: "PostgreSQL实例 {{ $labels.instance }} 复制延迟超过5分钟,当前延迟: {{ $value }}秒"
- alert: PostgreSQLSlowQueries
expr: rate(pg_stat_statements_mean_time_ms[5m]) > 1000
for: 2m
labels:
severity: warning
annotations:
summary: "PostgreSQL存在慢查询"
description: "PostgreSQL实例 {{ $labels.instance }} 平均查询时间超过1秒"
- alert: PostgreSQLCacheHitRatio
expr: pg_stat_database_blks_hit / (pg_stat_database_blks_hit + pg_stat_database_blks_read) < 0.95
for: 5m
labels:
severity: warning
annotations:
summary: "PostgreSQL缓存命中率过低"
description: "PostgreSQL实例 {{ $labels.instance }} 缓存命中率低于95%,当前值: {{ $value }}%"
- alert: PostgreSQLDiskUsage
expr: (node_filesystem_size_bytes{mountpoint="/var/lib/postgresql"} - node_filesystem_free_bytes{mountpoint="/var/lib/postgresql"}) / node_filesystem_size_bytes{mountpoint="/var/lib/postgresql"} * 100 > 85
for: 5m
labels:
severity: critical
annotations:
summary: "PostgreSQL磁盘使用率过高"
description: "PostgreSQL数据目录磁盘使用率超过85%,当前值: {{ $value }}%"
- alert: PostgreSQLDeadlocks
expr: rate(pg_stat_database_deadlocks[5m]) > 0
for: 1m
labels:
severity: warning
annotations:
summary: "PostgreSQL检测到死锁"
description: "PostgreSQL实例 {{ $labels.instance }} 检测到死锁,死锁率: {{ $value }}/s"
# alertmanager.yml
global:
smtp_smarthost: 'localhost:587'
smtp_from: '[email protected]'
smtp_auth_username: '[email protected]'
smtp_auth_password: 'password'
route:
group_by: ['alertname']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'web.hook'
routes:
- match:
severity: critical
receiver: 'critical-alerts'
- match:
severity: warning
receiver: 'warning-alerts'
receivers:
- name: 'web.hook'
webhook_configs:
- url: 'http://localhost:5001/webhook'
- name: 'critical-alerts'
email_configs:
- to: '[email protected]'
subject: '[CRITICAL] PostgreSQL告警'
body: |
{{ range .Alerts }}
告警: {{ .Annotations.summary }}
描述: {{ .Annotations.description }}
时间: {{ .StartsAt }}
{{ end }}
webhook_configs:
- url: 'http://localhost:5001/critical-webhook'
send_resolved: true
- name: 'warning-alerts'
email_configs:
- to: '[email protected]'
subject: '[WARNING] PostgreSQL告警'
body: |
{{ range .Alerts }}
告警: {{ .Annotations.summary }}
描述: {{ .Annotations.description }}
时间: {{ .StartsAt }}
{{ end }}
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'dev', 'instance']
#!/bin/bash
# PostgreSQL自动故障响应脚本
LOG_FILE="/var/log/postgresql_auto_response.log"
DB_HOST="localhost"
DB_PORT="5432"
DB_USER="postgres"
log_message() {
echo "$(date '+%Y-%m-%d %H:%M:%S') - $1" >> $LOG_FILE
}
# 处理连接数过高
handle_high_connections() {
log_message "INFO: 检测到连接数过高,开始处理"
# 查找空闲连接
IDLE_CONNECTIONS=$(psql -h $DB_HOST -p $DB_PORT -U $DB_USER -t -c "
SELECT pid FROM pg_stat_activity
WHERE state = 'idle' AND query_start < now() - interval '30 minutes';
")
# 终止长时间空闲连接
for pid in $IDLE_CONNECTIONS; do
psql -h $DB_HOST -p $DB_PORT -U $DB_USER -c "SELECT pg_terminate_backend($pid);"
log_message "INFO: 终止空闲连接 PID: $pid"
done
# 重启连接池(如果使用pgbouncer)
if systemctl is-active --quiet pgbouncer; then
systemctl reload pgbouncer
log_message "INFO: 重新加载pgbouncer配置"
fi
}
# 处理磁盘空间不足
handle_disk_full() {
log_message "WARNING: 磁盘空间不足,开始清理"
# 清理老旧的WAL文件
find /var/lib/postgresql/*/pg_wal -name "*.backup" -mtime +7 -delete
# 清理老旧的日志文件
find /var/lib/postgresql/*/pg_log -name "*.log" -mtime +30 -delete
# 执行VACUUM
psql -h $DB_HOST -p $DB_PORT -U $DB_USER -c "VACUUM;"
log_message "INFO: 磁盘清理完成"
}
# 处理复制延迟
handle_replication_lag() {
log_message "WARNING: 检测到复制延迟,开始诊断"
# 检查网络连接
if ! nc -z $MASTER_HOST $DB_PORT; then
log_message "ERROR: 无法连接到主库"
return 1
fi
# 检查复制状态
REPLICATION_STATUS=$(psql -h $DB_HOST -p $DB_PORT -U $DB_USER -t -c "
SELECT state FROM pg_stat_wal_receiver;
")
if [ "$REPLICATION_STATUS" != "streaming" ]; then
log_message "ERROR: 复制状态异常: $REPLICATION_STATUS"
# 尝试重启复制
systemctl restart postgresql
fi
}
# 处理死锁
handle_deadlocks() {
log_message "WARNING: 检测到死锁,终止长事务"
# 查找长时间运行的事务
LONG_TRANSACTIONS=$(psql -h $DB_HOST -p $DB_PORT -U $DB_USER -t -c "
SELECT pid FROM pg_stat_activity
WHERE state = 'active'
AND query_start < now() - interval '10 minutes'
AND query NOT LIKE '%pg_stat_activity%';
")
for pid in $LONG_TRANSACTIONS; do
psql -h $DB_HOST -p $DB_PORT -U $DB_USER -c "SELECT pg_terminate_backend($pid);"
log_message "INFO: 终止长事务 PID: $pid"
done
}
# 主处理函数
main() {
ALERT_TYPE=$1
case $ALERT_TYPE in
"high_connections")
handle_high_connections
;;
"disk_full")
handle_disk_full
;;
"replication_lag")
handle_replication_lag
;;
"deadlocks")
handle_deadlocks
;;
*)
log_message "ERROR: 未知的告警类型: $ALERT_TYPE"
exit 1
;;
esac
}
# 执行主函数
main $@
建立性能基线是监控的重要基础,需要收集以下关键指标:
-- 创建性能基线表
CREATE TABLE performance_baseline (
metric_name VARCHAR(100),
metric_value NUMERIC,
metric_unit VARCHAR(20),
measurement_time TIMESTAMP DEFAULT NOW(),
baseline_type VARCHAR(50) -- daily, weekly, monthly
);
-- 收集基线数据的存储过程
CREATE OR REPLACE FUNCTION collect_performance_baseline()
RETURNS VOID AS $$
BEGIN
-- 连接数基线
INSERT INTO performance_baseline (metric_name, metric_value, metric_unit, baseline_type)
SELECT 'active_connections', count(*), 'count', 'daily'
FROM pg_stat_activity WHERE state = 'active';
-- QPS基线
INSERT INTO performance_baseline (metric_name, metric_value, metric_unit, baseline_type)
SELECT 'transactions_per_second',
sum(xact_commit + xact_rollback) /
EXTRACT(EPOCH FROM (max(stats_reset) - min(stats_reset))),
'tps', 'daily'
FROM pg_stat_database;
-- 缓存命中率基线
INSERT INTO performance_baseline (metric_name, metric_value, metric_unit, baseline_type)
SELECT 'cache_hit_ratio',
round(sum(blks_hit) * 100.0 / sum(blks_hit + blks_read), 2),
'percent', 'daily'
FROM pg_stat_database WHERE blks_read > 0;
-- 平均查询时间基线
INSERT INTO performance_baseline (metric_name, metric_value, metric_unit, baseline_type)
SELECT 'avg_query_time',
avg(mean_time),
'milliseconds', 'daily'
FROM pg_stat_statements;
END;
$$ LANGUAGE plpgsql;
-- 创建定时任务执行基线收集
SELECT cron.schedule('collect-baseline', '0 1 * * *', 'SELECT collect_performance_baseline();');
-- 性能对比分析视图
CREATE VIEW performance_trend_analysis AS
WITH baseline_stats AS (
SELECT
metric_name,
AVG(metric_value) as baseline_avg,
STDDEV(metric_value) as baseline_stddev
FROM performance_baseline
WHERE measurement_time >= CURRENT_DATE - INTERVAL '30 days'
GROUP BY metric_name
),
current_stats AS (
SELECT
'active_connections' as metric_name,
count(*)::numeric as current_value
FROM pg_stat_activity WHERE state = 'active'
UNION ALL
SELECT
'cache_hit_ratio' as metric_name,
round(sum(blks_hit) * 100.0 / sum(blks_hit + blks_read), 2)
FROM pg_stat_database WHERE blks_read > 0
UNION ALL
SELECT
'avg_query_time' as metric_name,
avg(mean_time)
FROM pg_stat_statements
)
SELECT
b.metric_name,
b.baseline_avg,
c.current_value,
round(((c.current_value - b.baseline_avg) / b.baseline_avg * 100), 2) as deviation_percent,
CASE
WHEN abs(c.current_value - b.baseline_avg) > 2 * b.baseline_stddev
THEN 'ANOMALY'
WHEN abs(c.current_value - b.baseline_avg) > b.baseline_stddev
THEN 'WARNING'
ELSE 'NORMAL'
END as status
FROM baseline_stats b
JOIN current_stats c ON b.metric_name = c.metric_name;
#!/usr/bin/env python3
import psycopg2
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import json
class PostgreSQLPerformanceAnalyzer:
def __init__(self, host, port, database, username, password):
self.conn = psycopg2.connect(
host=host,
port=port,
database=database,
user=username,
password=password
)
def analyze_slow_queries(self):
"""分析慢查询并提供优化建议"""
query = """
SELECT
query,
calls,
total_time,
mean_time,
stddev_time,
rows,
100.0 * shared_blks_hit / nullif(shared_blks_hit + shared_blks_read, 0) AS hit_percent
FROM pg_stat_statements
WHERE mean_time > 100
ORDER BY mean_time DESC
LIMIT 20;
"""
df = pd.read_sql_query(query, self.conn)
recommendations = []
for _, row in df.iterrows():
recommendation = {
'query': row['query'][:100] + '...',
'mean_time': row['mean_time'],
'suggestions': []
}
# 基于统计信息生成建议
if row['hit_percent'] < 95:
recommendation['suggestions'].append(
"考虑添加索引以提高缓存命中率"
)
if row['rows'] > 1000 and 'SELECT' in row['query'].upper():
recommendation['suggestions'].append(
"查询返回行数过多,考虑添加LIMIT或优化WHERE条件"
)
if row['stddev_time'] > row['mean_time']:
recommendation['suggestions'].append(
"查询执行时间不稳定,检查统计信息是否过期"
)
recommendations.append(recommendation)
return recommendations
def analyze_index_usage(self):
"""分析索引使用情况"""
query = """
SELECT
schemaname,
tablename,
indexname,
idx_tup_read,
idx_tup_fetch,
pg_size_pretty(pg_relation_size(indexrelid)) as index_size
FROM pg_stat_user_indexes
ORDER BY idx_tup_read DESC;
"""
df = pd.read_sql_query(query, self.conn)
# 查找未使用的索引
unused_indexes = df[df['idx_tup_read'] == 0]
# 查找效率低的索引
df['efficiency'] = df['idx_tup_fetch'] / df['idx_tup_read'].replace(0, 1)
low_efficiency_indexes = df[df['efficiency'] < 0.1]
return {
'unused_indexes': unused_indexes.to_dict('records'),
'low_efficiency_indexes': low_efficiency_indexes.to_dict('records')
}
def analyze_table_bloat(self):
"""分析表膨胀情况"""
query = """
SELECT
schemaname,
tablename,
n_tup_ins,
n_tup_upd,
n_tup_del,
n_dead_tup,
last_vacuum,
last_autovacuum,
pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) as table_size
FROM pg_stat_user_tables
WHERE n_dead_tup > 1000
ORDER BY n_dead_tup DESC;
"""
df = pd.read_sql_query(query, self.conn)
bloated_tables = []
for _, row in df.iterrows():
bloat_ratio = row['n_dead_tup'] / (row['n_tup_ins'] + row['n_tup_upd'] + 1)
if bloat_ratio > 0.1: # 死元组超过10%
bloated_tables.append({
'table': f"{row['schemaname']}.{row['tablename']}",
'dead_tuples': row['n_dead_tup'],
'bloat_ratio': round(bloat_ratio * 100, 2),
'table_size': row['table_size'],
'last_vacuum': row['last_vacuum'],
'recommendation': 'VACUUM' if bloat_ratio < 0.2 else 'VACUUM FULL'
})
return bloated_tables
def generate_performance_report(self):
"""生成性能分析报告"""
report = {
'timestamp': datetime.now().isoformat(),
'slow_queries': self.analyze_slow_queries(),
'index_analysis': self.analyze_index_usage(),
'table_bloat': self.analyze_table_bloat()
}
return json.dumps(report, indent=2, default=str)
# 使用示例
if __name__ == "__main__":
analyzer = PostgreSQLPerformanceAnalyzer(
host='localhost',
port=5432,
database='postgres',
username='postgres',
password='password'
)
report = analyzer.generate_performance_report()
print(report)
P0级别(核心业务指标):
P1级别(性能指标):
P2级别(资源指标):
P3级别(优化指标):
# 告警分级策略示例
alert_levels:
critical:
description: "影响业务正常运行,需要立即处理"
response_time: "5分钟内"
escalation: "自动电话通知 + 短信 + 邮件"
examples:
- 数据库宕机
- 复制中断超过5分钟
- 磁盘使用率超过95%
- 连接数超过最大限制90%
warning:
description: "可能影响性能,需要关注"
response_time: "30分钟内"
escalation: "邮件 + 即时消息"
examples:
- 慢查询增多
- 缓存命中率下降
- 磁盘使用率超过85%
- 复制延迟超过1分钟
info:
description: "信息性告警,记录备查"
response_time: "工作时间内处理"
escalation: "日志记录"
examples:
- 定期备份完成
- 参数配置变更
- 连接数波动
场景描述:
某电商平台在促销活动期间遇到数据库连接数暴增,导致新用户无法登录。
问题分析:
-- 分析连接状态分布
SELECT
state,
count(*) as connection_count,
round(count(*) * 100.0 / sum(count(*)) OVER (), 2) as percentage
FROM pg_stat_activity
GROUP BY state;
-- 分析长时间空闲连接
SELECT
pid,
usename,
application_name,
state,
query_start,
state_change,
now() - state_change as idle_duration
FROM pg_stat_activity
WHERE state = 'idle'
AND now() - state_change > interval '10 minutes'
ORDER BY idle_duration DESC;
监控配置:
# 连接池监控告警规则
- alert: ConnectionPoolExhaustion
expr: |
(
sum(pg_stat_activity_count) by (instance) /
sum(pg_settings_max_connections) by (instance)
) * 100 > 85
for: 2m
labels:
severity: critical
annotations:
summary: "连接池使用率过高: {{ $value }}%"
description: "实例 {{ $labels.instance }} 连接池使用率超过85%"
- alert: IdleConnectionsHigh
expr: pg_stat_activity_count{state="idle"} > 50
for: 5m
labels:
severity: warning
annotations:
summary: "空闲连接数过多: {{ $value }}"
description: "实例 {{ $labels.instance }} 空闲连接数超过50个"
解决方案:
场景描述:
某SaaS平台用户反馈系统响应缓慢,通过监控发现大量慢查询。
分析过程:
-- 分析最耗时的查询
SELECT
substring(query, 1, 100) as short_query,
calls,
total_time,
mean_time,
stddev_time,
rows,
100.0 * shared_blks_hit / nullif(shared_blks_hit + shared_blks_read, 0) AS hit_percent
FROM pg_stat_statements
WHERE mean_time > 1000
ORDER BY total_time DESC
LIMIT 10;
-- 查看执行计划
EXPLAIN (ANALYZE, BUFFERS, FORMAT JSON)
SELECT * FROM orders o
JOIN customers c ON o.customer_id = c.id
WHERE o.created_at >= '2024-01-01';
监控仪表盘设计:
{
"dashboard": {
"title": "PostgreSQL慢查询监控",
"panels": [
{
"title": "平均查询执行时间",
"type": "graph",
"targets": [
{
"expr": "rate(pg_stat_statements_total_time_ms[5m]) / rate(pg_stat_statements_calls[5m])"
}
]
},
{
"title": "Top 10慢查询",
"type": "table",
"targets": [
{
"expr": "topk(10, pg_stat_statements_mean_time_ms > 1000)"
}
]
},
{
"title": "查询执行分布",
"type": "heatmap",
"targets": [
{
"expr": "histogram_quantile(0.95, rate(pg_stat_statements_total_time_ms_bucket[5m]))"
}
]
}
]
}
}
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from datetime import datetime, timedelta
import matplotlib.pyplot as plt
class PostgreSQLCapacityPlanner:
def __init__(self, monitoring_data):
self.data = pd.DataFrame(monitoring_data)
self.data['timestamp'] = pd.to_datetime(self.data['timestamp'])
def predict_growth(self, metric, days_ahead=90):
"""预测指定指标的增长趋势"""
# 准备数据
X = np.array(range(len(self.data))).reshape(-1, 1)
y = self.data[metric].values
# 训练线性回归模型
model = LinearRegression()
model.fit(X, y)
# 预测未来数据
future_X = np.array(range(len(self.data), len(self.data) + days_ahead)).reshape(-1, 1)
predictions = model.predict(future_X)
return {
'current_value': y[-1],
'predicted_value': predictions[-1],
'growth_rate': (predictions[-1] - y[-1]) / len(predictions),
'confidence_score': model.score(X, y)
}
def generate_capacity_report(self):
"""生成容量规划报告"""
metrics = ['database_size', 'connection_count', 'transaction_rate']
report = {}
for metric in metrics:
if metric in self.data.columns:
prediction = self.predict_growth(metric)
report[metric] = prediction
return report
# 使用示例
monitoring_data = [
{'timestamp': '2024-01-01', 'database_size': 100, 'connection_count': 50, 'transaction_rate': 1000},
{'timestamp': '2024-01-02', 'database_size': 102, 'connection_count': 52, 'transaction_rate': 1050},
# ... 更多历史数据
]
planner = PostgreSQLCapacityPlanner(monitoring_data)
capacity_report = planner.generate_capacity_report()
print(json.dumps(capacity_report, indent=2))
通过本文的深入分析,我们可以总结出PostgreSQL数据库故障与性能监控的几个关键要点:
监控体系建设:
故障预防与响应:
性能优化策略:
工具选型原则:
随着人工智能技术的发展,数据库监控正在向智能化方向演进:
关键技术发展方向:
随着云计算的普及,监控架构也在向云原生方向发展:
现代监控正在向可观测性演进,包含三个支柱:
指标(Metrics):
日志(Logs):
链路追踪(Traces):
对于企业实施PostgreSQL监控体系,建议按照以下路径:
第一阶段:基础监控
第二阶段:完善体系
第三阶段:智能化
第四阶段:平台化
PostgreSQL数据库的监控是一个持续演进的过程,需要根据业务发展和技术进步不断优化完善。通过建立科学的监控体系、选择合适的工具、制定有效的告警策略,并结合自动化和智能化技术,可以显著提升数据库的稳定性和性能,为业务发展提供坚实的数据基础支撑。
在实施过程中,要注重理论与实践相结合,根据实际环境特点和业务需求,制定个性化的监控方案。同时,要保持对新技术的关注,及时引入先进的监控理念和工具,确保监控体系始终处于行业领先水平。
参考资源:
本文适用于PostgreSQL 12及以上版本,部分特性可能在不同版本中有所差异,请根据实际使用版本调整相关配置。