Python Web 开发进阶实战：性能压测与调优 —— Locust + Prometheus + Grafana 构建高并发可观测系统-开发者社区

第一章：为什么需要性能工程？

1.1 真实世界的性能挑战

场景	后果
突发流量（如促销）	服务雪崩、502 错误
慢 SQL	数据库 CPU 100%，拖垮整个系统
内存泄漏	Worker 崩溃，需频繁重启
无监控	故障发生后才知晓，MTTR > 1 小时

1.2 性能工程四要素

[压测] → [监控] → [分析] → [优化] ↑_________________________↓

压测：主动暴露问题（“破坏性测试”）
监控：被动发现问题（“可观测性”）
分析：定位根因（CPU？IO？锁？）
优化：代码/配置/架构调整

原则：不要猜测瓶颈，用数据说话。

第二章：压测工具选型 —— 为什么是 Locust？

工具	编程语言	并发模型	分布式	易用性
JMeter	Java	线程	✔	❌（XML 配置复杂）
Gatling	Scala	Actor	✔	⚠️（学习曲线陡）
Locust	Python	协程 (gevent)	✔	✅（代码即配置）

优势：
用 Python 写用户行为，灵活度高
实时 Web UI 查看 RPS、响应时间、错误率
支持分布式压测（Master-Worker）

第三章：编写 Locust 压测脚本

3.1 安装 Locust

pip install locust

3.2 项目结构

/perf-test ├── locustfile.py ← 主压测脚本 ├── tasks/ │ ├── auth.py ← 登录任务 │ └── api.py ← API 调用任务 └── utils/ └── jwt.py ← Token 管理

3.3 核心压测逻辑（locustfile.py）

# perf-test/locustfile.py from locust import HttpUser, task, between from tasks.auth import login from tasks.api import get_profile, create_post class WebsiteUser(HttpUser): wait_time = between(1, 3) # 用户操作间隔 1～3 秒 def on_start(self): """每个用户启动时登录""" self.access_token = login(self.client) @task(3) def view_profile(self): get_profile(self.client, self.access_token) @task(1) def create_new_post(self): create_post(self.client, self.access_token, "Hello from Locust!")

3.4 登录任务（tasks/auth.py）

# perf-test/tasks/auth.py import json def login(client): response = client.post("/auth/login", json={ "username": "testuser", "password": "secure_password" }) assert response.status_code == 200 return response.json()["access_token"]

3.5 API 任务（tasks/api.py）

# perf-test/tasks/api.py def get_profile(client, token): client.get("/api/profile", headers={"Authorization": f"Bearer {token}"}) def create_post(client, token, content): client.post("/api/posts", json={"content": content}, headers={"Authorization": f"Bearer {token}"})

关键点：
每个虚拟用户独立登录，持有自己的 Token
@task(weight)控制行为频率（profile:post = 3:1）

第四章：执行压测并分析结果

4.1 单机压测

cd perf-test locust -f locustfile.py --host=http://localhost:5000

访问http://localhost:8089：

Spawn 1000 users, hatch rate 10/s
实时图表：RPS、响应时间、失败率

4.2 分布式压测（模拟万级并发）

启动 Master：

locust -f locustfile.py --master --host=http://your-prod-domain.com

启动多个 Worker（在不同机器）：

locust -f locustfile.py --worker --master-host=MASTER_IP

适用场景：单机网络/ CPU 不足以产生足够负载。

4.3 压测指标解读

指标	健康阈值	危险信号
RPS（每秒请求数）	≥ 预期峰值	远低于预期
P95 响应时间	< 500ms	> 2s
失败率	0%	> 0.1%
CPU 使用率	< 70%	持续 100%

案例：
若 RPS 上升但响应时间暴增 →数据库瓶颈
若失败率突增 →连接池耗尽 / 内存溢出

第五章：构建监控体系 —— Prometheus + Grafana

5.1 监控架构

[Flask App] → (metrics) → [Prometheus] → [Grafana] [Celery] ↗ [PostgreSQL]↗ [Redis] ↗

5.2 为 Flask 添加指标暴露

安装依赖：

pip install prometheus-client

在 Flask 应用中添加：

# app/metrics.py from prometheus_client import Counter, Histogram, generate_latest from flask import Response REQUEST_COUNT = Counter('http_requests_total', 'Total HTTP Requests', ['method', 'endpoint', 'status']) REQUEST_LATENCY = Histogram('http_request_duration_seconds', 'HTTP Request Latency', ['method', 'endpoint']) @app.route('/metrics') def metrics(): return Response(generate_latest(), mimetype='text/plain') # 中间件记录请求 @app.before_request def before_request(): g.start_time = time.time() @app.after_request def after_request(response): latency = time.time() - g.start_time REQUEST_LATENCY.labels(request.method, request.endpoint).observe(latency) REQUEST_COUNT.labels(request.method, request.endpoint, response.status_code).inc() return response

5.3 监控 Celery

安装celery-prometheus-exporter：

pip install celery-prometheus-exporter

启动 Exporter（作为独立进程）：

celery-prometheus-exporter --broker-url redis://redis:6379/0

暴露指标端口9808。

5.4 监控 PostgreSQL

启用pg_stat_statements（需 superuser）：

CREATE EXTENSION pg_stat_statements;

使用postgres_exporter：

# docker-compose.yml services: postgres-exporter: image: wrouesnel/postgres_exporter environment: DATA_SOURCE_NAME: "postgresql://user:pass@postgres:5432/db?sslmode=disable" ports: - "9187:9187"

5.5 监控 Redis

Redis 自带INFO命令，使用redis_exporter：

# docker-compose.yml services: redis-exporter: image: oliver006/redis_exporter command: --redis.addr redis://redis:6379 ports: - "9121:9121"

5.6 配置 Prometheus

新建prometheus.yml：

scrape_configs: - job_name: 'flask-app' static_configs: - targets: ['web:8000'] # Flask 容器名 - job_name: 'celery' static_configs: - targets: ['celery-exporter:9808'] - job_name: 'postgres' static_configs: - targets: ['postgres-exporter:9187'] - job_name: 'redis' static_configs: - targets: ['redis-exporter:9121']

5.7 启动监控栈（Docker Compose）

# docker-compose.monitoring.yml version: '3.8' services: prometheus: image: prom/prometheus ports: - "9090:9090" volumes: - ./prometheus.yml:/etc/prometheus/prometheus.yml grafana: image: grafana/grafana ports: - "3000:3000" environment: - GF_SECURITY_ADMIN_PASSWORD=admin volumes: - grafana-storage:/var/lib/grafana volumes: grafana-storage:

启动：

docker-compose -f docker-compose.yml -f docker-compose.monitoring.yml up -d

第六章：Grafana 可视化分析

6.1 导入预设看板

Flask：ID11895（Python HTTP Metrics）
PostgreSQL：ID9628
Redis：ID763
Celery：自定义（基于celery_queue_length）

6.2 关键看板指标

组件	核心指标
Flask	QPS、P95 延迟、错误率
PostgreSQL	活跃连接数、慢查询（>100ms）、缓存命中率
Redis	内存使用、命中率、阻塞客户端
Celery	队列长度、任务处理速率、Worker 数量

6.3 定位典型瓶颈

案例 1：数据库 CPU 100%

现象：PostgreSQL CPU 持续 100%，QPS 下降
Grafana：pg_stat_statements显示某 SQL 平均耗时 2s
优化：为WHERE字段添加索引

案例 2：Celery 队列堆积

现象：celery_queue_length持续增长
原因：Worker 数量不足或任务卡住
优化：增加 Worker 或优化任务逻辑

第七章：自动扩缩容策略

7.1 基于 CPU 的扩缩容（Docker Compose）

注意：Docker Compose 本身不支持 HPA，需借助外部脚本。

编写监控脚本autoscale.sh：

#!/bin/bash CPU_THRESHOLD=70 MIN_WORKERS=2 MAX_WORKERS=10 while true; do CPU=$(docker stats --no-stream --format "{{.CPUPerc}}" web | sed 's/%//') CURRENT=$(docker-compose ps -q celery | wc -l) if (( $(echo "$CPU > $CPU_THRESHOLD" | bc -l) )) && [ $CURRENT -lt $MAX_WORKERS ]; then echo "Scaling up Celery to $(($CURRENT + 1))" docker-compose up -d --scale celery=$(($CURRENT + 1)) elif (( $(echo "$CPU < 50" | bc -l) )) && [ $CURRENT -gt $MIN_WORKERS ]; then echo "Scaling down Celery to $(($CURRENT - 1))" docker-compose up -d --scale celery=$(($CURRENT - 1)) fi sleep 30 done

7.2 Kubernetes HPA（生产推荐）

若迁移到 K8s，可基于自定义指标扩缩容：

# hpa.yaml apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: celery-hpa spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: celery-worker minReplicas: 2 maxReplicas: 20 metrics: - type: Pods pods: metric: name: celery_queue_length target: type: AverageValue averageValue: "10" # 队列长度 >10 则扩容

需部署prometheus-adapter将 Prometheus 指标转为 K8s metrics。

第八章：数据库深度优化

8.1 启用慢查询日志

PostgreSQL 配置（postgresql.conf）：

log_min_duration_statement = 100 # 记录 >100ms 的查询 shared_preload_libraries = 'pg_stat_statements' pg_stat_statements.track = all

8.2 分析慢查询

SELECT query, calls, total_exec_time, mean_time FROM pg_stat_statements ORDER BY mean_time DESC LIMIT 10;

8.3 常见优化手段

问题	解决方案
全表扫描	添加 WHERE 字段索引
N+1 查询	使用 SQLAlchemy`joinedload()`
大分页	改用游标分页（`WHERE id > last_id`）
写入瓶颈	批量插入（`bulk_insert()`）

第九章：压测 → 监控 → 优化闭环

9.1 完整工作流

压测：Locust 模拟 5000 用户
监控：Grafana 发现 PostgreSQL CPU 100%
分析：pg_stat_statements定位慢 SQL
优化：添加复合索引(user_id, created_at)
验证：再次压测，QPS 提升 3 倍，CPU 降至 40%

9.2 性能基线管理

每次发布前运行基准压测
记录关键指标（RPS、P95）到数据库
对比历史数据，防止性能退化