GTE-large实战教程：Prometheus+Grafana监控GPU显存/请求延迟/错误率-开发者社区

GTE-large实战教程：Prometheus+Grafana监控GPU显存/请求延迟/错误率

1. 监控需求与方案概述

在现代AI应用部署中，实时监控系统状态至关重要。对于基于GTE-large文本向量模型的多任务Web应用，我们需要重点关注三个核心指标：

GPU显存使用情况：确保模型推理有足够的显存资源
请求延迟：监控API响应速度，保障用户体验
错误率：及时发现和处理系统异常

本教程将使用Prometheus+Grafana组合搭建完整的监控体系，让你能够：

实时查看GPU显存使用情况
监控每个API请求的响应时间
统计系统错误发生率
设置告警阈值，及时发现问题

2. 环境准备与组件安装

2.1 安装Prometheus

首先安装Prometheus作为监控数据收集和存储中心：

# 下载Prometheus wget https://github.com/prometheus/prometheus/releases/download/v2.37.0/prometheus-2.37.0.linux-amd64.tar.gz tar xvfz prometheus-*.tar.gz cd prometheus-* # 创建配置文件 cat > prometheus.yml << EOF global: scrape_interval: 15s scrape_configs: - job_name: 'gte-app' static_configs: - targets: ['localhost:5000'] - job_name: 'node-exporter' static_configs: - targets: ['localhost:9100'] - job_name: 'nvidia-gpu' static_configs: - targets: ['localhost:9835'] EOF # 启动Prometheus ./prometheus --config.file=prometheus.yml &

2.2 安装Node Exporter

Node Exporter用于收集系统级指标：

wget https://github.com/prometheus/node_exporter/releases/download/v1.3.1/node_exporter-1.3.1.linux-amd64.tar.gz tar xvfz node_exporter-*.tar.gz cd node_exporter-* ./node_exporter &

2.3 安装NVIDIA GPU Exporter

专门用于监控GPU指标：

pip install nvidia-ml-py git clone https://github.com/utkuozdemir/nvidia_gpu_exporter cd nvidia_gpu_exporter python -m nvidia_gpu_exporter &

2.4 安装Grafana

wget https://dl.grafana.com/oss/release/grafana-9.0.0.linux-amd64.tar.gz tar -zxvf grafana-9.0.0.linux-amd64.tar.gz cd grafana-9.0.0 ./bin/grafana-server web &

3. 配置应用监控指标

3.1 修改Flask应用添加监控端点

在原有的app.py中添加Prometheus监控支持：

from prometheus_client import Counter, Gauge, Histogram, generate_latest, CONTENT_TYPE_LATEST from prometheus_client.exposition import MetricsHandler import time # 定义监控指标 REQUEST_COUNT = Counter('gte_request_total', 'Total request count', ['method', 'endpoint', 'status']) REQUEST_LATENCY = Histogram('gte_request_latency_seconds', 'Request latency', ['endpoint']) GPU_MEMORY_USAGE = Gauge('gte_gpu_memory_usage', 'GPU memory usage in MB') ERROR_COUNT = Counter('gte_error_total', 'Total error count', ['type']) @app.route('/metrics') def metrics(): return generate_latest(), 200, {'Content-Type': CONTENT_TYPE_LATEST} @app.before_request def before_request(): request.start_time = time.time() @app.after_request def after_request(response): # 记录请求延迟 latency = time.time() - request.start_time REQUEST_LATENCY.labels(request.path).observe(latency) # 记录请求计数 REQUEST_COUNT.labels(request.method, request.path, response.status_code).inc() # 记录GPU显存使用 try: import pynvml pynvml.nvmlInit() handle = pynvml.nvmlDeviceGetHandleByIndex(0) info = pynvml.nvmlDeviceGetMemoryInfo(handle) GPU_MEMORY_USAGE.set(info.used / 1024 / 1024) # 转换为MB except: pass return response @app.errorhandler(Exception) def handle_exception(e): ERROR_COUNT.labels(type(e).__name__).inc() return jsonify({'error': str(e)}), 500

3.2 安装必要的Python依赖

pip install prometheus-client pynvml

4. 配置Prometheus数据收集

更新Prometheus配置文件，添加应用监控：

# prometheus.yml 新增配置 scrape_configs: - job_name: 'gte-application' metrics_path: '/metrics' static_configs: - targets: ['localhost:5000'] scrape_interval: 5s - job_name: 'gte-gpu' static_configs: - targets: ['localhost:9835'] scrape_interval: 5s - job_name: 'gte-system' static_configs: - targets: ['localhost:9100'] scrape_interval: 15s

重启Prometheus使配置生效：

pkill prometheus cd prometheus-* ./prometheus --config.file=prometheus.yml &

5. 配置Grafana监控面板

5.1 添加数据源

访问 http://localhost:3000 (Grafana默认端口)
用户名/密码：admin/admin
添加Prometheus数据源：
- Name: Prometheus
- URL: http://localhost:9090
- 点击Save & Test

5.2 创建监控仪表板

创建名为"GTE-large应用监控"的仪表板，添加以下面板：

GPU显存使用面板

Title: GPU显存使用情况
Query:gte_gpu_memory_usage
Visualization: Stat
Unit: megabytes

请求延迟面板

Title: API请求延迟
Query:rate(gte_request_latency_seconds_sum[5m]) / rate(gte_request_latency_seconds_count[5m])
Visualization: Graph
Unit: seconds

错误率面板

Title: 错误率统计
Query:rate(gte_error_total[5m])
Visualization: Graph
Unit: none

请求量面板

Title: 请求量统计
Query:rate(gte_request_total[5m])
Visualization: Graph
Unit: none

6. 设置告警规则

6.1 配置Prometheus告警规则

创建告警规则文件：

# alerts.yml groups: - name: gte-alerts rules: - alert: HighGPUUsage expr: gte_gpu_memory_usage > 8000 # 8GB阈值 for: 5m labels: severity: warning annotations: summary: "GPU显存使用过高" description: "GPU显存使用率超过8GB，当前值: {{ $value }}MB" - alert: HighRequestLatency expr: rate(gte_request_latency_seconds_sum[5m]) / rate(gte_request_latency_seconds_count[5m]) > 2 for: 2m labels: severity: warning annotations: summary: "请求延迟过高" description: "API请求平均延迟超过2秒，当前值: {{ $value }}秒" - alert: HighErrorRate expr: rate(gte_error_total[5m]) > 0.1 for: 2m labels: severity: critical annotations: summary: "错误率过高" description: "系统错误率超过10%，当前值: {{ $value }}"

更新Prometheus配置引用告警规则：

# prometheus.yml rule_files: - alerts.yml alerting: alertmanagers: - static_configs: - targets: - localhost:9093

6.2 安装Alertmanager

wget https://github.com/prometheus/alertmanager/releases/download/v0.24.0/alertmanager-0.24.0.linux-amd64.tar.gz tar xvfz alertmanager-*.tar.gz cd alertmanager-* # 创建配置文件 cat > alertmanager.yml << EOF global: smtp_smarthost: 'smtp.example.com:587' smtp_from: 'alertmanager@example.com' smtp_auth_username: 'username' smtp_auth_password: 'password' route: group_by: ['alertname'] group_wait: 10s group_interval: 10s repeat_interval: 1h receiver: 'email-notifications' receivers: - name: 'email-notifications' email_configs: - to: 'admin@example.com' EOF ./alertmanager &

7. 完整的启动脚本

创建完整的监控启动脚本：

#!/bin/bash # start_monitoring.sh # 启动Node Exporter cd node_exporter-* ./node_exporter & # 启动NVIDIA GPU Exporter cd nvidia_gpu_exporter python -m nvidia_gpu_exporter & # 启动Prometheus cd prometheus-* ./prometheus --config.file=prometheus.yml & # 启动Alertmanager cd alertmanager-* ./alertmanager & # 启动Grafana cd grafana-9.0.0 ./bin/grafana-server web & echo "监控系统启动完成" echo "Prometheus: http://localhost:9090" echo "Grafana: http://localhost:3000" echo "Alertmanager: http://localhost:9093"

给脚本执行权限并启动：

chmod +x start_monitoring.sh ./start_monitoring.sh

8. 实际监控效果验证

8.1 生成测试流量

使用以下脚本模拟真实请求，验证监控效果：

# test_monitoring.py import requests import time import random def test_ner(): payload = { "task_type": "ner", "input_text": "2022年北京冬奥会在北京举行" } return requests.post("http://localhost:5000/predict", json=payload) def test_relation(): payload = { "task_type": "relation", "input_text": "梅西在巴塞罗那踢球" } return requests.post("http://localhost:5000/predict", json=payload) def test_sentiment(): payload = { "task_type": "sentiment", "input_text": "这个产品质量非常好，服务也很棒" } return requests.post("http://localhost:5000/predict", json=payload) # 模拟负载测试 for i in range(100): try: # 随机选择测试类型 test_func = random.choice([test_ner, test_relation, test_sentiment]) response = test_func() print(f"Request {i+1}: Status {response.status_code}, Time {response.elapsed.total_seconds():.3f}s") except Exception as e: print(f"Request {i+1}: Error {str(e)}") time.sleep(random.uniform(0.1, 0.5))

8.2 查看监控数据

运行测试脚本后，在Grafana中观察：

GPU显存变化：观察模型加载和推理时的显存使用峰值
请求延迟分布：查看不同API端点的响应时间
错误率统计：确认系统稳定性
请求流量：了解系统负载情况

9. 生产环境部署建议

9.1 安全配置

# 为监控组件配置认证 # Grafana配置认证 [security] admin_user = admin admin_password = your_secure_password # Prometheus配置基本认证 echo 'admin:yourpassword' > .htpasswd

9.2 性能优化

# Prometheus配置优化 global: scrape_interval: 15s evaluation_interval: 15s # 数据保留策略 storage: tsdb: retention: 15d

9.3 高可用部署

对于生产环境，建议：

Prometheus集群：使用Thanos或Cortex实现高可用
Grafana多实例：配置多个Grafana实例负载均衡
监控数据备份：定期备份Prometheus数据
告警多通道：配置邮件、短信、钉钉等多渠道告警

10. 总结

通过本教程，你已经成功搭建了GTE-large应用的完整监控体系：

核心成果：

✅ 实时监控GPU显存使用情况
✅ 跟踪API请求延迟性能
✅ 统计系统错误率并设置告警
✅ 可视化监控数据通过Grafana面板
✅ 配置自动化告警机制

关键优势：

实时性：5秒级数据采集，快速发现问题
全面性：覆盖硬件资源、应用性能、系统稳定性
可视化：直观的仪表板，一目了然掌握系统状态
预警性：提前发现潜在问题，防患于未然

后续优化方向：

添加业务指标监控（如任务处理量、用户访问量等）
实现自动化扩容缩容基于监控指标
建立监控数据分析和趋势预测
集成日志监控形成完整的可观测性体系

现在你的GTE-large应用已经具备了生产级的监控能力，可以放心地部署到真实环境中服务用户了。

获取更多AI镜像
想探索更多AI镜像和应用场景？访问 CSDN星图镜像广场，提供丰富的预置镜像，覆盖大模型推理、图像生成、视频生成、模型微调等多个领域，支持一键部署。

GTE-large实战教程：Prometheus+Grafana监控GPU显存/请求延迟/错误率

GTE-large实战教程：Prometheus+Grafana监控GPU显存/请求延迟/错误率

1. 监控需求与方案概述

2. 环境准备与组件安装

2.1 安装Prometheus

2.2 安装Node Exporter

2.3 安装NVIDIA GPU Exporter

2.4 安装Grafana

3. 配置应用监控指标

3.1 修改Flask应用添加监控端点

3.2 安装必要的Python依赖

4. 配置Prometheus数据收集

5. 配置Grafana监控面板

5.1 添加数据源

5.2 创建监控仪表板

GPU显存使用面板

请求延迟面板

错误率面板

请求量面板

6. 设置告警规则

6.1 配置Prometheus告警规则

6.2 安装Alertmanager

7. 完整的启动脚本

8. 实际监控效果验证

8.1 生成测试流量

8.2 查看监控数据

9. 生产环境部署建议

9.1 安全配置

9.2 性能优化

9.3 高可用部署

10. 总结

哔哩下载姬DownKyi：5分钟掌握B站视频下载与处理的完整方案

lory.js 测试与调试：确保轮播组件稳定运行

3步解锁144帧：原神帧率限制解除终极指南

抖音直播间数据抓取技术解析：如何绕过隐私保护获取真实用户行为数据

告别模组安装烦恼：Scarab让《空洞骑士》模组管理变得如此简单

卡梅德生物技术快报｜抗体偶联药物（ADC）核心技术拆解：载体、连接子与载荷系统优化