vLLM推理引擎教程7-CUDA Graph-开发者社区

1、概念

在vLLM的decode阶段使用了CUDA Graph技术来提升性能。

CUDA Graph概念：它是NVIDIA提供的一种GPU计算优化技术，它的作用是：可以减少kernel launch等的开销，在某些场景下，如有大量的kernel的实际运行时间很短，甚至超过了kernel launch的时间，这时候有一定的性能收益。

CUDA Graph操作：

录制（capture）：运行一次真实计算，记录所有GPU操作到图中
重放（replay）：直接提交整个图给GPU、跳过CPU调度

优势：减少CPU-GPU同步、减少驱动层开销、提升GPU利用率。把每次都要CPU指挥GPU做一堆操作变成录一次播多次，从而减少开销、提升性能。适用于输入结构固定、反复执行的推理任务。

PyTorch使用前提：

输入、输出张量形状固定
计算流程无动态控制流（如if/else依赖GPU数据）
无CPU-GPU同步操作（如.item()、.cpu()）
张量地址固定，避免每次new memory

注意事项：

只适用于结构固定的计算（如decode阶段）
prefill阶段通常不用，因为prompt长度变化大
多batch size场景需要预录多个图（如batch_size=1,2,4,...,512）

2、实践Demo

（1）python代码

import torch import torch.nn as nn D_in = 32 D_out = 32 torch.manual_seed(1) class CUDAGraphRunner(): def __init__(self, model): self.model = model self.cuda_graph = None self.graph_input = {} self.graph_output = {} def capture(self, x, y, z): assert self.cuda_graph is None self.cuda_graph = torch.cuda.CUDAGraph() self.cuda_graph.enable_debug_mode() with torch.cuda.graph(self.cuda_graph): out = self.model(x, y, z) torch.cuda.synchronize() self.cuda_graph.debug_dump("graph.dot") # 定义graph 输入placeholder self.graph_input['x'] = x self.graph_input['y'] = y self.graph_input['z'] = z # 定义graph 输出placeholder self.graph_output['output'] = out def forward(self, x, y, z): self.graph_input['x'].copy_(x) self.graph_input['y'].copy_(y) self.graph_input['z'].copy_(z) self.cuda_graph.replay() return self.graph_output['output'] def __call__(self, *args, **kwargs): return self.forward(*args, **kwargs) # 创建模型和输入数据 class simple_model(nn.Module): def __init__(self): super().__init__() num_layer = 10000 self.blocks = torch.nn.ModuleList([nn.Linear(D_in, D_out) for _ in range(num_layer)]) def forward(self, x, y, z): a = torch.matmul(x, y) b = torch.matmul(x, z) c = torch.add(a, b) for block in self.blocks: c = block(c) return c def timed(fn, *args, **kwargs): start = torch.cuda.Event(enable_timing=True) end = torch.cuda.Event(enable_timing=True) repeat = 10 start.record() for _ in range(repeat): result = fn(*args, **kwargs) end.record() torch.cuda.synchronize() return result, start.elapsed_time(end) / repeat model = simple_model().cuda() inp = torch.randn(32, D_in).cuda() model.eval() model(x=inp, y=inp, z=inp) # warm up, 触发一些 gpu 资源的初始化 graph_runner = CUDAGraphRunner(model) inputs = {"x":inp, "y":inp, "z":inp} graph_runner.capture(**inputs) graph_runner(**inputs) # cuda_graph_runner warm up input = torch.randn(32, D_in).cuda() output, cuda_graph_elasped_time = timed(graph_runner, **inputs) output_ref, ori_infernce_elasped_time = timed(model.forward, **inputs) torch.cuda.synchronize() torch.testing.assert_close(output_ref, output, rtol=1e-03, atol=1e-03) print(f"cuda_graph_elasped_time: {cuda_graph_elasped_time} ms, ori_infernce_elasped_time: {ori_infernce_elasped_time} ms")

代码执行命令：

nsys profile --trace=cuda,nvtx,osrt --output=cuda_graph_trace --force-overwrite true python cuda_graph.py

执行结果：

(vllm_python312) [work@iZuf6hp1dkg31metmko4pbZ code]$ nsys profile --trace=cuda,nvtx,osrt --output=cuda_graph_trace --force-overwrite true python cuda_graph.py Collecting data... /data/xiehao/conda_workspace/envs/vllm_python312/lib/python3.12/site-packages/torch/cuda/graphs.py:167: UserWarning: DEBUG: calling debug_dump() (Triggered internally at /pytorch/aten/src/ATen/cuda/CUDAGraph.cpp:232.) return super().debug_dump(debug_path) cuda_graph_elasped_time: 47.78128662109375 ms, ori_infernce_elasped_time: 236.9718994140625 ms Generating '/tmp/nsys-report-93a0.qdstrm' [1/1] [========================100%] cuda_graph_trace.nsys-rep Generated: /data/xiehao/workspace/code/cuda_graph_trace.nsys-rep

原始模型执行236ms，通过图优化后执行47ms，提升明显。

（2）CUDAGraphRunner初始化

class CUDAGraphRunner(): def __init__(self, model): self.model = model self.cuda_graph = None self.graph_input = {} self.graph_output = {}

self.model：要加速的原始PyTorch模型

self.cuda_graph：存储录制好的CUDA Graph对象，初始化为None

self.graph_input / self.graph_output：字典，用于保存静态张量（static tensors），即GPU显存地址固定的输入/输出缓冲区。

静态张量不是指值不变，而是指内存地址不变。后续通过.copy_()更新内容，但地址始终不变，这是CUDAGraph正确工作的前提。

（3）录制方法capture

def capture(self, x, y, z): assert self.cuda_graph is None # 确保只录制一次

确保只录制一次，防止重复录制。

self.cuda_graph = torch.cuda.CUDAGraph() self.cuda_graph.enable_debug_mode() # 启用调试模式（可选）

enable_debug_mode()：开启后可生成 .dot 图用于可视化（方便调试）

with torch.cuda.graph(self.cuda_graph): out = self.model(x, y, z)

在 with 上下文中执行模型前向，PyTorch 会自动录制所有 GPU 操作（kernel 启动、内存拷贝等）到 cuda_graph 中。

torch.cuda.synchronize() self.cuda_graph.debug_dump("graph.dot") # 保存计算图为 graph.dot（可选）

synchronize()：确保录制完成；

debug_dump()：将图导出为 graph.dot 文件，可用 Graphviz 可视化。

# 保存静态输入/输出张量（关键！） self.graph_input['x'] = x self.graph_input['y'] = y self.graph_input['z'] = z self.graph_output['output'] = out

重点！这里保存的是 x, y, z, out 的引用（即它们的 GPU 内存地址）；

后续重放时，CUDA Graph 会直接从这些地址读写数据。

⚠️ 注意：这些张量必须在录制后保持存活，不能被释放或重新分配！

（4）前向方法forward

def forward(self, x, y, z): self.graph_input['x'].copy_(x) self.graph_input['y'].copy_(y) self.graph_input['z'].copy_(z)

将新的输入数据拷贝到静态张量的内存地址中；

使用 .copy_() 是为了不改变地址，只更新内容。

self.cuda_graph.replay() return self.graph_output['output']

replay()：一键重放整个 GPU 计算流程，跳过 CPU 调度开销；

返回录制时的输出张量（其内容已被更新）。

（5）定义测试模型simple_model

class simple_model(nn.Module): def __init__(self): super().__init__() num_layer = 10000 self.blocks = torch.nn.ModuleList([nn.Linear(D_in, D_out) for _ in range(num_layer)])

构造一个很深的模型（10000 层 Linear），目的是：

增加 GPU 计算量；
使 CPU 调度开销相对更明显，从而凸显 CUDA Graph 的加速效果。

def forward(self, x, y, z): a = torch.matmul(x, y) b = torch.matmul(x, z) c = torch.add(a, b) for block in self.blocks: c = block(c) return c

输入三个矩阵 x, y, z；

先做两个矩阵乘法 + 加法；

再经过 10000 层线性变换；

整个计算流程是静态的（无 if/else、无 CPU 同步），符合 CUDA Graph 要求。

（6）定义计时函数timed

def timed(fn, *args, **kwargs): start = torch.cuda.Event(enable_timing=True) end = torch.cuda.Event(enable_timing=True) repeat = 10 start.record() for _ in range(repeat): result = fn(*args, **kwargs) end.record() torch.cuda.synchronize() return result, start.elapsed_time(end) / repeat

使用torch.cuda.Event精确测量GPU执行时间。

Event的本质：它是一个GPU时间戳标记。当event.record()被调用时，该事件被插入到当前CUDA stream的队列末尾。GPU执行到这个位置时，才会打时间戳。即：record()不是立刻记录当前时间，而是在GPU执行流中预约一个时间点。

start.record()：在stream队列中插入start事件

end.record()：在stream队列末尾插入end事件

torch.cuda.synchronize()：等待GPU执行完所有操作，包括end事件

start.elapsed_time(end)：自动计算两个事件的时间差

（7）主程序

model = simple_model().cuda() inp = torch.randn(32, D_in).cuda() model.eval()

将模型和输入移到 GPU；

model.eval()：关闭 dropout/batchnorm 等训练特性。

model(x=inp, y=inp, z=inp) # warm up

先跑一次普通推理，触发 CUDA context 初始化、cudnn benchmark 等，避免首次运行慢影响计时。

graph_runner = CUDAGraphRunner(model) inputs = {"x": inp, "y": inp, "z": inp} graph_runner.capture(**inputs)

用 inp 作为典型输入进行录制；

要求：后续所有输入必须和 inp 形状一致（32×32）。

graph_runner(**inputs) # cuda_graph_runner warm up

再跑一次 Graph 推理，确保图已加载到 GPU，避免首次 replay 慢。

torch.testing.assert_close(output_ref, output, rtol=1e-03, atol=1e-03)

验证两种方式的输出是否一致（允许微小浮点误差）；

如果不一致，说明 CUDA Graph 使用有误！

output_ref：期望值（expected），通常是普通推理结果；

output：实际值（actual），通常是 CUDA Graph 推理结果；

rtol=1e-3：相对容差（relative tolerance）；

atol=1e-3：绝对容差（absolute tolerance）。

对应的数学公式：

|a - e| ≤ atol + rtol * |e|

# ❌ 危险！浮点数不要用 == assert torch.equal(output_ref, output) # ❌ 不够灵活 assert (output_ref - output).abs().max() < 1e-3 # ✅ 推荐：使用 assert_close（语义清晰，容差合理） torch.testing.assert_close(output_ref, output, rtol=1e-3, atol=1e-3)

3、vLLM的简单实现

import torch import torch.nn as nn D_in = 1024 D_out = 2048 class ModelRunner(): def __init__(self, model): self.model = model self.graph_runners = {} # (int, CUDAGraphRunner) @torch.inference_mode() def capture_model(self): for batch in [1, 2, 3, 4]: # 提前设置一批 batch input = torch.randn(batch, D_in).cuda() graph_runner = CUDAGraphRunner(self.model) graph_runner.capture(input) self.graph_runners[batch] = graph_runner @torch.inference_mode() def execute_model(self, x): batch = x.size(0) if batch in self.graph_runners: model_executable = self.graph_runners[batch] # 根据输入找到对应的 graph_runner else: print(f"warning, no cudagraph_runner, back to origin model") model_executable = self.model # 回退到原始的 model return model_executable(x) class CUDAGraphRunner(): def __init__(self, model): self.model = model self.cuda_graph = None self.graph_input = None self.graph_output = None def capture(self, x): assert self.cuda_graph is None self.cuda_graph = torch.cuda.CUDAGraph() with torch.cuda.graph(self.cuda_graph): out = self.model(x) torch.cuda.synchronize() self.graph_input = x # 定义 graph 输入 placeholder self.graph_output = out # 定义 graph 输出 def forward(self, x): self.graph_input.copy_(x) self.cuda_graph.replay() return self.graph_output def __call__(self, *args, **kwargs): return self.forward(*args, **kwargs) # 创建模型和输入数据 model = nn.Linear(D_in, D_out).cuda() model.eval() input = torch.randn(4, D_in).cuda() output_ref = model(input) model_runner = ModelRunner(model) model_runner.capture_model() # model_runner 构造cuda graph output = model_runner.execute_model(input) # 执行 torch.testing.assert_close(output_ref, output, rtol=1e-03, atol=1e-03)

在vllm中，设置的capture的batch为：

设置得越多，构建cudagraph时耗费得显存资源也越多。

vLLM推理引擎教程7-CUDA Graph

1、概念

2、实践Demo

（1）python代码

（2）CUDAGraphRunner初始化

（3）录制方法capture

（4）前向方法forward

（5）定义测试模型simple_model

（6）定义计时函数timed

（7）主程序

3、vLLM的简单实现

GPT-5.2 上线后差评如潮，其功能表现有哪些退步或不足？

从2025年来看，AI 泡沫是否会在一两年内破灭

会话技术cookie session token

露，大鼠活动记录仪小动物活动记录仪

微软确认：Windows 11 AI 智能体访问用户文件前会先请求许可

Shopee 验证码解决方案

1、概念

2、实践Demo

（1）python代码

（2）CUDAGraphRunner初始化

（3）录制方法capture

（4）前向方法forward

（5）定义测试模型simple_model

（6）定义计时函数timed

（7）主程序

3、vLLM的简单实现

GPT-5.2 上线后差评如潮，其功能表现有哪些退步或不足？

从2025年来看，AI 泡沫是否会在一两年内破灭

会话技术cookie session token

露，大鼠活动记录仪 小动物活动记录仪

微软确认：Windows 11 AI 智能体访问用户文件前会先请求许可

Shopee 验证码解决方案

露，大鼠活动记录仪小动物活动记录仪