在AI时代,GPU已成为企业最宝贵的计算资源之一。如何在Kubernetes中高效管理、调度和共享这些昂贵的异构计算资源,是每个云原生AI平台必须解决的核心问题。
引言:AI计算的新挑战
传统GPU使用模式的痛点:
- 资源孤岛:GPU服务器独立管理,无法形成资源池
- 利用率低下:单个任务无法充分利用整卡资源,平均GPU利用率不到30%
- 调度困难:手动分配GPU,缺乏统一的调度和排队机制
- 成本高昂:A100/H100等高端GPU单卡成本数万到数十万
Kubernetes GPU管理的价值:
- 资源池化:将分散的GPU资源统一管理,形成共享资源池
- 弹性伸缩:根据AI任务需求动态分配和释放GPU资源
- 成本优化:通过vGPU切分和混部提升资源利用率
- 标准化运维:统一的监控、运维和故障处理机制
一、Kubernetes GPU基础架构
1.1 设备插件(Device Plugin)机制
1.2 NVIDIA Device Plugin部署
基础部署配置
# nvidia-device-plugin-daemonset.yamlapiVersion:apps/v1kind:DaemonSetmetadata:name:nvidia-device-plugin-daemonsetnamespace:kube-systemlabels:k8s-app:nvidia-device-pluginspec:updateStrategy:type:RollingUpdaterollingUpdate:maxUnavailable:1selector:matchLabels:k8s-app:nvidia-device-plugintemplate:metadata:labels:k8s-app:nvidia-device-pluginspec:priorityClassName:system-node-criticaltolerations:-key:CriticalAddonsOnlyoperator:Exists-key:nvidia.com/gpuoperator:Existseffect:NoSchedulenodeSelector:# 仅在有GPU的节点上运行nvidia.com/gpu.present:"true"containers:-image:nvcr.io/nvidia/k8s-device-plugin:v0.14.1name:nvidia-device-plugin-ctrsecurityContext:allowPrivilegeEscalation:falsecapabilities:drop:["ALL"]volumeMounts:-name:device-pluginmountPath:/var/lib/kubelet/device-plugins-name:nvidia-drivermountPath:/usr/local/nvidiareadOnly:trueenv:-name:PASS_DEVICE_SPECSvalue:"true"-name:FAIL_ON_INIT_ERRORvalue:"true"-name:NVIDIA_VISIBLE_DEVICESvalue:"all"-name:NVIDIA_DRIVER_CAPABILITIESvalue:"compute,utility"-name:LD_LIBRARY_PATHvalue:/usr/local/nvidia/lib:/usr/local/nvidia/lib64resources:requests:cpu:50mmemory:100Milimits:cpu:100mmemory:300Mivolumes:-name:device-pluginhostPath:path:/var/lib/kubelet/device-plugins-name:nvidia-driverhostPath:path:/usr/lib/modules/nvidia节点标签与污点
# 标记GPU节点kubectl label nodes<node-name>nvidia.com/gpu.present=true kubectl label nodes<node-name>accelerator=nvidia-tesla-a100 kubectl label nodes<node-name>gpu-type=a100 kubectl label nodes<node-name>gpu-memory=40Gi# 添加污点(可选)kubectl taint nodes<node-name>nvidia.com/gpu=true:NoSchedule# 查看节点GPU信息kubectl describe node<node-name>|grep-A10"Capacity"1.3 GPU资源请求与限制
# gpu-pod-example.yamlapiVersion:v1kind:Podmetadata:name:gpu-podlabels:app:ai-trainingspec:# 节点选择nodeSelector:accelerator:nvidia-tesla-a100# 容忍GPU污点tolerations:-key:nvidia.com/gpuoperator:Existseffect:NoSchedulecontainers:-name:cuda-containerimage:nvidia/cuda:12.1.0-base-ubuntu22.04command:["/bin/bash"]args:["-c","nvidia-smi && sleep infinity"]# GPU资源请求resources:limits:# 请求整张GPU卡nvidia.com/gpu:1# 也可以指定具体型号# nvidia.com/gpu.a100: 1# nvidia.com/gpu.v100: 2# GPU内存限制(需要MIG或vGPU)# nvidia.com/gpumem: 10Gi# 其他资源cpu:"4"memory:"16Gi"requests:nvidia.com/gpu:1cpu:"2"memory:"8Gi"# 安全上下文(需要特权才能访问GPU)securityContext:privileged:true# 环境变量env:-name:NVIDIA_VISIBLE_DEVICESvalue:"all"-name:NVIDIA_DRIVER_CAPABILITIESvalue:"compute,utility,graphics,video"# 挂载NVIDIA驱动volumeMounts:-name:nvidia-drivermountPath:/usr/local/nvidiareadOnly:truevolumes:-