CANN hixl 高级索引通信原语的拓扑感知调度机制-开发者社区

cann组织链接：https://atomgit.com/cann
hixl仓库链接：https://atomgit.com/cann/hixl

前言

在大模型推理与训练中，非结构化、稀疏或按需的数据传输需求日益增长。传统集合通信（如 AllReduce）难以满足此类场景的灵活性要求。CANN（Compute Architecture for Neural Networks）开源项目中的hixl（Huawei Xfer Library）仓库（https://atomgit.com/cann/hixl）提供了一套面向点对点、支持高级索引语义的单边通信原语，其核心创新在于拓扑感知的链路调度机制——能够根据设备间物理连接拓扑、链路带宽与负载状态，动态选择最优传输路径，并高效处理带索引偏移的内存拷贝。

1. 高级索引通信原语的需求与设计

1.1 传统通信的局限性

标准 memcpy 或 send/recv 要求连续内存块传输。但在 KV Cache 交换、MoE 专家路由等场景中，数据往往以非连续索引集合形式存在：

# 示例：仅传输 tokens [10, 25, 100] 对应的 KV Cache 行indices=[10,25,100]send(KV_cache[indices])# ← 需要高级索引支持

1.2 hixl 的索引通信接口

hixl 在include/hixl/hixl_api.h中定义了带索引的发送/接收接口：

// hixl/include/hixl/hixl_api.hHIXL_EXPORT HixlResultHixlSendIndexed(HixlContext ctx,constvoid*src_base,// 源内存基地址constint64_t*src_indices,// 源索引数组（单位：元素）size_t num_indices,// 索引数量size_t elem_size,// 单个元素字节数intdst_rank,uint64_tremote_key,// 远端注册内存句柄int64_tremote_offset// 远端偏移（元素单位）);HIXL_EXPORT HixlResultHixlRecvIndexed(HixlContext ctx,void*dst_base,constint64_t*dst_indices,size_t num_indices,size_t elem_size,intsrc_rank,uint64_tlocal_key,int64_tlocal_offset);

✅关键特性：
支持任意索引序列；
元素粒度偏移（非字节）；
单边操作（无需远端参与）。

2. 拓扑感知的链路抽象与建模

2.1 设备拓扑描述

hixl 通过TopologyManager获取设备间连接信息：

// hixl/core/topology/topology_manager.hstructLinkInfo{intsrc_dev;intdst_dev;std::string protocol;// "HCCS", "RoCE", "PCIe"doublebandwidth_gb;// 实测带宽intlatency_us;boolis_available;};classTopologyManager{public:staticconststd::vector<LinkInfo>&GetDeviceLinks();};

该信息可从系统运行时或配置文件（topology.yaml）加载。

2.2 虚拟链路池

hixl 为每对(src, dst)维护一个虚拟链路池，包含多条物理链路：

// hixl/core/link/link_pool.hclassLinkPool{public:structVirtualLink{std::string backend;// "hccs", "rdma"void*handle;// 底层句柄（如 QP 或 fd）doublecurrent_load;// 当前负载（0~1）size_t max_msg_size;// 最大消息限制};std::vector<VirtualLink>GetAvailableLinks(intsrc,intdst);};

3. 拓扑感知调度算法

3.1 调度目标

调度器需在以下目标间权衡：

最小化传输时间：选择高带宽链路；
负载均衡：避免单链路过载；
协议适配：大消息走 HCCS，小消息走 RoCE。

3.2 成本模型与链路评分

hixl 引入链路评分函数：

Score ( l ) = Bandwidth l 1 + α ⋅ Load l − β ⋅ Latency l \text{Score}(l) = \frac{\text{Bandwidth}_l}{1 + \alpha \cdot \text{Load}_l} - \beta \cdot \text{Latency}_lScore(l)=1+α⋅LoadlBandwidthl−β⋅Latencyl

其中α , β \alpha, \betaα,β为可调权重。

// hixl/core/scheduler/topology_aware_scheduler.cppdoubleTopologyAwareScheduler::CalculateLinkScore(constLinkPool::VirtualLink&link,size_t msg_size){doublebandwidth_factor=link.bandwidth_gb;doubleload_penalty=1.0+kLoadWeight*link.current_load;doublelatency_penalty=kLatencyWeight*link.latency_us/1000.0;// 转 ms// 大消息更看重带宽，小消息更看重延迟doublesize_factor=(msg_size>kLargeMsgThreshold)?1.0:0.3;return(bandwidth_factor/load_penalty)*size_factor-latency_penalty;}

3.3 调度执行流程

// hixl/core/scheduler/scheduler.cppHixlResultScheduler::ScheduleTransfer(constTransferRequest&req,SelectedLink*out_link){autolinks=link_pool_.GetAvailableLinks(req.src,req.dst);if(links.empty())returnHIXL_ERROR_NO_LINK;// 计算每条链路的得分std::vector<std::pair<double,size_t>>scores;for(size_t i=0;i<links.size();++i){doublescore=CalculateLinkScore(links[i],req.total_bytes);scores.emplace_back(score,i);}// 选择最高分链路std::sort(scores.rbegin(),scores.rend());*out_link=links[scores[0].second];// 更新负载（滑动平均）links[scores[0].second].current_load=0.9*links[scores[0].second].current_load+0.1*(static_cast<double>(req.total_bytes)/kMaxThroughput);returnHIXL_SUCCESS;}

🔧动态调整：kLoadWeight和kLatencyWeight可通过环境变量HIXL_SCHED_WEIGHT_LOAD调整。

4. 索引数据的高效打包与传输

4.1 索引到 scatter-gather 描述符转换

hixl 将索引数组转换为底层通信库可理解的scatter-gather 列表：

// hixl/core/transfer/indexed_transfer.cppstd::vector<hcomm_sge>BuildScatterGatherList(constvoid*base,constint64_t*indices,size_t num_indices,size_t elem_size){std::vector<hcomm_sge>sge_list;sge_list.reserve(num_indices);for(size_t i=0;i<num_indices;++i){hcomm_sge sge;sge.addr=reinterpret_cast<uint64_t>(static_cast<constchar*>(base)+indices[i]*elem_size);sge.length=elem_size;sge.lkey=memory_registry_.GetLKey(sge.addr);// RDMA 内存密钥sge_list.push_back(sge);}returnsge_list;}

4.2 单边写入（RDMA Write）

对于支持单边操作的后端（如 RoCE），hixl 直接发起 RDMA Write：

// hixl/plugins/rdma/rdma_backend.cppHixlResultRDMABackend::PostIndexedWrite(conststd::vector<hcomm_sge>&local_sge,uint64_tremote_addr,uint32_trkey,CommRequest*req){// 构造 WRibv_send_wr wr{};wr.wr_id=reinterpret_cast<uint64_t>(req);wr.sg_list=const_cast<ibv_sge*>(reinterpret_cast<constibv_sge*>(local_sge.data()));wr.num_sge=local_sge.size();wr.opcode=IBV_WR_RDMA_WRITE;wr.wr.rdma.remote_addr=remote_addr;wr.wr.rdma.rkey=rkey;// 提交ibv_send_wr*bad_wr;if(ibv_post_send(qp_,&wr,&bad_wr)){returnHIXL_ERROR_RDMA_POST;}returnHIXL_SUCCESS;}

✅零拷贝：无需中间缓冲区，直接从源内存读取。

5. 与 hcomm 的协同与故障恢复

5.1 统一后端接口

hixl 通过 hcomm 的插件化后端执行实际传输：

// hixl/core/backend/backend_adapter.cppclassHcommBackendAdapter:publicITransferBackend{public:HixlResultSendIndexed(...)override{// 转换为 hcomm ISend 调用hcomm_request_t hreq;hcomm_isend_scatter(local_sge.data(),local_sge.size(),remote_addr,remote_size,dst_rank,tag,&hreq);// 绑定到 hixl 请求request_map_[&hreq]=req;returnHIXL_SUCCESS;}};

5.2 异常链路自动清理

如 PR !164 所述，hixl 支持心跳检测与自动断链：

// hixl/core/link/link_monitor.cppvoidLinkMonitor::CheckHeartbeat(){for(auto&link:active_links_){if(GetCurrentTime()-link.last_heartbeat>kTimeout){LOG(WARNING)<<"Link to rank "<<link.dst<<" timeout";scheduler_->MarkLinkUnhealthy(link);// 触发自动重建（若启用 AutoConnect）if(auto_connect_enabled_){RebuildLink(link);}}}}

6. 性能实测与应用场景

6.1 KV Cache 传输性能（A3 芯片，8 卡）

传输模式	带宽	延迟（1MB）
连续 memcpy	119 GB/s	8.4 μs
hixl 索引传输（100 随机行）	92 GB/s	12.1 μs
传统 gather+send	45 GB/s	28.7 μs

💡优势：相比 gather+send，hixl 减少一次内存拷贝，带宽提升 2 倍。

6.2 典型应用场景

LLM 推理：跨设备 KV Cache 按需交换；
MoE：专家激活 token 的稀疏路由；
RL 训练：经验回放池的异步采样传输。

结语

CANN hixl 通过高级索引通信原语与拓扑感知调度机制，成功解决了非结构化数据传输的性能瓶颈。其不仅支持灵活的索引语义，更通过与 hcomm 插件化后端的深度协同，实现了在异构互连环境下的最优路径选择与故障自愈。作为 CANN 通信栈中面向细粒度、低延迟场景的关键组件，hixl 为大模型推理、强化学习等前沿应用提供了高效、可靠的通信基础设施。随着对更多索引模式（如 block-sparse、strided）的支持，hixl 的能力边界将持续扩展。

cann组织链接：https://atomgit.com/cann
hixl仓库链接：https://atomgit.com/cann/hixl