news 2026/5/9 18:29:20

CANN PTO瓦片库

作者头像

张小明

前端开发工程师

1.2k 24
文章封面图
CANN PTO瓦片库

【免费下载链接】pto-isaParallel Tile Operation (PTO) is a virtual instruction set architecture designed by Ascend CANN, focusing on tile-level operations. This repository offers high-performance, cross-platform tile operations across Ascend platforms.项目地址: https://gitcode.com/cann/pto-isa

PTO Tile Library

Parallel Tile Operation (PTO) is a virtual ISA for tile-oriented programming defined by Ascend CANN. This repository provides PTO Tile instruction implementations, examples, tests, and documentation to help developers migrate and optimize operators more smoothly across different Ascend generations.

📰 News

  • 🎉2025-12-27: PTO Tile Library is officially open-sourced.
  • 2026-01-30: Added reduction instructions and MX instructions.
  • 🚀2026-02-28: Added convolution instructions, quantization instructions, and inter-kernel communication instructions.
  • 🔥2026-03-30: Added support for Ascend A5, asynchronous communication instructions, and CostModel performance simulation.
  • 🛠️2026-04-02: Local engineering workflow improved with pre-commit checks, documentation build verification, and CPU-SIM validation updates.

🎯 Project Positioning

The PTO ISA is built on Ascend's underlying hardware and software abstractions and defines more than 90 standard tile instructions. It uses a higher-level tile programming model to bridge implementation differences across generations. Its goal is not to hide low-level capabilities, but to raise the abstraction level while preserving room for performance tuning.

  • Unified cross-generation tile abstraction: reduces migration cost across different Ascend generations.
  • Balances portability and performance: guarantees correct behavior under fixed tile shapes while preserving tuning dimensions such as tile size, tile shape, and instruction ordering.
  • Designed for frameworks, operators, and toolchains: serves as a common interface for upper-layer frameworks, operator implementations, and compiler toolchains.
  • Continuously extensible: defines 90+ standard operations today, with ongoing implementation and ecosystem integration.

In addition to compute and># CPU Simulator (recommended first step) python3 tests/run_cpu.py --clean --verbose # Run GEMM demo python3 tests/run_cpu.py --demo gemm --verbose # Run Flash Attention demo python3 tests/run_cpu.py --demo flash_attn --verbose # Run a single ST testcase python3 tests/script/run_st.py -r sim -v a3 -t tadd -g TADDTest.case_float_64x64_64x64 # One-click build and run recommended tests ./build.sh --run_all --a3 --sim

For more complete build, test, and scripting details, see the Getting Started Guide and Test Guide.

Recommended Examples

  • Auto Mode Add example: a good first example for understanding how PTO instructions are organized
  • GEMM performance example: useful for understanding tile-level operator optimization
  • Flash Attention example: useful for understanding complex operators and performance tuning

Recommended Learning Path

  1. Start from simple examples to understand how PTO instructions organize tile-level computation and data movement.
  2. Verify functionality and correctness in CPU simulation to build intuition about instruction semantics and results.
  3. Port the code to Ascend hardware to validate correctness and collect performance data. See the msprof tool
  4. Identify performance bottlenecks (CUBE Bound / MTE Bound / Vector Bound) and start optimization and tuning. See Performance Optimization

This repository also demonstrates how standard tile operations can be mapped to different pipeline implementations through template parameters:

  • Tile Programming Model: understand static tile shapes, dynamic tile masks, and data organization
  • Events and Synchronization: understand set/wait flag and pipeline synchronization
  • General Conventions: understand general PTO programming rules and constraints
  • PTO Instruction List: browse the standard operations defined by the PTO ISA

🗂️ Documentation Navigation

ISA and Programming Model

  • ISA Overview: entry point and navigation for PTO ISA documentation
  • PTO Instruction List: browse PTO standard operations by category
  • Tile Programming Model: understand tile shapes, masks, and the programming model
  • Events and Synchronization: understand event recording, waiting, and synchronization
  • General Conventions: review naming, constraints, and common rules

Development and Optimization

  • Developer Documentation Index: browse documentation for extending PTO Tile Lib
  • Performance Optimization: review performance analysis and tuning guidance
  • Documentation Build Guide: learn how to build the MkDocs site locally

📊 Examples and Performance References

GEMM

  • Reference implementation:kernels/manual/a2a3/gemm_performance/
  • Detailed analysis and tuning notes: High-Performance GEMM Operator Example

Flash Attention

  • Operator implementation and tuning notes: A2/A3 version, A5 version
  • A5 build guide, with A5 performance numbers still pending: Flash Attention Performance Kernel (A5)
  • S0: query sequence length (number of rows in Q/O)
  • S1: key/value sequence length (number of rows in K/V)

Ascend 910B2 multi-core comparison, usingtorch_npuas the baseline:

Sequence lengthPTO time (us)torch_npu time (us)PTO TFLOPStorch_npu TFLOPSPTO speedup
102420.96058.46125.619.182.79x
204832.46170.80166.1630.332.18x
409688.902118.30296.6272.611.33x
8192292.626353.147117.4297.301.21x
16384909.0581118.462151.19122.881.23x
327683262.6453646.173168.50150.781.12x

Communication Instruction Bandwidth

  • Reference implementation:kernels/manual/a2a3/tget_bandwidth/
  • Detailed analysis and build/run guide: TGET / TGET_ASYNC Bandwidth Comparison Example

This example measures point-to-point remote-read bandwidth on Ascend A2/A3 and comparesTGET(synchronous, via UB staging) withTGET_ASYNC(asynchronous, direct transfer through the DMA engine).

GEMM AllReduce Fused Compute-Communication

  • Reference implementation:kernels/manual/a2a3/gemm_ar/
  • Detailed analysis and tuning notes: High-Performance GEMM AllReduce Fused Operator Example

This example shows how PTO communication primitives can be fused with compute kernels to overlap GEMM and AllReduce within one operator pipeline.

🖥️ Platform Support

  • Ascend A2 (Ascend 910B)
  • Ascend A3 (Ascend 910C)
  • Ascend A5 (Ascend 950)
  • CPU (x86_64 / AArch64)

For more details, see include/README.md.

🛣️ Roadmap

Planned future features:

FeatureDescriptionScopeProgress / target completion
PTO Auto ModeBiSheng compiler support for automatic tile buffer allocation and synchronization insertion.Compiler / toolchainOngoing
PTO Tile FusionBiSheng compiler support for automatic tile operation fusion.Compiler / toolchainOngoing
PTO-ASBytecode support for PTO ISA.Compiler / toolchainOngoing
Convolution extensionPTO ISA support for convolution kernels.ISA extensionOngoing
Collective communication extensionAdd asynchronous communication instructions for Ccu and Roce, and add the TPREFECTH (AIV direct-drive) communication instruction.Communication ISA extension2026 Q2
System scheduling extensionPTO ISA support for SPMD/MPMD programming schedules.ISA extensionPlanned
Micro-instructionsSupport expressing high-performance operators through micro-instructions, together with a foundational high-performance micro-instruction library.ISA extension / operator development2026 Q2
Base instructionsFurther optimize A5 instruction performance, add Pooling-related base instructions, and enhance convolution, quantization, and Fixpipe instruction capabilities.ISA extension2026 Q2
CostModelSupport CostModel performance simulation for A5 instructions.Toolchain / performance modeling2026 Q2
CPU-SIMKeep CPU-SIM built in sync with instruction enhancements.CPU simulation2026 Q2

🗃️ Directory Structure

Key directories are listed below:

├── include/ # Public PTO headers and interfaces │ └── pto/ # Common types, ISA interfaces, and CPU/NPU implementations ├── kernels/ # Kernels and operator implementations │ ├── manual/ # Hand-optimized implementations and performance examples │ └── custom/ # Custom operator examples ├── docs/ # ISA, programming model, getting started, and doc site sources │ ├── isa/ # Instruction references and category indexes │ ├── coding/ # Developer and performance optimization docs │ ├── assembly/ # PTO-AS assembly syntax and specification │ └── mkdocs/ # MkDocs config and source files ├── demos/ # Auto Mode, baseline, and torch_jit examples ├── tests/ # CPU / NPU tests, scripts, and test entry points │ ├── cpu/ # CPU simulation tests │ ├── npu/ # SoC-specific NPU tests │ └── script/ # Test build and execution scripts ├── scripts/ # Build, install, and release scripts ├── cmake/ # Shared CMake configuration and packaging logic ├── build.sh # One-click build and run entry script └── CMakeLists.txt # Top-level CMake configuration

ℹ️ Related Information

  • Contributing Guide: contribution workflow and development guidelines
  • Security and Vulnerability Disclosure: process for reporting security issues
  • Release Notes: version updates and release history
  • License: CANN Open Software License Agreement Version 2.0
  • PyPTO: an upper-layer programming framework in the PTO ecosystem
  • PTOAS: PTO assembler and compiler backend for PTO workflows
  • pto-dsl: Pythonic frontend and JIT workflow exploration for PTO

📬 Contact Us

  • Issue reporting: submit problems through repository Issues
  • Feature requests: share suggestions through Issues or discussion channels
  • Code contributions: contribute through Pull Requests

【免费下载链接】pto-isaParallel Tile Operation (PTO) is a virtual instruction set architecture designed by Ascend CANN, focusing on tile-level operations. This repository offers high-performance, cross-platform tile operations across Ascend platforms.项目地址: https://gitcode.com/cann/pto-isa

创作声明:本文部分内容由AI辅助生成(AIGC),仅供参考

版权声明: 本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若内容造成侵权/违法违规/事实不符,请联系邮箱:809451989@qq.com进行投诉反馈,一经查实,立即删除!
网站建设 2026/5/9 18:20:47

CANN/pypto按位异或操作API文档

# pypto.bitwise_xor 【免费下载链接】pypto PyPTO(发音: pai p-t-o):Parallel Tensor/Tile Operation编程范式。 项目地址: https://gitcode.com/cann/pypto 产品支持情况 产品是否支持Ascend 950PR/Ascend 950DT√Atla…

作者头像 李华
网站建设 2026/5/9 18:12:39

为什么选微服务而不是动态扩容单体

动态扩容单体能不能做?能。单体应用部署成集群,前面挂个负载均衡,流量大了加机器。 问题在于:选微服务的驱动力从来不是吞吐量。 如果你面临的问题只是「流量扛不住了」,代码性能又不错,加机器就行&#…

作者头像 李华
网站建设 2026/5/9 18:08:28

AI如何革新文献综述:从NLP、机器学习到知识图谱的智能工作流

1. 项目概述:当AI遇上文献综述,一场效率革命正在发生 如果你是一名研究生、科研人员,或者任何需要大量阅读文献来支撑决策的分析师,那么“系统文献综述”这个词对你来说,可能意味着长达数月的痛苦煎熬。从确定检索式、…

作者头像 李华
网站建设 2026/5/9 18:08:02

CANN/tensorflow NPU性能调优

性能调优 【免费下载链接】tensorflow Ascend TensorFlow Adapter 项目地址: https://gitcode.com/cann/tensorflow 基础配置 iterations_per_loop 针对一次session.run调用,在NPU执行训练迭代的次数,默认为1,且用户设置的训练迭代总…

作者头像 李华
网站建设 2026/5/9 18:08:01

LingBot-Depth部署教程:HTTPS反向代理配置+Nginx负载均衡接入指南

LingBot-Depth部署教程:HTTPS反向代理配置Nginx负载均衡接入指南 1. 引言:为什么需要专业部署 当你成功在本地运行LingBot-Depth后,下一个问题自然而来:如何让团队其他成员也能使用这个强大的深度感知模型?直接暴露D…

作者头像 李华
网站建设 2026/5/9 18:04:29

nli-MiniLM2-L6-H768在舆情分析中的实战:识别观点冲突与一致性

nli-MiniLM2-L6-H768在舆情分析中的实战:识别观点冲突与一致性 1. 舆情分析的痛点与解决方案 在社交媒体时代,企业每天面临海量用户评论的冲击。传统舆情分析往往停留在情感分析层面,难以捕捉观点间的复杂关系。某手机品牌新品发布后&#…

作者头像 李华