【免费下载链接】pto-isaParallel Tile Operation (PTO) is a virtual instruction set architecture designed by Ascend CANN, focusing on tile-level operations. This repository offers high-performance, cross-platform tile operations across Ascend platforms.项目地址: https://gitcode.com/cann/pto-isa
PTO Tile Library
Parallel Tile Operation (PTO) is a virtual ISA for tile-oriented programming defined by Ascend CANN. This repository provides PTO Tile instruction implementations, examples, tests, and documentation to help developers migrate and optimize operators more smoothly across different Ascend generations.
📰 News
- 🎉2025-12-27: PTO Tile Library is officially open-sourced.
- ✨2026-01-30: Added reduction instructions and MX instructions.
- 🚀2026-02-28: Added convolution instructions, quantization instructions, and inter-kernel communication instructions.
- 🔥2026-03-30: Added support for Ascend A5, asynchronous communication instructions, and CostModel performance simulation.
- 🛠️2026-04-02: Local engineering workflow improved with pre-commit checks, documentation build verification, and CPU-SIM validation updates.
🎯 Project Positioning
The PTO ISA is built on Ascend's underlying hardware and software abstractions and defines more than 90 standard tile instructions. It uses a higher-level tile programming model to bridge implementation differences across generations. Its goal is not to hide low-level capabilities, but to raise the abstraction level while preserving room for performance tuning.
- Unified cross-generation tile abstraction: reduces migration cost across different Ascend generations.
- Balances portability and performance: guarantees correct behavior under fixed tile shapes while preserving tuning dimensions such as tile size, tile shape, and instruction ordering.
- Designed for frameworks, operators, and toolchains: serves as a common interface for upper-layer frameworks, operator implementations, and compiler toolchains.
- Continuously extensible: defines 90+ standard operations today, with ongoing implementation and ecosystem integration.
In addition to compute and># CPU Simulator (recommended first step) python3 tests/run_cpu.py --clean --verbose # Run GEMM demo python3 tests/run_cpu.py --demo gemm --verbose # Run Flash Attention demo python3 tests/run_cpu.py --demo flash_attn --verbose # Run a single ST testcase python3 tests/script/run_st.py -r sim -v a3 -t tadd -g TADDTest.case_float_64x64_64x64 # One-click build and run recommended tests ./build.sh --run_all --a3 --sim
For more complete build, test, and scripting details, see the Getting Started Guide and Test Guide.
Recommended Examples
- Auto Mode Add example: a good first example for understanding how PTO instructions are organized
- GEMM performance example: useful for understanding tile-level operator optimization
- Flash Attention example: useful for understanding complex operators and performance tuning
Recommended Learning Path
- Start from simple examples to understand how PTO instructions organize tile-level computation and data movement.
- Verify functionality and correctness in CPU simulation to build intuition about instruction semantics and results.
- Port the code to Ascend hardware to validate correctness and collect performance data. See the msprof tool
- Identify performance bottlenecks (CUBE Bound / MTE Bound / Vector Bound) and start optimization and tuning. See Performance Optimization
This repository also demonstrates how standard tile operations can be mapped to different pipeline implementations through template parameters:
- Tile Programming Model: understand static tile shapes, dynamic tile masks, and data organization
- Events and Synchronization: understand set/wait flag and pipeline synchronization
- General Conventions: understand general PTO programming rules and constraints
- PTO Instruction List: browse the standard operations defined by the PTO ISA
🗂️ Documentation Navigation
ISA and Programming Model
- ISA Overview: entry point and navigation for PTO ISA documentation
- PTO Instruction List: browse PTO standard operations by category
- Tile Programming Model: understand tile shapes, masks, and the programming model
- Events and Synchronization: understand event recording, waiting, and synchronization
- General Conventions: review naming, constraints, and common rules
Development and Optimization
- Developer Documentation Index: browse documentation for extending PTO Tile Lib
- Performance Optimization: review performance analysis and tuning guidance
- Documentation Build Guide: learn how to build the MkDocs site locally
📊 Examples and Performance References
GEMM
- Reference implementation:
kernels/manual/a2a3/gemm_performance/ - Detailed analysis and tuning notes: High-Performance GEMM Operator Example
Flash Attention
- Operator implementation and tuning notes: A2/A3 version, A5 version
- A5 build guide, with A5 performance numbers still pending: Flash Attention Performance Kernel (A5)
- S0: query sequence length (number of rows in Q/O)
- S1: key/value sequence length (number of rows in K/V)
Ascend 910B2 multi-core comparison, usingtorch_npuas the baseline:
| Sequence length | PTO time (us) | torch_npu time (us) | PTO TFLOPS | torch_npu TFLOPS | PTO speedup |
|---|---|---|---|---|---|
| 1024 | 20.960 | 58.461 | 25.61 | 9.18 | 2.79x |
| 2048 | 32.461 | 70.801 | 66.16 | 30.33 | 2.18x |
| 4096 | 88.902 | 118.302 | 96.62 | 72.61 | 1.33x |
| 8192 | 292.626 | 353.147 | 117.42 | 97.30 | 1.21x |
| 16384 | 909.058 | 1118.462 | 151.19 | 122.88 | 1.23x |
| 32768 | 3262.645 | 3646.173 | 168.50 | 150.78 | 1.12x |
Communication Instruction Bandwidth
- Reference implementation:
kernels/manual/a2a3/tget_bandwidth/ - Detailed analysis and build/run guide: TGET / TGET_ASYNC Bandwidth Comparison Example
This example measures point-to-point remote-read bandwidth on Ascend A2/A3 and comparesTGET(synchronous, via UB staging) withTGET_ASYNC(asynchronous, direct transfer through the DMA engine).
GEMM AllReduce Fused Compute-Communication
- Reference implementation:
kernels/manual/a2a3/gemm_ar/ - Detailed analysis and tuning notes: High-Performance GEMM AllReduce Fused Operator Example
This example shows how PTO communication primitives can be fused with compute kernels to overlap GEMM and AllReduce within one operator pipeline.
🖥️ Platform Support
- Ascend A2 (Ascend 910B)
- Ascend A3 (Ascend 910C)
- Ascend A5 (Ascend 950)
- CPU (x86_64 / AArch64)
For more details, see include/README.md.
🛣️ Roadmap
Planned future features:
| Feature | Description | Scope | Progress / target completion |
|---|---|---|---|
| PTO Auto Mode | BiSheng compiler support for automatic tile buffer allocation and synchronization insertion. | Compiler / toolchain | Ongoing |
| PTO Tile Fusion | BiSheng compiler support for automatic tile operation fusion. | Compiler / toolchain | Ongoing |
| PTO-AS | Bytecode support for PTO ISA. | Compiler / toolchain | Ongoing |
| Convolution extension | PTO ISA support for convolution kernels. | ISA extension | Ongoing |
| Collective communication extension | Add asynchronous communication instructions for Ccu and Roce, and add the TPREFECTH (AIV direct-drive) communication instruction. | Communication ISA extension | 2026 Q2 |
| System scheduling extension | PTO ISA support for SPMD/MPMD programming schedules. | ISA extension | Planned |
| Micro-instructions | Support expressing high-performance operators through micro-instructions, together with a foundational high-performance micro-instruction library. | ISA extension / operator development | 2026 Q2 |
| Base instructions | Further optimize A5 instruction performance, add Pooling-related base instructions, and enhance convolution, quantization, and Fixpipe instruction capabilities. | ISA extension | 2026 Q2 |
| CostModel | Support CostModel performance simulation for A5 instructions. | Toolchain / performance modeling | 2026 Q2 |
| CPU-SIM | Keep CPU-SIM built in sync with instruction enhancements. | CPU simulation | 2026 Q2 |
🗃️ Directory Structure
Key directories are listed below:
├── include/ # Public PTO headers and interfaces │ └── pto/ # Common types, ISA interfaces, and CPU/NPU implementations ├── kernels/ # Kernels and operator implementations │ ├── manual/ # Hand-optimized implementations and performance examples │ └── custom/ # Custom operator examples ├── docs/ # ISA, programming model, getting started, and doc site sources │ ├── isa/ # Instruction references and category indexes │ ├── coding/ # Developer and performance optimization docs │ ├── assembly/ # PTO-AS assembly syntax and specification │ └── mkdocs/ # MkDocs config and source files ├── demos/ # Auto Mode, baseline, and torch_jit examples ├── tests/ # CPU / NPU tests, scripts, and test entry points │ ├── cpu/ # CPU simulation tests │ ├── npu/ # SoC-specific NPU tests │ └── script/ # Test build and execution scripts ├── scripts/ # Build, install, and release scripts ├── cmake/ # Shared CMake configuration and packaging logic ├── build.sh # One-click build and run entry script └── CMakeLists.txt # Top-level CMake configurationℹ️ Related Information
- Contributing Guide: contribution workflow and development guidelines
- Security and Vulnerability Disclosure: process for reporting security issues
- Release Notes: version updates and release history
- License: CANN Open Software License Agreement Version 2.0
- PyPTO: an upper-layer programming framework in the PTO ecosystem
- PTOAS: PTO assembler and compiler backend for PTO workflows
- pto-dsl: Pythonic frontend and JIT workflow exploration for PTO
📬 Contact Us
- Issue reporting: submit problems through repository Issues
- Feature requests: share suggestions through Issues or discussion channels
- Code contributions: contribute through Pull Requests
【免费下载链接】pto-isaParallel Tile Operation (PTO) is a virtual instruction set architecture designed by Ascend CANN, focusing on tile-level operations. This repository offers high-performance, cross-platform tile operations across Ascend platforms.项目地址: https://gitcode.com/cann/pto-isa
创作声明:本文部分内容由AI辅助生成(AIGC),仅供参考