Kani and CUDA
Table of Contents
CUDA kernels in pure Rust?!
I've been implementing an experimental feature in the cudarc
crate. The goal is to write CUDA kernels in pure Rust to improve portability and safety. It's far from finished.
Currently cudarc
emits PTX, the CUDA equivalent of assembly, with JIT source code compilation of C
code to PTX that is then loaded at runtime:
let ptx = compile_ptx?;
// and dynamically load it into the device
dev.load_ptx?;
For Rust PTX emission, I instead use a JIT crate compiler.
// use compile_crate_to_ptx to build kernels in pure Rust
// uses experimental ABI_PTX
let kernel_path: PathBuf = "examples/rust-kernel/src/lib.rs".into;
let kernels: = compile_crate_to_ptx.unwrap;
let kernel: Ptx = kernels.first.unwrap;
dev.load_ptx?;
// we can also manage and clean up the build ptx files with a PtxCrate
let mut rust_ptx: PtxCrate = kernel_path.try_into.unwrap;
rust_ptx.build_ptx.unwrap;
let _kernel: &Ptx = rust_ptx.peek_kernels.unwrap.first.unwrap;
println!;
The development environment is rough. Notice how I used square
? I couldn't actually find sin
! I'm still learning how to get the linter on board inside the kernel
crate.
Kernel crate source code
// lib.rs
// emitting ptx (unstable)
// simd instructions (unstable)
// CUDA compatibility
use *; // access to thread id, etc
!
pub unsafe extern "ptx-kernel"
The panic_handler
is for no_std
crates. Here it is used to give CUDA devices instructions for when panic
s can't be avoided. Branches and panics drag on performance. Avoid them as much as possible.
// device.rs
// a no_std fn
We compile as a C
dynamic library and emit PTX, targeing nvptx64
.
# Cargo.toml
[]
= "rust-kernel"
= "0.1.0"
= "2021"
[]
= "kernel"
# build = "cargo +nightly rustc --lib --target nvptx64-nvidia-cuda --release -- --emit asm"
= ["cdylib"]
Emitted PTX
The PTX itself is pretty clean.
//
// Generated by LLVM NVPTX Back-End
//
.version 6.0
.target sm_30
.address_size 64
// .globl square_kernel
.visible .entry square_kernel
On the other hand, ABI_PTX
is very much an experiemntal Rust feature. It would be an exciting challenge to work on it some day.
As of today, PTX is on version 8.1, compared to our version 6.0. For simple kernels, perhaps this doesn't matter. But what about sm_30
? Read here to learn more about the architecture names. I don't think it's a good sign to see Kepler
here!
Kani
What if my device function emits PTX with lots of panics? Ideally I'd like to refactor it so that the Rust compiler writes fewer panics into the code.
I'm experimenting with the model-checker kani
.
// https://github.com/model-checking/kani/blob/2df67e380a42a78748b8e84dcc699b0378b287c7/README.md?plain=1#L30
use ;
Here we assert!
that any
input
to the function_under_test
which meets the precondition
produces output
that meets_specification
. It does so by converting the problem to a boolean SAT formula (SAT solvers are fun!) and trying to write a proof. We encode input
and output
by assigning each bit a boolean variable. The SAT formula asserts that there exists an input
whose output
does not meet specifications. Many thousands of dummy variables are introduced to calculate output
from input
through function_under_test
.
When given a boolean formula in CNF, a SAT solver either produces:
a witness
: a truth assignment that satisfies the formula (equivalently, a model)- here, a counterexample to the proof: the underlying bits for
input
that fails the assertion; - (!) save this to write a failing unit test; or
- here, a counterexample to the proof: the underlying bits for
a certificate
: a step-by-step walkthrough of the binary search space that proves insatisfiability- this can be saved for reuse
Will this meet my needs? It certainly has me interested.
Next steps
- finish the PR
- investigate panic-free
no_std
refactoring withkani
- investigate
abi_ptx
development - apply to other crates
- potential features:
- build scripts to build PTX before runtime
- specify more command arguments
- improve compatibility
- compile standalone kernel code (uses
rustc
, notcargo
) - macros to build kernels from device functions