<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Albatross1382</title>
    <description>The latest articles on Forem by Albatross1382 (@albatross1382).</description>
    <link>https://forem.com/albatross1382</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3699510%2F7019e4ff-714c-4c9d-b2bf-6b40355523a7.png</url>
      <title>Forem: Albatross1382</title>
      <link>https://forem.com/albatross1382</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/albatross1382"/>
    <language>en</language>
    <item>
      <title>Getting ONNX Runtime CUDA Working on NVIDIA Blackwell (GX10/DGX Spark)</title>
      <dc:creator>Albatross1382</dc:creator>
      <pubDate>Fri, 10 Apr 2026 03:32:07 +0000</pubDate>
      <link>https://forem.com/albatross1382/getting-onnx-runtime-cuda-working-on-nvidia-blackwell-gx10dgx-spark-ngb</link>
      <guid>https://forem.com/albatross1382/getting-onnx-runtime-cuda-working-on-nvidia-blackwell-gx10dgx-spark-ngb</guid>
      <description>&lt;h1&gt;
  
  
  Getting ONNX Runtime CUDA Working on NVIDIA Blackwell (GX10/DGX Spark)
&lt;/h1&gt;

&lt;p&gt;&lt;strong&gt;Or: How I spent 12 hours discovering that nobody ships GPU inference binaries for NVIDIA's own hardware&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;NVIDIA's DGX Spark (GX10) ships with a Grace CPU (ARM64) and GB10 GPU (Blackwell, sm_121). As of April 2026, no prebuilt ONNX Runtime GPU binary exists for this platform — not from Microsoft, not from PyPI, not from the Rust ecosystem. Here's how I built one from source and got CUDA-accelerated embedding inference running. The prebuilt binaries and full build instructions are published at &lt;a href="https://github.com/Albatross1382/onnxruntime-aarch64-cuda-blackwell" rel="noopener noreferrer"&gt;https://github.com/Albatross1382/onnxruntime-aarch64-cuda-blackwell&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Setup
&lt;/h2&gt;

&lt;p&gt;I'm building a Rust application that uses snowflake-arctic-embed-m-v2.0 for semantic search — a 768-dimension embedding model running via ONNX inference. On my laptop (ARM64, CPU-only via tract-onnx), each embedding took ~3,400ms. The GX10's Blackwell GPU should crush this.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 1: Swap tract-onnx for the ort crate (Easy)
&lt;/h2&gt;

&lt;p&gt;The &lt;code&gt;ort&lt;/code&gt; Rust crate wraps ONNX Runtime with CUDA support. I extracted an &lt;code&gt;EmbeddingProvider&lt;/code&gt; trait, implemented &lt;code&gt;TractProvider&lt;/code&gt; and &lt;code&gt;OrtProvider&lt;/code&gt;, wired a &lt;code&gt;--features ort-cuda&lt;/code&gt; flag. Clean refactor, 26 files, all tests passing.&lt;/p&gt;

&lt;p&gt;Immediate result: &lt;strong&gt;3,400ms → 135ms&lt;/strong&gt; — even on CPU. ORT's CPU backend is far better optimised than tract's pure-Rust implementation. 25x improvement before even touching the GPU.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 2: Where Are the GPU Binaries? (The Wall)
&lt;/h2&gt;

&lt;p&gt;The &lt;code&gt;ort&lt;/code&gt; crate auto-downloads prebuilt ONNX Runtime binaries. On aarch64 + CUDA 13:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[ort-sys] [WARN] no prebuilt binaries available on this platform
for combination of features 'cu13'
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Checked every source:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;pyke.io&lt;/strong&gt; (ort crate's CDN): No aarch64 + CUDA builds at all&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Microsoft GitHub releases&lt;/strong&gt;: No aarch64 GPU tarballs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;PyPI&lt;/strong&gt; (&lt;code&gt;onnxruntime-gpu&lt;/code&gt;): "No matching distribution found" for aarch64&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;NVIDIA apt repos&lt;/strong&gt;: Has cuDNN and CUDA, but no onnxruntime package&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;NVIDIA AI Workbench&lt;/strong&gt;: Not bundled&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;NVIDIA sells hardware that their own ML ecosystem doesn't have prebuilt inference binaries for.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 3: Build from Source (The Gauntlet)
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Attempt 1: ORT v1.20.1
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Eigen hash mismatch&lt;/strong&gt;: GitLab changed the zip archive, breaking the SHA1 check. Fix: pre-clone Eigen via git and set &lt;code&gt;FETCHCONTENT_SOURCE_DIR_EIGEN&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;thrust::unary_function&lt;/code&gt; removed&lt;/strong&gt;: CUDA 13's CCCL/Thrust removed this class. ORT v1.20.1 uses it. &lt;strong&gt;Dead end — need a newer ORT.&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Attempt 2: ORT v1.24.4
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;compute_53&lt;/code&gt; unsupported&lt;/strong&gt;: CUDA 13 dropped old GPU architectures. Fix: &lt;code&gt;CMAKE_CUDA_ARCHITECTURES=121&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;sm_120 vs sm_121&lt;/strong&gt;: I initially guessed sm_120 for Blackwell. Wrong — the GB10 is compute capability 12.1. Discovered via &lt;code&gt;nvidia-smi --query-gpu=compute_cap --format=csv,noheader&lt;/code&gt;. Cost: one full rebuild (~40 minutes).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Success&lt;/strong&gt;: ORT v1.24.4 built clean with sm_121. Total build time: ~40 minutes on 20 cores.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Step 4: Dynamic Loading vs Static Linking (The Trap)
&lt;/h2&gt;

&lt;p&gt;With the built &lt;code&gt;.so&lt;/code&gt;, I tried two approaches:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Static linking&lt;/strong&gt; (&lt;code&gt;ORT_LIB_LOCATION&lt;/code&gt;): The &lt;code&gt;ort-sys&lt;/code&gt; build script ignored it and downloaded the CPU-only binary anyway. The env var didn't propagate through Cargo's build process reliably.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Dynamic loading&lt;/strong&gt; (&lt;code&gt;load-dynamic&lt;/code&gt; feature + &lt;code&gt;ORT_DYLIB_PATH&lt;/code&gt;): Loaded the correct library. CUDA provider plugin loaded. All symbols resolved. EP registered without error. &lt;strong&gt;But I couldn't tell whether CUDA was actually active.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 5: The Debugging Trap
&lt;/h2&gt;

&lt;p&gt;At this point I spent hours convinced CUDA wasn't working. The &lt;code&gt;ort&lt;/code&gt; crate's EP registration appeared to silently fail — inference timing looked the same with and without CUDA. I even wrote a 20-line unsafe FFI workaround to call the legacy &lt;code&gt;OrtSessionOptionsAppendExecutionProvider_CUDA&lt;/code&gt; function directly.&lt;/p&gt;

&lt;p&gt;The actual problem? &lt;strong&gt;I didn't have &lt;code&gt;tracing-subscriber&lt;/code&gt; initialised.&lt;/strong&gt; The &lt;code&gt;ort&lt;/code&gt; crate logs all EP registration events via the &lt;code&gt;tracing&lt;/code&gt; crate. Without a subscriber, &lt;code&gt;RUST_LOG&lt;/code&gt; does nothing — zero output, zero feedback on whether CUDA is active.&lt;/p&gt;

&lt;p&gt;Once I added &lt;code&gt;tracing-subscriber&lt;/code&gt;:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;INFO ort::ep: Successfully registered `CUDAExecutionProvider`
TRACE: Node(s) placed on [CUDAExecutionProvider]. Number of nodes: 455
INFO: Creating BFCArena for Cuda ...
INFO: cuDNN version: 92000
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;CUDA was working the whole time — through the crate's native &lt;code&gt;ort::ep::CUDA&lt;/code&gt; registration. The FFI workaround was unnecessary.&lt;/p&gt;

&lt;p&gt;The definitive proof:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;With GPU: &lt;strong&gt;148ms&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;CUDA_VISIBLE_DEVICES=""&lt;/code&gt; (GPU hidden): &lt;strong&gt;3,279ms&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Lesson learned:&lt;/strong&gt; If you're using the &lt;code&gt;ort&lt;/code&gt; crate and can't tell whether your EP is active, add &lt;code&gt;tracing-subscriber&lt;/code&gt; before anything else. It's not optional for debugging.&lt;/p&gt;

&lt;h3&gt;
  
  
  Verification
&lt;/h3&gt;

&lt;p&gt;With &lt;code&gt;RUST_LOG=ort=debug&lt;/code&gt;, ORT confirms CUDA activation:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;INFO ort::logging: Creating BFCArena for Cuda ...
INFO ort::logging: Extending BFCArena for Cuda. bin_num:20 num_bytes: 768147456
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;768MB allocated on GPU for model weights. Only lightweight shape ops fall back to CPU — standard ORT optimisation behaviour.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 6: INT8 vs FP32
&lt;/h2&gt;

&lt;p&gt;The quantized (INT8) model doesn't have CUDA kernels for sm_121. All ops fall back to CPU, making it slower than ORT's CPU path with the FP32 model. Solution: use the FP32 model for CUDA, keep the quantized model for the CPU-only tract backend.&lt;/p&gt;

&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Step&lt;/th&gt;
&lt;th&gt;Time&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;tract-onnx CPU (before)&lt;/td&gt;
&lt;td&gt;3,400ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ORT CPU (ort crate, no GPU)&lt;/td&gt;
&lt;td&gt;135ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ORT CUDA (GB10)&lt;/td&gt;
&lt;td&gt;148ms cold&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ORT CPU (CUDA disabled)&lt;/td&gt;
&lt;td&gt;3,279ms&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The 135ms → 148ms difference on cold start is misleading — model loading dominates. In a long-running server with warm sessions, CUDA inference should be significantly faster than CPU.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Needs to Happen
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Microsoft&lt;/strong&gt;: Publish aarch64 Linux GPU release artifacts for ONNX Runtime&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;pyke.io&lt;/strong&gt;: Add aarch64 + CUDA prebuilts to the ort crate's download cache&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ort crate&lt;/strong&gt;: Document that &lt;code&gt;tracing-subscriber&lt;/code&gt; is required to see EP registration status. Without it, debugging execution providers is nearly impossible.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;NVIDIA&lt;/strong&gt;: Your flagship developer workstation doesn't have prebuilt ML inference binaries. Fix that.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The prebuilt binaries and complete build instructions are available at &lt;a href="https://github.com/Albatross1382/onnxruntime-aarch64-cuda-blackwell" rel="noopener noreferrer"&gt;https://github.com/Albatross1382/onnxruntime-aarch64-cuda-blackwell&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>rust</category>
      <category>cuda</category>
      <category>nvidia</category>
      <category>onnxruntime</category>
    </item>
  </channel>
</rss>
