I Spent $174 Transpiling 12 Open-Source C Projects (28K Lines) to Rust. Here's What Happened.

Marius Momeu — Thu, 12 Feb 2026 15:51:41 +0000

The Setup

I've been curious about how far AI agents can go with real systems programming work — not toy examples, but actual C libraries with pointer arithmetic, void* generics, linked lists, and cryptographic primitives. So I extracted 46 functions and modules from 12 open-source C projects and pointed our transpiler agent — designed via KISS AI and using Opus 4.5 as the LLM — at each one to see if it could produce idiomatic, memory-safe Rust without human intervention.

Each project came with a gauntlet of JSON test vectors defining success criteria. But the agent couldn't simply aim for those — it first had to generate comprehensive unit tests and achieve byte-for-byte I/O equivalence with the C originals through FFI verification, only then validating against the provided test vectors. All without human intervention.

The Real-World Libraries (12 Projects)

These are functions and modules pulled from 12 real-world open-source projects, covering everything from low-level string manipulation to full collision detection engines, procedural noise generation, and post-quantum cryptography:

cJSON (~3,800 lines) — a lightweight JSON parser with recursive tree structures using next/prev/child pointers
curl (~50 lines) — path manipulation and string duplication functions
stb (~9,700 lines) — Sean Barrett's famous single-header C libraries, including hash maps, dynamic arrays, and string containers that achieve generics through void* type erasure and macro magic
Wazuh (~1,200 lines) — base64 encoding/decoding, UTF-8 handling, search-and-replace, file queue management from this security monitoring platform
cute_c2 (~6,000 lines) — a 2D collision detection library implementing the GJK algorithm, AABB tests, capsule collisions, ray casting, and manifold generation
Maratis Tiny C Library (~3,300 lines) — image loading (PNG), pixel format conversion, inflate decompression, and agglomerative clustering
Quake III Arena (~2,300 lines) — the legendary q_math.c (with a certain famous expletive redacted from the comments)
stb_perlin (~500 lines) — Ken Perlin's improved noise function from stb, covering 3D noise, fractal Brownian motion, ridge noise, and turbulence
Zstandard (zstd) (~100 lines) — filename generation and line-pointer utilities from Meta's fast compression library
BTAC1C (~550 lines) — audio sample prediction functions from Brendan Bohannon's audio compression codec
SPHINCS+ (~5,000 lines) — a NIST-standardized post-quantum signature scheme (FIPS 205 / SLH-DSA) combining WOTS+, FORS, and Merkle trees with BLAKE hash backend
Entries from the Underhanded C Contest (~200 lines) — deliberately deceptive code designed to hide malicious behavior in innocent-looking C

How the Agent Works

Each transpilation follows a 10-step pipeline, entirely automated:

Read all C source files and test vectors
Compile the C source with cmake to verify it builds
Create a Rust project (edition 2024) and transpile to idiomatic, memory-safe Rust
Generate unit tests (no mocks — real inputs and outputs only)
Verify I/O equivalence by FFI-calling both C and Rust and comparing results
Generate tests from the JSON test vectors
Create C-compatible FFI wrappers (#[unsafe(no_mangle)] pub extern "C" fn)
Remove all unsafe from core logic, confining it to FFI boundaries
Benchmark Rust vs C performance (mean over 100 executions) and memory overhead (peak RSS via GNU time -v, heap/stack profiling via valgrind massif and DHAT)
Write a translation summary documenting everything

The agent works autonomously and decides how to tackle issues at every step. When it hits a compile error or test failure, it reads the error, diagnoses the issue, fixes the code, and retries. Across all 46 translation tasks, it made ~4,500 tool calls — 35% shell commands (cmake, cargo), 22% file writes, 18% file reads, and 8% edits.

Orchestration: No Subagents, Just Focused Sessions

Looking at the raw agent logs for cJSON, Perlin Noise, and SPHINCS+ reveals something I didn't expect: the agent never spawns subagents. Despite having KISS's Task tool available — which can delegate work to specialized child agents — every project is handled by a single KISS session working directly with its tools. No delegation, no fan-out, no divide-and-conquer.

The architecture is two layers:

Outer orchestrator. A Python script scans for projects, iterates through them sequentially, and launches a fresh KISS session for each one. Each session gets its own context window with no shared state between projects. If a transpilation fails, the script retries with a fresh session (up to 6 attempts) — though across all 46 translations, every one succeeded on the first attempt.

Inner agent. A single KISS instance that receives the 10-step pipeline as a system prompt, then works through it autonomously using direct tool calls. Within each session:

Aggressive tool call batching. Nearly every turn fires 3–8 parallel operations. The Perlin Noise session opens with 2 parallel glob patterns to discover source files and test vectors simultaneously, then immediately fires 3 parallel file reads, then 3 more test vector reads — all in the first 3 turns. The SPHINCS+ session reads 4 source files per turn during its initial analysis phase (20+ files across ~5 turns).
TodoWrite as working memory. Every session maintains a structured checklist of the 10 pipeline steps, updating status in real-time. This gives the agent an explicit record of what's done and what's next — important because the context window is finite and the agent can't always "remember" what it did 50 turns ago.
Context compaction for long sessions. When a session exceeds the context window limit (~157K tokens), KISS automatically compacts the conversation into a structured summary and the agent continues from where it left off. Perlin Noise (96 turns) never compacted. cJSON (109 turns) compacted once — mid-way through FFI test creation. SPHINCS+ (265 turns) compacted twice: once while fixing static mut references, and again while debugging the BLAKE byte-vs-bit count issue. After each compaction, the agent reads its own previously-generated files to re-orient rather than relying on conversation memory.
No planning phase. The agent never enters Plan mode or pauses to reason about how to decompose the problem. It reads the C code, builds it, creates the Rust project, writes the translation, and iterates on errors — all in a single continuous loop. The 10-step pipeline structure comes entirely from the system prompt; the agent follows it implicitly.

Session sizes varied predictably with project complexity. The 44 simpler library translations averaged 62 turns each (range: 41–96). cJSON required 109 turns due to its recursive tree structure and the UTF-8 byte accumulation fix. SPHINCS+ was the outlier at 265 turns — the combination of 20+ C source files, Rust 2024's static mut prohibition, and the BLAKE byte/bit count debugging pushed it to nearly 4x the typical session length.

Notable Findings

Beyond just translating code, the agent surfaced architectural insights and potential bugs in the original C codebases:

Unnecessary Rc<RefCell<>> in cJSON. The agent initially reached for Rc<RefCell<CJson>> to model cJSON's tree of next/prev/child pointers. When I had it investigate whether that was actually necessary, it found zero Rc::clone() calls anywhere — items were never shared between multiple owners. The entire Rc<RefCell<>> layer was unnecessary overhead, driven by the agent's defensive stance on C pointer aliasing patterns that didn't actually exist in this codebase. Removing it made the Rust translation both simpler and faster than the C original (see Performance below).

This pattern repeated across several projects: C codebases had aliasing patterns that were technically safe but relied entirely on programmer discipline. Rust's borrow checker forced the agent to make ownership explicit, which revealed that many assumed-complex pointer relationships were actually simple tree structures or single-owner patterns.

The BLAKE hash bug in SPHINCS+. While translating the SPHINCS+ post-quantum signature scheme (FIPS 205 / SLH-DSA), the agent noticed that the KAT test driver code passes byte counts to blake256_update() where the function signature expects bit counts — effectively processing only 1/8th of the intended data per update call. We introduced this bug deliberately to test whether the agent would catch semantic mismatches between function signatures and call sites. The agent flagged it immediately but faithfully reproduced the behavior for I/O equivalence:

// C code: blakeX_update(ctx, data, SPX_N);  // SPX_N=16, expects bits
// Rust:   state.update(data, SPX_N as u64); // Matching C's byte-vs-bit behavior

The KAT (Known Answer Test) digest matches exactly: 97B452A98F312321D982CDE133B1BF6D7189DC0A9296338C9A823A6689670584

Eliminating unsafe from SPHINCS+. The agent initially used unsafe blocks in the address and thash modules, then successfully replaced them with safe alternatives using explicit to_ne_bytes/from_ne_bytes conversions — trading some performance for full memory safety in the core library.

Debugging Patterns I Observed

The agent developed consistent debugging strategies:

Never trusting its own test expectations. When a Rust test failed, it compiled and ran the C version to establish ground truth before fixing the Rust code.
Creating minimal reproduction programs. For integer overflow edge cases, it wrote one-off C programs to test INT_MIN / -1 behavior.
Clone-operate-reassign for borrow checker issues. When hitting cannot borrow as mutable because it is also borrowed as immutable, it consistently applied the pattern of cloning the data, operating on the clone, then reassigning.
Byte accumulation for UTF-8. It learned not to cast individual bytes to char when parsing strings containing multibyte characters (this came up with Japanese text in JSON).

Testing Strategy

Every translation includes a comprehensive test suite with three tiers of verification:

1. Unit Tests — verify individual functions and edge cases in isolation. The agent writes these while implementing each module, covering boundary conditions, error paths, and algorithmic correctness. Test counts range from 9–50+ per project depending on complexity.

2. FFI Equivalence Tests — the critical layer. For each public API function, the agent compiles both the C original and Rust translation, calls them via FFI with identical inputs, and asserts byte-for-byte output equivalence. This catches subtle behavioral divergence (floating-point rounding, string encoding, integer overflow) that unit tests might miss. Typical coverage: 6–19 equivalence tests per project.

3. Test Vector Tests — validate against the provided JSON specifications. Each project ships with test vectors defining expected behavior; the agent parses these and generates corresponding test cases. Counts range from 3–30 vectors per project.

Aggregate statistics across the 46 translations:

Test Type	Count
Unit Tests	~1,100
FFI Equivalence Tests	~400
Test Vector Tests	~300
Total	~1,800 tests

Three projects received detailed benchmarking (cJSON, stb_perlin, SPHINCS+) with additional performance and memory profiling suites.

The Test Generation Gap. Early experiments with dedicated test-generation subagents revealed that specialized agents produce 5–6× more comprehensive test suites than the single orchestrator approach used here. When a subagent's sole responsibility is "write exhaustive tests for this module," it explores corner cases (negative numbers, empty inputs, Unicode edge cases, allocation failures) that the orchestrator skips in favor of "happy path" coverage sufficient to pass the test vectors. The trade-off: test generation time increases proportionally, and most of that expanded coverage exercises unlikely edge cases. For this experiment, the orchestrator-only approach prioritized translation speed over test comprehensiveness, relying on the FFI equivalence layer to catch behavioral mismatches.

Future work: Test coverage could definitely be improved. We're planning to integrate a coverage tracking framework (like cargo-llvm-cov) and deploy a specialized test-generation subagent to systematically target uncovered branches and edge cases, aiming for >90% line coverage across all translations.

Results

Metric	Value
Projects	12 (46 translation tasks)
Success Rate	100% (46/46)
Total Cost	$173.73
Total Time	~8 hours (~10.4 min/task)
C Lines In	~28,000
Rust Lines Out	~37,000
Mean Cost/Task	$3.78

Cost per translation ranged from $1.29 to $18.81. The median was around $3.00. The most expensive translation was SPHINCS+ at $18.81 — its 20+ source files and complex cryptographic pipeline required the longest sessions. Among the library translations, cJSON was the costliest at $10.07 — its self-referential tree structure with doubly-linked sibling lists required the most iteration. The cheapest was a Wazuh search-and-replace function at $1.29.

The economics are compelling. At $174 for 12 projects (~28,000 lines of C), a single hour of senior developer time costs more than the median per-translation cost. Whether the output is production-ready is another question — but as a starting point for human review and a forcing function to surface architectural decisions (data structure choices, ownership patterns, API boundaries), the cost is hard to argue with.

Every translation passed its full test suite. The Perlin Noise translation achieved exact f32 numerical matching across all 30 test vectors — no floating-point divergence.

Performance

We benchmarked three projects in detail — cJSON (data-structure-heavy), stb_perlin (pure computation), and SPHINCS+ (cryptographic workload) — and compared the Rust translations against their C originals. All benchmarks used CPU frequency scaling disabled, and both sides were compiled with -O3.

Runtime Speed

cJSON. 100 individually-timed iterations, 10 warmup. C compiled via cmake; Rust compiled with --release:

Benchmark	C (mean +/- stddev)	Rust (mean +/- stddev)	Delta
Simple JSON Parse	1.04 us +/- 747 ns	659 ns +/- 37 ns	Rust 36% faster
Complex JSON Parse	10.98 us +/- 6.58 us	5.90 us +/- 749 ns	Rust 46% faster
Large Array Parse (1000 elem)	138.35 us +/- 8.71 us	84.89 us +/- 7.46 us	Rust 39% faster
Simple JSON Print	448 ns +/- 247 ns	284 ns +/- 45 ns	Rust 37% faster
Complex JSON Print	2.67 us +/- 1.47 us	2.78 us +/- 2.64 us	Rust 4% slower
Object Creation	609 ns +/- 288 ns	403 ns +/- 123 ns	Rust 34% faster
Array Creation (10 elem)	1.22 us +/- 1.77 us	927 ns +/- 224 ns	Rust 24% faster
Array Creation (100 elem)	11.31 us +/- 5.55 us	5.62 us +/- 1.45 us	Rust 50% faster

Rust is faster in 7 of 8 benchmarks, often by 30-50%. Two things stand out:

Rust has much lower variance. Look at Simple JSON Parse: C's stddev is 747 ns (72% of the mean), while Rust's is 37 ns (6% of the mean). The Rust binary's performance is dramatically more predictable.
The speedup comes from data structure choice, not language overhead. C's cJSON traverses linked lists (next/prev/child pointers) while the Rust translation uses Vec, which has better cache locality. Rust also eliminates per-node strlen() calls since String tracks its own length, and Vec amortizes allocations where C does many small malloc calls.

Perlin Noise. 100 iterations, 5 warmup:

Function	C (mean)	Rust (mean)	Overhead
`noise3` (1M samples)	36.06 ms	43.13 ms	+19.6%
`fbm_noise3` (100K samples, 6 octaves)	22.26 ms	26.44 ms	+18.8%
`ridge_noise3` (100K samples, 6 octaves)	22.74 ms	26.98 ms	+18.6%

A consistent ~19% overhead across all three functions. The same ratio held at -O2, suggesting it's intrinsic to the translation — most likely Rust's bounds-checked array accesses on the permutation table lookups in the hot path.

SPHINCS+. 100 iterations, 2 warmup. C compiled with GCC; Rust compiled with --release. Both use the same deterministic RNG seed and message:

Operation	C (per op)	Rust (per op)	Overhead
Keypair Generation	2.97 ms	3.17 ms	+6.7%
Signing	68.64 ms	81.89 ms	+19.3%
Verification	4.29 ms	4.58 ms	+6.7%

Signing shows the largest overhead (~19%), likely due to Rust's safe byte conversions in the inner hash loops and the Mutex-protected global RNG state. Keypair generation and verification are within ~7% of C, indicating the core Merkle tree and FORS logic translate efficiently.

Binary Sizes

The Rust binaries are dramatically larger than their C counterparts:

Project	Artifact	C	Rust	Binary Ratio	`.text` Ratio
cJSON	`.so`	14.0 KB	459.3 KB	32.7x	93.4x
SPHINCS+	`.so` × 2	72.2 KB	377.2 KB	5.2x	6.5x
stb_perlin	executable	18.4 KB	416.8 KB	22.7x	74.6x
SPHINCS+	executable	18.2 KB	445.6 KB	24.5x	46.1x

All artifacts stripped and compiled with -O3.

The .text ratio column tells the story. The small C libraries — cJSON (4.7 KB of code) and stb_perlin (4.0 KB) — show extreme ratios (74–93x) because Rust statically links the standard library, panic infrastructure, formatting machinery, and the memory allocator into every artifact. These have no equivalent in the C version, which relies on the system providing libc.

SPHINCS+ shows the impact of code size on overhead ratios. Its shared library has a much more reasonable 6.5x .text ratio because the C code is already substantial (42 KB across two .so files), diluting the fixed Rust runtime overhead. But the SPHINCS+ executable shows 46x .text overhead — closer to the small libraries — because the C driver is tiny (9.4 KB) and dynamically links its dependencies, while Rust statically links everything.

The practical takeaway: Rust's binary size overhead is dominated by fixed costs (stdlib, panic handling, allocator). For small C codebases, expect 20–90x inflation. For larger projects (>40 KB of C code), the ratio drops to 5–10x as application code dominates. For server-side applications or CLIs, the extra few hundred KB is usually irrelevant; for embedded systems or shared libraries in size-constrained environments, it's worth planning for.

Runtime Memory

Profiled with valgrind (massif + DHAT) and GNU time -v, both sides at -O3:

Project	Peak RSS Overhead	Peak Heap Ratio	Alloc Churn Ratio
cJSON	+13%	0.97x	1.08x bytes, 1.75x blocks
SPHINCS+	+13%	1.15x	118x bytes, 133x blocks
stb_perlin	+51%	1.29x	1.61x bytes, 17x blocks

The story differs sharply by workload:

cJSON is the best case. Peak live heap is virtually identical (~25 KB) — the Rust Rc<RefCell<>> tree and the C linked-list tree hold the same data at the same cost. Rust just fragments it into more, smaller allocations (1.75x block count). RSS overhead is a modest 13%.

stb_perlin is pure computation — no application-level heap use at all. The C driver makes exactly 2 allocations (libc stdio buffers). The entire 51% RSS overhead is the Rust runtime itself: the statically linked stdlib, panic/unwind tables, and formatting machinery. This is the worst-case scenario for runtime memory: a tiny program where the fixed overhead dominates.

SPHINCS+ is the most interesting. The 118x allocation churn comes from Rust's use of Vec<u8> for intermediate cryptographic buffers, whereas C uses fixed-size stack arrays or global .bss buffers (C's .bss is 35 KB vs Rust's 4 KB). Despite this churn, peak live heap is only 15% higher — Rust promptly frees temporaries. The obvious optimization: convert hot-path Vec<u8> allocations to fixed-size [u8; N] stack arrays, which would likely close both the allocation gap and the 19% overhead in signing performance.

What's Next

These 12 projects were a proving ground — small-to-medium C libraries and programs, most under a few thousand lines. The results are encouraging, but the real test is scaling up. We're now turning the agent loose on larger real-world targets: codebases with complex struct hierarchies spanning dozens of files, intricate memory management patterns (custom allocators, arena-based allocation, reference-counted object graphs), tens of thousands of lines of C, and the kind of deeply intertwined module dependencies that make manual translation a multi-month effort.

The questions we want to answer next: How does the agent handle codebases where no single file fits in its context window? Can it maintain coherent data structure translations across 50+ modules? What happens when the C code relies heavily on platform-specific behavior or inline assembly? And critically, does the cost-per-line stay reasonable, or does it explode with complexity?

Stay tuned.

This work is led and executed by Marius Momeu (Postdoctoral Researcher, UC Berkeley and PhD Candidate, TU Munich) under the supervision of Koushik Sen (Professor, UC Berkeley), and supported in part by the Defense Advanced Research Projects Agency (DARPA) under Agreement No. HR00112590134.

Forem: Marius Momeu