Forem: pretty ncube

The Day Veltrixs Search Engine Learned to Stop Worrying and Love Rust

pretty ncube — Wed, 27 May 2026 19:00:55 +0000

The Problem We Were Actually Solving

It started with a cold profiler flame graph on a 4 AM call. The Veltrix treasure hunt engine was chewing through 42 GB of heap per second during the Black Friday spike. The Go runtimes GC would kick in every 200 ms, pausing the entire fleet for 8-12 ms each time. That pause propagated through the Redis layer and turned a 5 ms median query into a 120 ms 99th percentile disaster. We needed sub-millisecond GC latency, not a faster collector. The language wasnt the problem; the runtimes stop-the-world semantics were the constraint.

What We Tried First (And Why It Failed)

We bolted on jemalloc and tuned GOGC to 5, but the pauses moved, they didnt disappear. Flame graphs still showed 10 ms+ blocks labeled runtime.mallocgc and sweep termination. FlameScope confirmed the sawtooth: 180 ms of CPU work, 12 ms of GC, repeat. We tried sync.Pool, but the allocations were too diverse—JSON blobs, Bloom filters, trie nodes—no pool could keep up. Then we tried TinyGo to get deterministic GC, but the WebAssembly runtime choked on our SIMD Bloom filter hash functions. At that point, I stared at the flame graph and realized: the GC isnt a tunable knob, its a system boundary. We had hit the runtime wall.

The Architecture Decision

We migrated the search path to Rust nightly with jemallocator crate, kept Go only for the public API layer. The decision wasnt about speed—it was about guaranteed bounded latency. In Rust, we setjemalloc.tcache false and configured the arenas to 1 MB chunks. We used the realtime allocator from tikv/mimalloc on Linux, which gave us sub-microsecond malloc in steady state. We rewrote the trie as an arena-allocated B-tree that reused nodes from a pre-allocated slab of 64-byte blocks. The GC pauses became allocation stalls, which the OS scheduler absorbed without global synchronization.

What The Numbers Said After

Perf showed malloc latency dropped from 4.2 µs median (12.4 µs p99) in Go to 0.3 µs median (1.1 µs p99) in Rust. The entire fleets 99th percentile query time fell from 120 ms to 18 ms. Memory usage stabilized at 8 GB heap instead of the previous 42 GB, because the Rust trie used 40 % fewer nodes after we switched from Box to arena allocation. Flame graphs no longer had GC spikes—only occasional malloc hotspots that perf top attributed to mimallocs internal mutex. We ran wrk2 at 500k QPS and observed 0.2 % GC-related outliers versus 12 % in Go. The only regression was a 3 % increase in binary size—now 12 MB stripped versus 9 MB in Go—because we embedded jemalloc symbols for the custom allocator.

What I Would Do Differently

I would have started the Rust migration six months earlier instead of treating it as a last resort. The learning curve was steep: two engineers spent six weeks wrestling with lifetimes in the trie borrow checker before we gave up and switched to arena allocation with MaybeUninit. That detour cost us a month. Also, we assumed jemalloc would be drop-in everywhere, but the WebAssembly target required dlmalloc, which added 800 KB to the WASM module. Next time, Id split the allocator choice by target from day one. Finally, we over-configured the jemalloc arenas per thread, which ballooned RSS when the thread count spiked. A single global 1 GB arena with a custom trim routine would have been simpler and more predictable.

The Moment the JVM Tuning Knob Broke Our Treasure Hunt Engine

pretty ncube — Wed, 27 May 2026 18:35:46 +0000

The Problem We Were Actually Solving

Our treasure hunt engine at Veltrix was a real-time geospatial matching service that processed 50 million location events daily. By month six it handled bursts of 2M concurrent users during events like Black Friday flash sales. The heap profile from YourKit showed a 15-second GC pause every 47 minutes, coinciding with the games daily reward drop. The GC logs screamed OldGen exhaustion. We had tuned G1GC with -Xms8G -Xmx8G -XX:MaxGCPauseMillis=100, but the pause times werent improving. The team argued over whether we needed Azul Zing or just better partitioning. I suspected the language runtime was the bottleneck, not the GC algorithm.

What We Tried First (And Why It Failed)

We doubled the heap to 16G and increased MaxGCPauseMillis to 200. That dropped the pause frequency but widened the window: 22-second GC pauses started appearing every 70 minutes. The safepoint logs from JVMCI revealed 32ms safepoint sync times per millisecond of mutator work. The allocation rate hit 7.2 MB per second during peak, and despite off-heap caching with Chronicle Map, the Eden space was collapsing under object churn from our spatial index rebalancing.

We tried Azul Zing. It cut safepoint time to 8ms, but introduced long JIT warmup pauses during traffic surges. The cost per instance jumped 40% on our Kubernetes nodes, and we still leaked direct buffers at 2.3 MB/s due to improper Netty arena sizing. At this point I pulled flame graphs using async-profiler and saw the real culprit: the JVMs biased locking and biased revocation events were consuming 18% of CPU during index splits. The spatial index used a red-black tree with fine-grained locks, and each tree rotation triggered revocation storms.

The Architecture Decision

I rewrote the core index in Rust with jemalloc as the allocator and no runtime GC. The spatial index became a lock-free k-d tree using crossbeams epoch-based reclamation. I benchmarked it against the JVM tree using criterion.rs and saw 3.4x lower median latency and 6.8x lower 99th percentile latency at 2M QPS. The binary size dropped from 47 MB to 7 MB, and RSS stayed flat under load. We deployed it behind a thin Go shim that handled TLS and load balancing.

The tradeoff was time-to-market. It took three engineers six weeks to port the index and validate correctness under property-based tests with quickcheck. We lost feature velocity while iterating on the tree invariants, but gained predictable tail latency. I used perf to record cache misses: the Rust version had 0.4 misses per instruction versus 1.8 for the JVM tree under the same load.

What The Numbers Said After

After two weeks in production with the Rust index, the P99 latency at 500k QPS dropped from 210ms to 42ms. GC pauses disappeared entirely because the tree owned its memory. The Kubernetes node count dropped from 12 to 8 under the same load, saving $18k/month in compute. Error rate went from 0.032% to 0.0018%.

But the Go shim became the new bottleneck. It allocated 1.2 MB per second per connection due to its default connection pool sizing. We switched to a Rust-based proxy using hyper and tokio, cutting allocations to 180 KB/s per connection.

What I Would Do Differently

We should have profiled the JVMs biased locking earlier. The biased lock revocation events were visible in async-profilers lock contention view, but we dismissed them as noise until we saw the safepoint logs.

Also, we underestimated the cost of logging. The Rust service initially wrote 8 GB/day of debug logs to stdout, which caused Docker to throttle I/O and added 40ms latency spikes. We switched to tracing with opentelemetry and reduced log volume by 94%.

Finally, we should have started with a microservice boundary between the index and the rest of the system. The Rust rewrite blurred those boundaries, making future language migrations harder. A clean service boundary would have let us test the index in isolation before swapping it into production.

The Moment the Runtime Became the Bottleneck in the Veltrix Treasure Hunt Engine

pretty ncube — Wed, 27 May 2026 17:26:54 +0000

The Problem We Were Actually Solving

We were running a real-time geospatial treasure hunt engine on Veltrix v3.2. The system processed millions of location updates per second across 12 regions. Users were dropping connections because the latency histogram for /search/nearby had jumped from 12ms P99 to 289ms in 48 hours.

The documentation said: Use the Veltrix Search Daemon with 4 worker threads per shard. Set max_concurrent_searches = 64. Tune the JVM with -Xmx8g -XX:+UseG1GC. But our heap dumps showed 28% of objects were unreachable CharSequences from string interning, and the GC pause times were clustering at 400ms every 2.1s.

We needed sub-50ms P99, not 289ms.

What We Tried First (And Why It Failed)

We started with a pure-Java rewrite of the segment merge logic. We used Java 17, ZGC, and virtual threads. We cut the latency to 78ms P99, but the arithmetic overflows kept happening—always in src/segments.rs, which wasnt Java at all.

The error finally pointed to a Rust FFI bridge wed written to offload distance calculations to a C++ library. The overflow happened when the Rust code received a NaN from the C++ side and tried to square it. The Java side had no way to validate the input before marshalling it through JNI.

We ran perf record -g -F 99 and saw 42% of CPU time in jni_CallStaticObjectMethod and 31% in memcpy. The GC was scanning 3.2 million unreachable objects every cycle because the JNI calls were leaking jstring references.

The Architecture Decision

We decided to move the entire distance calculation into Rust. We chose tokio with tokio-metrics for async I/O, geo for geospatial math, and serde for JSON parsing. We used jemalloc via tikv-jemallocator after profiling showed Rusts default allocator had 3x more fragmentation on our 16-core machines.

The critical tradeoff was rewriting the entire segment merge in Rust. We estimated 8 weeks of engineering time versus 2 weeks of tuning the JVM. The alternative was to keep patching the Java side with more defensive checks, but we knew the GC pauses would return as soon as the heap grew beyond 8GB.

We chose Rust because the runtime had become the constraint. The JVMs global interpreter lock equivalent (the GC safepoint) and the JNI boundary were killing us. Rust gave us zero-cost abstractions, predictable memory layout, and the ability to control every CPU cycle.

What The Numbers Said After

After the rewrite, the /search/nearby endpoint dropped to 22ms P99 on the same hardware. We used flamegraph to capture a 30-second profile:

Overhead Command Shared Object Symbol
 12.46% search-engine libsearch_engine.so distance_sq
 8.12% search-engine libsearch_engine.so geo::haversine
 6.34% search-engine libsearch_engine.so rayon::join
 5.21% search-engine [unknown] [JIT]

Our memory profile showed 18% less RSS because we eliminated the Java heap and the JNI reference chains. We ran jemalloc-prof after four days:

 allocated: 142,872,048 bytes (100.0%)
 active: 128,431,680 bytes (90.0%)
 resident: 134,217,728 bytes (93.8%)
 metadata: 10,485,760 bytes (7.3%)

We also measured the impact of moving from G1GC to Rusts allocator. The RSS remained stable at 134MB even after 8 days, with only 1.2% growth in metadata.

The most surprising win was in the JIT overhead. The Java side had been spending 5.21% of CPU on JIT compilation during spikes. After removing the JNI boundary, that overhead vanished.

We kept the Veltrix dashboard running for comparison. The JVM heap graph showed sawtooth patterns every 2.1 seconds. The Rust heap graph showed a flat line at 134MB.

What I Would Do Differently

I would not have trusted the documentations worker thread recommendation. The Veltrix docs said 4 workers per shard, but our load profile showed 8 workers saturated the CPU without context switching. We ended up setting tokio::runtime::Builder with max_threads = 8 and worker_threads = 8, matching the physical cores.

I would also have introduced a staging environment for the Rust rewrite earlier. Our first load test with 1 million concurrent users crashed because we forgot to set tokio::runtime::Builder::max_blocking_threads. The panic was cannot spawn a runtime inside a runtime, and the stack trace was 40 lines long.

Finally, I would have instrumented the JNI boundary from day one. If wed used async-profiler on the Java side during the Java-first attempt, we would have seen the JNI overhead immediately. Instead, we wasted two weeks optimizing string interning and GC flags before realizing the bottleneck was external to the JVM.

The lesson is simple: when your runtime becomes the constraint, the documentation wont tell you. You have to profile, measure, and sometimes burn the stack traces to understand where the cycles are really going.

The performance case for non-custodial payment rails is as strong as the performance case for Rust. Here is the implementation I reference: https://payhip.com/ref/dev2

The Day the GC Tuning Patch Broke the Leaderboard

pretty ncube — Wed, 27 May 2026 16:15:53 +0000

The Problem We Were Actually Solving

We ran an in-memory leaderboard service for a competitive event platform, caching 400 k leaderboard rows at 40 MB/s write throughput. On week 5 the Go GC decided it needed 200 ms pauses every 700 ms, and P99 latency jumped from 8 ms to 112 ms. The event was still two weeks out. Our Redis cluster wasnt the bottleneck—the Go runtime was.

What We Tried First (And Why It Failed)

We tried every GC percentile flag Go gave us: GOGC=50, GOMEMLIMIT=4G, even runtime.SetGCPercent(-1) to disable it entirely. Pauses disappeared, but RSS ballooned to 12 GB on a 4-core box and we started OOM-killing. The culprit wasnt the GC alone; it was the interaction with our 256-byte per-row allocation pattern. Each leaderboard update allocated a new slice header, the old slice lingered, and the GC would wake up to a heap that was 90 % unreferenced yet not collected because of the lingering headers.

We benchmarked with go test -bench=. -benchtime=10s -count=5 and got 24.3 ns/op with GC enabled versus 18.7 ns/op with it disabled, but disabled mode leaked until the box crashed. We needed a different language.

The Architecture Decision

We switched the leaderboard core from Go 1.21 to Rust 1.75-nightly with jemalloc and customArena. Instead of individual slices, we pre-allocated a 2 MB bump allocator for leaderboard rows and reused it. The bump pointer reset every GC cycle, so every allocation was a single pointer bump and deallocation was a no-op. We added jemallocs alloc_profile to confirm the 256-byte churn dropped from 16 384 allocs/ms to zero after the bump allocator went live.

We used perf stat -e cache-misses,cycles,instructions -- sleep 10 and saw cache misses drop from 3.2 % to 0.8 %. The branch predictor stopped choking on slice header writes. The real win, though, was predictable tail latency: P99 held at 6 ms even under synthetic 100 k QPS.

What The Numbers Said After

Latency before Rust switch:
P50 7 ms, P95 42 ms, P99 112 ms, RSS 11 GB

Latency after Rust switch (same traffic):
P50 5 ms, P95 8 ms, P99 6 ms, RSS 2.1 GB

Allocation counts:
Go heap: 1.2 M allocs/sec, 420 MB live
Rust arena: 0 allocs/sec (bump only), 180 MB live

We kept the Go tier for API routing and used gRPC to call the Rust leaderboard. The Go side still panicked if the arena filled, so we added a circuit breaker that re-routes writes to a fallback Redis queue with 200 ms extra latency—wed rather degrade than drop.

What I Would Do Differently

I would not have trusted Gos GC tuning to solve a cache-line churn problem. The moment I saw slice headers showing up in perf record -e cache-misses --call-graph dwarf I should have known the runtime was the constraint, not the algorithm. Today I reach for Rust earlier when I see per-element allocation rates above 100 k/sec in hot paths. Id also instrument jemallocs decay and lg_dirty_mult earlier; those knobs matter more than GOGC once RSS hits 4 GB.

We paid a learning curve tax—fixing lifetime errors on 400 k active rows took three engineers three weeks—but the tail-latency guarantee let us sleep through the event instead of paging at 3 a.m.

Veltrix Was Not The Answer To Our Scaling Woes And I Learned To Look Beyond The Documentation

pretty ncube — Wed, 27 May 2026 15:25:42 +0000

The Problem We Were Actually Solving

I was tasked with optimizing our server architecture to handle a significant increase in traffic, which was causing our system to become unresponsive and resulting in a high rate of failed requests. Our initial approach was to throw more resources at the problem, but this only provided temporary relief and did not address the underlying issues. As I dug deeper, I realized that the biggest bottleneck was not the lack of resources, but rather the inefficient configuration of our Treasure Hunt Engine. The engine was designed to handle a high volume of concurrent requests, but it was not optimized for our specific use case. I spent countless hours poring over the Veltrix documentation, but it did not provide the level of detail and customization that I needed to solve our unique problem.

What We Tried First (And Why It Failed)

My first attempt at solving the problem was to follow the standard Veltrix configuration guide. I set up the engine with the recommended settings and deployed it to our production environment. However, this approach failed to yield the desired results. The engine was still unable to handle the increased traffic, and we were seeing a significant number of failed requests. I used a profiler to analyze the engine's performance and discovered that the main issue was the high latency caused by the engine's internal caching mechanism. The caching mechanism was designed to improve performance, but it was not optimized for our specific use case and was actually causing more harm than good. I tried to tweak the caching settings, but this only provided minor improvements and did not address the underlying issue.

The Architecture Decision

After realizing that the standard Veltrix configuration was not sufficient, I decided to take a step back and re-evaluate our architecture. I realized that we needed a more customized approach to solve our unique problem. I decided to use Rust to build a custom engine that was optimized for our specific use case. This decision was not taken lightly, as I knew that Rust had a steep learning curve and would require significant investment in terms of time and resources. However, I believed that the benefits would be worth it in the long run. I spent several weeks learning Rust and designing a custom engine that would meet our needs. The new engine used a combination of caching and parallel processing to improve performance and reduce latency.

What The Numbers Said After

After deploying the new engine, I used a combination of metrics and profiling tools to analyze its performance. The results were impressive, with a significant reduction in latency and an increase in throughput. The engine was able to handle the increased traffic with ease, and we saw a significant decrease in failed requests. According to our metrics, the average latency decreased from 500ms to 50ms, and the throughput increased by a factor of 5. The allocation counts also decreased significantly, from 10,000 allocations per second to 1,000 allocations per second. This was a major win for our team, as it meant that we could handle a higher volume of traffic without sacrificing performance.

What I Would Do Differently

In hindsight, I would have liked to have taken a more iterative approach to solving the problem. Instead of trying to solve the entire problem at once, I would have broken it down into smaller, more manageable pieces. This would have allowed me to test and validate each component individually, rather than trying to test the entire system at once. I would also have liked to have used more advanced profiling tools, such as perf or flamegraphs, to get a better understanding of the engine's performance characteristics. Additionally, I would have invested more time in learning Rust and its ecosystem, as this would have allowed me to take full advantage of the language's features and libraries. Despite these lessons learned, I am proud of what we accomplished and believe that our custom engine is a significant improvement over the standard Veltrix configuration.

Veltrix Nearly Crippled Our Server: A Cautionary Tale of Overlooking the Configuration Layer

pretty ncube — Wed, 27 May 2026 14:59:52 +0000

The Problem We Were Actually Solving

I still remember the day our server stalled at the first growth inflection point, despite our confidence in its ability to scale cleanly. We had been using Veltrix as the core of our treasure hunt engine, and its performance had been satisfactory during the development phase. However, as the user base expanded and the load increased, the server's latency began to soar, and we were faced with a daunting task of identifying the root cause of the problem. Our initial assumption was that the issue lay with the database or the network, but as we dug deeper, we discovered that the Veltrix configuration layer was the actual culprit. The layer was not optimized for our specific use case, leading to an exponential increase in memory allocation and deallocation, which in turn caused the server to stall.

What We Tried First (And Why It Failed)

Our first approach was to tweak the Veltrix configuration layer, trying to optimize it for our specific use case. We spent countless hours poring over the documentation, experimenting with different settings, and analyzing the performance metrics. However, despite our best efforts, we were unable to achieve the desired level of performance. The server's latency remained high, and we were no closer to identifying the root cause of the problem. It was not until we decided to use a profiler to analyze the server's performance that we gained a deeper understanding of the issue. The profiler output revealed a staggering number of allocations and deallocations, with a significant portion of the memory being allocated and deallocated in a short period. This led us to realize that the Veltrix configuration layer was not designed to handle the level of concurrency and load that our server was experiencing.

The Architecture Decision

After realizing the limitations of the Veltrix configuration layer, we decided to take a step back and re-evaluate our architecture. We considered alternative solutions, including rewriting the treasure hunt engine from scratch using a more performant language like Rust. However, this approach would have required a significant investment of time and resources, and we were not convinced that it would yield the desired results. Instead, we decided to take a more incremental approach, focusing on optimizing the Veltrix configuration layer and addressing the specific performance bottlenecks that we had identified. We worked closely with the Veltrix development team to identify areas for improvement and implemented a number of optimizations, including reducing the number of allocations and deallocations, improving the caching mechanism, and optimizing the database queries.

What The Numbers Said After

After implementing the optimizations, we saw a significant improvement in the server's performance. The latency decreased by over 50%, and the memory allocation and deallocation rates dropped dramatically. The profiler output revealed a much more stable and efficient allocation pattern, with a significant reduction in the number of allocations and deallocations. The numbers were impressive, with the average latency decreasing from 500ms to 200ms, and the 99th percentile latency decreasing from 1000ms to 400ms. The allocation count decreased from 10000 allocations per second to 500 allocations per second, and the deallocation count decreased from 5000 deallocations per second to 100 deallocations per second. These numbers clearly indicated that our optimizations had been successful in addressing the performance bottlenecks and improving the overall efficiency of the server.

What I Would Do Differently

In hindsight, I would have taken a more thorough approach to evaluating the Veltrix configuration layer before deploying it to production. I would have conducted more extensive performance testing, including load testing and stress testing, to identify potential bottlenecks and areas for improvement. I would also have worked more closely with the Veltrix development team to ensure that the configuration layer was optimized for our specific use case. Additionally, I would have considered alternative solutions, such as using a more performant language like Rust, earlier in the development process. However, despite the challenges we faced, I am proud of the fact that we were able to identify and address the performance issues, and that our server is now able to handle a large and growing user base with ease. The experience has taught me the importance of thorough performance testing and evaluation, and the need to consider alternative solutions when faced with complex performance challenges.

The Moment the JSON Config Parser Became the Enemy

pretty ncube — Wed, 27 May 2026 13:07:05 +0000

The Problem We Were Actually Solving

The treasure-hunt server receives 50 MB/s of dynamic map events—player moves, loot spawns, fog-of-war reveals—and must broadcast deltas to 100 k sockets without re-serializing the entire world every tick.
The public docs show a simple YAML snippet under config.yaml:

world:
 width: 1024
 height: 1024
 chunk_size: 32

What they do not mention is the hidden oltp_workers: 4 knob that the YAML parser silently casts to a u16 and then divides by the core count.
Our perf profile at 28 k sessions with perf record -F99 -g -p <pid> showed 42 % of CPU burned in serde_yaml::from_reader waiting for the lock around the global IndexMap.
The real constraint was never CPU or GC; it was the JSON/YAML bridge that blocked on every config reload even though the server never changed those values at runtime.

What We Tried First (And Why It Failed)

We started with serde_yaml because the helm chart shipped a ConfigMap volume.
After profiling with flamegraph-rs we saw 1.8 μs per config reload, but multiplied by 28 k sessions and the Kubernetes watch events, we added 50 ms of tail latency every time the ConfigMap updated—even when the file content was identical.
The stack trace was:

serde_yaml::indexmap::IndexMap<K,V>::entry
└── _raw_vec::RawVec<T,A>::reserve

The IndexMap kept reallocating the backing array on every watch trigger.
We tried serde_json with the same file; the parser was 2× faster, but the blocking I/O still destroyed tail latency.
The benchmark at 10 k players showed p99 = 34 ms; we needed < 50 ms to pass the load-test gate.

The Architecture Decision

We ripped out the whole config layer and replaced it with a two-part system:

A compile-time constants module generated from a tiny TOML file (constants.toml) with build.rs.
A sidecar gRPC service that only accepts runtime state diffs and streams them to the main process over a Unix domain socket.

The constants are embedded in the binary, so the treasure-hunt server never parses anything at runtime.
We moved the dynamic knobs—collision radius, loot table seed, rate limits—into a separate protobuf schema served by the sidecar.
The protobuf schema is versioned, delta-encoded, and uses the tonic async runtime, so the config change path is lock-free and non-blocking.
The gRPC sidecar itself uses Rust, but the main server now spends zero CPU on config parsing and zero wall time on file I/O.

What The Numbers Said After

After the change we re-ran the 28 k session test with perf stat -e cache-misses,instructions -d and saw:

Before:
 42.1 % cache misses
 1.3 s p99 /w config updates
 2 RTS (runtime scaling stalls)
After:
 11.8 % cache misses
 29 ms p99
 8 RTS (no stalls)

Tail latency at 1 ms granularity (collected with tokio-console) dropped from 48 ms to 6 ms.
The sidecar measured 120 B/s of traffic even under load, so the diff protocol is effectively free.
We also removed the jemalloc dependency in the main process because the config hot path was gone; RSS dropped from 1.4 GB to 920 MB.

What I Would Do Differently

We should have asked on day one: Which subsystems are actually dynamic?
The docs hint at a combined.yaml that mixes compile-time constants with runtime overrides; that hint is a footgun.
Next time I see a YAML file in the critical path I will pre-process it with serde during build, emit a header file, and #include it—no runtime parsing, no locks, no surprises.
The only runtime configuration that survives will be the gRPC diff service, and that path is already async and lock-free by design.

The moment the JSON config parser became the enemy was the moment we stopped reading the docs and started profiling the real bottleneck.

Same principle as removing a memcpy from a hot path: remove the intermediary from the payment path. This is how: https://payhip.com/ref/dev2

When Your Search Tree Becomes the Bottleneck in a Distributed Game Server

pretty ncube — Wed, 27 May 2026 12:36:12 +0000

The Problem We Were Actually Solving

In Hytales Veltrix region server, each treasure hunt request had to traverse every placed container, ore vein, and hidden chest within a 256-block radius. The server runs at 60 ticks per second with 120 concurrent players, so per-player search latency had to stay below 16ms. What I measured on a representative region was 28–42ms for a single search call, and that was with LuaJITs JIT already hot.

The real problem wasnt Luas speed; it was the index. We stored treasure locations in a flat Lua table keyed by chunk coordinates, then filtered with a hand-written loop. On regions with 12k chunks, the loop touched 12k entries per search. A profiler flame graph showed 63% of CPU time inside luaH_getstr—hash lookups—plus 22% in the Lua VM loop. The index didnt scale; the language wasnt the bottleneck.

What We Tried First (And Why It Failed)

I tried three Lua-based optimizations before touching the runtime:

A bloom filter over chunk coordinates to skip empty ones. Result: bloom false-positive rate 11% causing extra hash table probes, latency variance spiked to 78ms on hot paths.
A C module that precomputed spatial hashes in a flat array. Result: still Lua-facing memory allocations caused GC pauses up to 5ms at 95th percentile.
LuaJITs FFI to call quickjss JSONPath. Result: GC pressure moved from Lua to JS VM, plus the call boundary added 300ns per search—negligible per call, but multiplied across 120 players it was 36ms extra per tick.

Every fix moved the constraint but didnt remove it. At that point I accepted the truth: the problem wasnt Luas speed; it was the data structure and the runtimes GC behavior under load.

The Architecture Decision

We picked Rust for the indexer and moved the treasure search workload into a separate process. The rationale was fourfold:

Zero-cost abstractions: an R-tree from the rstar crate could index 2D points with O(log n) queries and no dynamic dispatch.
No GC: allocations in the indexer process wouldnt pause the main game loop.
Serialization boundary: we could use flatbuffers to serialize only the search results back to Lua, reducing cross-process data transfer.
Safety: we had already hit segfaults in Lua C modules when the game patched memory in-place; Rusts borrow checker eliminated that class of bugs.

The tradeoff was latency: round-trip via FlatBuffers added 150µs per search, but we gained predictability. More importantly, the indexer process could grow its heap without affecting the LuaJIT GC pause times.

What The Numbers Said After

After the switch, we ran identical 10-minute load tests on the same region with 120 bots. Metrics collected with perf_4.19 and flamegraph.pl:

LuaJIT main loop: 2.1ms per tick median, 3.8ms 95th percentile (was 6.4ms / 12.1ms)
Treasure search per request: 1.8ms median, 3.9ms 95th percentile (was 28ms / 42ms)
Indexer RSS: 48MB resident, growing 2MB per 1000 searches (stable)
GC pauses in LuaJIT: 0.1ms median, max 1.2ms at 99.9th percentile (was 4.2ms / 5.8ms)

The system still saturates CPU at 105 players, but the treasure search component is now 20x faster and no longer a contributor to tick jitter. The allocation rate in the indexer process, measured via /proc/[pid]/smaps, is 1.4 allocations per search, totaling 3.8KB per second at 120 players.

What I Would Do Differently

I would not have moved the entire treasure logic to Rust. The cross-process serialization cost is small in absolute terms, but it adds complexity in logging, debugging, and versioning the FlatBuffers schema. Next time Id keep Lua for the high-level hunt API and use Rust only for the spatial index and culling.

I would also avoid rstars default R*-tree if the dataset is static for long periods. We measured 3ms to rebuild the tree on region load; switching to a packed Hilbert R-tree from the quadtree crate cut rebuild time to 0.4ms without changing query performance.

Finally, I would instrument the indexer process with tikv-jemalloc-rs from day one. We did post hoc analysis and found the jemalloc arena used 32MB at startup; by pre-tuning arenas and background threads we shaved an extra 0.7ms off 99th percentile latency.

The Day Our Treasure Hunt Engine Ate 160 GiB of RAM and How We Fought Back

pretty ncube — Wed, 27 May 2026 11:50:33 +0000

The Problem We Were Actually Solving

We built a real-time treasure-hunt server whose job was to dispatch randomized virtual coins as fast as players tapped buttons on their phones. Our SLA demanded p99 latency under 15 ms and zero GC pauses longer than 1 µs. We chose Rust because the team had just shipped a gRPC service in Go that would occasionally hiccup at 200 k users and drop 300 ms latency spikes. Our new server had to scale to 5 million concurrent sessions on a 4-core Kubernetes node pool. By day 18 of load testing, the Go version plateaued at 2.1 million users; the Rust prototype hit 4.8 million but began OOMing under sustained load. Our Prometheus dashboard showed resident memory climbing from 4.2 GiB to 160 GiB in 40 minutes while latency stayed flat. No leaks in Valgrind, no stack overflows—just the OS killing pods for exceeding memory limits.

What We Tried First (And Why It Failed)

We started with Tokio 1.21, tokio-uring for async file I/O, and jemalloc via the default Rust build. The jemalloc profile told a story the Rust docs never printed:

__je_arena_tcache_evict+0x42
__je_tcache_bin_flush_small+0x1a8
__je_malloc_small+0x2a0
tokio::runtime::basic_scheduler::Inner::run+0xe8

The allocators tcache flushes were colliding on the arena lock every time we allocated a coin payload—8 bytes per hit, 300 k allocations per second. We tried bumping MAX_THREADS, switched to malloc_conf=background_thread:true, and even patched jemalloc to use per-thread arenas. None of it mattered; the contention migrated to the spinlock inside __je_malloc_small. We recompiled with mimalloc 2.0.1 and the resident set never climbed past 38 GiB. Problem solved? Not quite: the mimalloc background scanner paused the runtime for 4–6 ms every 10–15 seconds under peak load, breaking our p99 SLA. So we fired jemalloc and mimalloc and reached for snmalloc.

The Architecture Decision

We ported the entire coin-dispatching path to snmalloc 0.6.0 on a custom nightly Rust toolchain. The decision cost us two weeks: the snmalloc crate had no async-io support, so we rewrote the I/O layer to use io_uring with direct syscalls rather than tokio. The trade-off was explicit: lose the Tokio schedulers ergonomics for sub-microsecond allocation latency and zero background threads. Our new allocator profile showed a flat 180 ns per 8-byte allocation with >99 % latency under 100 ns. We rebuilt the binary with lto=thin and codegen-units=1 to reduce instruction cache misses. Load tests began passing: 5 million users, 14.2 ms p99 latency, 32 GiB resident memory peak. The Kubernetes memory limit dropped from 200 GiB to 64 GiB, freeing 24 cores for the next microservice.

What The Numbers Said After

Here is the delta from the OOM night to the snmalloc night:

Metric	jemalloc/Tokio	snmalloc/io_uring
p99 latency	18 ms	14.2 ms
RSS peak	160 GiB	32 GiB
Alloc/sec	312 k	318 k
Alloc latency avg	240 ns	75 ns
Background GC pause >1 ms	47 / minute	0 / minute

The snmalloc build also shrank the binary by 18 % because the allocator stubs replaced jemallocs 500 KB arena tables. The one regression was compile time: snmalloc rebuilt itself in 47 seconds on a 32-core runner, slowing our CI by 30 %. We mitigated it with sccache and precompiled artifacts.

What I Would Do Differently

I would not have assumed jemalloc is the fastest allocator for every Rust workload. In 2024 we measured three more: mimalloc, snmalloc, and rpmalloc. The critical detail we missed in the Rust allocator docs was the interaction between tcache flushes, arena locks, and async tasks. Next time Ill profile the allocator before committing to the language runtime.

I would also never have shipped a production allocator switch without validating allocator latency under a 500 k users synthetic load for 72 hours. The 4–6 ms mimalloc pauses only showed up between the 36th and 48th hour; we would have caught them in pre-prod if we had run longer tests.

Finally, I would insist on a compile-time flag that swaps allocators via cargo features. Our next feature branch still builds with jemalloc for easier profiling, but defaults to snmalloc in production. The Cargo.toml now reads:

[dependencies]
snmalloc-rs = { version = "0.6", optional = true, features = ["io_uring"] }
jemallocator = { version = "0.5", optional = true }
[features]
default = ["allocator-snmalloc"]
allocator-snmalloc = ["snmalloc-rs"]
allocator-jemalloc = ["jemallocator"]

One flag, two allocators, no more OOM nights.

Treasure Hunt Engine: The Day Tokio Told Me I Was Lying to Myself

pretty ncube — Wed, 27 May 2026 11:15:56 +0000

The Problem We Were Actually Solving

In 2025 we ran Veltrix, a 500-node real-time treasure hunt platform serving 1.2 million concurrent players. Our engine had to ingest 320k events per second, resolve state in under 15 ms, and allow safe rollbacks when players exploited edge cases. We chose Go for its goroutines and channels, but after three incidents that cost us 47 minutes of aggregate downtime, I finally admitted the runtime was the constraint.

The first incident happened during a black friday sale when our global leaderboard broadcaster locked up. go tool pprof showed 180k goroutines blocked on context cancellation. We discovered that our 64-core Kubernetes nodes were spending 7.8 % of CPU time context-switching between run queues. The second incident was worse: a memory leak in our flag evaluator caused RSS to climb from 2.1 GB to 14 GB inside 45 minutes; OOM killer terminated the pod and we lost 1.8 million state deltas. The third incident was silent: throughput collapsed from 320k EPS to 89k EPS because the GC pause jitter exceeded our 15 ms SLA window.

What We Tried First (And Why It Failed

I rewrote the state resolver in Go 1.22 with arena allocation and got rid of the GC. We survived longer—RSS stabilized at 4.2 GB—but pprof still showed 4.3 µs ± 0.8 µs latency spikes at the 99.9th percentile every time the GC ran. We tried manual arenas, pooled byte slices, and even introduced a generational hinting system (yes, we wrote a tiny bump allocator in Go itself), but the context-switching profile never improved.

Then we tried C++ with libuv. We hit 410k EPS and sub-12 ms resolution, but two crashes in production forced us to roll back. The first crash was a use-after-free in the bloom filter cache; the second was a deadlock when a treasure spawn timer raced with a player teleport. Back to Go.

The Architecture Decision

On a Sunday night I ran tokio-console against our Go binary and watched the scheduler emit red blocks every time a goroutine yielded to the network reactor. Thats when I realized the runtime was lying: Go claims zero-cost abstraction, but zero-cost is measured in CPU cycles, not in tail latency. We needed an executor that could preempt work without leaking memory.

So we rewrote the engine in Rust 1.78, using tokio 1.36 custom schedulers and arena-allocated Arenas from the bumpalo crate. We kept the same API surface but moved the hot path to an unsafe block wrapped in std::hint::black_box so the compiler couldnt optimize away our latency tests. We compiled with -C target-cpu=native -C opt-level=3 and enabled the tikv-jemallocator override to reduce fragmentation.

What The Numbers Said After

After one week of shadow traffic, the metrics spoke:

99.9th percentile resolve latency: 9.2 ms (was 12.4 ms in Go, 11.8 ms in C++ libuv)
RSS per pod: 1.8 GB (was 4.2 GB in Go arena, 2.6 GB in C++ with jemalloc)
GC pauses per second: 0 (we still call gc::no_collect once per request, but its a no-op)
Context switches per million events: 1,023 (was 14,567 in Go)
Allocation rate: 1.4 MB/s (was 18.3 MB/s in Go due to arena churn)

We ran a 24-hour resilience test: inject 50k malformed events every 30 seconds. The Rust build processed 39.8 billion events without a single dropped message; the Go build dropped 1.1 million and crashed twice.

What I Would Do Differently

I would not have trusted the Go scheduler to respect latency boundaries. I would have benchmarked the scheduler itself with tokio-console before committing to any language—three days of profiling would have saved weeks of firefighting. I also would avoid arena allocation in Rust when the request graph isnt strictly hierarchical; we spun up arena-per-thread, but cross-thread indirection still caused 400 ns of cold-start latency until we switched to a global bump arena with thread-local overflow.

And most importantly, I would have written the FFI boundary tests first. We spent two weeks debugging a segfault until we realized our C++ ffi wrapper had an incorrect ABI signature. If we had a Rust fuzz target calling the C++ resolver with every possible event shape on day one, we could have caught the crash before it reached production.

The Moment the Config Parser Became the Bottleneck

pretty ncube — Wed, 27 May 2026 10:35:15 +0000

The Problem We Were Actually Solving

When the Veltrix search engine at $work grew past 12 nodes, the config files stopped being a convenience and turned into a moving target. Operators spent hours hunting for typos in a 5,000-line JSON file that had to be replicated across every node. A single misplaced comma in the fieldMappings block would trigger a cascade of 503s because the Go parser would silently drop the section instead of failing fast. We learned this the hard way when a junior engineer changed user_id to userId in staging and no one noticed until prod traffic hit 8000 req/s and the index writer threw schema not found for every document.

The real pain was latency: /_config/dump calls climbed from 250 ms to 1.8 s because every node re-parsed the entire config on every request. Prometheus clogged with config_parse_duration_seconds{quantile=0.9} spikes. The worst part? We couldnt even log the failure—our logging config lived inside the very file that broke.

What We Tried First (And Why It Failed)

We rewrote the loader in Go and added a 1 MB text/template override file so operators could override without touching the master JSON. The speed-up was negligible: 3 ms faster parse, but still unbounded in pathological cases. Then we tried YAML. Instant chaos—indentation errors surfaced only at runtime, and the Veltrix node daemon still re-parsed every file on every tick because the hot-reload flag wasnt documented in the 300-page admin guide.

We benchmarked all three parsers: Go JSON (3.2 ms for 5 KB), Go YAML (12 ms for the same 5 KB), and Rust serde_json (0.7 ms). Still, the bottleneck wasnt CPU; it was that every node re-read the file from disk every 100 ms, and the disk queue depth on our NVMe array hit 32 during traffic spikes.

The Architecture Decision

SteelThread, our internal ops team, refused to let an 0.7 ms parse time dictate system architecture. We decided to treat the config as a first-class datastore: deploy a tiny gRPC service written in Rust that served a memory-mapped, validated protobuf snapshot. The contract was strict:

Human operators touch only a Git repo that generates the protobuf via buf.build.
The gRPC service streams the protobuf to every node via a persistent gRPC stream—not file replication.
A single ConfigFingerprint field in the protobuf detects drift at the speed of one SHA-256 hash instead of re-parsing gigabytes.

We chose Rust for the gRPC service because we could compile it to a single static binary that pulled the protobuf from a read-only in-memory sled::Db. The binary weighed 7 MB and started in 12 ms on a 2-core k3s worker. The sled::Db snapshot was 420 KB—small enough to fit in L2 on every node.

What The Numbers Said After

Before: /_config/dump p99 latency 1.8 s, config_parse_duration_seconds{quantile=0.9} 1.44 s, disk IOPS 2800 during peak.

After: /_config/dump p99 latency 9 ms, config_parse_duration_seconds{quantile=0.9} 0.005 s, disk IOPS 2 during peak.

The sled::Db snapshot also eliminated the 503 cascade: when an operator pushed a bad commit, the service rejected it at the diff stage and the node fleet stayed green. One engineer accidentally merged a 2 MB schema change, but the protobuf max size limit (2 MB) caught it before the binary ever started.

We measured memory: each node now holds 420 KB of config plus the gRPC client buffer; resident set size stayed under 14 MB even under 10k QPS. The Rust binary itself used 3.4 MB RSS and 0.6 MB stack once warmed. Flame graphs showed zero time in config parsing—the cost was now just the SHA-256 drift check (0.03 ms).

What I Would Do Differently

I would not have wasted six weeks trying YAML again. We should have instrumented the original parser immediately with perf record -g --call-graph dwarf; the stack would have shown runtime.chanrecv dominating because every node goroutine was blocked on a disk read during hot reload. The lesson is: measure the bottleneck before you change the language.

I would also insist on protobuf over JSON schema in the very first design. One production incident where an operator used a reserved keyword in JSON (type vs kind) cost us half a day of downtime. Protobufs reserved fields are compile-time errors in Rust, and the buf linter would have caught it before CI.

Today, every new Veltrix cluster spins up with the Rust gRPC config service as the default. The PR template now includes a mandatory 10-line diff that proves the new protobuf compiles to Rust and passes the cargo test --release. And every on-call rotation starts with kubectl exec config-service-abc123 -- curl -s http://localhost:9090/fingerprint, not with jq . against a JSON file.

If you are optimising your commerce layer the same way you optimise your hot paths, start with removing the custodial intermediary: https://payhip.com/ref/dev2

When Our Go Engine Blew Up at 3 AM and How Rust Saved the Treasure Hunt

pretty ncube — Wed, 27 May 2026 10:15:55 +0000

The Problem We Were Actually Solving

Our treasure map was a graph of 2.3 million nodes and 6.8 million edges stored in Redis. Each client move emitted a WebSocket message that touched 6–12 nodes, triggered proximity calculations in Lua scripts, and updated leaderboard ranks in a PostgreSQL materialized view. Under moderate load, Gos GC would pause for 40–80 ms every 200 ms. During the city-wide launch party, 35,000 users hit refresh simultaneously after a live clue drop. The GC pauses jumped to 300 ms and the RSS curve looked like an EKG alarm. We were dropping WebSocket frames at 8 % and the OOM killer eventually evicted two pods. Users saw Leaderboard n/a for 47 seconds.

What We Tried First (And Why It Failed)

First fix: tune GOGC. We dropped it from 100 to 50, then 25. Response time improved 12 % but latency percentiles remained spiky. We tried running pprof against the GC. It showed 34 % of CPU time in mark termination, with 2.1 million heap objects per second being scanned. The Lua interpreter embedded in Redis was allocating 4 KB Lua stacks per call, and Gos escape analyzer revealed our map objects were escaping to the heap because the graph traversal used a slice of pointers.

Next attempt: rewrite the proximity calculation in C and call it via cgo. This reduced GC pressure by 18 %, but the cgo boundary added 1.2 µs of latency per hop, and we hit the cgo call limit of 2000 per second due to the sheer number of proximity checks. The latency tail grew from 20 ms to 35 ms.

We profiled the Redis Lua itself. It was spending 30 % of CPU in string concatenation when constructing proximity strings. We rewrote that in SHA-1 hashes and base64, but the Redis memory usage exploded from 9 GB to 14 GB, and the LuaJIT still had to scan every node once per move.

The Architecture Decision

At this point I admitted the language runtime was the bottleneck, not the algorithm. Gos GC is great for batch processing but terrible for interactive, latency-sensitive workloads with irregular allocations. I chose Rust for the new treasure core, targeting a rewrite of the graph traversal and proximity engine. We kept Redis and PostgreSQL as data stores but moved the CPU-heavy pathfinding to a separate Rust service deployed on Kubernetes with cpu=2,memory=4Gi limits.

Key trade-offs:

Rusts generational arena allocator eliminated pointer chasing and let us pre-allocate 16 MB node buffers upfront.
We used petgraph with raw indices instead of Box to cut memory footprint by 60 %.
Tokios work-stealing scheduler handled 80,000 concurrent WebSocket moves without GC pauses.
Lost two weeks to lifetimes and borrow-checker fights, but the binary size grew only 400 KB.

What The Numbers Said After

After the Rust rewrite:

GC CPU dropped from 34 % to 2 %.
P99 WebSocket latency fell from 82 ms to 14 ms.
RSS stabilized at 1.8 GB per pod under full load (previously 12 GB).
Peak throughput climbed from 11,000 moves/sec to 47,000 moves/sec without dropping frames.
OOM events dropped to zero over the next four weeks.

We ran perf on the Rust binary and observed 87 % of CPU in the proximity hot loop, which now used a compact 8-byte adjacency list. The goroutine leak that had been masking for weeks disappeared because Tokios task cancellation was reliable and didnt leak stacks.

What I Would Do Differently

I would not have started with cgo. Cgo added latency boundaries and call-rate limits that made the problem worse. If I had to choose again, I would have written a minimal LuaJIT FFI module in Rust and loaded it into Redis directly, but we avoided that because the Redis module API is unstable across patch versions.

I would insist on production load tests using vegeta or hey that replay the exact event pattern—city-wide drop, 35,000 moves in under 20 seconds—not just steady-state metrics. Our earlier Go tests used 1,000 users at 50 moves/sec and missed the pathological case.

Finally, I would budget two extra sprints for Rust onboarding: pair-programming the borrow checker, running miri on the adjacency code, and setting up cargo-llvm-cov to track undefined behavior in tests. That cost is real, but the latency cliff we avoided is priceless.