Forem: Illia

Cache Optimization in Rust: From HashMap Surprises to 4x Image Processing Speedup

Illia — Wed, 26 Nov 2025 16:52:33 +0000

The Surprising Benchmark

Imagine we have a task, me and you: a file with 10000 words in it - count the occurrence of each word, and present us with top-10 of the most common words there.

We come up with some ideas, write those ideas out on an IDE, and now we benchmark it to see how blazingly fast our API is.

word_counting_hashing   time:   [4.6504 ms 4.6591 ms 4.6686 ms]
Found 13 outliers among 100 measurements (13.00%)
  6 (6.00%) high mild
  7 (7.00%) high severe

word_counting_binary    time:   [29.797 ms 30.042 ms 30.304 ms]
Found 6 outliers among 100 measurements (6.00%)
  6 (6.00%) high mild

Benchmarking word_counting_linear: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 12.0s, or reduce sample count to 40.
word_counting_linear    time:   [120.82 ms 121.11 ms 121.39 ms]

end_to_end              time:   [4.8724 ms 4.8830 ms 4.8946 ms]
                        change: [−0.3363% +0.9093% +1.7798%] (p = 0.09 > 0.05)
                        No change in performance detected.
Found 12 outliers among 100 measurements (12.00%)
  7 (7.00%) high mild
  5 (5.00%) high severe

In essence:

Linear: 121 ms
Binary: 30 ms (4x improvement, right?)
HashMap: 4.65 ms - What?! But how?

Fantastic question! Before I answer the How, we can both agree that a binary-search algorithm is still not bad, right? Because we perform up to log(N) iterations and not up to N iterations.

So what's the deal here?

When you hear a word "cache", what's the first thing that comes to your mind?

Browser's cache and cookies?
A physical safe-like box to store something important to retrieve it later in the future?
Something else?...

In either case, the idea is pretty much that: a space to put something in for an easy retrieval later in the future.

In the technical sense, 'cache-friendly' code means structuring your data so the CPU can access it efficiently.

The difference? Milliseconds versus seconds.

Now, the secret lies in how CPUs actually access memory.
Let me show you what's happening under the hood - and why it matters for every Rust program you write.

How CPU Cache Actually Works

Cache Hierarchy

The cache is not just 1 area of quick storage of data for a later retrieval. There are 3 layers of it, each with its own characteristics:

L1: fastest yet smallest (on-CPU, can hold up to 32 - 64 KB)
L2: not as fast as L1, yet still much faster than RAM (on modern CPU architectures - also on CPU: ~512 KB - 2 MB on average)
L3: typically a shared resource between several CPU cores ( 2 < x < 10 MB)
RAM: the slowest, yet the largest (you typically want to avoid having all the operational data stored here - otherwise, it hurts the optimization and performance - nowadays can store dozens of GB)

The better-designed the API is, the closer to L1 your data will stick to - ensuring the most possible throughput in the shortest possible timeframe.

Cache Lines - The 64-Byte Unit

When the CPU grabs some data for processing in some CPU-bound task (i.e.: read a file) - it does not grab that data individually
by bytes. It loads something called a cache line - a 64-byte-sized chunk of memory that contains this data at that given moment
to iterate through quickly.

Since this chunk is only 64 bytes, it easily fits into the L1 cache layer, so the processing of it is virtually instant (we are talking roughly microseconds).

If we take a concrete example, like a vector of u32 integers - that are each 4 bytes (32 bits / 8 = 4 bytes) - when you
iterate over it, the CPU will load a 64-byte cache line that contains the part of the vector just large enough to fit in and very quickly iterate through. In this case, it would be able to host up to 16 elements per cache line.

Spatial Locality

Now, when it comes to cache optimization, there are two types of the locality (locating and accessing the data in memory on cache):

Spatial locality
Temporal locality

Spatial locality refers to the data being located contiguously (like in a vector, sequentially), and the cache line loading means grabbing one contiguous chunk of that
data structure to quickly iterate through at a time.

If you remember from before, our benchmark showed some rather decent latency results for both a linear and a binary approaches in iterating over the words and counting up the occurrences.

The reason is spatial locality. The CPU loads 64-byte chunks of the vector, processing multiple words per cache line instead of fetching each word individually. What this provides us is a rather high chance of so-called "cache hits" and reduces an excessive need to reload the cache line per chunk.

Temporal Locality

Temporal locality means accessing the same memory location repeatedly over time. If you keep hitting the same data (also known as "cache hit"), it stays hot in cache.
The opposite of the above is called the cache miss.

What's worth noting regarding the latter is upon the cache miss - the CPU has to reload a fresh 64-byte cache line with that new data to be stored "temporally"
until a data searched for, not-yet existent on the CPU, gets replaced in the cache line.

Looking back again at our benchmarks, we noticed a significant increase in performance for the HashMap approach in the same task for finding word duplicates.
Let's analyze deeper why that is.

In that particular case, the text in question had quite a lot of repeating common words such as "the, a, most" etc.

HashMap can be viewed as a table with some memory buckets that contain a pointer to the key (the key itself living in the Heap memory), and the associated value linked to that key.

Once we encounter a cache miss upon grabbing an index that contains the pointer to the key that shows what value is associated with it, that bucket is loaded into the cache line.

The more duplicates encounter upon each search, the hotter the cache line is - since it will be reused and stay intact - hence the key advantage of this approach here.

Sure, you could still stick to the algorithm that favour spatial locality by checking in on the neighboring data elements, but that
would always require to reload the cache line, even if we encounter duplicate words multiple times.

Issue with the Binary Search

Despite the binary search having performed in around 30 ms, there may be a concern regarding when this algorithm may be beneficial or when it may be an overhead on the CPU
and on memory access.

Think about it: in this algorithm, assuming it's a large vector in question, we get the index in the middle of the collection.
And then, based on that value, we check if it's greater or less than the middle - requiring us to jump to a completely different memory region. Based on that we load another cache line, and the procedure repeats.

Imagine how many jumps across various memory regions we need to perform and how many cache misses we encounter...

Sure - in regards to time complexity, we still win over the linear approach. But is it always the case?
The CPU cache might prefer sometimes to just feed it one vector, iterate over it linearly by chunking in cache lines and just get onto the next one until we find the necessary value.
We do not need to reload the cache line that much - only if the cache line is fully iterated over and finds no necessary result.

In essence, nearly every step in binary search triggers a cache miss because each comparison jumps to a distant memory location, loading a fresh cache line while discarding the previous one.

Cache Coherency and False Sharing

Although we have now understood the essence of how cache performs during several simple examples of word counting, we still need to understand what happens when multiple threads are involved.
Let's take a look at a more complex scenario: blurring an image using multiple threads.

(Link to the project's slightly-messy GitHub repo here)

Cache Coherency: The Foundation

Imagine you want to speed up image processing by using multiple CPU cores. You split the work:

Thread 1 (Core 1): Process the red channel
Thread 2 (Core 2): Process the green channel
Thread 3 (Core 3): Process the blue channel

Sounds perfect, right? But there's a catch.
Each CPU core has its own L1 cache. When Core 1 loads some red channel data into its L1 cache and modifies it, the other cores need to know about this change. Otherwise, Core 2 might read stale data that's no longer correct.

This is cache coherency - the mechanism that keeps all the caches in sync across multiple cores.
Here's what happens when Core 1 modifies data:

Core 1 writes to its L1 cache
The cache line is marked as "modified"
Other cores that have this cache line must invalidate their copies
Next time Core 2 needs that data, it must fetch the updated version

In essence, cache coherency ensures correctness in multi-threaded programs, but it comes with a performance cost - cores must communicate and synchronize their caches.

False Sharing: When Cache Lines Betray You

Now here's where things get tricky. Remember that the CPU loads entire 64-byte cache lines, not individual variables?
False sharing happens when two or more threads modify DIFFERENT variables that happen to live in the SAME cache line.
Here's a concrete example. Suppose we're tracking progress from multiple threads:

struct ThreadCounters {
    thread1_count: u64,  // Bytes 0-7
    thread2_count: u64,  // Bytes 8-15  
    thread3_count: u64,  // Bytes 16-23
}
    // All three counters fit in ONE 64-byte cache line!

Now watch what happens:

Core 1 increments thread1_count → invalidates the cache line on Core 2 and Core 3
Core 2 increments thread2_count → invalidates the cache line on Core 1 and Core 3
Core 3 increments thread3_count → invalidates the cache line on Core 1 and Core 2

The threads aren't sharing data logically (each has its own counter),
but they're sharing a cache line physically. Hence the name: false sharing.

The result? Cache ping-pong.
Performance tanks even though the threads are supposedly independent.

The solution? Force each counter into its own cache line:

#[repr(align(64))]
struct CachePadded<T> {
    value: T,
}

struct ThreadCounters {
    thread1_count: CachePadded<u64>,  // Gets its own 64-byte line
    thread2_count: CachePadded<u64>,  // Gets its own 64-byte line
    thread3_count: CachePadded<u64>,  // Gets its own 64-byte line
}

Now each thread can work independently without invalidating the others' cache lines.

How This Relates to Image Blur

You might be wondering: does our image blur code suffer from false sharing?
Actually, no! Here's why:

struct Image {
    red: Vec<u8>,
    green: Vec<u8>,
    blue: Vec<u8>,
}

Each Vec is a separate heap allocation. When Thread 1 processes the red channel, it's modifying memory in a completely different region from where Thread 2 is processing green. They're not sharing cache lines.

But - and this is important - the choice between Array of Structs (AoS) and Struct of Arrays (SoA) does affect cache behavior in a single-threaded context, which is what we'll explore next.

Now that we've covered both cache coherency (how cores stay in sync) and false sharing (when proximity hurts), let's see how data layout choices affect our image blur performance.

Image Blur Optimization

As a note to keep in mind before we dive into each approach.
The algorithm that will serve as the foundation how we will blur the image is via a 3x3 Box Blurring.
The pixel's value will be the sum of that pixel and all 8 of its neighbors divided by 9 (the pixel count).

For the sake of simplicity, we will not opt in for a toroidal looping over the edges to blur those - but rather will just return those as they are.

As a reference, see below the last benchmark results (some may not seen as intuitive, but we will cover these in just a moment).

Gnuplot not found, using plotters backend
blur_naive              time:   [2.0894 ms 2.0918 ms 2.0942 ms]
                        change: [−0.2921% −0.1217% +0.0394%] (p = 0.15 > 0.05)
                        No change in performance detected.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) low mild
  1 (1.00%) high mild

blur_cache_optimized    time:   [2.2716 ms 2.2772 ms 2.2842 ms]
                        change: [−18.085% −17.836% −17.549%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  4 (4.00%) high mild
  2 (2.00%) high severe

Benchmarking blur_separable: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 9.3s, enable flat sampling, or reduce sample count to 50.
blur_separable          time:   [1.8354 ms 1.8374 ms 1.8395 ms]
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) low mild
  1 (1.00%) high mild

To simplify (on average):

Naive: 2.01 ms
Cache-optimized (standard): 2.27 ms
Cache-optimized (separable filtering): 1.83 ms

Naive approach

In a naive approach, we went for an AoS data structure type (Array of Structs).
This is what it looks like:

struct Pixel {
    r: u8,
    g: u8,
    b: u8,
}

type ImageAoS = Vec<Pixel>;

That means if we want to blur this image, we will have to loop over the vector once, grab each field's red / green / blue - and return a new vector of blurred image data.

The function signature:

fn blur_naive(image: &[Pixel]) -> Vec<Pixel> { /* the core logic */ }

Let's look at the pros and cons of this approach

Pros

one loop -> which means one load of 64 bytes into the cache line
simple design
Temporal locality: accessing r→g→b of same pixel happens immediately
Cache line holds all three channels for ~21 pixels

Cons

locating the neighbors: since we are dealing with a 1D vector, identifying the vertical and horizontal neighbors requires some specific index arithmetics (based on the provided width and height arguments)
non-contiguous channel data: while the pixels as a whole are located contiguously in memory, the channel values are not

Cache-optimized (standard)

By the logic, if we want to process one entire channel at a time (all reds, then all greens etc.), we would want a SoA structure
(Struct of Arrays).

As a gentle reminder from our False Sharing section, here is what that looks like:

#[derive(Debug)]
pub struct Image {
    red: Vec<u8>,
    green: Vec<u8>,
    blue: Vec<u8>,
}

Each field is a vector of all the relevant channel's pixel values.

Let's take a quick look at the function signature:

fn blur_cache_optimized(image: &Image, width: usize, height: usize) -> Image { /* the core logic */ }

If we want to respect the intent that each channel should be processed one at a time - we cannot have the same loop to deal with THREE distinct vectors.
Not only would this lead to even more memory hopping but for three different vectors per one iteration, but this function call would last longer if we were to profile its performance in a flamegraph.

Let's look at some pros and cons:

Pros

one loop -> which means one load of 64 bytes into the cache line
Spatial locality: one channel vector - all elements are contiguous
Cache line holds one channel per loop

Cons

three loops: instead of having one looping, we need three loops to handle each channel separately - that is because our project was single-threaded
worse performance: relaying the above-demonstrated results from the benchmark, the average latency shows a slowdown by roughly 200 microseconds - contradicting to the original intent

In regards to the contradictory intent, it seems like we overlooked another piece of the puzzle that didn't seem as directly-related yet that would greatly affect the outcome and bring all the components discussed up until now together.

Separable filtering

In order to understand what we mean by separable filtering, it is important to be aware of what a professional-grade algorithm would look like to achieve a goal within the same context.
Let us reference this source code of an actual crate publicly available: imageproc

What the author utilizes there is something that aligns with something called separation of concerns, which means each piece of logic performs this task AND ONLY this task, which does not interfere in the underlying logic of another code's functionality.
What we mean by it here is that you shall see we have components that make up for the main function: filter_horizontal() (in our case - blur_horizontal()
and filter_vertical() (in our case - blur_vertical()).

If you were to go back to the standard cache-optimized approach, we still had to perform the average-pixel-value calculation from adding up 9 elements.
In this case, however ,each component needs to get the average pixel value of 3 (horizontally: left-center-right, and vertically: top-center-bottom).

Together, not only do we perform 6 additions instead of 9, but we jump across scattered memory regions in much less quantities compared to the previous approach, hence the improved result by 200 microseconds form the naive approach.

Now what does this all tell us regarding the caching, false sharing and deciding what approach works best when?

Results & Lessons

Give me them numbers!

Word counting

HashMap: 4.6 ms avg
Vector - Binary search: 30 ms avg
Vector - Linear: 121 ms avg

Image blurring

Naive (AoS): 2.01 ms avg
Cache-optimized (SoA: Standard): 2.2 ms avg
Cache-optimized (SoA: Separable blurring): 1.8 ms

As you will see below, the reason for such results don't have to do necessarily only with how cache-friendly our data structure is, but rather how our logic interacts with it, and what the Rust compiler does to the data on runtime.

What do the Rust compiler and the algorithm picks have to do with cache-friendliness?

If we are dealing with a set of data that has a great deal of repeated values, a HashMap approach will be most suitable in order to have a higher probability of retaining the hot cache line with the stored values in it.

However, that is not to say that a linear - binary search should be ignored. Either of these approaches is feasible if looking up data is not as frequent and / or the data set size is not as substantial for storage - here, it will be more of a question whether we need to speed up this infrequent iterations or not (O(log(N)) vs O(N) time complexity)

In regards to the image-blurring project, the concern is presented from a different angle that is equally important to keep in mind.
The benchmark results demonstrated that an algorithm choice matters much more than just picking between AoS or SoA - because picking a data structure is one part of the story - it's a whole different part when it comes to how our API interacts with that data structure.

What can I learn from all this?

Let me briefly go over each key lessons I have cemented over the course of weeks 6 and 7 when covering cache coherence, cache optimisation and how to interpret various relevant benchmark results in this regard:

Always measure, never assume - as practice showed twice in my case here, do not assume that just because a certain approach looks solid on paper - will always present as such on binary. Code them up and test each hypothesis.
Study professional implementations (like imageproc's separable filter) - there's a reason why certain implementations work the way they do. Studying production code (like Rust crates on crates.io) teaches you patterns you wouldn't discover on your own.
Test across different scenarios (small vs large datasets, different patterns) - HashMap dominated my word-counting benchmark because the text had many duplicates. But what if the data were different - unique formulas, random UUIDs, or different sizes? The optimal choice changes with the workload. Test your actual use case.
Data structure choice isn't just about API ergonomics - it directly controls memory layout, which determines cache behavior. This makes it a first-class performance consideration, not an afterthought.

The North Star

If you want your Rust code to be blazingly fast as marketed by some people online, you need to keep in mind and be constantly aware how your code actually interacts with your AND someone else's CPU + memory you wish to distribute with - that shall be your northern light in picking the right algorithm, the right data structure and the right optimisations to apply.

No matter where you are in your learning journey of Rust programming (or any systems programming language) - you are in charge of how you want your code to perform.
No matter the results you see at the current moment - there will always be at least one way how the API could be different to chop down those extra milliseconds / microseconds. Your users (and your future debugger self) will thank you for that!

From 'Why the F@&k Do I Need This?' to 'Oh, That's Why' - My GAT Journey

Illia — Fri, 07 Nov 2025 17:35:40 +0000

"Understanding GATs isn't about memorizing syntax - it's about that moment when you finally see WHY they exist."

Anyone can find the knowledge needed to just get the essence of what Async Rust, or some of the advanced types are all about. But what's the actual process of truly understanding them like?
It's different for everyone - and for that, let me share with you how it went so far from my perspective.

The Journey

Before I even started the RBES (Rust Blockchain & Embedded Systems) curriculum - you have to keep in mind that I didn't just find the passion to start learning Rust from scratch out of thin air.
No no - I actually learned quite a few other languages to test out my passion for programing - and that includes JavaScript / TypeScript, Golang, Swift (where all started because I wanted to be in the Apple ecosystem).

When I heard about Rust - the whole idea of stringent memory safety, low-level programming attracted me to get to know it more - and it was hyped quite a bunch, so I decided to grab
the Rust book in the official site and to just have a crack at it.

After SIX iterations of the curriculum design (took several months, now that I remember) - I can proudly say I am now at Week 5 of the RBES program where the prior weeks
are synthesized to fully solidify the material covered.

What those weeks covered, you might ask?

Well, let me just briefly go through them:

Week 1 + 2 Async Rust - Fundamentals

I didn't just use Tokio - I built a mini runtime with Pin, Poll, and custom wakers to understand what .await actually does under the hood.

On top of that, I covered the importance of Arc and Mutex and passing resources over the Tokio-powered channels such as mpsc -
and the underlying marker traits that allow us to work in the async logic safely - the Send and Sync marker traits

Week 3 + 4: GATs + HRTBs + PhantomTypes

Here, we got to know what those types are, how they are different from, say, a normal trait bonud in a type defintion or in a function signature - that allows us to write more flexible
APIs, deal with several lifetimes, avoiding runtime costs and allocating much less data to the heap.

Check my Week 5 progress to get a better understanding of how that learnign process went (spolier alert: NOT as quick and smooth as it sounded theoretically).

Week 5 (where we're at)

All of the above material has been brought together to attempt to solidify the knowledge accumulated before
moving forward to something a little bit scarier but a whole lot more exciting:

Unsafe Rust!

This is the week where I got to further strengthen my understanding of, primarily, Week 3 + 4 material, namely the GATs.
And I thought that in the beginning of Week 3 I understood what GATs' true power is - boy was I wrong...

Let me just illustrate briefly how my concept knowledge went from Week 3 to Week 5:

GATs: "We can work with generics and lifetimes when design a more expressive trait with more flexible associated types" -> "GATs allow us to
returned borrowed values into the implementor's data as long as the implementor is alive - no memory allocaiton is done, whereas with a normal associated type -> the borrow is released when you return an owned associated type value"
HRTBs: That pretty much stayed the same - we work with a function that take a closure (or a generic by itself) that can work with ANY lifetime the caller decides on - per each call.
PhantomTypes (the easiest for me) - An efficient tool that utilizes zero-size data types for building state machines that validate type states at compile-time.

That bit went from "I'm not sure why they are necessary in real code", along with "why the f**k you gave a me a task to use GAT when you told me afterwards it was not needed?!", all the way to
"Here is my idea of how a GAT could be useful" (I will show you that example in just a bit).

But first... Let's talk about some of my [modest] portfolio highlights.

Portfolio Highlights

1. Async Payment Processor - The GAT "Aha" Moment

This is where GATs stopped being a theoretical concept and became a necessary tool.
The task was simple on the surface: build a payment processor that could handle transactions asynchronously. But here's the catch - I needed to return borrowed data from the processor's internal state without cloning it on every operation.

The Technical Challenge:

Without GATs, I had two bad options:

Clone everything: Return owned Strings and Vecs, which means heap allocations on every transaction query. When you're processing thousands of payments, every clone is latency you can't afford.
Use lifetimes everywhere: But then every caller needs to manage lifetimes explicitly, making the API a nightmare to use.

The GAT Solution:

pub trait AsyncPaymentProcessor {
    type TransactionData<'a> where Self: 'a;

    async fn get_transaction<'a>(&'a self, id: TxId) -> Option<Self::TransactionData<'a>>;
}

That 'a parameter in the associated type? That's the magic. It says: "I'm returning borrowed data that lives as long as the processor does." The caller gets a reference tied to self's lifetime - no clones, no unnecessary heap allocations.

What This Proves:

I can identify when zero-cost abstractions actually matter. In systems where performance isn't just nice-to-have (blockchain consensus, game engines, real-time trading), understanding the difference between "elegant code" and "efficient code" is the difference between shipping and failing.

2. Mini Async Runtime - Understanding What .await Actually Does

Anyone can use Tokio. I built my own minimal runtime to understand what happens under the hood.

The Core Insight:

The biggest "aha" wasn't about Pin or Poll syntax - it was understanding the handoff dance between futures and the IO driver:

Future yields: "I'm waiting on network IO, here's my waker handle"
Executor parks it: Future goes into a HashMap (not a queue!)
OS completes IO: Driver receives notification
Waker activates: Driver looks up the TaskId and wakes that specific future
Executor re-polls: Future resumes exactly where it left off

Why HashMap vs VecDeque Matters:

When an IO operation completes, the OS needs to wake that specific future, not scan through a queue. HashMap lookup by TaskId is O(1) - this is how real async runtimes scale to thousands of concurrent tasks without wasting CPU cycles polling futures that aren't ready.

What This Proves:

I don't just use abstractions - I understand their implementation trade-offs. When debugging production async code or optimizing hot paths, knowing why the runtime makes certain choices means I can write code that works with the system, not against it.

3. HTTP Request Builder - Compile-Time State Machines

This one showcases phantom types and the builder pattern, but more importantly, it demonstrates how Rust's type system can prevent entire classes of bugs at compile time.

The Design:

struct Request {
    url: String,
    headers: HashMap,
    _method: PhantomData,
    _state: PhantomData,
}

Those PhantomData markers don't exist at runtime (zero-size types), but at compile time they enforce state transitions:

Request can call .with_header()
Request can call .with_body()
Request can call .send()

You cannot call .send() on a request without headers. Not "you shouldn't" - you cannot. The compiler won't let you compile.

Why This Matters:

When you're building APIs at scale - whether it's a blockchain RPC client or a game's network layer - catching state machine errors at compile-time means your users can't misuse your API. It's the difference between "safe by convention" (documentation saying "don't do this") and "safe by construction" (the type system enforcing it).

What This Proves:

I can design APIs that are both ergonomic (builder pattern feels natural) and bulletproof (invalid states are unrepresentable). This is the kind of API design that matters when mistakes are expensive - and in blockchain or embedded systems, mistakes are expensive.

The GAT Breakthrough (Week 4-5)

The Setup: Confidence Before the Fall

Week 3, I learned about GATs. I understood the syntax. I could explain that they're "generic associated types that let you add lifetime parameters." I thought I got it.
Week 4, you (Claude) gave me a task: "Build an HTTP request builder using phantom types. Find a way to incorporate GATs."

Cool. I'd mastered GATs, right? Let's do this.

The Struggle: When Theory Meets Reality

I tried everything to make GATs work with the state machine:

trait BuilderStep {
    type Next<S>;  // GAT for state transitions!
    fn next_step(self) -> Self::Next<S>;
}

But wait... if the GAT method gets to decide the return type based on the generic parameter S, how do I enforce that .with_header() must return HeadersSet state?
The whole point of phantom types is compile-time guarantees.
If the method itself is generic over the next state, I've just opened the door to any state transition - which defeats the entire purpose of a state machine.
I kept hitting this wall. The compiler would accept my code, but it was wrong. Too flexible. No guarantees.

The Realization: Not Every Problem Needs a Hammer

After what felt like hours of wrestling with syntax, you said something that changed how I think about these tools:

"GATs aren't for strict type transitions. They're for when you need lifetime flexibility or borrowing from self."

Oh.

Oh.

GATs weren't wrong for state machines because I was bad at using them. They were wrong because that's not what they're for.

The HTTP builder needs specific, enforced transitions:

pub struct Request<Method, State> {
    url: String,
    headers: HashMap<String, String>,
    body: Option<String>,
    _method: PhantomData<Method>,
    _state: PhantomData<State>,
}

impl<M> Request<M, Initialized> {
    pub fn new(url: impl Into<String>) -> Self {
        // Your implementation
        Self {
            url: url.into(),
            headers: HashMap::new(),
            body: None,
            _method: PhantomData,
            _state: PhantomData,
        }
    }

    pub fn with_header(mut self, key: impl Into<String>, value: impl Into<String>) -> Request<M, HeadersSet> {
        self.headers.insert(key.into(), value.into());
        Request {
            url: self.url,
            headers: self.headers,
            body: self.body,
            _method: PhantomData,
            _state: PhantomData,
        }
    }
}

No GAT. No lifetime parameter. Just: "This state goes to this state, period."

The Proof: Finding Where GATs Actually Belong

Then came the async payment processor.

This time, the problem was different: I needed to return borrowed data from the processor's internal state - transaction records, account balances - without cloning them on every query.

trait AsyncPaymentProcessor {
    type TransactionData<'a> where Self: 'a;

    async fn get_transaction<'a>(&'a self, id: TxId) -> Option<Self::<TransactionData<'a>>>;
}

This is where GATs shine. The lifetime 'a ties the returned data to how long self is borrowed.
The caller can hold a reference as long as they're holding the processor - no clones, no heap allocations, no unnecessary overhead.

Without GATs? I'd be returning owned Strings and Vecs on every transaction query. With thousands of concurrent transactions, that's latency I can't afford.

The Lesson: Judgment Over Memorization

The breakthrough wasn't learning GAT syntax - it was developing engineering judgment:

Phantom types: When you need compile-time state enforcement with zero runtime cost
GATs: When you need to return borrowed data with flexible lifetimes tied to self
Regular associated types: When you just need type-level abstraction without borrowing

It's not about which tool is "better" - it's about recognizing which problem you're actually solving.

And honestly? That's the lesson I wish I'd understood in Week 3.
But struggling through the HTTP builder, being wrong, getting frustrated, and then building the payment processor with the right tool for the right job - that's what made it stick.

Anyone can memorize syntax. Understanding why a feature exists? That takes bruises.Technical Deep Dives

Week 1-2: Async Rust - From Surface to System

Before the RBES curriculum, I thought async Rust was just async/.await syntax and maybe using mpsc channels for message passing. That's it. I'd read the Rust book's async section, understood the surface layer because I had JavaScript experience, and figured that was enough.

I was painfully wrong.

What Changed:

Now I understand the entire runtime executor cycle - how futures are polled, how the IO driver signals the waker when operations complete, how the executor knows which specific future to re-poll. It's not magic syntax - it's a coordinated dance between the executor, futures, and the OS.

What I Can Explain Confidently:

Async functions are lazy until polled (via .await, tokio::join!, tokio::select!, etc.)
The Send/Sync marker traits and when resources need to be wrapped in Arc> for multi-threaded safety
That TCP server handling 500 concurrent connections? I can explain why the mpsc receiver needed Arc>for multiple workers to grab tasks from a single channel

What I Don't Know Yet:

Honestly? I won't know what I need to look up until I hit a project with requirements beyond basic async patterns. The async_trait macro for the payment processor was a perfect example - completely new territory. That's how learning works: you don't know the gaps until you need to fill them.

Week 3-4: Type System - Bruises Build Understanding

Difficulty Ranking (Hardest → Easiest):

HRTBs - Even the official docs admit explicit HRTBs are rare (despite recent lifetime elision improvements in Rust 1.85+). I haven't had enough real-world experience with them because needing to handle multiple inputs' lifetimes (or no lifetimes at all) just doesn't come up often.
GATs - Still not trivial, but the bruises from Week 5 made them stick. Applying them wrong (HTTP builder), then applying them right (async payment processor) taught me more than any documentation could.
Phantom Types - Straightforward once you get the zero-cost state machine pattern. These clicked fastest for me.

The False Confidence Trap:

Week 3, first encounter with GATs: "Yeah, I get it - generic associated types with lifetime parameters."
Week 5, after the HTTP builder struggle: "Oh... I didn't get it at all."
That gap between "I understand the syntax" and "I know when and why to use this" is where Week 5 lived. The intuition that something was off in Week 3 was right - I just hadn't done the work yet.

My 2-Sentence GAT Test:

When should you use a GAT?

When you need to return borrowed data tied to the implementor's lifetime and you're not sure what exact type (generic) it should return - giving that
type flexibility to the caller of the trait method, not the trait definition itself. If you need strict type transitions (like state machines), GATs are the wrong tool.

Week 5: Consolidation - Concepts Meeting Reality

The big shift in Week 5 wasn't learning new concepts - it was seeing how async and GATs work together in real systems. Building the async payment processor forced me to combine everything: lifetime management, async traits, borrowing semantics, all in one design.

If I Could Rebuild One Project:

The TCP server. Not the basic string-printing version we built in Week 1, but a more sophisticated one using the newtype pattern and GATs to
make the connection handling more interesting. Now that I actually understand what GATs are for, I could design something that borrows connection state efficiently instead of cloning data everywhere.
That's the clearest sign of progress: not "I finished the curriculum," but "I'd do it differently now."What I Can Build Now

Current Capabilities (Proven in Portfolio):

Concurrent TCP servers handling hundreds of simultaneous connections with proper async runtime usage
Type-safe state machines using phantom types for compile-time validation (HTTP request builders, protocol state tracking)
Zero-allocation async APIs using GATs to return borrowed data from internal state (payment processors, data access layers)
Custom async patterns beyond just using Tokio - understanding why executors work the way they do

Realistically, Today:

Anything that combines the Week 1-5 concepts - basic servers, API clients, concurrent task schedulers, type-safe builders. These are components of larger systems, not full applications yet.

What I Can't Build Yet (And I'm Honest About It):

Production blockchain infrastructure (RPC load balancers, validator monitoring platforms, MEV optimization - that's Phase 2, Weeks 15-38)
Embedded systems with real hardware (STM32/ESP32 interfacing, RTOS integration, automotive CAN bus - that's Phase 3, Weeks 39-61)
Edge AI inference systems (on-device ML models, sensor fusion, industrial IoT - that's Phase 3, Weeks 46-60)
Production observability stacks (Prometheus/Grafana pipelines, distributed tracing, SRE practices - that's Phase 5, Weeks 70-71)

A better answer to "what can I build?" will come after Phase 2 (blockchain infrastructure) and Phase 3 (embedded + edge AI). For now, I can build sophisticated components - the building blocks that go into larger systems. That's not modest - that's realistic.

For Collaborators/Employers:

If you need someone to build Rust libraries with strong API design, async concurrency, and compile-time safety guarantees, I can do that.
If you need someone who's shipped production blockchain RPC infrastructure or embedded automotive systems, that's not me... yet.

I know the difference. And knowing what you don't know is just as valuable as knowing what you do.

Open for Collaboration

This is my first full technical blog post, so any feedback is genuinely appreciated - comments, corrections, questions, even just a "this resonated with me" helps me understand what's landing and what's not.

Who I'm hoping to connect with:

If you're a Rust developer (whether you're a few weeks ahead or a few years ahead), or if you're working in blockchain infrastructure, embedded systems, or edge AI and want to discuss the technical challenges in these domains, I'd love to hear from you. The best learning happens in conversation, not isolation.

What I'm not doing (yet):

I'm not seeking consulting work or commercial collaborations at this stage - I'm Week 5 of an 85-week curriculum, and I'm being realistic about where I am. As I build more substantial projects and publish more posts, you'll see those calls-to-action appear naturally. For now, I'm just documenting the journey and inviting technical discussion.

Transparency note:

My curriculum is commercially focused (blockchain infrastructure + embedded systems + edge AI), so by Month 18-24,
you'll absolutely see me positioning for client work. But I'd rather be honest about being in the learning phase than oversell capabilities I haven't proven yet.

If you want to follow along, you can find my projects on GitHub, reach me on
Discord, or connect on Reddit.
I'm documenting everything - the wins, the bruises, the "oh NOW I get it" moments.

Because understanding GATs isn't about memorizing syntax - it's about that moment when you finally see why they exist.

And I'm still collecting those moments.