Forem: Dharrsan Amarnath

Your Struct is Wasting Memory and You Don't Know It

Dharrsan Amarnath — Sat, 25 Apr 2026 00:46:38 +0000

We write structs by listing fields in whatever order feels readable. Name, then age, then score. It compiles. It runs. The compiler silently bloats it, misaligns it, or both, and you ship it without ever checking.

Here are three structs holding the exact same six fields:

#include <stdio.h>
#include <stdint.h>
#include <stddef.h>

struct Good {
    double balance;
    uint64_t transaction_id;
    int32_t account_type;
    int16_t region_code;
    char status;
    char currency[4];
};

struct Bad {
    char status;
    double balance;
    int16_t region_code;
    uint64_t transaction_id;
    char currency[4];
    int32_t account_type;
};

struct __attribute__((packed)) PackedBad {
    char status;
    double balance;
    int16_t region_code;
    uint64_t transaction_id;
    char currency[4];
    int32_t account_type;
};

int main() {
    printf("Good:      %zu bytes\n", sizeof(struct Good));
    printf("Bad:       %zu bytes\n", sizeof(struct Bad));
    printf("PackedBad: %zu bytes\n", sizeof(struct PackedBad));
    return 0;
}

Good:      32 bytes
Bad:       40 bytes
PackedBad: 27 bytes

Same fields. 27, 32, and 40 bytes. The difference is not the data. It is the order and whether you let the compiler do its job.

What Happens When You Read One Byte

Before touching any struct, you need to understand how the CPU actually talks to RAM. There are three buses connecting them.

The address bus carries the memory address the CPU wants to read. It is 48 to 52 physical wires on a modern system. The CPU puts a number on these wires and RAM listens.

The data bus carries the actual bytes back. It is 64 bits wide, so 8 bytes travel in parallel per transfer. But the CPU does not stop at 8 bytes. It keeps bursting transfers across the data bus until it has filled a full cache line.

A cache line is 64 bytes. That is the only unit of communication between RAM and your L1/L2 cache. The CPU never fetches 1 byte. It never fetches 8 bytes. It always fetches 64 bytes. When you read a single char, the CPU puts that char's address on the address bus, pulls the entire 64-byte block containing it across the data bus, stores it in cache, and then gives you your one byte out of it.

Every cache line starts at an address that is a multiple of 64. Cache line 0 covers 0x0000 to 0x003F (0 to 63). Cache line 1 covers 0x0040 to 0x007F (64 to 127). Cache line 2 covers 0x0080 to 0x00BF (128 to 191). The boundaries are fixed and always at multiples of 64.

This is the rule everything else in this post follows.

Natural Alignment and Why It Matters

Every data type has an alignment requirement equal to its own size. A double (8 bytes) must start at an address divisible by 8. A uint32_t (4 bytes) must start at an address divisible by 4. A char (1 byte) can go anywhere.

When a field sits at a naturally aligned address the CPU reads it in one bus transaction. It fits cleanly inside one cache line fetch.

When a field is misaligned it can straddle a cache line boundary. Say a double starts at 0x003C (60). It is 8 bytes, so it occupies 0x003C to 0x0043 (60 to 67). Cache line 0 ends at 0x003F (63). Cache line 1 starts at 0x0040 (64). Your double is split across both. The CPU issues an address request for cache line 0, waits for the data bus to deliver, then issues a second address request for cache line 1, waits again, and stitches both halves together in hardware. Two full round trips to memory for one field read.

Now think about why address mod data_size == 0 prevents this. Cache line boundaries sit at multiples of 64. A naturally aligned double sits at a multiple of 8. The worst case is a double at 0x0038 (56), occupying bytes 56 to 63. It ends exactly at the cache line boundary, never crossing it. This works because 64 is itself a multiple of 8. A field aligned to its own size mathematically cannot straddle a boundary that is also a multiple of that same size. So address mod data_size == 0 is not a style convention. It is the condition that guarantees your field lives inside exactly one cache line, fetched in exactly one bus transaction, with no possibility of being split.

The compiler inserts padding between fields to maintain this guarantee. Bad field ordering forces it to insert a lot of padding. And packed removes all of it.

Good : 32 bytes, nothing wasted

0x0000      0x0008      0x0010 0x0014 0x0016 0x0017  0x001B  0x001F
(0)         (8)         (16)   (20)   (22)   (23)    (27)    (31)
|-----------|-----------|------|----|--|------|--------|
balance     trans_id    acct   reg  st currency  pad

balance at 0x0000 (0). 0 mod 8 = 0. Aligned.
transaction_id at 0x0008 (8). 8 mod 8 = 0. Aligned.
account_type at 0x0010 (16). 16 mod 4 = 0. Aligned.
region_code at 0x0014 (20). 20 mod 2 = 0. Aligned.
status at 0x0016 (22). Char, goes anywhere.
currency at 0x0017 (23). Char array, goes anywhere.

Every field starts exactly where the previous one ended. Zero internal padding.

The 5 bytes at the end are tail padding. In an array the second element must start at an address divisible by 8, the largest field alignment. Without tail padding the second element begins at 0x001B (27) and its balance field lands there too. 27 mod 8 = 3. Misaligned. So the compiler rounds 27 up to 32. The second element starts at 0x0020 (32). 32 mod 8 = 0. Clean.

Zero bytes wasted internally. The tail padding is structural and unavoidable.

Bad : 40 bytes, 13 bytes of dead space

0x0000 0x0001    0x0008      0x0010 0x0012    0x0018      0x0020 0x0024 0x0028
(0)    (1)       (8)         (16)   (18)      (24)        (32)   (36)   (40)
|------|---------|-----------|------|---------|-----------|------|----|
st     7B pad    balance     reg    6B pad    trans_id    curr   acct

status at 0x0000 (0), one byte. The next field is balance, a double that needs a multiple of 8. After byte 1, the nearest multiple of 8 is 0x0008 (8). The compiler inserts 7 bytes of padding between them that store nothing and do nothing.

Then region_code lands at 0x0010 (16), two bytes, ending at 0x0011 (17). The field after it is transaction_id, which needs a multiple of 8. The nearest is 0x0018 (24). Six more bytes gone.

13 bytes wasted purely from putting char status first. In an array of a million of these structs that is 13MB of RAM holding nothing. The struct is 25% larger than it needs to be, meaning fewer elements fit per cache line and more trips to RAM on every access pattern.

PackedBad : 27 bytes, zero padding, four misaligned fields

0x0000 0x0001    0x0009 0x000B      0x0013  0x0017 0x001B
(0)    (1)       (9)    (11)        (19)    (23)   (27)
|------|---------|------|-----------|--------|------|
st     balance   reg    trans_id    curr    acct

__attribute__((packed)) removes all padding. Fields sit back to back. 27 bytes. But look at where each field actually lands:

balance at 0x0001 (1). 1 mod 8 = 1. Not 0. Misaligned.
region_code at 0x0009 (9). 9 mod 2 = 1. Not 0. Misaligned.
transaction_id at 0x000B (11). 11 mod 8 = 3. Not 0. Misaligned.
account_type at 0x0017 (23). 23 mod 4 = 3. Not 0. Misaligned.

Four fields, zero aligned. In an array, whether a given element straddles a cache line boundary depends on its index. You can check with:

(index x struct_size) mod 64 + struct_size > 64

For element 2 of PackedBad: (2 x 27) mod 64 + 27 = 54 + 27 = 81. Since 81 > 64, element 2 straddles. Its bytes run from 0x0036 (54) to 0x0054 (84), crossing the cache line boundary at 0x0040 (64). The CPU issues an address request for cache line 0 (0x0000 to 0x003F), waits for the data bus, issues a request for cache line 1 (0x0040 to 0x007F), waits again, and stitches both halves. Two full round trips for one struct read. You saved 13 bytes on paper and doubled your memory traffic in practice.

Tearing

The straddle is slow. In single-threaded code it is just slower. In multithreaded code it is also wrong.

The CPU guarantees a memory access is atomic, meaning indivisible and instantaneous from every other thread's perspective, only when:

address mod data_size == 0

That condition guarantees the field sits inside one cache line and the CPU fetches it in one bus transaction. One transaction means no window for another thread to slip in.

When balance sits at 0x0001 (1) in PackedBad, 1 mod 8 = 1. The condition fails. The CPU fetches the first portion of balance in one bus transaction, then the second portion in a separate bus transaction. There is a real time gap between them.

If another thread writes to that same balance field inside that gap, the reading thread gets the first half from before the write and the second half from after it. A value assembled from two different points in time. A number that was never logically written anywhere in your program.

No segfault. No assertion. No log line. The field silently reads as garbage. In a monitoring system this corrupts your metrics. In a financial system this is a balance that never existed reaching your business logic.

Good and Bad are both padded by the compiler so every field satisfies address mod data_size == 0. Tearing cannot happen. PackedBad has four fields that fail this condition in every element.

All Three Side by Side

Struct	Size	Internal padding	Misaligned fields
Good	32 bytes	0 bytes	0
Bad	40 bytes	13 bytes	0
PackedBad	27 bytes	0 bytes	4

Bad pays in memory. PackedBad pays in correctness. Good pays nothing.

The Fix Is Just Field Order

Order fields from largest alignment requirement to smallest:

struct Good {
    double balance;          // 8 bytes
    uint64_t transaction_id; // 8 bytes
    int32_t account_type;    // 4 bytes
    int16_t region_code;     // 2 bytes
    char status;             // 1 byte
    char currency[4];        // 1 byte alignment
};

The compiler has nothing to pad because each field naturally follows the previous one without any gap. No attributes, no pragmas. Just ordering.

Verify with sizeof. Inspect individual field positions with __builtin_offsetof(struct Foo, field) when something looks off.

When packed Is Actually Correct

__attribute__((packed)) has one valid use: serializing data onto a network socket or disk, where you control both ends and the CPU never does arithmetic directly on the packed bytes.

You pack the struct, write the raw bytes to the wire, and on the receiving end you copy into a properly aligned struct before reading any field. The packed struct is a transport container, not a data structure your code operates on. The moment you read fields out of a packed struct in a running program you pay the straddle penalty on every access and you are one concurrent write away from tearing.

False Sharing

You fix your field order. You remove packed. Everything is aligned. You go multithreaded and all cores pin at 100% while throughput collapses.

struct Good is 32 bytes. Two of them fit inside one 64-byte cache line. Say your array starts at 0x1000 (4096). arr[0] lives at 0x1000 to 0x101F (4096 to 4127). arr[1] lives at 0x1020 to 0x103F (4128 to 4159). Both sit inside the single cache line spanning 0x1000 to 0x103F (4096 to 4159).

Thread 1 writes to arr[0]. Thread 2 writes to arr[1]. Different structs. No shared fields. No mutex involved. But both live in the same 64-byte cache line.

Every time Thread 1 writes to arr[0], the CPU's MESI cache coherency protocol broadcasts an invalidation across the ring bus to every other core: the cache line at 0x1000 was modified, your copies are stale, drop them. Thread 2 has its L1 cache entry for arr[1] ripped away even though nobody touched arr[1]. It takes an L1 miss, goes out to L3, fetches the 64-byte line again, modifies arr[1], and now Thread 1 gets invalidated. Back and forth. The cores spend the vast majority of their time passing one cache line across the ring bus and almost no time doing actual work.

The fix is to give each struct its own cache line:

struct __attribute__((aligned(64))) NodeMetrics {
    double balance;
    uint64_t transaction_id;
    int32_t account_type;
    int16_t region_code;
    char status;
    char currency[4];
};

Now arr[0] owns 0x1000 to 0x103F (4096 to 4159) entirely. arr[1] owns 0x1040 to 0x107F (4160 to 4223) entirely. Thread 1 and Thread 2 never touch the same cache line and the coherency protocol never fires between them. You waste 32 bytes per struct. You get linear scaling across every core.

Takeaway

Order fields largest to smallest. Verify with sizeof. Check offsets with __builtin_offsetof when something feels off. Use packed only for wire or disk formats where you control both ends. Pad to 64 bytes with aligned(64) only when multiple threads write to adjacent elements of an array.

Why Blockchains Exclude Floating Point at the Architecture Level

Dharrsan Amarnath — Mon, 20 Apr 2026 00:51:11 +0000

I ran the same C program on three machines. Same code. Same inputs. Three different answers. Here's exactly why

The Experiment

#include <stdio.h>
int main() {
    long double x = 0.1L + 0.2L;
    printf("%.20Lf\n", x);
    unsigned char *p = (unsigned char *)&x;
    for (int i = 0; i < sizeof(x); i++)
        printf("%02x ", p[i]);
    printf("\n");
    return 0;
}

Three machines. All running the same binary-equivalent logic:

Machine	OS	Architecture
A	Linux	AMD x86_64
B	Linux	Raspberry Pi ARMv8
C	macOS	Apple Silicon M4 (ARM64)

The Results

Machine A: AMD x86_64 Linux (GCC)

0.30000000000000001665
9f 93 54 5d e9 52 49 81 ff 3f 00 00 00 00 00 00

sizeof(long double) = 16 bytes on this machine. But only the first 10 bytes hold actual data: the remaining 6 are padding added for alignment. The meaningful precision lives in an 80-bit format called x87 extended precision.

Machine B: Raspberry Pi ARM Linux (GCC)

0.30000000000000004441
34 33 33 33 33 33 33 33 33 33 33 33 33 33 fd 3f

sizeof(long double) = 16 bytes here too but the byte layout is completely different. On ARM Linux, GCC implements long double as software-emulated 128-bit quad precision (IEEE-754 binary128). The bytes are not compatible with Machine A's output, even though both are nominally "16 bytes."

Machine C: Apple M4 (ARM64, Clang)

0.30000000000000004
9a 99 99 99 99 99 d3 3f

sizeof(long double) = 8 bytes. On Apple Silicon, Clang maps long double to the same 64-bit double type. There is no extended precision. What you write is exactly what you compute.

Why They Disagree: The IEEE-754 Representation Problem

This is not a hardware quality issue. It is a representation issue.

The core problem: not all decimals fit in binary

The decimal number 0.1 in binary is:

0.0001100110011001100110011001100110011001100110011001100110...

It repeats infinitely. A computer must cut it off at a finite number of bits and round. In IEEE-754 double (64-bit), that cutoff is at 52 bits of mantissa.

The layout of a 64-bit IEEE-754 double is:

┌─────────┬───────────────────┬──────────────────────────────────────────────────────┐
│  Sign   │     Exponent      │                    Mantissa                          │
│  1 bit  │     11 bits       │                    52 bits                           │
└─────────┴───────────────────┴──────────────────────────────────────────────────────┘

So before addition even happens:

0.1  ≈  0.1000000000000000055511151231257827021181583404541015625
0.2  ≈  0.2000000000000000111022302462515654042363166809082031250

These are not 0.1 and 0.2. They are the closest representable binary fractions. The rounding error is baked in before a single arithmetic operation runs.

Why addition makes it worse across machines

When you add the two rounded approximations, the machine has to round again and where that second rounding happens depends on how wide the intermediate register is.

Machine	Intermediate register width	What this means
x86 Linux (A)	x87 80-bit extended	Computation happens with 64 bits of mantissa; rounded back down when written to memory
ARM Linux (B)	Software 128-bit	The rounding rules of a software IEEE-754 quad implementation are used; produces a different truncation point
Apple M4 (C)	64-bit strict	No intermediate widening at all; the mantissa is 52 bits throughout, start to finish

The rounding path is different. So the final bit pattern is different.

What the hex reveals

Machine A's 16-byte hex: 9f 93 54 5d e9 52 49 81 ff 3f 00 00 00 00 00 00

Bytes 0–9: the 80-bit extended value
Bytes 10–15: compiler-inserted padding (00 00 ...)

Machine B's 16-byte hex: 34 33 33 33 33 33 33 33 33 33 33 33 33 33 fd 3f

All 16 bytes carry data this is a real 128-bit float
The repeating 33 pattern is the binary encoding of 0.3333... the internal representation of the rounded result at 128-bit precision

Machine C's 8-byte hex: 9a 99 99 99 99 99 d3 3f

A standard IEEE-754 double, little-endian
3f d3 99 99 99 99 99 9a in big-endian: sign=0, exponent=01111111101 (= -2), mantissa = 0011001100110011... the truncated binary of 0.3 at 52 bits

Why This Is Catastrophic for Distributed Systems

Consider a simple balance operation repeated across nodes:

balance = balance * 1.000000001

After 10 million such operations on a real bank ledger:

Node A (x86): $1,000.00000823...
Node B (ARM): $1,000.00000847...
Node C (M4): $1,000.00000819...

The states have diverged. Each node believes a different truth. There is no consensus.

In a traditional distributed database, this is serious but recoverable a primary node's value wins, replicas sync. But in a blockchain, there is no primary node. Every node is equal. Every node must independently arrive at the exact same bit-for-bit result. If they don't, the network fractures.

The Blockchain Solution: Integer Arithmetic Only

Blockchains don't try to fix floating point. They remove it.

How integers solve the problem

Integer arithmetic has no mantissa, no exponent, no rounding mode. 100 + 200 = 300 on x86, ARMv8, RISC-V, MIPS, and every other architecture, identically, always. There is nothing to round. There are no intermediate registers with different widths.

Integers are bit-for-bit deterministic across all architectures.

How major chains implement this

Ethereum represents all value in wei, stored as uint256. 1 ETH = 10¹⁸ wei. The Ethereum Virtual Machine (EVM) has explicit opcodes for integer arithmetic and deliberately has no floating-point opcode. Smart contract developers who want decimal semantics must implement fixed-point arithmetic manually using integer scaling.

Solana represents all value in lamports, stored as uint64. 1 SOL = 10⁹ lamports. Programs running in the Sealevel runtime must use integer arithmetic for any computation that enters the ledger.

Polkadot represents all value in planck, stored as u128. 1 DOT = 10¹⁰ planck. Logic runs inside WebAssembly-based runtimes where all balance and governance arithmetic is handled exclusively through integer types from Rust's standard library u128, u64, never floats.

Chain       | Unit      | Type    | Scale
------------|-----------|---------|------------------------
Ethereum    | wei       | uint256 | 10^18 per ETH
Solana      | lamport   | uint64  | 10^9 per SOL
Polkadot    | planck    | u128    | 10^10 per DOT

What about real-world prices? (The oracle problem)

Real-world prices ETH/USD, BTC/EUR are inherently decimal data. How do oracle networks like Chainlink handle this without introducing float?

Floating point exists off-chain, integers cross the boundary.

Price data is collected off-chain from exchanges as human-readable decimals
Chainlink converts them to integers using parseUnits() passing the value as a string, not a float, to avoid precision loss at the conversion step itself
The resulting integer is submitted on-chain
Smart contracts only ever see and operate on the scaled integer

// WRONG — multiplying a float loses precision before it even hits the chain
const amount = 0.1 * 1e18  // imprecise

// CORRECT — string-based conversion, no precision loss
const amount = parseUnits("0.1", 18)  // → 100000000000000000n (exact)

The reverse works the same way formatUnits() converts the on-chain integer back to a human-readable string for display, without ever passing through a float.

Take away:

Blockchains reject floating point not because it is inaccurate, but because it is not reproducible across machines at the bit level.