Forem: Aman Prasad

The Inline Myth: Why the inline Keyword is Just a Suggestion

Aman Prasad — Mon, 23 Feb 2026 14:01:49 +0000

Inline functions are functions that the compiler may expand directly at the place where they are called, instead of performing a normal function call.

Inline functions are often misunderstood especially by beginners who assume that writing the inline keyword forces the compiler to inline a function.

In reality, inline is only a suggestion and modern compilers are far smarter than we realize.
For optimization purposes, inline is only a suggestion. (In C99, it also has defined linkage semantics.)

This post explains:

What inline actually means
Why it is only a hint
Inline vs macros
How modern compilers decide to inline
Performance and binary size trade-offs
Real compiler behavior using assembly output

Inline Is a Suggestion, Not a Command

When you write:

inline int add(int a,int b){
    return a + b;
}

you are not instructing the compiler to inline this function.

You are merely suggesting that inlining may be beneficial.

The compiler is completely free to:

Inline the function
Ignore the suggestion
Inline it in some call sites but not others

Why?

Because the C/C++ standards do not require compilers to perform any optimization at all.

Inlining is an optimization, and therefore it cannot be mandatory.

If inline were a mandatory, optimization itself would no longer be optional and this would violate the language standard.

Modern Compilers Inline Even Without `inline`

A very common misconception is:

If I don’t write inline, the function won’t be inlined.

This is false.

Modern compilers (GCC, Clang, MSVC):

Perform automatic inlining
Analyze function size, call frequency, and context
Inline functions even if the inline keyword is not used

Example:

int add(int a,int b){
    return a + b;
}
int main(){
    return add(2,3);
}

With optimizations enabled (-O2, -O3), the compiler will very likely inline such a small function.

Today, inline is more of a semantic hint than a performance switch.

Why Inline Exists at All

Inlining was originally introduced to:

Reduce function call overhead
Improve performance in tight loops
Replace unsafe macros

A traditional function call involves:

Pushing arguments onto the stack
Saving registers
Jumping to another memory location
Returning back

Inlining eliminates this overhead by expanding the function body at the call site.

But Call Overhead Isn’t That Expensive Anymore

Modern CPUs are highly optimized for function calls through:

Branch prediction
Instruction pipelining
Speculative execution

As a result, the overhead of a well-predicted function call is often very small.

In many cases, aggressive inlining does not yield significant performance gains and can even hurt performance due to:

Increased code size
Instruction cache pressure
Register pressure

Today, the primary benefit of inlining is not eliminating the call itself, but enabling further compiler optimizations.

How Compilers Decide to Inline

Compilers use heuristics. They compare:

Cost of the function call
Size of the function body

If the cost of call > cost of expanded code then the compiler may inline.

Likely to Be Inlined

Very small functions
Simple calculations
Getters/setters
Functions called inside loops

Unlikely to Be Inlined

Large functions
Functions with loops
Functions with static variables
Functions called via function pointers
Recursive functions

Recursive Functions Cannot Be Fully Inlined

Inlining requires the compiler to expand the function body.

For recursion:

int fact(int n){
    return n == 0 ? 1 : n * fact(n -1);
}

Inlining would require:

Infinite expansion
Unlimited code generation

Compilers cannot infinitely expand recursive calls,
though they may still inline limited cases or optimize tail recursion.

Inline vs Macros

Macros were the original “inline mechanism,” but they come with serious problems.

Macro Example

#define ADD(a, b) a + b

Usage:

4 * ADD(2 + 2)

Expansion:

4 * 2 + 2  //  Wrong result

Inline Function Equivalent

inline int add(int a,int b) {return a + b;
}

Usage:

4 * add(2 + 2) //  Correct

Too Much Inline Increases Binary Size

Inlining duplicates code at every call site.

If a function is used in many places:

Binary size increases
Instruction cache pressure increases
Performance may actually degrade

This phenomenon is known as code bloat.

So:

Inlining trades space for speed.

Experiment: Verifying Inlining Across Optimization Levels

We test a simple program with and without the inline keyword to observe how the compiler behaves at different optimization levels.

Test Code

inline int add(int a, int b) {
    return a + b;
}

int main() {
    return add(2, 3);
}

Case 1: Compilation with `-O0` (No Optimization)

gcc -S  test.c -O0

Assembly Observation

call    _add

main() explicitly calls _add
Stack frame setup is visible
No inlining occurs

Symbol Table

nm a.exe | grep add

00401b40 T ____w64_mingwthr_add_key_dtor
00403880 T ___mingw_readdir
00401460 T _add   // we can see the add symbol with -O0

Conclusion:

At -O0, GCC prioritizes debuggability. Almost no inlining happens even if inline is written.

⚠️ Important Warning About `inline` at `-O0`

In C, writing inline does not create a normal, externally callable function.

At -O0:

The compiler does not inline
A function call may still be generated
No external definition is emitted for the inline function

This leads to a linker error:

Call exists, but function does not.

To avoid this issue in C, always use:

static inline

or provide a separate external definition.

⚠️ Important Clarification About C99 inline

In C99, inline is not only about optimization — it also affects linkage and symbol emission.

There are three forms:

inline → provides an inline definition but does not emit an external definition.
extern inline → forces emission of the external definition in one translation unit.
static inline → gives internal linkage (each translation unit gets its own copy).

To avoid linker issues at low optimization levels, you can either:

• Use static inline in headers (common and simple), or

• Use inline in a header and extern inline in exactly one .c file (the strict C99 model).

Case 2: Compilation with `-O2` (Optimized Build)

gcc -S test.c -O2

Assembly Observation

No call _add
Function is fully inlined
Constant folding reduces add(2,3) to 5

Symbol Table

nm a.exe | grep add

00401b10 T ____w64_mingwthr_add_key_dtor
00403850 T ___mingw_readdir
00401460 T _add

⚠️ Important Observation
_add still exists, but it is never called.

Why `_add` Exists but Is Never Called

This is the core question, and the answer is subtle but fundamental.

Reason 1: External Linkage

inline int add(int a,int b);

Functions have external linkage by default, meaning another translation unit might call add().
The compiler must therefore keep the symbol.

Reason 2: No Whole-Program Visibility

Without Link Time Optimization (LTO), the compiler cannot prove the function is unused globally.

Forcing Removal of `_add`

Option 1: Make the Function static

static int add(int a,int b){
    return a + b;
}

Internal linkage allows the compiler to remove the symbol.

Option 2: Enable Link Time Optimization (LTO)

gcc -O2 -flto test.c

⚠️ Important Note

Although the example uses the inline keyword for explanation, I also tested the same code without inline.

When compiled with -O2, the compiler still inlined the function automatically.

This confirms that inlining at higher optimization levels is driven by the compiler’s heuristics, not by the presence of the inline keyword.

Key Takeaways

inline is a hint, not a guarantee
The compiler may inline even without inline
- Inline primarily enables optimization, but in C99 it also affects linkage
Macros are unsafe; inline functions are type-safe
Recursive functions cannot be inlined
Excessive inlining increases binary size
Modern CPUs reduce the benefit of aggressive inlining

char str1[] = "hello world"; vs char *str2 = "hello world"; – The Memory Story Every C Programmer Must Know

Aman Prasad — Thu, 19 Feb 2026 14:38:25 +0000

A string literal is a sequence of characters stored in read-only memory and automatically terminated by a null character '\0'.

char str1[] = "hello world";   //  str1 is a character array stored on stack
char *str2 = "hello world";    // str2 is a pointer stored on the stack; it points to a string literal in read-only memory

Although they look similar, these two lines behave very differently in memory.

char str1[] = “hello world”;

The compiler allocates an array of size 12 bytes (11 characters + '\0') on the stack (for local variables).
At runtime it copies the 12 bytes from the string literal (typically stored in .rodata)

So you end up with two copies of the string:

One immutable in .rodata
One mutable on the stack
(If str1 were global or static, it would be stored in .data instead of the stack.)

so if we try to change it something like that

str1[0] = 'a';       // works fine because it is mutable and located on stack
str1 = "something";  // compilation error array name is not assignable
str1++;              // not allowed: array names are non-modifiable lvalues

In the above snippet, str1 = "something" is not allowed because str1 is an array and array names cannot be reassigned.
If we want to write new data into the existing array str1, we must copy the contents using strcpy.

strcpy(str1, "some");   // '\0' is copied automatically

⚠️ strcpy assumes the destination array is large enough; otherwise it causes buffer overflow.

Alternatively, we can copy from another character array:

char new_str[] = "new value";
strcpy(str1, new_str);
// (Here str1 is the destination array and new_str is the source string.)

Now, if we check the size of the variable str1, it gives 12 bytes
(11 characters + one null character '\0'):

sizeof(str1);  // 12 bytes (11 chars + 1 null char '\0')

str1 is a real array, and sizeof returns the total allocated size of the array, not the length of the string stored in it.

Why can’t I do `str1 = str2`, but I can do `strcpy(str1, str2)`?

At first glance, both statements look like they should “copy” a string. However, they do two very different things in C.

Internally, strcpy does something like this:

while (*str2 != '\0') {
    *str1 = *str2;
    str1++;
    str2++;
}
*str1 = '\0';

It never changes the address of str1 It writes into the memory owned by str1 That is why it is allowed.

char *s2 = "hello world";

The compiler allocates only the pointer (8 bytes on 64-bit, 4 bytes on 32-bit) on the stack.
The pointer’s value is the address of the string literal (typically stored in .rodata).
we cannot modify the data that str2 points to, because the string literal is stored in the read-only .rodata segment

str2[0] = 'a';         // undefined behavior (often results in segmentation fault)
str2++;                // allowed: moves the pointer, not the string data
printf("%s", str2);    // ello world
str2 = "bye world";    // allowed — repoints to another literal

The string literal "hello world" is placed in read-only memory (.rodata)
str2 (a pointer) stores the address of that literal

The const Habit we Should Adopt

const char *str2 = "hello world";   // this is what we should write

Now the compiler will give you a compile-time error if you try str2[0] = 'x'; instead of a runtime crash.

What Happens If Two Variables Use the Same String Literal?

Consider the following code:

const char *a = "hello";
const char *b = "hello";

printf("%p %p\n", (void *)a, (void *)b);

On many systems, this program prints the same address for both a and b:

0x0040507D  0x0040507D

Most modern compilers perform an optimization called string literal pooling (or string interning):

Identical string literals are stored only once
Multiple pointers reference the same memory location
This saves memory and improves cache usage

As a result:

a and b point to the same string literal
Modifying either (which is illegal anyway) would affect both

Important Standard Note

⚠️ The C standard does NOT guarantee this behavior.

Compilers are allowed to merge identical literals

Compilers are also allowed to keep them separate

You must never rely on their addresses being equal

So this is valid C:

a == b   // may be true or false

Both outcomes are legal.

Why This Matters in Practice

Never compare string literals using pointer equality
```
if (a == b) { ... }   // ❌ wrong
```

Always use strcmp

if (strcmp(a, b) == 0) { ... }  // ✅ correct

Never attempt to modify string literals
Treat all string literals as read-only shared objects

#include <stdio.h>
#include <string.h>

int main() {

    /* =========================================================
                PART 1: char str1[] = "hello world";
    ========================================================= */

    char str1[] = "hello world";
    /* str1 is a CHARACTER ARRAY.
    Memory for the array is allocated on the stack.
    The string literal "hello world" is COPIED into this array.
    Size allocated = 11 characters + 1 null terminator = 12 bytes. */

    // Since str1 owns writable memory, modifying characters is VALID.
    str1[0] = 'a';   // changes 'h' to 'a'

    // Prints the modified string stored in stack memory
    printf("str1: %s\n", str1);   // Output -> str1: aello world

    strcpy(str1, "aman");
    printf("str1: %s\n", str1);   // Output-> str1: aman

    // Array names are NOT pointers and are NOT modifiable lvalues.
    // str1++;              //  INVALID: cannot change base address of an array
    // str1 = "bye world";  //  INVALID: array cannot be reassigned

    // sizeof(str1) gives the TOTAL SIZE of the array in bytes
    // because str1 is a real array.
    printf("size of str1: %zu\n", sizeof(str1));   // 12 bytes

    /* =========================================================
       PART 2: char *str2 = "hello world";
       ========================================================= */

       char *str2 = "hello world";
    /* str2 is a POINTER to char.
    The string literal "hello world" is stored in READ-ONLY memory (.rodata).
    str2 only stores the ADDRESS of the first character of the literal. */

    /* Attempting to modify a string literal is UNDEFINED BEHAVIOR.
    On most systems this causes a segmentation fault or crash.
    str2[0] = 'a';    //  SEGMENTATION FAULT

    CORRECT and SAFE declaration for string literals:
    const char *str2 = "hello world";

    Pointer arithmetic is allowed because str2 itself is modifiable.
    This makes str2 point to the second character of the string. */
    str2++;

    // Prints the string starting from the new pointer location
    printf("str2: %s\n", str2);   // Output: ello world

    // Reassigning the pointer is allowed.
    // Now str2 points to a DIFFERENT string literal.
    str2 = "bye world";

    // Prints the new string literal
    printf("str2: %s\n", str2);   // Output: bye world

    // sizeof(str2) gives the size of the POINTER itself,
    // NOT the size of the string it points to.
    printf("size of str2: %zu\n", sizeof(str2));   // 8 bytes on 64-bit systems

    // All pointer types have the same size on a given architecture
    printf("size of int pointer: %zu\n", sizeof(int*)); // 8 bytes (64-bit)

    return 0;
}

MQTT Explained in Simple Terms: The Lightweight Protocol That Powers the Entire IoT World

Aman Prasad — Tue, 17 Feb 2026 06:38:47 +0000

MQTT (Message Queuing Telemetry Transport) is a lightweight, standards-based messaging protocol designed for machine-to-machine (M2M) and Internet of Things (IoT) communication.

It is optimized for Low bandwidth, High latency networks, Resource-constrained devices like microcontrollers. Unlike HTTP, MQTT does not use request–response. Instead, it uses a publish/subscribe communication model.

Why is the MQTT protocol important?

The MQTT protocol has become a standard for IoT data transmission because it delivers the following benefits:

Lightweight and efficient: MQTT implementation on the IoT device requires minimal resources, so it can even be used on small microcontrollers. For example, a minimal MQTT control message can be as little as two data bytes. MQTT message headers are also small so that you can optimize network bandwidth.
Scalable: The protocol has built-in features to support communication with a large number of IoT devices. Hence, you can implement the MQTT protocol to connect with millions of these devices.
Reliable: Many IoT devices connect over unreliable cellular networks with low bandwidth and high latency. MQTT has built-in features that reduce the time the IoT device takes to reconnect with the cloud. It also defines three different quality-of-service levels to ensure reliability for IoT.
Secure: MQTT makes it easy for developers to encrypt messages and authenticate devices and users using modern authentication protocols, such as OAuth, TLS1.3, Customer Managed Certificates, and more.

Principle behind MQTT

MQTT is built around the publish-subscribe communication model, which is fundamentally different from the traditional client–server approach.

In a typical client–server system, a client directly requests data from a server, and the server responds with the requested information. This creates a tight coupling between both sides. The client must know where the server is and both must be available at the same time.

But MQTT removes this direct dependency by introducing an intermediary called a broker. Instead of sending messages directly to each other, devices communicate through the broker:

A device that sends data is called a publisher
A device that receives data is called a subscriber
The broker receives all published messages and delivers them to the appropriate subscribers based on topics

Because of this design, publishers and subscribers remain completely independent of each other.

This creates three beautiful kinds of freedom (called decoupling):

Space Decoupling: "I don't need to know where you live”

Publishers and subscribers do not need to know anything about each other’s network details. They don’t exchange IP addresses, port numbers, or device identities.

Each device only knows the broker address and the topic it publishes to or subscribes to. This makes it easy to add, remove or replace devices without changing the rest of the system.

Time Decoupling: "I'll leave a message for you”

In MQTT, publishers and subscribers do not need to be connected at the same time.

A publisher can send data even when subscribers are offline, and subscribers can receive data later when they reconnect (depending on QoS and session settings). This is especially useful for IoT devices that frequently go into sleep mode or experience unstable connectivity.

Synchronization Decoupling: "No waiting in line”

Publishers and subscribers operate independently and do not block each other.

A publisher can send messages without waiting for subscribers to process them and subscribers can receive messages whenever they are ready. This asynchronous behavior makes MQTT highly efficient and suitable for real-time systems with limited processing power.

MQTT components

MQTT follows the publish/subscribe model by defining a small set of core components. The most important ones are clients, the broker and the connection that links them.

MQTT client

An MQTT client is any device or application that communicates using the MQTT protocol. This can range from a cloud server or mobile app to a small microcontroller running an MQTT library.

A client can play different roles:

When it sends data, it acts as a publisher
When it receives data, it acts as a subscriber
A single client can do both at the same time

In simple terms, if a device connects to a broker and exchanges messages using MQTT, it is considered an MQTT client.

MQTT broker

The MQTT broker is the central communication hub of the system. All MQTT clients connect to the broker and clients never communicate directly with each other.

The broker’s main responsibilities include:

Receiving messages from publishers
Filtering messages based on topics
Delivering messages to all subscribed clients
Managing client connections and sessions

In addition, the broker often handles:

Client authentication and authorization
Storing and delivering messages for disconnected clients
Forwarding data to databases, analytics engines, or cloud services

Because of this, the broker plays a critical role in ensuring reliability, scalability, and security.

MQTT connection

Communication in MQTT starts when a client establishes a connection with the broker.

The process works as follows:

The client sends a CONNECT message to the broker
The broker responds with a CONNACK message to confirm the connection

This communication happens over a persistent TCP/IP connection, which remains open while data is exchanged. All MQTT communication flows through this connection.

An important rule in MQTT is that clients only connect to the broker, never directly to other clients. This design keeps the system loosely coupled and easy to scale.

MQTT Topics

A topic is a structured string that the MQTT broker uses to route messages between clients. Instead of sending messages directly to a specific device, MQTT clients publish messages to a topic and the broker decides which subscribers should receive them.

Topics are arranged in a hierarchical format, similar to folders in a file system, with each level separated by a forward slash (/).

Example topic structure

ourhome/groundfloor/livingroom/light
ourhome/firstfloor/kitchen/temperature

Each part of the topic adds context:

ourhome → identifies the system
groundfloor / firstfloor → identifies the location
livingroom / kitchen → identifies the room
light / temperature → identifies the device or data type

This hierarchy makes it easy to organize data logically and scale the system as more devices are added.

MQTT Publish

Publishing is the process of sending data to the broker.

When an MQTT client publishes a message, it includes:

A topic (where the message belongs)
A payload (the actual data)
Optional settings like QoS and retain flag

The payload is sent as raw bytes, which means the client is free to choose any data format, such as Plain text, JSON, Binary data, Sensor readings

Example

A smart lamp in a home automation system may publish:

Topic: ourhome/groundfloor/livingroom/light
Payload:ON

Once published, the message is delivered to all clients subscribed to that topic, based on broker rules.

point to remember:

Publishers do not know who receives the message. It only send data to a topic.

MQTT Subscribe

Subscribing is how an MQTT client expresses interest in receiving certain messages.

To subscribe, a client sends a SUBSCRIBE request to the broker that includes:

One or more topic filters
The desired QoS level for each topic

After subscribing, the broker automatically forwards any matching messages to the client whenever data is published on those topics.

Example

A mobile app that monitors home lighting may subscribe to:

ourhome/+/+/light

Every time a light publishes its state (ON or OFF), the app receives the update and can:

Display the current status
Update a counter of active lights
Trigger notifications or automation rules

Quality of Service (QoS)

In MQTT, Quality of Service (QoS) defines how reliably a message is delivered from a publisher to a subscriber.

Because IoT networks can be slow, unstable, or intermittent. MQTT allows developers to choose the right balance between reliability, speed, and bandwidth usage.

MQTT supports three QoS levels:

QoS 0 – At most once
QoS 1 – At least once
QoS 2 – Exactly once

Each level provides a different delivery guarantee.

QoS 0

QoS 0 delivers a message at most once (fire and forget). The message is sent without any acknowledgment from the receiver.

Publisher sends the message
No confirmation is expected
Message may be lost if the connection fails
It ensures fastest delivery, lowest bandwidth usage and no retry mechanism.

Use cases

Temperature and humidity readings, live sensor streams where occasional loss is acceptable

QoS 1

QoS 1 guarantees that a message is delivered at least once. However, duplicate messages are possible.

Publisher sends message
Subscriber (via broker) sends an acknowledgment (PUBACK)
If no acknowledgment is received, the publisher retransmits
It ensures reliable delivery, it may contain duplicate messages and in this case moderate bandwidth is used.

Use cases

Device control commands, status updates, alerts and notifications

QoS 2

QoS 2 ensures that a message is delivered exactly once, with no loss and no duplication.

It uses a four-step handshake:

PUBLISH
PUBREC
PUBREL
PUBCOMP

This ensures both sender and receiver agree that the message was delivered once and only once.

It provides the highest reliability with the cost of increased overhead, higher latency, and greater memory usage.

Use cases

Financial transactions, billing data, critical industrial control messages

Last Will and Testament (LWT)

In MQTT, Last Will and Testament (LWT) is a mechanism that helps detect unexpected client failures.

It allows an MQTT client to tell the broker in advance:

“If I disconnect suddenly or crash, publish this message on my behalf.”

This feature is extremely useful in IoT systems where devices may lose power, crash, or disconnect due to unstable networks.

Without LWT, other systems would have no way of knowing whether a device went offline intentionally or failed unexpectedly.

LWT solves this problem by automatically informing subscribers about the device’s failure.

Common MQTT question

What Does a Will Message Contain?
A will message is just like a normal MQTT message and includes:

Topic - where the message will be published
Payload - the message content
QoS level - reliability of delivery
Retain flag (optional)

What Port does MQTT Normally Use?
The standard port is 1883.

Can you use MQTT without a broker?
No

What Protocol does MQTT use?
The standard version uses TCP/IP.

Can multiple clients publish to the same topic?
Yes, Multiple clients can publish messages to the same topic.

Is is possible to know the identity of the client that published a message?
No, not unless the client includes that information in the topic or payload.

What happens to messages that get published to topics that no one subscribes to?
They are discarded by the broker.

How can I find out what topics have been published?
You can’t do this easily as the broker doesn’t seem to keep a list of published topics as they aren’t permanent.

Can I subscribe to a topic that no one is publishing to?
Yes, Subscribing to a topic does not require an active publisher.

Are messages stored on the broker?
Yes, but only temporarily. Once messages are delivered to all subscribers, they are discarded.
(See retained messages below.)

What are retained messages?
When you publish a message with the retain flag set, the broker stores only the last published message** for that topic.

This retained message is immediately sent to new subscribers when they subscribe to the topic. MQTT retains only one message per topic.

Image sources

All the images are from this youtube video

Understanding Endianness: Little-Endian vs Big-Endian

Aman Prasad — Thu, 12 Feb 2026 13:55:12 +0000

Endianness refers to the order in which bytes are arranged and stored in computer memory.

In simple terms, endianness decides which byte is stored at the lowest memory address: the most significant byte (MSB) or the least significant byte (LSB).

A simple analogy:

Think of endianness like reading direction.

Some languages read left to right, while others read right to left. Both convey the same information but only if you know the rule beforehand.

Similarly, computers need a defined rule to interpret multi-byte values correctly.

The Two Main Types of Endianness

Most modern systems use one of two byte-ordering schemes.

To illustrate, consider storing the 4-byte hexadecimal value: 0x12345678.

Big-Endian

In Big-Endian systems, the most significant byte (MSB) is stored at the lowest memory address.
This format is often considered more human-readable because it matches how we write numbers.

Memory Address	Stored Byte
Address +0	`0x12` (MSB)
Address +1	`0x34`
Address +2	`0x56`
Address +3	`0x78` (LSB)

It is used in Networking protocols, older mainframes and legacy systems.

Little-Endian (LE)

In Little-Endian systems, the least significant byte (LSB) is stored at the lowest memory address. This layout aligns well with how CPUs perform arithmetic internally.

Memory Address	Stored Byte
Address +0	`0x78` (LSB)
Address +1	`0x56`
Address +2	`0x34`
Address +3	`0x12` (MSB)

It is used by Intel (x86), AMD, and most modern desktops, laptops and embedded MCUs.

Bi-Endianness

Many modern processors, like ARM, are actually Bi-endian. This means they can be configured to operate in either Big-Endian or Little-Endian mode depending on the operating system's requirements.
In practice, most modern ARM systems run in little-endian mode.

Why Does It Matter?

In high-level languages like Python or Java, endianness is usually hidden.

However, it becomes critical in the following cases:

Networking: The internet uses big-endian. Without proper byte conversion, data sent from a little-endian system will be misinterpreted.
Binary File Sharing: Opening a binary file created on a big-endian system on a little-endian machine can corrupt values unless handled correctly.
Low-Level Programming: In C, assembly, or embedded systems, incorrect assumptions about byte order lead to subtle and dangerous bugs.

Detecting Endianness Using C

#include <stdio.h>
#include <stdint.h>

int main() {
    uint16_t x = 1;
    if (*(uint8_t*)&x == 1){
        printf("Little Endian");
    }
    else{
        printf("Big Endian");
    }
    return 0;
}

Step by step explanation

The Variable uint16_t x = 1
- uint16_t is a 16-bit (2-byte) integer
- The numeric value is 1, but memory must store it using two bytes
Possible memory layouts:

Endianness Memory Bytes

Big-Endian 00 01

Little-Endian 01 00
The pointer cast (uint8_t*)&x

This line does three things:
1. &x → gets the memory address of x
2. (uint8_t*) → treats that address as a pointer to a single byte
3. Dereferencing reads the first byte in memory
if *((uint8_t)&x == 1)*
If the first byte is 1: the least significant byte is stored first. The system is Little-Endian.
If the first byte is 0: the most significant byte is stored first. The system is Big-Endian.

Endianness	Memory Bytes
Big-Endian	`00 01`
Little-Endian	`01 00`

A Subtle Advantage of Little-Endian

One often-mentioned but rarely explained advantage of little-endian systems is:

The same value can be read from memory at different widths using the same base address.

This works because, in little-endian memory, the least significant byte is stored at the lowest address.

Example

uint32_t x =0x12345678;

Little-endian memory layout:

Memory Address	Stored Byte	Significance
`&x + 0`	`0x78`	Least Significant Byte (LSB)
`&x + 1`	`0x56`
`&x + 2`	`0x34`
`&x + 3`	`0x12`	Most Significant Byte (MSB)

&x always points to the lowest memory address (&x + 0).

Now, reading from the same address:

Read size	Expression	Result
8-bit	`(uint8_t)&x`	`0x78`
16-bit	`(uint16_t)&x`	`0x5678`
32-bit	`(uint32_t)&x`	`0x12345678`

The starting address never changes only the read size does. Increasing the width naturally reveals more significant bytes.

Why Do We Care About the Lower Bytes?

This raises an important question:

If big-endian systems expose the upper part of a number first, why do CPUs and programmers care so much about the lower bytes?

The answer lies in how arithmetic works.

In any positional number system, the least significant bits form the foundation of the value, while higher bits only add scale.

For example: 0x12345678

The lower byte (0x78) controls changes of ±1
The upper byte (0x12) only affects large magnitude

All arithmetic operations addition, subtraction, multiplication start from the least significant byte and propagate upward using carry.

Addition and Subtraction

When adding multi-byte numbers, the CPU must process the least significant byte first to determine whether a carry occurs.

Example:

  Carry:           1 1
  Value A:   0x0 0 F F
+ Value B:   0x0 0 0 1
  --------------------
  Result:    0x0 1 0 0

LSB addition: 0xFF + 0x01 = 0x00 (carry = 1)
Next byte uses the carry to produce 0x01

In little-endian systems, the LSB is fetched first, allowing computation to begin immediately while higher bytes are fetched in parallel. This was especially important on early 8-bit and 16-bit processors and strongly influenced CPU and compiler design.

Why Big-Endian Still Exists

If little-endian fits computation so well, why does big-endian persist?

The reason is legacy and standardization, not performance.

Big-endian is used in:

Networking (TCP/IP)
File formats like JPEG and PNG
Older architectures and mainframes

Once a format or protocol is defined, changing its byte order would break compatibility with existing data and software.

Modern Reality

On modern systems:

CPU arithmetic happens in registers (no endianness)
Caches and pipelines hide memory order
High-level languages abstract it away

Today, endianness is mostly a data-format concern, not a CPU performance concern except in networking, embedded systems, and low-level code.

Reference

This discussion is inspired by community explanations on Stack Overflow:

https://stackoverflow.com/questions/13926760/the-reason-behind-endianness

Flash Memory Explained: NAND vs NOR, Architecture, and Memory Organization

Aman Prasad — Sat, 07 Feb 2026 13:42:21 +0000

Flash memory is a type of non-volatile semiconductor memory that can be electrically erased and reprogrammed. It is based on floating-gate MOSFETs (Metal-Oxide-Semiconductor Field-Effect Transistors) where data is stored by trapping electrons in a floating gate, altering the threshold voltage of the transistor to represent binary states. Unlike volatile memory like DRAM, flash retains data even when power is removed, making it ideal for applications requiring persistent storage such as SSDs, USB drives, memory cards and embedded systems. Flash memory evolved from EEPROM, but instead of erasing individual bytes, it erases data in larger blocks, which significantly improves speed, density, and cost efficiency.

Types of Flash Memory: NAND and NOR
- NOR Flash
- NAND Flash
Comparison of NAND and NOR
Memory Organization: Sector, Block, and Page

Types of Flash Memory: NAND and NOR

The two primary types of flash memory are NAND and NOR which are named after the way their memory cells are connected internally, which resembles NAND and NOR logic gates.

Both use the same basic floating-gate cell design, but they differ in architecture, access methods, performance, cost, and applications.

Image source: nexusindustrialmemory

NOR Flash

Image source: embedded.com

In NOR flash, memory cells are connected in parallel, with the drain of each cell connected to a bit line and the source connected to a common source line (typically ground). This parallel connection resembles the structure of a NOR logic gate, which is the origin of the name NOR flash.

This architecture enables true random access at the byte level, allowing the processor to directly read instructions from flash memory. As a result, code can be executed directly from NOR flash using Execute-In-Place (XIP), without first copying the code into RAM.

NOR flash offers fast read access, making it ideal for code storage. However, write and erase operations are slower because erase operations occur at the sector level, and the cell structure requires higher voltages and larger physical area. This leads to lower memory density and a higher cost per bit compared to NAND flash.

Typical applications of NOR flash include firmware storage in embedded systems such as bootloaders, BIOS, and microcontroller internal flash.

Key Components

Memory Cell: The basic storage element implemented as a floating-gate MOSFET. Data is stored by trapping or removing charge from the floating gate, representing logic 0 or 1.
Word Line: Horizontal lines (in black color) that connect to the control gates of the memory cells. These are used to select specific rows of cells for operations like read, program (write), or erase.
Bit Line: Vertical orange line at the top which is connected to the drain of the cells. This carries data in and out during read and write operations.
Source Line: Vertical blue line at the top right. This is typically connected to ground or a reference voltage and is shared among cells.

Structural Characteristics

Direct electrical path per cell: Each memory cell is connected directly between the bit line (drain) and the source line (ground). This one-to-one connection allows the state of a single cell to be sensed without interference from neighboring cells.
Independent cell access: Because cells are not connected in series, selecting a specific word line activates only the targeted cell(s). This independence enables true random access to individual bytes or words.
Larger physical cell size: Each cell requires its own drain contact, source connection, and routing lines. This increases the silicon area per bit, resulting in lower storage density compared to NAND flash.
High reliability for code storage: The simple read path and minimal need for complex error correction make NOR flash highly reliable for instruction fetch and execution, which is critical for firmware and boot code.

Operational Implications

Read Operation: Read operations are fast and byte-addressable. The processor can directly fetch instructions from NOR flash using Execute-In-Place (XIP), eliminating the need to copy code into RAM before execution.
Write Operation: Programming is slower because it involves injecting charge into the floating gate using precise voltage pulses. Writes typically occur at the page level, even if only a small amount of data is modified.
Erase Operation: Erase operations are performed at the sector level, where a group of memory cells is cleared simultaneously by removing charge from their floating gates. This operation is relatively slow and requires higher voltages.

NAND Flash

Image source: embedded.com

In NAND flash, memory cells are connected in series, forming a cell string typically consisting of 32 to 128 cells. The drain of one cell is connected to the source of the next and the entire string is connected between a bit line at the top and a common source line at the bottom. This serial connection resembles the structure of a NAND logic gate, which is the origin of the name NAND flash.

This architecture significantly reduces the number of required contacts and routing lines per cell, enabling much higher storage density, lower cost per bit and larger memory capacities than NOR flash. However, because cells are accessed through a series path, NAND flash does not support true random access. Instead, data is accessed in pages, and erase operations are performed in blocks, making random reads slower but bulk data operations highly efficient.

Key Components

Memory Cell: The basic storage element implemented as a floating-gate MOSFET. Data is stored by trapping or removing charge from the floating gate, representing logic 0 or 1.
Word Line: Horizontal lines (in black color) that connect to the control gates of the memory cells. These are used to select specific rows of cells for operations like read, program (write), or erase.
Bit Line: Vertical orange line at the top, which is connected to the drain of the top select transistor. This carries data in and out during read and write operations.
Source Line: Horizontal blue line at the bottom. This is typically connected to ground or a reference voltage and is shared among strings.
Ground Line Select Transistor (SL Select): A switch transistor at the bottom of the string that connects or isolates the string from the source line.

Structural Characteristics

Series-connected cell strings: Memory cells are connected in a chain, requiring current to pass through multiple cells to access a target cell.
High-density layout: Fewer contacts and shared routing allow more cells to fit in the same silicon area, resulting in significantly higher density than NOR flash.
Shared access path: Cells do not have independent read paths. All unselected cells in a string must be biased ON to access the selected cell.
Complex peripheral circuitry: NAND flash requires page buffers, sense amplifiers, and error correction codes (ECC) to ensure data integrity.

Operational Implications

Read Operation: Reads are performed at the page level. An entire page is transferred into an internal buffer and the requested data is then output. This makes random reads slower compared to NOR flash.
Write Operation: Programming is fast and efficient, occurring at the page level. NAND flash is well suited for frequent data writes.
Erase Operation: Erase operations are performed at the block level, where a block consists of many pages. Block erase in NAND flash is faster and more energy-efficient per bit erased compared to NOR flash.

NAND Flash Cell Types

NAND flash is further classified based on the number of bits stored per cell:

SLC (Single-Level Cell): Stores 1 bit per cell. Offers the highest speed, endurance (up to ~100,000 cycles), and reliability, but at higher cost and lower density.
MLC (Multi-Level Cell): Stores 2 bits per cell. Balances density and endurance (3,000–10,000 cycles).
TLC (Triple-Level Cell): Stores 3 bits per cell. Higher density with reduced endurance (1,000–5,000 cycles); common in consumer SSDs.
QLC (Quad-Level Cell): Stores 4 bits per cell. Very high density with lower endurance (100–1,000 cycles) and slower performance.
PLC (Penta-Level Cell): Stores 5 bits per cell. Emerging technology focused on ultra-high density with increased reliability challenges.

Comparison of NAND and NOR

Feature	NOR Flash	NAND Flash
Cell Architecture	Parallel connection (like NOR gate)	Series connection (like NAND gate)
Access Method	Random byte-level access, supports XIP	Sequential page/block access
Read Speed	Faster (e.g., 100-200 ns per byte)	Slower for random reads, faster for sequential
Write/Erase Speed	Slower (sector erase in tens to hundreds of milliseconds, writes in milliseconds)	Faster (erase in ms, write in µs per page)
Density/Cost	Lower density, higher cost per bit	Higher density, lower cost per bit
Endurance	Higher than NAND (typically 10⁴–10⁵ erase cycles)	Varies by type (SLC high, QLC low)
Typical Capacities	Up to a few GB	Up to TB-scale
Power Consumption	Higher for writes/erases	Lower overall
Use Cases	Code storage, firmware, embedded systems	Data storage, SSDs, USB drives, memory cards

Memory Organization: Sector, Block, and Page

Flash memory is not organized like RAM. Instead of allowing free read, write, and erase operations on individual bytes, flash memory follows a strict hierarchical structure. This structure exists because of the physical nature of flash memory cells and how they are erased.

Flash memory is organized into pages (read/write units) and blocks or sectors (erase units). This design directly impacts how data is stored, modified, and managed.

Why Flash Memory Uses Pages and Blocks

Flash memory cells store data using charge trapped in floating gates. To erase data, a high voltage must be applied to remove this charge.

Because applying such high voltage to individual cells is impractical and unsafe, flash memory erases groups of cells together. This leads to the following rules:

Read operations occur at the page level
Write operations occur at the page level
Erase operations occur at the block (or sector) level

This asymmetry is fundamental to all flash memory technologies.

The Hierarchical Structure of Flash Memory

Page

A page is the smallest unit used for reading or writing data in flash memory. It is a row of memory cells that share a common word line, which is a control signal.

In NAND flash: Pages are usually 2KB to 16KB in size, with 4KB being common in modern SSDs. They include a spare area of 64-512 bytes for error correction codes (ECC), metadata, and bad block markers.
In NOR flash: Pages are smaller, often 256-512 bytes.

Writing to a page involves charging or discharging the floating gates in the cells, which takes microseconds. Pages cannot be overwritten directly; the block that contains the page must be erased first. In NAND, pages have a main data area and a spare area for extra information.

Block

A block is a group of pages and the smallest unit that can be erased at once. It is a grid of strings (columns of connected cells) and pages (rows). Erasing uses high voltage to set all bits to 1 via Fowler-Nordheim tunneling.

In NAND flash: 64-512 pages per block, totaling 128KB to 8MB (e.g., 4MB common).
In NOR flash: Often called sectors, with erase units of 4KB-256KB.

Blocks wear out over erase cycles. In NAND-based storage devices, this is managed by a Flash Translation Layer (FTL), which performs wear leveling and garbage collection by relocating valid pages before erasing blocks.

Sector

The term sector is used differently depending on the type of flash memory and the context. In NOR flash, a sector refers to the smallest erasable unit of memory and is functionally equivalent to a block in NAND flash. NOR flash sectors typically range from 4 KB to 256 KB and contain multiple pages.

In NAND flash, the term sector is not a formal physical unit. It is often used informally to describe a 512-byte or 4 KB logical chunk of data, a convention inherited from hard disk drives. These logical sectors map to portions of a page but do not represent erase units. In NAND flash, pages are the smallest read/write units, and blocks are the smallest erase units.

Discovering Hall Sensors: The Hidden Tech in Laptops and TWS Earbuds

Aman Prasad — Fri, 30 Jan 2026 08:35:17 +0000

Have you ever wondered why your laptop screen turns off when you close the lid? Or how your True Wireless Stereo (TWS) earbuds, know when the charging case is open or closed? It all comes down to a clever little component called the Hall sensor. In this short post, I'll share a fun experiment I did that uncovers this tech in everyday devices.

The Laptop Trick

It started with a simple curiosity. I placed a magnet near the edges of my laptop base and the display turned off! Why? Laptops use Hall sensors (named after physicist Edwin Hall) to detect magnetic fields. These sensors are typically embedded near the hinge or edges. When you close the lid, a small magnet in the display aligns with the sensor in the base and it signal the system to sleep or turn off the screen. By mimicking that with an external magnet, you can "trick" the laptop into thinking the lid is closed.

I even made a quick video demonstrating this.

Watch the same demo in higher quality

Extending to TWS Earbuds

Inspired, I dug into my TWS charging case. These cases also detect lid status to pause charging, play audio, or enter sleep mode. Sure enough, after some disassembly, I spotted a Hall sensor inside! It's positioned to react to a magnet in the lid, just like in laptops.
Here's a photo I took of the Hall sensor in the TWS.

Watch the same demo in higher quality

The sensor is tiny but powerful, using the Hall effect to measure magnetic field changes and convert them into electrical signals that the device interprets.

How Hall Sensors Work

A Hall sensor is a semiconductor that generates a voltage difference when exposed to a magnetic field perpendicular to the current flow.
In devices like laptops and TWS, this voltage triggers actions like screen off/on or power management.
Pro tip: If you're into hardware hacking, tools like a multimeter or Arduino can help you experiment with these sensors safely.

This discovery shows how universal tech like Hall sensors powers seamless user experiences across gadgets. Next time you close your laptop or pop open your earbuds case, give a nod to Edwin Hall!

If you've tried similar experiments or have tips on Hall sensor projects, drop a comment below. Thanks for reading! 🚀

Memory Layout in Embedded Systems: How C Code Really Ends Up in FLASH and RAM

Aman Prasad — Thu, 29 Jan 2026 05:34:20 +0000

The CPU does not understand variables, types, or sections. It only executes raw commands to "read address X" or "write address Y." It only understands memory addresses.

When you declare a variable, you are effectively requesting storage. The Compiler assigns it to a logical section (like .data or .bss), and the Linker calculates its final physical address based on the rules defined in your Linker Script.
If you don't understand this mapping, you are blind to the root causes of memory corruption and performance bottlenecks. In embedded systems, correct logic placed in the wrong memory is still a broken system.

From C Code to Binary: Who Decides Memory Placement
FLASH Memory Layout (Non-Volatile Sections)
RAM Memory Layout (Volatile Sections)
Startup Code: The Invisible Hand Before main()
The Truth Table: Where Does It Go?
Verifying Memory Placement
Final Rules to Remember
Conclusion

From C Code to Binary: Who decides memory placement

Your C code doesn't just become a binary. It passes through a four-stage transformation. This transformation happens entirely at build time, long before the binary is flashed or executed on the CPU. Understanding this pipeline reveals that C syntax defines logic, while the Linker Script defines location.

1. The Preprocessor: The often-forgotten first step. It handles #include files and expands #define macros. It doesn't care about memory or logic; it simply performs text manipulation to prepare a pure C file for the compiler.

2. The Compiler: The Compiler translates C logic into Assembly instructions. At this stage, the tool works with placeholders (logical categories like .data or .bss). It does not decide physical memory locations. It doesn't know where memory is? It only works with placeholders.

3. The Assembler: The Assembler converts those assembly instructions into Machine Code. It produces relocatable object files. These files contain the binary logic, but the addresses are still relocatable. They are not yet tied to a physical spot in your RAM or FLASH.

4. The Linker: The Linker is the architect. It takes all the relocatable object files and uses the Linker Script (.ld) to assign every symbol a fixed, physical address in FLASH or RAM.

The Bottom Line:
You write int x = 10; but the linker decides whether if that 10 lives at address 0x20000004 (RAM) or causes a collision. Memory placement is entirely controlled by the linker script.

FLASH Memory Layout (Non-Volatile Sections)

Flash is the permanent home for everything your program knows but does not need to change. Its contents survive resets and power loss.

.isr_vector (The Map)

Located at the very start of FLASH (typically 0x00000000).

It contains the initial stack pointer and the addresses of the Reset Handler and all Interrupt Service Routines. On reset, the CPU fetches this table first to know how to start execution.

.text (The Instructions)

Contains the compiled machine instructions for the application, libraries, and ISRs. The CPU executes this code directly from FLASH using Execute-In-Place (XIP).

.rodata (The Constants)

Stores read-only data such as const global variables, lookup tables, and string literals.

Why const Saves RAM:
If you write const int table[] = {1, 2, 3};, the array lives only in Flash. If you forget const, the linker forces it into RAM (so you can edit it), wasting precious SRAM for data that never changes. Always use const for lookup tables.

The String Literal Trap

const char *ptr = "Hello"; → The string "Hello" is stored in FLASH (.rodata) but the pointer ptr lives in RAM. Safe and RAM-efficient
char arr[] = "Hello"; → The string "Hello" is stored in Flash and copied to RAM at startup. (Costs extra RAM, allows modification if modification is necessary).

Warning: If you remove const from the pointer (char *ptr = "Hello";), the string still lives in Flash (.rodata).

With const: The compiler gives you an error if you try to write to it.
Without const: The compiler allows the write because the type system no longer enforces read-only access, even though the underlying memory is still read-only, but when the CPU tries to write to the Read-Only Flash address, the system triggers a HARD FAULT and crashes.

Rule: Removing const does not move the string to RAM. It only removes protection and makes undefined behavior possible.

The Hidden Data Sections in FLASH: `.data` and `.bss`

Although .data and .bss are runtime RAM sections, FLASH plays a critical role in their initialization. It represent the bridge between storage (Flash) and execution (RAM).

.data Initialized Global variables (LMA vs VMA)

Global initialized variables (e.g., int score = 100;). This variable must live in RAM so you can change it. But RAM is wiped at power loss. So where does the 100 come from?

This section lives a double life.

In Flash (LMA - Load Memory Address): The initial value (100) is stored here to survive power loss.
In RAM (VMA - Virtual Memory Address): The startup code reserves space for the variable here.
The Mechanism: Before main() runs, the startup code copies the values from Flash (LMA) to RAM (VMA).

.bss — Zero-Initialized Global

The .bss section contains global and static variables that are uninitialized or explicitly set to zero
(e.g., int counter;, static int flag;).

No space is reserved for these variables in FLASH; only RAM is allocated.
At startup, the runtime clears the entire .bss region to zero before main() executes.
This avoids wasting FLASH space storing zeros, so .bss consumes RAM only.

RAM Memory Layout (Volatile Sections)

RAM is the system’s working memory. It holds all writable runtime state and is rebuilt on every reset.

.data (Active Variables)

Contains initialized global and static variables copied from FLASH during startup. These variables are freely read and modified during execution.

.bss (Zeroed Variables)

Holds global and static variables without explicit initial values. This entire region is cleared to zero at startup for predictable behavior.

Heap (Dynamic Memory)

Starts after .bss
Grows upward
Used by malloc() / free()
Fragmentation-prone
No bounds checking
In embedded systems, uncontrolled heap usage leads to Fragmentation. Many safety-critical systems restrict or avoid heap usage entirely to avoid instability.

Stack (Execution Context)

Starts at the top of RAM
Grows downward from the end of RAM.
Stores function call frames, local variables, return addresses, and interrupt context Since the Stack grows down and the Heap grows up, they are on a collision course. If the Stack grows too deep (recursion), it will silently overwrite the Heap or .bss variables. This is the #1 cause of "ghost bugs."

Startup Code: The Invisible Hand Before `main()`

In a standard C course, you are taught that "execution begins at main()." On a microcontroller, this is a lie.

Execution does not begin at main() on a microcontroller.

Before user code runs, startup code prepares the execution environment:

Stack Pointer Init: Loads the Main Stack Pointer (MSP) from the vector table. Without this, functions cannot be called.
.data Copy: Copies initial values from Flash to RAM. If this fails, variables start with garbage values.
.bss Zeroing: The entire .bss region is cleared to zero in RAM.
System Initialization: Clock and low-level hardware configuration is performed.
Jump to main() Only after memory is prepared does execution enter the application.

If any of these steps fail, variables contain garbage, the stack corrupts memory, and failures appear unrelated to the real cause.

The Truth Table: Where Does It Go?

Here is a quick reference guide to predict where your variables will land.

Variable Declaration	Segment	Why?
`int x;` (Global)	.bss	No initial value. Zeroed by startup code.
`int x = 10;` (Global)	.data	Needs a non-zero initial value. Copied from Flash.
`const int x = 10;` (Global)	.rodata	Read-only. Stays in Flash.
`static int x = 5;` (Local)	.data	`static` means "persist forever." Cannot live on Stack.
`int x = 5;` (Local)	Stack	Temporary. Exists only while function runs.
`char *s = "Text";`	.rodata	String is in Flash; Pointer is in RAM.
`char s[] = "Text";`	Stack	Array is on Stack; String is copied into it.
`malloc(10)`	Heap	Requested manually by programmer.

Verifying Memory Placement

Understanding memory layout is meaningless unless it can be verified. Embedded systems do not tolerate assumptions. Use these tools to turn theory into engineering.

Step 1: Use this minimal snippet to force variables into every section of the memory.

#include <stdio.h>
#include <stdlib.h>

int var_bss;                    // Uninitialized -> .bss
int var_data = 42;              // Initialized   -> .data
const int var_rodata = 100;     // Read-only     -> .rodata (Flash)

void memory_map_test() {
    int var_stack = 5;          // Local         -> Stack
    static int var_static = 10; // Static Local  -> .data
    int *var_heap = malloc(4);  // Dynamic       -> Heap

    printf("Code (.text):   %p\n", memory_map_test);
    free(var_heap);
}

int main() {
    memory_map_test();
    return 0;
}

Step 2: High-Level Footprint (size)
Run size <filename.exe> to see the total consumption.

size test.exe
text    data     bss     dec     hex filename
14696    1560     116   16372    3ff4 test.exe

Step 3: Forensic Inspection (nm)
Use nm to prove exactly which section each variable occupies.
Run this command

nm test.exe | grep var_

nm  test.exe | grep var_
00407070 B _var_bss
00404004 D _var_data
00405064 R _var_rodata
00404008 d _var_static.2277

T = Text (Flash)
R = Read-only (Flash)
D = Data (RAM)
B = BSS (RAM)

Step 4: The Ground Truth (Map File)

Enable the linker map file (-Wl,-Map=output.map) in your IDE. This is the final document showing every symbol and its physical address. Use it to verify that your symbols are not colliding and are placed within the correct memory boundaries defined in your .ld script.

Final Rules to Remember

The CPU only understands addresses
The linker decides memory placement
.data costs FLASH + RAM
.bss costs RAM only
Stack overflows are silent
Always verify memory with tools

Conclusion

You manage the memory, or the memory manages you. By understanding the pipeline from the Compiler to the Linker, and verifying your layout with tools, you transform from a C programmer into an Embedded Engineer.

Understanding the ABI by Observation

Aman Prasad — Fri, 23 Jan 2026 04:29:31 +0000

📌Table of Contents

What Exactly Is an ABI?
The ABI Contract: What It Defines
- A. Calling Convention
- B. Data Layout & Alignment
- C. Stack Frame
- D. Name Mangling (C++)
Target Context: ARM Cortex-M (AAPCS)
Practical Exploration: ABI in Action with C Functions
- Function with 2 Arguments (add2)
- Function with 4 Arguments (add4)
- Function with 5 Arguments (add5)
Verification: Which Way Does the Stack Grow?
Data Layout & Alignment: The Offset Proof
ELF Symbols and Function Size (nm -S)
Name Mangling: C vs C++
Key Takeaways

What Exactly Is an ABI?

An Application Binary Interface (ABI) is a low-level contract that defines how compiled binaries interact.

It ensures that when Function A calls Function B, both sides agree on:

where arguments are located
where return values appear
how control returns to the caller

This remains true even if:

the code was compiled with different compilers
parts are written in different languages (C, C++, Assembly)
libraries are precompiled

Without an ABI, Module_A.o might pass an argument in a register while Module_B.o expects it on the stack. The result is not a compiler error it is silent runtime failure.

The ABI Contract: What It Defines

An ABI specifies rules that compiled code must obey so that independently compiled binaries can interoperate correctly at runtime.

A. Calling Convention

The calling convention defines:

Argument Passing Which arguments go in registers, which go on the stack, and in what order.
Return Values Which register holds the result (e.g., r0 on ARM).
Register Preservation
- Caller-saved (volatile): the caller must save them if needed.
- Callee-saved (non-volatile): the callee must preserve and restore them.
Under AAPCS, registers r0–r3 and r12 are caller-saved, while r4–r11 are callee-saved.

B. Data Layout and Alignment

The ABI defines:

Type Size For example, int is 32-bit on ARM EABI.
Alignment Rules 32-bit data must be aligned to 4-byte boundaries.
Structure Padding Compilers insert padding bytes to preserve alignment.

This is why sizeof(struct) is often larger than the sum of its members.

C. Stack Frame

The ABI governs stack behavior:

Stack growth direction
Where the return address lives
How local variables are addressed

The ABI specifies stack alignment and what must be preserved at function boundaries, while the compiler decides how to implement the prologue and epilogue.

D. Name Mangling (C++)

C++ supports function overloading, so function names must encode type information.

The ABI standardizes this encoding so binaries can link correctly across compilers.

Target Context: ARM Cortex-M (AAPCS)

On STM32 and other Cortex-M systems, the ABI is AAPCS.

Key rules:

First 4 integer arguments → r0–r3
5th and rest arguments → stack
Return value → r0
Return address → lr

Everything below is verified against this ABI.

Practical Exploration: ABI in Action with C Functions

We examine unoptimized (-O0) and optimized (-O2) output to separate:

ABI rules from compiler implementation details

Function with 2 Arguments (`add2`)

int add2(int a, int b) {
    return a + b;
}

At optimization -O0

At -O0, the assembly contains stack setup, spills, and reloads.

This noise exists for debugging — not because the ABI requires it.

At optimization -O2

At -O2, only the ABI-mandated behavior remains:

Arguments arrive in r0, r1
Result placed in r0
Return via bx lr

Key insight:

Everything except argument location, return value, and return mechanism is compiler detail.

Function with 4 Arguments (`add4`)

int add4(int a,int b,int c,int d) {
return a + b + c + d;
}

with optimization -O0

with optimization -O2

ABI Guarantees on Entry

r0 → a
r1 → b
r2 → c
r3 → d

This function uses all available argument registers.

No stack access is required to receive arguments.

For up to four arguments, the ARM ABI passes all parameters in registers.

Function with 5 Arguments (`add5`)

int add5(int a,int b,int c,int d,int e) {
return a + b + c + d + e;
}

This function crosses an ABI boundary.

The ABI Rule

r0–r3 → first four arguments
5th argument → stack

This is the first time the stack becomes mandatory.

with optimization -O0

with optimization -O2

At -O2, the instruction:

ldr ip, [sp]

proves that the fifth argument must be fetched from memory.

This behavior is ABI law, not optimization or compiler behavior.

Verification: Which Way Does the Stack Grow?

Rather than assuming, we verify it directly from assembly.

Observation

str fp, [sp, #-4]!
sub sp, sp, #20

Both instructions subtract from sp to allocate space.

Conclusion

If allocating space requires decrementing the stack pointer, the stack grows toward lower memory addresses.

Verified: The stack grows downward.

Data Layout & Alignment: The Offset Proof

Claim

A 32-bit int must be 4-byte aligned.

A char followed by an int requires padding.

Experiment

struct MyPackedStruct {
char c;
int  i;
};

int get_i(struct MyPackedStruct *s) {
return s->i;
}

Optimized Assembly (`O2`)

Optimization disabled (`O0`)

Observation

The field i is accessed at offset #4.

If there were no padding, the offset would be #1.

Conclusion

int is 4 bytes wide
The compiler inserted 3 bytes of padding
This layout is ABI-mandated
Optimization does not change it

ELF Symbols and Function Size (`nm -S`)

int aman;
intget_i(struct MyPackedStruct *s) {
return s->i;
}

Command

arm-none-eabi-nm -S main.o

Output

0000000000000004 B aman
0000000000000028 T get_i

Interpretation

aman is 4 bytes → confirms int size.

B (BSS): uninitialized global data

get_i occupies 0x28 bytes (40 bytes).

T (Text): executable code

At -O0, this corresponds to 10 ARM instructions:

10 × 4 bytes = 40 bytes = 0x28

At -O2, the function shrinks to 2 instructions (8 bytes).

Function size is a compiler artifact; ABI rules are not.

Name Mangling: C vs C++

C

int add(int a,int b) {
    return a + b;
}
// C file

run this command and you’ll get this symbol add


arm-none-eabi-nm.exe add.o
00000000 T add

C++

int add(int a,int b) {
    return a + b;
}
// C++ file

run this command and you’ll get this symbol _Z3addii

arm-none-eabi-nm.exe add.o
00000000 T _Z3addii

This encodes function name and parameter types.

Using:

extern"C" int add(int,int);

int add(int a,int b) {
    return a + b;
}

disables mangling and restores the C symbol name.

Key Takeaways

The ABI governs how functions are called, how data is laid out, how the stack behaves, and how symbols are named — all of which can be verified directly from generated binaries.

Once you learn to observe the ABI instead of memorizing it, low-level code stops being mysterious and starts being predictable.

Function Prologue and Epilogue in ARM: What Really Happens When a Function Enters and Exits

Aman Prasad — Fri, 16 Jan 2026 07:47:57 +0000

Function prologue and epilogue are the instructions executed at the beginning and end of a function to preserve required CPU state and manage the stack. Although they are not visible in C code, the compiler automatically inserts these sequences to ensure correct function execution. In this article, we examine how ARM compilers use prologue and epilogue to safely handle function calls at the assembly level.

Why Function Prologue and Epilogue Exist
The Rulebook: AAPCS
What Happens at Function Entry: The Prologue
What Happens at Function Exit: The Epilogue
From C Code to Assembly: A Practical Example
- Understanding the Assembly Output
- Prologue — Setting Up the Stack Frame
- Function Body — Execution of C Logic
- Epilogue — Cleaning Up and Returning
Leaf vs Non-Leaf Functions
Prologue and Epilogue in Interrupts and Context Switching
Naked Functions: Skipping Prologue and Epilogue (When and Why)
Conclusion

Why Function Prologue and Epilogue Exist

On ARM, function calls reuse the same CPU registers and stack memory. Without a defined mechanism to save and restore this state, operations performed inside a function corrupt the caller’s execution context without a defined calling convention, operations performed inside a function would corrupt the caller’s execution context.. To prevent this, the compiler automatically inserts a function prologue and epilogue that preserve required registers and restore the stack state, ensuring correct program execution.

The Rulebook: AAPCS

Before we look at the assembly, we need to understand why the code is generated this way.

In the ARM ecosystem, all toolchains follow a strict set of rules called the AAPCS (Procedure Call Standard for the ARM Architecture). This standard defines:

Which registers a function can overwrite freely (Caller-Saved: R0-R3, R12).
Which registers a function must preserve and restore (Callee-Saved: R4-R11).
How the stack is managed (Full Descending Stack, 8-byte alignment).
The AAPCS also defines how function arguments are passed and how return values are delivered.

The Prologue and Epilogue are simply the compiler's way of enforcing these rules consistently across all functions.

What Happens at Function Entry: The Prologue

When a function is called on ARM Cortex-M, the compiler executes a short sequence of instructions at the function entry known as the prologue. These instructions run before any user-defined C code in function and prepare the stack and registers according to the AAPCS. A typical Cortex-M prologue looks like this (details will be examined in the example below)

What Happens at Function Exit: The Epilogue

At function return, the compiler inserts a short sequence of instructions known as the epilogue. Its role is to undo the changes made by the prologue and restore the CPU state so execution can safely resume in the caller.

The exact instructions used depend on the function, but the epilogue typically releases the stack frame, restores saved registers, and returns control to the caller. These steps are shown in the assembly example below.

From C Code to Assembly: A Practical Example

To make this concrete, the following example was compiled for an STM32F407 (ARM Cortex-M4) with optimizations disabled (-O0). The generated assembly uses the Thumb-2 instruction set, as is standard on Cortex-M cores. We focus on the assembly generated for compute_sum(), a non-leaf function that calls another function.

int add(int a, int b){
    return a + b;
}

int compute_sum(int x, int y){
    int temp1 = x * 2;
    int temp2 = y * 3;

    int result = add(temp1, temp2);

    return result;
}

int main(void){
    int value;
    value = compute_sum(10, 20);
    while (1);
}

Assembly generated for compute_sum function

080002f8 <compute_sum>:
 80002f8:   b580        push    {r7, lr}
 80002fa:   b086        sub sp, #24
 80002fc:   af00        add r7, sp, #0
 80002fe:   6078        str r0, [r7, #4]
 8000300:   6039        str r1, [r7, #0]
 8000302:   687b        ldr r3, [r7, #4]
 8000304:   005b        lsls    r3, r3, #1
 8000306:   617b        str r3, [r7, #20]
 8000308:   683a        ldr r2, [r7, #0]
 800030a:   4613        mov r3, r2
 800030c:   005b        lsls    r3, r3, #1
 800030e:   4413        add r3, r2
 8000310:   613b        str r3, [r7, #16]
 8000312:   6939        ldr r1, [r7, #16]
 8000314:   6978        ldr r0, [r7, #20]
 8000316:   f7ff ffe1   bl  80002dc <add>
 800031a:   60f8        str r0, [r7, #12]
 800031c:   68fb        ldr r3, [r7, #12]
 800031e:   4618        mov r0, r3
 8000320:   3718        adds    r7, #24
 8000322:   46bd        mov sp, r7
 8000324:   bd80        pop {r7, pc}

This function allocates local variables and calls another function, which makes it a non-leaf function.

Understanding the Assembly Output

The image above shows the disassembly of the compute_sum() function. The instructions are visually divided into three regions: Prologue, Function Body, and Epilogue. Each region serves a distinct purpose in the execution of the function.

Prologue — setting up the stack frame

The prologue appears at the top of the function:

push {r7, lr}
sub  sp, #24
add  r7, sp, #0

This sequence is the function prologue and it is inserted automatically by the compiler.
At function entry, the compiler:

Saves r7 and lr so the caller’s frame pointer and return address are not lost.
Reserves 24 bytes on the stack for local variables and compiler-generated temporaries
Even though the function defines only three int variables (12 bytes), extra space is allocated to maintain alignment and to give the compiler room for temporary values, which is common when optimizations are disabled (-O0)
Sets up r7 as a frame pointer, allowing all local variables to be accessed using fixed offsets regardless of changes to sp

Together, these steps create a private stack frame for the function, ensuring it can execute and return without disturbing the caller’s state.

Function Body — execution of C logic

The middle section of the image corresponds to the actual work performed by compute_sum().

The input parameters (x and y) are first stored on the stack so they can be reused
temp1 is calculated as x * 2 using a left-shift operation
temp2 is calculated as y * 3 using a shift followed by an add
The computed values are loaded into registers and passed to add()

The instruction bl <add> performs a function call and overwrites the Link Register (lr). Because of this, lr must be saved earlier in the prologue. This is what makes compute_sum() a non-leaf function.

Epilogue — cleaning up and returning

This sequence forms the function epilogue and restores the caller’s state.

adds r7, #24
mov  sp, r7
pop  {r7, pc}

The stack space allocated for the function is released
The original frame pointer (r7) is restored
The return address is loaded into the program counter (pc), returning execution to the caller

The epilogue exactly mirrors the prologue, ensuring the function exits with the CPU state unchanged.

Leaf vs Non-Leaf Functions

Not all functions require the same prologue and epilogue.

A leaf function is a function that does not call any other function. Since it never executes a BL instruction, the Link Register (LR) is not overwritten. As a result, the compiler may omit saving LR and, in some cases, avoid creating a full stack frame altogether.

A non-leaf function, on the other hand, calls one or more functions. Because a BL instruction overwrites LR, the function must save LR in its prologue and restore it in the epilogue. Non-leaf functions almost always require a stack frame to preserve state and manage local variables.

Whether a function is leaf or non-leaf directly influences how much code the compiler inserts at function entry and exit.

Prologue and Epilogue in Interrupts and Context Switching

On ARM Cortex-M, a similar mechanism appears in interrupt handling. When an interrupt occurs, the hardware automatically pushes an architecturally defined subset of the CPU state onto the stack and restores it on return. RTOS context switching extends this idea in software. While the mechanisms differ, the goal is the same: preserving execution context.

Naked Functions: Skipping Prologue and Epilogue (When and Why)

By default, the compiler generates a prologue and epilogue to manage the stack and preserve registers according to the AAPCS. Using __attribute__((naked)), this behavior can be disabled entirely.

A naked function is compiled without any automatically generated prologue or epilogue. The compiler does not save or restore registers, allocate stack space, enforce stack alignment, or generate a return sequence. All responsibility for preserving CPU state and managing the stack falls entirely on the programmer.

This is only appropriate in very low-level code, such as task context switching, interrupt entry routines, or early boot initialization. Because naked functions bypass the ABI completely, the compiler does not protect register or stack state. Even small mistakes can therefore cause stack corruption or hard faults.

For this reason, naked functions should not be used in normal application code. They are intended only for situations where compiler-generated prologue and epilogue code must be avoided and the programmer is prepared to manage the CPU state manually.

Conclusion

Function prologue and epilogue are fundamental to how ARM compilers implement safe and predictable function calls. By following the AAPCS, the compiler ensures registers, stack state, and return flow are preserved across function boundaries. Understanding how these mechanisms work especially at the assembly level makes it easier to analyze stack usage, debug low-level issues, and write reliable embedded software.

Bit Fields in C Explained: How They Work and Why They Matter

Aman Prasad — Sat, 10 Jan 2026 04:14:33 +0000

We often use full integers to store simple flags that need only one bit. Bit fields in C seem like an easy way to save memory by using just the bits we need.
But this simplicity hides compiler and hardware details that can change how the data is actually stored in memory.

📌Table of Contents

What Are Bit Fields?
Bit Fields vs Normal Structure Members
How Compilers Actually Store Bit Fields
Appropriate Uses of Bit Fields
Rules of Thumb

What Are Bit Fields?

A bit field is a special struct member that allows you to specify exactly how many bits a variable should occupy, rather than using the standard byte-aligned sizes.

struct Date {
    unsigned int day   : 5;  // 5 bits (Range: 0-31)
    unsigned int month : 4;  // 4 bits (Range: 0-15)
    unsigned int year  : 11; // 11 bits (Range: 0-2047)
};

Instead of allocating a full int (typically 32 bits) for each member, the compiler may pack these fields together to reduce memory usage.

Syntax and basic rules

A bit field is defined by placing a colon : after a structure member name, followed by the number of bits it should use.
Bit fields can only be declared inside a struct. They cannot exist as standalone variables.
Bit fields are not addressable objects in C, so the address-of operator (&) cannot be used on them.

Bit Fields vs Normal Structure Members

This behavior contrasts with normal structure members.

Normal structure members are aligned to byte boundaries, so each int usually consumes 4 bytes, even if it stores only a small value.
Bit field members, on the other hand, can be packed into adjacent bits within a machine word, allowing multiple small values to share the same underlying storage.

The trade-off is clear: normal members offer predictable layout, while bit fields trade layout guarantees for compactness and expressiveness.

How Compilers Actually Store Bit Fields

This is where bit fields stop being simple.
When you write a structure like this:

struct Flags {
    unsigned int a : 1;
    unsigned int b : 1;
    unsigned int c : 1;
};

It is natural to assume that these fields are placed in memory one after another, each occupying a single bit in order, for example:

bit 0 -> a
bit 1 -> b
bit 2 -> c

However, the C standard does not guarantee any such layout.
Bit fields are stored inside a larger storage unit, typically the base type used in their declaration such as unsigned int. How individual bit fields are placed within that storage unit is largely decided by the compiler.

In particular, the C standard does not define:

the ordering of bits within a word (LSB vs MSB)
how bit fields are packed across bytes
alignment and padding rules

Two different compilers targeting the same architecture are therefore allowed to produce different memory layouts for the same bit-field structure.

This does not make bit fields useless. It means they are context-sensitive and not a reliable way to control precise bit-level memory layout.

Appropriate Uses of Bit Fields

Bit fields and manual bit masking serve different purposes, even though both operate at the bit level.

Bit fields are best used to represent logical state inside your program. They improve readability, group related flags naturally, and work well when the exact memory layout does not matter outside the program. This makes them a good fit for internal flags, state machines, and configuration structures.

Manual bit masking is the correct choice when exact bit positions matter. This includes hardware registers, binary protocols, and any layout defined by a datasheet or specification. Bit masks provide full control over bit positions, behave consistently across compilers, and match hardware documentation exactly.

For example, when working with hardware registers:

#define UART_RXNE (1 << 5)
#define UART_TC   (1 << 6)
#define UART_TXE  (1 << 7)

This approach may look less elegant than bit fields, but it is precise, portable, and unambiguous. In embedded systems, correctness matters more than elegance.

While bool works for individual flags, multiple bool members still consume at least one byte each and may introduce padding.

Rules of Thumb

Bit fields express meaning, not layout.
Never use bit fields for memory-mapped hardware registers.
Use bit fields for internal flags and logical program state.
Use bit masks for hardware registers and binary protocols.

Why Arrays Start at Index 0: A Memory-Level Explanation

Aman Prasad — Sun, 04 Jan 2026 14:52:04 +0000

Have you ever wondered why arrays in C/C++ (and many other languages) start with indexing at 0 instead of 1?
To understand this properly, we need to look at how arrays are stored in memory and how the compiler computes element addresses.

📌 Table of Contents

Arrays as Contiguous Memory Blocks
How arr[i] Works: Pointer Arithmetic Explained
Why This Forces Indexing to Start at 0
What If Arrays Started at Index 1?
Why arr[i] and i[arr] Mean the Same Thing
Conclusion

Arrays as Contiguous Memory Blocks

At its core, an array in C/C++ is a fixed-size collection of elements of the same type, stored in contiguous memory locations. When you declare

int arr[100];

The compiler allocates space for 100 consecutive integers.
On most modern systems:

An int typically occupies 4 bytes (on 32/64-bit architectures).
So, the array consumes 400 bytes, laid out back-to-back in memory.

How `arr[i]` Works: Pointer Arithmetic Explained

The real reason arrays start at index 0 has nothing to do with counting or convention. It comes from how the compiler rewrites array indexing into pointer arithmetic.

When you write arr[i] it is translated directly into *(arr + i)
This is not an implementation detail. It is how the language defines array subscripting.
This single translation explains why array indexing starts at zero.

Let’s unpack what each part in *(arr + i) means:

arr Refers to the base address of the array. It is the address of the first element (i.e., &arr[0])
+ i Performs pointer arithmetic. This does not add i bytes. It adds i × sizeof(element_type) bytes
* Dereferences the computed address to read or write the value.

So arr[i] literally means: Go i elements away from the start of the array, then access the value stored there.

Let’s verify this equivalence with a simple C program

#include <stdio.h>

int main() {
    int arr[] = {10, 20, 30, 40};

    // Direct array access
    printf("arr[1]: %d\n", arr[1]);  // Output: 20

    // Equivalent pointer version
    printf("*(arr + 1): %d\n", *(arr + 1));  // Same: 20

    return 0;
}

Why This Forces Indexing to Start at 0

Here’s the key insight: the first element isn’t one step away — it lives at the base address.
There is zero distance and zero bytes to skip.

That means:

distance from base address = 0
offset = 0
index = 0

That’s why the first element is accessed as:

arr[0] == *(arr + 0) — no adjustment needed

Each subsequent element is reached by moving forward in memory:

arr[1] == *(arr + 1) — skip 1 element (4 bytes for int)
arr[2] == *(arr + 2) — skip 2 elements (8 bytes)
and so on

Each index represents how many elements to move forward from the base address.
No additional arithmetic or correction is required.

An index is an offset measured in elements.
Offsets start at 0 because nothing can be closer than zero distance from the origin.
This follows directly from how memory addressing and pointer arithmetic work.

What If Arrays Started at Index 1?

Now that we know arr[i] is just syntactic sugar for *(arr + i), let’s imagine a different design.

Suppose arrays were 1-based indexed, as in some mathematical tools (for example, MATLAB), where the first element is accessed as arr[1].

Pointer arithmetic itself does not change.
arr[i] would still translate to:
*(arr + i)
If we applied this rule directly:
arr[1] → *(arr + 1)
this would actually point to the second element, not the first.
To make 1-based indexing work, the compiler would need to internally rewrite every access as:
arr[i] → *(arr + (i - 1))
That subtraction is the key difference.

While modern compilers can often optimize this subtraction away, one-based indexing still introduces a semantic mismatch with the hardware’s base + offset addressing model. It complicates bounds reasoning and obscures the simple “offset from base” mental model.

Modern CPU addressing modes operate naturally in terms of base address plus offset, making zero-based indexing a direct and transparent match.

Why `arr[i]` and `i[arr]` Mean the Same Thing

Once you understand that array indexing in C is defined in terms of pointer arithmetic, an interesting (and often surprising) consequence follows.
In fact, the C standard defines a[b] as *(a + b), which is why b[a] is also valid C.

In C, the subscript operator is defined as:
a[b] = *(a + b)
This definition does not treat a as “the array” and b as “the index.”
It simply means: add b to a, then dereference the result.

Now consider the implication of this definition.
Pointer addition is just integer addition under the hood, and addition is commutative:
(a + b) == (b + a)

Because of this, both of the following expressions compute the same address

*(a + b)
*(b + a)

Which means:
a[b] == b[a]

This is not a trick, a compiler hack, or undefined behavior.
It is a direct and intentional consequence of how the C language defines array subscripting.

The example below demonstrates this equivalence in practice.

#include <stdio.h>

int main(void) {
    // Declare an array with 5 elements
    int arr[5] = {1, 2, 3, 4, 5};

    /*
     * In C, array access is defined as:
     *   a[b] == *(a + b)
     *
     * Because addition is commutative:
     *   a + b == b + a
     *
     * This means:
     *   arr[3] == 3[arr]
     */

    // Normal array indexing
    printf("arr[3]  = %d\n", arr[3]);   // Output: 4

    // Equivalent but unusual indexing
    printf("3[arr]  = %d\n", 3[arr]);   // Output: 4

    return 0;
}

NOTE: While i[arr] is valid C, it is rarely used in real code because it hurts readability. It exists only because array indexing is defined in terms of pointer arithmetic.

Conclusion

In C/C++, array indexing is not about counting positions. It is about measuring offsets from a base address.

Structure Padding Isn’t Wastage of Memory — It’s a Hardware Requirement

Aman Prasad — Wed, 31 Dec 2025 17:02:49 +0000

Have you ever manually calculated the size of a struct, only to find that sizeof returns a larger number? You aren't crazy, and the compiler isn't broken. In this guide, we’ll decode Structure Padding. why it happens, why your CPU loves it, and how to optimize it for embedded systems.

What Is a Structure?
How Structures Are Stored in Memory
The Padding Myth
Why Structure Padding Is Necessary
Packed Structures: When to Use Them (and When Not To)
The Trade-Off: Memory vs Performance

What is a Structure

A Structure in C is a user defined data type that allow programmers to group together the values of different data types under a single name.
The items in the structure are called its members and they can be of any valid data type
A structure in C is defined using the struct keyword followed by the structure’s name, inside curly braces {}

How Structures Are Stored in Memory

Consider the following structure:

#include <stdio.h>
struct example{
    char a;     // 1 byte
                // 3 byte of padding inserted here
    int b;      // 4 bytes
    char c;     // 1 byte
                // 3 bytes of padding inserted here
};

int main(void) {
    printf("size of struct = %zu bytes\n",
           sizeof(struct example));  // prints 12 bytes (instead of 6)
    return 0;
}

At first glance, this structure seems simple. It contains two char fields (1 byte each) and one int field (4 bytes).
Adding the sizes manually gives 6 bytes.

Yet size of struct example evaluates to 12 bytes.

This is not a mistake. To understand why, we need to look at how the compiler maps this structure onto actual memory addresses.

Byte-level layout

The compiler places structure members into a contiguous block of memory, assigning each field a fixed offset from the start of the structure:

Offset	Address	Content
+0	0x00001001	char a
+1	0x00001002	Padding
+2	0x00001003	Padding
+3	0x00001004	Padding
+4	0x00001005	int b (LSB)
+5	0x00001006	int b
+6	0x00001007	int b
+7	0x00001008	int b (MSB)
+8	0x00001009	char c
+9	0x0000100A	Padding
+10	0x0000100B	Padding
+11	0x0000100C	Padding

Note: Byte order shown assumes a little-endian system. Padding behavior is independent of endianness.

At this point, we can see padding but we still haven’t explained why it exists.

The Padding Myth

To understand why padding exists at all, we need to briefly leave the C language behind and look at how CPUs actually fetch data from memory.

Let’s trigger the confusion deliberately.

struct Test {
    char c;
    int  x;
};

printf("%zu\n", sizeof(struct Test));

Many beginners expect this to print 5 (1 byte + 4 bytes).
On most systems, it prints 8.

At this point, you’ll often hear:

The compiler is wasting memory by inserting padding.

That conclusion is wrong.
What actually happened is structure padding. The compiler inserted extra bytes between members and possibly at the end to satisfy the hardware alignment rules. It has nothing to do with the C language.

Visually, the memory layout looks like this:

This diagram illustrates how a C structure is laid out in memory on a system with a 32-bit data bus and alignment requirements.

This diagram uses a 32-bit data bus visualization to illustrate alignment constraints imposed by the hardware.

At the top-left, the struct example definition is shown:

char a → 1 byte
int b → 4 bytes
char c → 1 byte

Naively, this looks like 6 bytes of data.
However, the memory layout above shows why the actual size becomes 12 bytes.

Byte-level memory layout

The table above represents memory byte by byte, grouped into 32-bit word banks (BANK 0 to BANK 3), each corresponding to one byte lane of the data bus:

BANK 0 → D7–D0
BANK 1 → D15–D8
BANK 2 → D23–D16
BANK 3 → D31–D24

Each row represents a 4-byte aligned word in memory.

Placement of char a

char a occupies only one byte
It is placed in BANK 0 at offset +0
The remaining three byte lanes in that word are unused for data

These unused lanes are shown as padding bytes.
They exist so that the next field can start at a properly aligned address.

Alignment of int b

int b requires a 4-byte aligned address
The compiler therefore starts int b at offset +4
All four byte lanes (BANK 0–BANK 3) are used to store the integer

This allows the CPU to fetch int b in one aligned 32-bit memory access.

Placement of `char c` and tail padding

char c occupies one byte at offset +8
The remaining three bytes in that word are again unused

These final padding bytes are tail padding.
They ensure that the total structure size is a multiple of 4, so that arrays of this structure remain correctly aligned.

Final result

Actual data: 6 bytes
Padding inserted: 6 bytes
Total structure size: 12 bytes

The key takeaway illustrated by this diagram is:

Padding is not wasted memory — it is the cost of alignment, paid to allow efficient and safe access on real hardware.

Important note

This diagram intentionally shows banked memory and a 32-bit data bus to emphasize that structure padding is driven by hardware access rules. Different architectures may implement alignment differently, but the principle of alignment remains the same.

Why Structure Padding Is Necessary

To understand why padding is necessary, we need to bridge the gap between how we write code (byte by byte) and how the hardware runs it (word by word).

A 32-bit CPU does not access memory one byte at a time; that would be inefficient. Instead, it fetches data in 4-byte chunks known as words. The CPU runs fastest when the data it needs starts exactly at the beginning of a word boundary.

Let's visualize our structure (char a, char b, int c) and see how the CPU handles it with and without padding.

1. The Slow Way: Without Padding (Unaligned)
Look at the left side of the diagram. If the compiler packed data tightly, here is what happens when the CPU needs to read the 4-byte integer int c.

The Problem: Because the two char fields take up the first two bytes, int c starts halfway through the first 4-byte word and ends halfway through the second word. It is split across two words.

The Cost: To get that single integer int c, the CPU must perform two memory cycles. It has to fetch Word 1 to get the first half of the integer, then fetch Word 2 to get the second half, and finally stitch them together. This is slow and inefficient.

2. The Fast Way: With Padding (Aligned)
Now look at the right side of the diagram. This is what the compiler actually does to help the CPU. The compiler inserts padding two empty unused bytes after the char fields.

The Benefit: This forces int c to start exactly at the beginning of Word 2. When the CPU needs that integer, it can retrieve the entire 4-byte value in a single memory cycle (indicated by the single green arrow).

As the bottom of the image summarizes, structure padding is a deliberate trade-off. The compiler sacrifices a small amount of memory space (the padding bytes) to gain a significant boost in execution speed by ensuring data is aligned for single-cycle CPU access.

Packed Structures: When to Use Them (and When Not To)

After understanding structure padding, a natural question arises:
If padding costs memory, why not just remove it?
C allows you to do exactly that using packed structures, commonly through #pragma pack or compiler-specific attributes.

Packing through #Pragma pack

#pragma pack(1)
struct example{
    char a;     // 1 byte
    int b;      // 4 bytes
    char c;     // 1 byte
};
#pragma pack()
// total size = 6 bytes, no padding

Packing through compiler-specific attributes

struct PackedExample {
    char a;
    int b;
    char c;
} __attribute__((packed));

This tells the compiler to ignore natural alignment rules and place structure members back-to-back with no padding.

At first glance, this looks like an optimization. In practice, it’s a trade-off, and often dangerous.

What packing actually does
Packing affects only memory layout, not CPU behavior.
When you pack a structure:

Padding bytes are removed
Multi-byte fields may become misaligned
size of struct becomes smaller

What does not change:

How the CPU fetches memory
Alignment requirements of the architecture
Cost of misaligned access

The CPU still expects aligned data.
Packing simply removes the compiler’s safety net.

Reducing Padding by Reordering Members

Consider this reordered structure:

struct optimized {
    int  b;   // 4 bytes
    char a;   // 1 byte
    char c;   // 1 byte
    // 2 bytes of padding at the end
};
// total size = 8 bytes

This structure contains the same data as the original version, but the total size is reduced from 12 bytes to 8 bytes without using packed attributes.

Why this layout is better
The key change is member ordering.

The int field, which requires 4-byte alignment, is placed first
Smaller char fields followed after the large int field.
Padding is pushed to the end of the structure, not between members

This layout allows the compiler to satisfy alignment rules with minimal padding.

Why this is better than packing
Compared to a packed structure:

All fields remain naturally aligned
The CPU can access int b in one aligned memory read
No risk of misaligned access faults
Performance and portability are preserved

This is the compiler-friendly way to reduce padding.

The general rule

Order structure members from largest alignment requirement to smallest.

This simple rule often eliminates most padding automatically.

The Trade-Off: Memory vs Performance

In engineering, there is rarely a perfect solution, only trade-offs.
Structure padding exists because software has to choose between two competing goals:

Minimize memory usage
Maximize execution speed and safety

You rarely get both at the same time.
Padding is the compiler’s way of deliberately choosing performance and correctness over absolute memory compactness.

What happens when you minimize memory
When fields are packed tightly with no padding:

Structures are smaller
Cache and RAM usage is reduced
Memory footprints look efficient on paper

But the cost is hidden:

Multi-byte fields may become misaligned
The CPU may need multiple memory reads for a single variable
Extra instructions are required to assemble the value
On some architectures, misaligned access can trap or crash In other words, you save bytes but pay in cycles.

What happens when you allow padding
When padding is introduced:

Structures become slightly larger
Some memory appears “unused”
size of struct increases

But the benefits are:

Data is naturally aligned
The CPU fetches values in one memory access
Code executes faster and more predictably
Hardware behavior becomes simpler and safer

You spend a few bytes to save CPU time.

Forem: Aman Prasad

The Inline Myth: Why the inline Keyword is Just a Suggestion

Inline Is a Suggestion, Not a Command

Why?

Modern Compilers Inline Even Without inline

Why Inline Exists at All

But Call Overhead Isn’t That Expensive Anymore

How Compilers Decide to Inline

Likely to Be Inlined

Unlikely to Be Inlined

Recursive Functions Cannot Be Fully Inlined

Inline vs Macros

Macro Example

Inline Function Equivalent

Too Much Inline Increases Binary Size

Experiment: Verifying Inlining Across Optimization Levels

Case 1: Compilation with -O0 (No Optimization)

Assembly Observation

Symbol Table

Conclusion:

⚠️ Important Warning About inline at -O0

Case 2: Compilation with -O2 (Optimized Build)

Assembly Observation

Symbol Table

Why _add Exists but Is Never Called

Forcing Removal of _add

⚠️ Important Note

Key Takeaways

char str1[] = "hello world"; vs char *str2 = "hello world"; – The Memory Story Every C Programmer Must Know

char str1[] = “hello world”;

Why can’t I do str1 = str2, but I can do strcpy(str1, str2)?

char *s2 = "hello world";

The const Habit we Should Adopt

What Happens If Two Variables Use the Same String Literal?

Important Standard Note

Why This Matters in Practice

MQTT Explained in Simple Terms: The Lightweight Protocol That Powers the Entire IoT World

Why is the MQTT protocol important?

Principle behind MQTT

Space Decoupling: "I don't need to know where you live”

Time Decoupling: "I'll leave a message for you”

Synchronization Decoupling: "No waiting in line”

MQTT components

MQTT client

MQTT broker

MQTT connection

MQTT Topics

MQTT Publish

MQTT Subscribe

Example

Quality of Service (QoS)

QoS 0

QoS 1

QoS 2

Last Will and Testament (LWT)

Common MQTT question

Image sources

Understanding Endianness: Little-Endian vs Big-Endian

The Two Main Types of Endianness

Big-Endian

Little-Endian (LE)

Bi-Endianness

Why Does It Matter?

Detecting Endianness Using C

Step by step explanation

A Subtle Advantage of Little-Endian

Example

Why Do We Care About the Lower Bytes?

Addition and Subtraction

Why Big-Endian Still Exists

Modern Reality

Reference

Flash Memory Explained: NAND vs NOR, Architecture, and Memory Organization

Table of Contents

Types of Flash Memory: NAND and NOR

NOR Flash

Key Components

Structural Characteristics

Operational Implications

NAND Flash

Modern Compilers Inline Even Without `inline`

Case 1: Compilation with `-O0` (No Optimization)

⚠️ Important Warning About `inline` at `-O0`

Case 2: Compilation with `-O2` (Optimized Build)

Why `_add` Exists but Is Never Called

Forcing Removal of `_add`

Why can’t I do `str1 = str2`, but I can do `strcpy(str1, str2)`?

The Hidden Data Sections in FLASH: `.data` and `.bss`

Startup Code: The Invisible Hand Before `main()`

Function with 2 Arguments (`add2`)

Function with 4 Arguments (`add4`)

Function with 5 Arguments (`add5`)

Optimized Assembly (`O2`)

Optimization disabled (`O0`)

ELF Symbols and Function Size (`nm -S`)