S2S Compilers: Understanding Switch Case Statements
This blog is sponsored and supported by Voxgig. |
---|
![]() |
Motivated and inspired by Decl. In development and soon to be open-sourced. |
---|
Table Of Contents
-
S2S Compilers: Understanding Switch Case Statements
- About Author
- Introduction
- Motivation
- Source Code
- Prerequisites
- Background
- GCC Options
- Headers and Macros
- Switch case VS If-else
- If-else statement
- Switch Case
- Benchmarking the two
- Out-of-Bounds Benchmark
- Turning a switch case into a jump table
- Is there a limit for the values of a switch case?
- Breaking switch case: Non-packed values
- Multiple Jump Tables
- If-else as a jump table
- Implementing Switch Case
- Jump Table with function pointers
- Computed GOTO: Goto Jump Table with Labels
- Conclusion
- References
About Author
My name is Aleksandar Milenkovic. I am a senior software engineer at Voxgig and a self-driven coder, an autodidact, that creates software and improves its quality at any level.
If you have any feedback or questions, feel free to reach out to me via LinkedIn.
Introduction
Source-to-source compilers transpile the source code of a program and generate it into another language (target language). The target is typically a high-level language that is supported by all platforms so that the compatibility is not a problem.
If you are turning your source code into languages such as C or C++, it is required to have great understanding and knowledge of C/C++. Since these languages also have compilers be it GNU Compiler Collection or Clang, we have to do a lot of digging and researching around their features and functionalities. There is a lot of benefit in that once the target codebase grows and developers start reusing the target source as a component rather than referring back to the source language code. For more context, also check out Wikipedia | Source-to-source compiler | Programming language implementations.
In a lot of cases, we start benchmarking a lot of the target source code generated by our S2S compiler and comparing to the plain C/C++. We start finding a lot of improvements we can make for the target source itself. The goal is the quality and performance of the target source.
As the first instalment of this series, we dive into the usage of the switch
case, optimizing it, comparing it to if-else, implementing our own switch case with all of the assembly conversions and benchmarks to back it all up. We use GCC (GNU Compiler Collection) and cover some of its commands in detail.
Even if you are not interested in S2S Compilers as a C/C++ developer, this blog still comes in handy with all of the examples and information.
This blog series is named S2S (Source-to-source) Compilers as all of the code and content is motivated by building a S2S Compiler that takes in the source language and translates it into C++. By benchmarking and analyzing the target source (C++), I have realized how many improvements there are to make. More on the motivation in the following section.
Motivation
As a senior software engineer for Voxgig, we have been building SDKs in various programming languages. For our utilities, this brought about a lot of performance concerns as we port from one language to another, we would try to make the two and more implementations as close as possible side by side. That led us into the waters of language-specific implementations where we have pushed to use the most out of each language so as to make it as performant as possible for the developer using the SDK in the respective language. For example, the C++ implementation is going to have more focus on the performance since it is harder to optimize code compared to Python, Ruby, or JavaScript.
At the same point in time while at Voxgig, I have been building a dynamic programming language that transpiles into C++, I have been doing a lot of research and have benchmarked plenty of C++ code in order to hit the maximum runtime performance of the target language (C++).
All of this work was motivated by the dynamic programming language called, Decl; soon to be open-sourced.
To be frank, the reason for C++ as the target language is that it is much easier to build a garbage collector by just implementing reference counter in the wrapper class. The state of dynamically allocated objects is determined by implicit destructors of C++, again, with the reference counting implementation. However, we won't cover any of that in this blog but stay ready for the next instalment :).
All of the knowledge that I have acquired turned out to be very useful for me and, most of all, our team.
If you are interested in building SDKs for one or even more programming languages, please contact Voxgig.
Again, huge thanks to Voxgig for enabling me to put in the time for this research and writing the blog.
Source Code
The source code of this blog is all in one file, at the Github repository where it is also hosted. The code contains the main
function running all the benchmarks in order.
Prerequisites
Background
This blog is intended for C/C++ programmers who have the essential knowledge of assembly. However, if you haven't had any experience of assembly before, read on and see how you get on!
GCC Options
You can use the following gcc
command to dissect assembly code:
gcc src/main.cpp -g -o output.s -masm=intel -fverbose-asm -S -Werror
As the Using the GNU Compiler Collection Documentation indicates, we use -fverbose-asm
to "put extra commentary information in the generated assembly code to make it more readable".
For example,
#include <stdio.h>
int main() {
int i = 0;
printf("%d\n", i);
printf("%d%d\n", i, i);
return 0;
}
Generates the following assembly output with comments indicating the line number and its content.
main:
.LFB0:
.file 1 "main.cpp"
.loc 1 4 12
.cfi_startproc
endbr64
push rbp #
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
mov rbp, rsp #,
.cfi_def_cfa_register 6
sub rsp, 16 #,
# main.cpp:5: int i = 0;
.loc 1 5 7
mov DWORD PTR -4[rbp], 0 # i,
# main.cpp:8: printf("%d\n", i);
.loc 1 8 9
mov eax, DWORD PTR -4[rbp] # tmp84, i
mov esi, eax #, tmp84
lea rax, .LC0[rip] # tmp85,
mov rdi, rax #, tmp85
mov eax, 0 #,
call printf@PLT #
# main.cpp:9: printf("%d%d\n", i, i);
.loc 1 9 9
mov edx, DWORD PTR -4[rbp] # tmp86, i
mov eax, DWORD PTR -4[rbp] # tmp87, i
mov esi, eax #, tmp87
lea rax, .LC1[rip] # tmp88,
mov rdi, rax #, tmp88
mov eax, 0 #,
call printf@PLT #
# main.cpp:11: return 0;
.loc 1 11 10
mov eax, 0 # _5,
# main.cpp:12: }
.loc 1 12 1
leave
.cfi_def_cfa 7, 8
ret
.cfi_endproc
We also use -masm=dialect
with intel
to produce code optimized for the most current Intel processors but in this case we are using Intel ASM Syntax for the assembly generation. I do personally prefer this syntax branch.
For the assembly examples, we use the Compiler Explorer as demangling identifiers for the assembly comes in built-in and makes the code more readable and easier to follow. The Complier Explorer also allows you to follow the code by hovering over a line or scrolling to source, pointing you directly to the assembly conversion. Anyway, see what works best for you!
We will have two snippets where we introduce the C high-level source code and show the generated assembly code so we can make comparisons and draw conclusions - even so subtle in some cases. But don't forget that any assembly instruction can make a difference.
Headers and Macros
For benchmarking, we pretty much use printf
and clock()
defined as the following macros.
#include <stdio.h>
#include <time.h>
#define start_time clock_t s_t_a_r_t = clock();
#define end_time printf("[CPU Time Used: %f]\n", (double)(clock() - s_t_a_r_t) / CLOCKS_PER_SEC);
Switch case VS If-else
Let's start by comparing the if-else to the switch statement. Note that in most of the examples we don't use any optimization flags by default since we need the most accurate assembly translation line by line, keeping our code "clean" in that we can also accurately benchmark our implementations.
For example, if we turn on the -O1
flag, the compiler, if finds appropriate, will generate a jump table even in the if-else example as it determines the values are densely packed. For more information, please refer to Using the GNU Compiler Collection | 3.12 Options That Control Optimization.
🚧 Terminology
Beware that the "packed" word is used in relation to jump tables and has nothing to do with C Packed Structures.
You might find the following code examples, for a lack of a better word, dumb since we are just having integers that are assigned based on their value so why can't we just go with:
r = type;
or just use the Using the GNU Compiler Collection | 6.18.11 Case Ranges.
Well, that wouldn't be the point of this blog. The individual r=0
, r=1
, ..., r=n
are supposed to represent our code blocks executed at the point they are reached. Another reason is that we wanted to trick the compiler into generating separate, most simple, labels for each of the conditions.
The whole point is to make assembly output smaller so that we can focus on the important assembly instructions at hand such as CALL
, CMP
, JMP
, etc.
The reason behind naming our parameter type
is that in the context of creating a dynamic programming language where the current member value is wrapped in a union
, you would want an enum
in the switch
case to most efficiently determine which member is active to apply certain operations against. Check out Union-like classes.
If-else statement
The following is the if-else
chain of statements comparing the unsigned integers 0 through 10.
📘 Note
In the examples like this, we use the
volatile
keyword in order to prevent the compiler from eliminating our code in some cases where the-O3
or-O2
optimizations are added, which is a collection of optimization flags for the C/C++ code. For more information, see Using the GNU Compiler Collection | 3.12 Options That Control Optimization.We are not going to go into too much details on these flags in this blog but we will cover them in the next where we show how we can carefully pick and use them according to our needs, ensuring there is more control over our flags rather than using a whole collection of them.
void _if_run(unsigned int type) {
volatile int r; // NOTE: preventing compiler from optimizing it out
if(type == 0) {
r = 0;
} else if(type == 1) {
r = 1;
} else if(type == 2) {
r = 2;
} else if(type == 3) {
r = 3;
} else if(type == 4) {
r = 4;
} else if(type == 5) {
r = 5;
} else if(type == 6) {
r = 6;
} else if(type == 7) {
r = 7;
} else if(type == 8) {
r = 8;
} else if(type == 9) {
r = 9;
} else if(type == 10) {
r = 10;
} else {
r = 11;
}
}
To stay consistent with the switch case example, we use the unsigned int
that represents the range of 0 to 4,294,967,295
. The reason being is the switch case must start from 0
for the jump table to get generated effectively. Since in C, it is not possible to have array subscripts below 0
. This also helps with the switch case in order to have only one CMP
instruction before the JMP
instructions so that it can decide whether to jump straight to the default
or go on to the jump table where goto
jumps to the corresponding label.
For example, this code takes two CMP
instructions.
void _cmp(int t) {
if(t < 0 || t > 5) {
return;
}
}
_cmp:
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-4], edi
cmp DWORD PTR [rbp-4], 0
js .L19
cmp DWORD PTR [rbp-4], 5
jmp .L16
.L19:
nop
.L16:
pop rbp
ret
Whereas
void _cmp(unsigned int t) {
if(t > 5) {
return;
}
}
Only one:
_cmp:
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-4], edi
cmp DWORD PTR [rbp-4], 5
pop rbp
ret
In the assembly output of the C high-level code, we can see that for each statement the GCC compiler uses CMP
, which is of course O(1)
in the best case given that we match the first condition. However, if none match - the point to get to else
equates to the time complexity of O(n)
being the worst case. The average time complexity would really depend on what the code does and how frequently the value to compare proceeds further into the statement, which means more CMP
s before the condition is met.
This is where the switch case
comes in to save the day with the average time complexity of O(1)
but we have to make sure that the switch case is optimized into a jump table, which we cover in the next section.
_if_run:
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-20], edi
cmp DWORD PTR [rbp-20], 0
jne .L32
mov DWORD PTR [rbp-4], 0
jmp .L44
.L32:
cmp DWORD PTR [rbp-20], 1
jne .L34
mov DWORD PTR [rbp-4], 1
jmp .L44
.L34:
cmp DWORD PTR [rbp-20], 2
jne .L35
mov DWORD PTR [rbp-4], 2
jmp .L44
.L35:
cmp DWORD PTR [rbp-20], 3
jne .L36
mov DWORD PTR [rbp-4], 3
jmp .L44
.L36:
cmp DWORD PTR [rbp-20], 4
jne .L37
mov DWORD PTR [rbp-4], 4
jmp .L44
.L37:
cmp DWORD PTR [rbp-20], 5
jne .L38
mov DWORD PTR [rbp-4], 5
jmp .L44
.L38:
cmp DWORD PTR [rbp-20], 6
jne .L39
mov DWORD PTR [rbp-4], 6
jmp .L44
.L39:
cmp DWORD PTR [rbp-20], 7
jne .L40
mov DWORD PTR [rbp-4], 7
jmp .L44
.L40:
cmp DWORD PTR [rbp-20], 8
jne .L41
mov DWORD PTR [rbp-4], 8
jmp .L44
.L41:
cmp DWORD PTR [rbp-20], 9
jne .L42
mov DWORD PTR [rbp-4], 9
jmp .L44
.L42:
cmp DWORD PTR [rbp-20], 10
jne .L43
mov DWORD PTR [rbp-4], 10
jmp .L44
.L43:
mov DWORD PTR [rbp-4], 11
.L44:
nop
pop rbp
ret
Switch Case
In case(no pun intended) you are wondering why we are not using the enum
keyword and the enumeration values, that is to demonstrate that those enums
are essentially unsigned int
and are starting from 0
. So if you are using enum
, the assembly outputs are gonna be exactly the same when it comes to the switch
statement. If you are looking for this enum
version of the code, please check out _switch_enum_run
at the Github repository source code or Default Case: Can we do without it section.
For example,
#include <stdio.h>
#include <assert.h>
enum N {
N0,
N1,
N2,
N3,
};
int main()
{
assert(sizeof(unsigned int) == sizeof(N0));
// Cast to unsigned int
assert((unsigned int)N0 == 0);
// No need to cast if not necessary
assert(N0 == 0);
return 0;
}
🚧 Note
Please understand that I am not advising casting
enums
in any shape or form.If you are using
func(enum T)
, always use one of theenum
values directly - notunsigned int
unless your function isfunc(unsigned int)
. It doesn't necessarily take a toll on the runtime performance but it is a bit unconventional in the high-level code.
The following is a simple switch case of 0 through 10 with the default
case.
void _switch_run(unsigned int type) {
volatile int r; // NOTE: preventing compiler from optimizing it out
switch(type) {
case 0: {
r = 0;
break;
}
case 1: {
r = 1;
break;
}
case 2: {
r = 2;
break;
}
case 3: {
r = 3;
break;
}
case 4: {
r = 4;
break;
}
case 5: {
r = 5;
break;
}
case 6: {
r = 6;
break;
}
case 7: {
r = 7;
break;
}
case 8: {
r = 8;
break;
}
case 9: {
r = 9;
break;
}
case 10: {
r = 10;
break;
}
default: {
r = 11;
break;
}
}
}
_switch_run:
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-20], edi
cmp DWORD PTR [rbp-20], 10
ja .L2
mov eax, DWORD PTR [rbp-20]
mov rax, QWORD PTR .L4[0+rax*8]
jmp rax
.L4:
.quad .L14
.quad .L13
.quad .L12
.quad .L11
.quad .L10
.quad .L9
.quad .L8
.quad .L7
.quad .L6
.quad .L5
.quad .L3
.L14:
mov DWORD PTR [rbp-4], 0
jmp .L15 # break
.L13:
mov DWORD PTR [rbp-4], 1
jmp .L15
.L12:
mov DWORD PTR [rbp-4], 2
jmp .L15
.L11:
mov DWORD PTR [rbp-4], 3
jmp .L15
.L10:
mov DWORD PTR [rbp-4], 4
jmp .L15
.L9:
mov DWORD PTR [rbp-4], 5
jmp .L15
.L8:
mov DWORD PTR [rbp-4], 6
jmp .L15
.L7:
mov DWORD PTR [rbp-4], 7
jmp .L15
.L6:
mov DWORD PTR [rbp-4], 8
jmp .L15
.L5:
mov DWORD PTR [rbp-4], 9
jmp .L15
.L3:
mov DWORD PTR [rbp-4], 10
jmp .L15
.L2: # Default Case Label
mov DWORD PTR [rbp-4], 11
nop
.L15: # Break Label
nop
pop rbp
ret
These instruction are pretty straightforward to comprehend. First, we have the CMP
instruction that checks if the unsigned integer
is in the bounds of 10. As noted above, this is the deciding instruction for whether to jump straight to the default
case.
cmp DWORD PTR [rbp-20], 10
ja .L2
JA
is a jump when the comparison is unsigned. This makes sense given that we are usingunsigned int
to compare.
Roughly, the C equivalent of this code is the following:
void _switch_run(unsigned int type) {
if(type > 10) {
goto L2;
}
// ...
}
Finally, we get to the jump table:
mov eax, DWORD PTR [rbp-20]
mov rax, QWORD PTR .L4[0+rax*8]
jmp rax
.L4:
.quad .L14
.quad .L13
.quad .L12
.quad .L11
.quad .L10
.quad .L9
.quad .L8
.quad .L7
.quad .L6
.quad .L5
.quad .L3
We can see that the compiler has essentially generated the .L4
, a static
array of labels that represent the code blocks that we defined in our switch case. The cost of this JMP
is constant O(1)
.
In C, this code is roughly equivalent to the following, which is covered thoroughly in the Implementing Switch Case section.
static void* cases[] = {
&&l00,
&&l01,
&&l02,
&&l03,
&&l04,
&&l05,
&&l06,
&&l07,
&&l08,
&&l09,
&&l10,
};
// CMP For the default case: "cmp DWORD PTR [rbp-20], 10"
if(v > 10) {
goto _default;
}
goto *cases[v];
volatile int r;
l00: {
r = 0;
goto defer; // break
}
l01: {
r = 1;
goto defer; // break
}
l02: {
r = 2;
goto defer; // break
}
/* ... */
l10: {
r = 10;
goto defer;
}
_default: {
r = 11;
goto defer; // break
}
// For us, defer serves as the alternative to where the "break" keyword jumps to as far as the switch case and jump table
defer: {
return;
}
Benchmarking the two
Finally, let's put our code to use and benchmark to demonstrate our assembly breakdown at runtime.
We are going to use the value of 10
since that is the last condition of the two statements. That way, we are forcing the worst case of if-else
and confirming that the switch case is the constant time complexity: O(1)
.
int main() {
volatile unsigned int value = 10; // NOTE: preventing compiler from optimizing it out
{
printf("_switch_run\n");
for(int i = 0; i < 10; i++) {
start_time;
for(int i = 0; i < 1000000; i++) {
_switch_run(value);
}
end_time;
}
printf("_switch_run\n");
}
{
printf("_if_run\n");
for(int i = 0; i < 10; i++) {
start_time;
for(int i = 0; i < 1000000; i++) {
_if_run(value);
}
end_time;
}
printf("_if_run\n");
}
return 0;
}
The code produces the following output:
_switch_run
[CPU Time Used: 0.002417]
[CPU Time Used: 0.002171]
[CPU Time Used: 0.002241]
[CPU Time Used: 0.002250]
[CPU Time Used: 0.002212]
[CPU Time Used: 0.002244]
[CPU Time Used: 0.002201]
[CPU Time Used: 0.002209]
[CPU Time Used: 0.002329]
[CPU Time Used: 0.002238]
_switch_run
_if_run
[CPU Time Used: 0.005600]
[CPU Time Used: 0.005801]
[CPU Time Used: 0.005637]
[CPU Time Used: 0.005700]
[CPU Time Used: 0.005804]
[CPU Time Used: 0.005658]
[CPU Time Used: 0.005693]
[CPU Time Used: 0.005757]
[CPU Time Used: 0.005792]
[CPU Time Used: 0.005832]
_if_run
Out-of-Bounds Benchmark
Let's see what's the case if we provide the value that matches none of our conditions.
With the switch
case, that would be the default
label and with the if-else
set: the else
statement.
int main() {
volatile unsigned int value = 11; // NOTE: preventing compiler from optimizing it out
{
printf("_switch_run\n");
for(int i = 0; i < 10; i++) {
start_time;
for(int i = 0; i < 1000000; i++) {
_switch_run(value);
}
end_time;
}
printf("_switch_run\n");
}
{
printf("_if_run\n");
for(int i = 0; i < 10; i++) {
start_time;
for(int i = 0; i < 1000000; i++) {
_if_run(value);
}
end_time;
}
printf("_if_run\n");
}
return 0;
}
In this particular scenario, the switch
comes out victorious again but even by a larger margin, as we can see from the following results.
_switch_run
[CPU Time Used: 0.001884]
[CPU Time Used: 0.001813]
[CPU Time Used: 0.001726]
[CPU Time Used: 0.002118]
[CPU Time Used: 0.001847]
[CPU Time Used: 0.001746]
[CPU Time Used: 0.001817]
[CPU Time Used: 0.001675]
[CPU Time Used: 0.001713]
[CPU Time Used: 0.001679]
_switch_run
_if_run
[CPU Time Used: 0.005802]
[CPU Time Used: 0.005815]
[CPU Time Used: 0.005619]
[CPU Time Used: 0.005913]
[CPU Time Used: 0.005581]
[CPU Time Used: 0.005722]
[CPU Time Used: 0.005725]
[CPU Time Used: 0.005770]
[CPU Time Used: 0.005779]
[CPU Time Used: 0.005683]
_if_run
This is due to the fact that out of all this code for the _switch_run
:
_switch_run:
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-20], edi
cmp DWORD PTR [rbp-20], 10
ja .L18
mov eax, DWORD PTR [rbp-20]
mov rax, QWORD PTR .L20[0+rax*8]
jmp rax
.L20:
.quad .L30
.quad .L29
.quad .L28
.quad .L27
.quad .L26
.quad .L25
.quad .L24
.quad .L23
.quad .L22
.quad .L21
.quad .L19
It is only this portion of the code that actually runs:
_switch_run:
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-20], edi
cmp DWORD PTR [rbp-20], 10
ja .L18
.L18:
mov DWORD PTR [rbp-4], 11
nop
Once the CMP
matches and JA
jumps to .L18
, that is the end of our function lifetime. In other words, it doesn't even use the jump table.
Turning a switch case into a jump table
The GCC compiler is not always going to turn a switch case into a jump table. Only when there are multiple case
s does the compiler (even consider to) make that optimization.
For example, if we take this code into consideration with its respective assembly output.
void _switch_test(unsigned int type) {
volatile int r; // NOTE: preventing compiler from optimizing it out
switch(type) {
case 0: {
r = 0;
break;
}
case 1: {
r = 1;
break;
}
case 2: {
r = 2;
break;
}
case 3: {
r = 3;
break;
}
default: {
r = 11;
break;
}
}
}
This is basically an if-else
statement, which we can tell by the CMP
and JE
(Jump if Equal) instructions. There is an interesting detail that all of these values are compared backwards - from 3
rather than from 0
as opposed to the if-else example.
_switch_test:
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-20], edi
cmp DWORD PTR [rbp-20], 3
je .L18
cmp DWORD PTR [rbp-20], 3
ja .L19
cmp DWORD PTR [rbp-20], 2
je .L20
cmp DWORD PTR [rbp-20], 2
ja .L19
cmp DWORD PTR [rbp-20], 0
je .L21
cmp DWORD PTR [rbp-20], 1
je .L22
jmp .L19
.L21:
mov DWORD PTR [rbp-4], 0
jmp .L23
.L22:
mov DWORD PTR [rbp-4], 1
jmp .L23
.L20:
mov DWORD PTR [rbp-4], 2
jmp .L23
.L18:
mov DWORD PTR [rbp-4], 3
jmp .L23
.L19:
mov DWORD PTR [rbp-4], 11
nop
.L23:
nop
pop rbp
ret
Even by enabling the -O3
optimizations, this code is not turned into a jump table.
The GCC compiler decides whether to generate a jump table given there is enough densely packed values.
So for this code, we do get the jump table generated.
void _switch_test(unsigned int type) {
volatile int r; // NOTE: preventing compiler from optimizing it out
switch(type) {
case 0: {
r = 0;
break;
}
case 1: {
r = 1;
break;
}
case 2: {
r = 2;
break;
}
case 3: {
r = 3;
break;
}
case 4: {
r = 4;
break;
}
default: {
r = 11;
break;
}
}
}
There is our jump table by the name of .L20
!
_switch_test:
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-20], edi
cmp DWORD PTR [rbp-20], 4
ja .L18
mov eax, DWORD PTR [rbp-20]
mov rax, QWORD PTR .L20[0+rax*8]
jmp rax
.L20:
.quad .L24
.quad .L23
.quad .L22
.quad .L21
.quad .L19
.L24:
mov DWORD PTR [rbp-4], 0
jmp .L25
.L23:
mov DWORD PTR [rbp-4], 1
jmp .L25
.L22:
mov DWORD PTR [rbp-4], 2
jmp .L25
.L21:
mov DWORD PTR [rbp-4], 3
jmp .L25
.L19:
mov DWORD PTR [rbp-4], 4
jmp .L25
.L18:
mov DWORD PTR [rbp-4], 11
nop
.L25:
nop
pop rbp
ret
With all of that being said, there is a way to force generating a jump at all times by specifying --param case-values-threshold=1
as one of the GCC command options.
According to Using the GNU Compiler Collection Documentation:
-
case-values-threshold
- The smallest number of different values for which it is best to use a jump-table instead of a tree of conditional branches. If the value is 0, use the default for the machine.
-
-fno-jump-tables
- Do not use jump tables for switch statements even where it would be more effi- cient than other code generation strategies. This option is of use in conjunction with ‘-fpic’ or ‘-fPIC’ for building code that forms part of a dynamic linker and cannot reference the address of a jump table. On some targets, jump tables do not require a GOT and this option is not needed.
📘 Note
The closest "opposite" to
case-values-threshold
probably is-fno-jump-tables
where no matter the number of values/cases, that block is turned into a series ofCMP
andJMP
instructions.As a side note, there is the option called
-fjump-tables
that I managed to dig out of Using the GNU Compiler Collection Documentation but I am assuming it is the default option since I couldn't find any description of it and it is the-fno-jump-tables
, if specified, that overrides it.
gcc src/main.cpp -Werror -g -o output.s -masm=intel -fverbose-asm -S --param case-values-threshold=1
Even if we have a switch case such as the following.
void _switch_test(unsigned int type) {
volatile int r; // NOTE: preventing compiler from optimizing it out
switch(type) {
case 0: {
r = 0;
break;
}
case 1: {
r = 1;
break;
}
default: {
r = 11;
break;
}
}
}
The output is:
_switch_test:
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-20], edi
cmp DWORD PTR [rbp-20], 1
ja .L18
mov eax, DWORD PTR [rbp-20]
mov rax, QWORD PTR .L20[0+rax*8]
jmp rax
.L20:
.quad .L21
.quad .L19
.L21:
mov DWORD PTR [rbp-4], 0
jmp .L22
.L19:
mov DWORD PTR [rbp-4], 1
jmp .L22
.L18:
mov DWORD PTR [rbp-4], 11
nop
From this, we can conclude that using a jump table for as few as 1, 2, or 3 values is redundant and rather impractical at the same time.
Is there a limit for the values of a switch case?
Since a jump table is basically a static array, there should be some limit of how many values we can pack into that jump table.
There is no particular way to explicitly test when the jump table isn't used in case there are thousands upon thousands of cases - unless we want to torture our computer and buy a new one any time soon so we can use a script like this.
if __name__ == '__main__':
code = '''
void __switch_test(unsigned int type) {
volatile int r; // NOTE: preventing compiler from optimizing it out
switch(type) {
''';
for i in range(0, 100000):
code += 'case ' + str(i) + ':{ \n' + ' r=' + str(i) + ';\n break;\n}\n'
code += 'default: {\n r = 11;\n break;\n}\n'
code += '}'
code += '}'
print(code)
python3 generate.py > _out.c && gcc _out.c -g -o output.s -masm=intel -fverbose-asm -S
To answer the question in the title, there is no limit for the GCC compiler not to make it a jump table given that all the values are densely packed.
As long as there is memory, the output is going to be a jump table. Please, check out Using the GNU Compiler Collection | 4.12 Statements.
There are some options to consider for restricting the usage of jump tables.
-
-fno-jump-tables
- Do not use jump tables for switch statements even where it would be more efficient than other code generation strategies. This option is of use in conjunction with ‘-fpic’ or ‘-fPIC’ for building code that forms part of a dynamic linker and cannot reference the address of a jump table. On some targets, jump tables do not require a GOT and this option is not needed.
-
jump-table-max-growth-ratio-for-size
- The maximum code size growth ratio when expanding into a jump table (in percent). The parameter is used when optimizing for size.
-
jump-table-max-growth-ratio-for-speed
- The maximum code size growth ratio when expanding into a jump table (in percent). The parameter is used when optimizing for speed.
Same as the compiler not using the jump table in the example above, note that with -fno-jump-tables
, the GCC compiler creates a set of CMP
and JMP
instructions in reverse order to the original values defined.
Breaking switch case: Non-packed values
This is a rather weird use case for the switch statement but worth exploring to see what happens if we get "out of the way of the packed values".
void _switch_unpacked(unsigned int type) {
volatile int r; // NOTE: preventing compiler from optimizing it out
switch(type) {
case 0: {
r = 0;
break;
}
case 1: {
r = 1;
break;
}
case 2: {
r = 2;
break;
}
case 3: {
r = 3;
break;
}
case 4: {
r = 4;
break;
}
case 5: {
r = 5;
break;
}
case 30: {
r = 30;
break;
}
default: {
r = 11;
break;
}
}
}
The GCC compiler handles this code in a very interesting way. It fills all the cases in between 4 and 30 with blanks. These all lead to the same .L18
default case jump.
_switch_unpacked:
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-20], edi
cmp DWORD PTR [rbp-20], 30
ja .L18
mov eax, DWORD PTR [rbp-20]
mov rax, QWORD PTR .L20[0+rax*8]
jmp rax
.L20:
.quad .L26
.quad .L25
.quad .L24
.quad .L23
.quad .L22
.quad .L21
.quad .L18
.quad .L18
.quad .L18
.quad .L18
.quad .L18
.quad .L18
.quad .L18
.quad .L18
.quad .L18
.quad .L18
.quad .L18
.quad .L18
.quad .L18
.quad .L18
.quad .L18
.quad .L18
.quad .L18
.quad .L18
.quad .L18
.quad .L18
.quad .L18
.quad .L18
.quad .L18
.quad .L18
.quad .L19
.L26:
mov DWORD PTR [rbp-4], 0
jmp .L27
.L25:
mov DWORD PTR [rbp-4], 1
jmp .L27
.L24:
mov DWORD PTR [rbp-4], 2
jmp .L27
.L23:
mov DWORD PTR [rbp-4], 3
jmp .L27
.L22:
mov DWORD PTR [rbp-4], 4
jmp .L27
.L21:
mov DWORD PTR [rbp-4], 5
jmp .L27
.L19:
mov DWORD PTR [rbp-4], 30
jmp .L27
.L18:
mov DWORD PTR [rbp-4], 11
nop
.L27:
nop
pop rbp
ret
It is rational to think that if we say 100
instead of 30
, the compiler doesn't waste too much space with all the blanks that point to the default
case and instead needs to use another CMP
instruction. This is exactly the case.
void _switch_unpacked(unsigned int type) {
volatile int r; // NOTE: preventing compiler from optimizing it out
switch(type) {
/* ... */
case 5: {
r = 5;
break;
}
case 100: {
r = 100;
break;
}
default: {
r = 11;
break;
}
}
}
We have two CMP
:
- First
CMP
: The value is above5
and jump to check (CMP
) if it is100
-
cmp DWORD PTR [rbp-20], 5
ja .L18
# ...
.L18:
cmp DWORD PTR [rbp-20], 100
je .L27
jmp .L19
```
- Second `CMP`: Checks if the value is above `5` - if that's _true(sets the zero flag)_, jump to the `default`. Otherwise, proceed with the jump table where all of the values are _still_ packed.
-
```assembly
cmp DWORD PTR [rbp-20], 5
ja .L19
mov eax, DWORD PTR [rbp-20]
mov rax, QWORD PTR .L21[0+rax*8]
jmp rax
.L21:
.quad .L26
.quad .L25
.quad .L24
.quad .L23
.quad .L22
.quad .L20
# ...
.L19:
mov DWORD PTR [rbp-4], 11
nop
```
```assembly
_switch_unpacked:
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-20], edi
cmp DWORD PTR [rbp-20], 5
ja .L18
cmp DWORD PTR [rbp-20], 5
ja .L19
mov eax, DWORD PTR [rbp-20]
mov rax, QWORD PTR .L21[0+rax*8]
jmp rax
.L21:
.quad .L26
.quad .L25
.quad .L24
.quad .L23
.quad .L22
.quad .L20
.L18:
cmp DWORD PTR [rbp-20], 100
je .L27
jmp .L19
.L26:
mov DWORD PTR [rbp-4], 0
jmp .L28
.L25:
mov DWORD PTR [rbp-4], 1
jmp .L28
.L24:
mov DWORD PTR [rbp-4], 2
jmp .L28
.L23:
mov DWORD PTR [rbp-4], 3
jmp .L28
.L22:
mov DWORD PTR [rbp-4], 4
jmp .L28
.L20:
mov DWORD PTR [rbp-4], 5
jmp .L28
.L27:
mov DWORD PTR [rbp-4], 100
jmp .L28
.L19:
mov DWORD PTR [rbp-4], 11
nop
.L28:
nop
pop rbp
ret
These two CMP
s may appear to be a little unoptimized since they do pretty much the same thing. So let's go ahead and turn on the -O1
optimizations:
_switch_unpacked:
cmp edi, 5
ja .L2
ja .L3
mov edi, edi
jmp [QWORD PTR .L5[0+rdi*8]]
.L5:
.quad .L10
.quad .L9
.quad .L8
.quad .L7
.quad .L6
.quad .L4
.L2:
cmp edi, 100
jne .L3
mov DWORD PTR [rsp-4], 100
ret
.L3:
mov DWORD PTR [rsp-4], 11
ret
Multiple Jump Tables
If we think more and more about the case of "Using the non-packed values", the compiler cannot keep using the CMP
instructions in the ranges such as [0, 1, 2, 3, 4, 5, 100, 101, 102, 103, 104, 105]
. We can deduce that there are actually two sets of packed values there: 0...5
and 100...105
.
However, because they are two entirely separate sets, there must be an additional CMP
instruction between them to check the out-of-bounds ranges and whether to jump to the default
case.
Let's back up our thinking in the following example.
void _switch_unpacked(unsigned int type) {
volatile int r; // NOTE: preventing compiler from optimizing it out
switch(type) {
case 0: {
r = 0;
break;
}
case 1: {
r = 1;
break;
}
case 2: {
r = 2;
break;
}
case 3: {
r = 3;
break;
}
case 4: {
r = 4;
break;
}
case 5: {
r = 5;
break;
}
case 100: {
r = 100;
break;
}
case 101: {
r = 101;
break;
}
case 102: {
r = 102;
break;
}
case 103: {
r = 103;
break;
}
case 104: {
r = 104;
break;
}
case 105: {
r = 105;
break;
}
default: {
r = 11;
break;
}
}
}
In this assembly output, there are two jump tables - separated by their respective CMP
instructions.
-
.L29
:0...5
range jump table -
.L22
:100...105
range jump table
_switch_unpacked:
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-20], edi
cmp DWORD PTR [rbp-20], 5
ja .L18
jmp .L37
.L35:
mov eax, DWORD PTR [rbp-20]
sub eax, 100
cmp eax, 5
ja .L20
mov eax, eax
mov rax, QWORD PTR .L22[0+rax*8]
jmp rax
.L22:
.quad .L27
.quad .L26
.quad .L25
.quad .L24
.quad .L23
.quad .L21
.L37:
cmp DWORD PTR [rbp-20], 5
ja .L20
mov eax, DWORD PTR [rbp-20]
mov rax, QWORD PTR .L29[0+rax*8]
jmp rax
.L29:
.quad .L34
.quad .L33
.quad .L32
.quad .L31
.quad .L30
.quad .L28
.L18:
cmp DWORD PTR [rbp-20], 105
ja .L20
cmp DWORD PTR [rbp-20], 100
jnb .L35
jmp .L20
.L34:
mov DWORD PTR [rbp-4], 0
jmp .L36
.L33:
mov DWORD PTR [rbp-4], 1
jmp .L36
.L32:
mov DWORD PTR [rbp-4], 2
jmp .L36
.L31:
mov DWORD PTR [rbp-4], 3
jmp .L36
.L30:
mov DWORD PTR [rbp-4], 4
jmp .L36
.L28:
mov DWORD PTR [rbp-4], 5
jmp .L36
.L27:
mov DWORD PTR [rbp-4], 100
jmp .L36
.L26:
mov DWORD PTR [rbp-4], 101
jmp .L36
.L25:
mov DWORD PTR [rbp-4], 102
jmp .L36
.L24:
mov DWORD PTR [rbp-4], 103
jmp .L36
.L23:
mov DWORD PTR [rbp-4], 104
jmp .L36
.L21:
mov DWORD PTR [rbp-4], 105
jmp .L36
.L20:
mov DWORD PTR [rbp-4], 11
nop
.L36:
nop
pop rbp
ret
If-else as a jump table
With no extra optimizations such as -O1
, -O2
or -O3
, the if-else statement is a set of CMP
and JMP
instructions checking each value as detailed out above.
Much to our surprise, if the conditional values are compared via the equal-to operator and are densely packed, the -O1
optimizations can do their magic and turn the if-else
statements into a jump table.
gcc src/main.cpp -g -o output.s -masm=intel -fverbose-asm -S -O1
void _if_run(unsigned int type) {
volatile int r; // NOTE: preventing compiler from optimizing it out
if(type == 0) {
r = 0;
} else if(type == 1) {
r = 1;
} else if(type == 2) {
r = 2;
} else if(type == 3) {
r = 3;
} else if(type == 4) {
r = 4;
} else if(type == 5) {
r = 5;
} else if(type == 6) {
r = 6;
} else if(type == 7) {
r = 7;
} else if(type == 8) {
r = 8;
} else if(type == 9) {
r = 9;
} else if(type == 10) {
r = 10;
} else {
r = 11;
}
}
_if_run:
cmp edi, 10
ja .L85
mov edi, edi
jmp [QWORD PTR .L87[0+rdi*8]]
.L87:
.quad .L97
.quad .L96
.quad .L95
.quad .L94
.quad .L93
.quad .L92
.quad .L91
.quad .L90
.quad .L89
.quad .L88
.quad .L86
.L97:
mov DWORD PTR [rsp-4], 0
ret
.L96:
mov DWORD PTR [rsp-4], 1
ret
.L95:
mov DWORD PTR [rsp-4], 2
ret
.L94:
mov DWORD PTR [rsp-4], 3
ret
.L93:
mov DWORD PTR [rsp-4], 4
ret
.L92:
mov DWORD PTR [rsp-4], 5
ret
.L91:
mov DWORD PTR [rsp-4], 6
ret
.L90:
mov DWORD PTR [rsp-4], 7
ret
.L89:
mov DWORD PTR [rsp-4], 8
ret
.L88:
mov DWORD PTR [rsp-4], 9
ret
.L86:
mov DWORD PTR [rsp-4], 10
ret
.L85:
mov DWORD PTR [rsp-4], 11
ret
In my opinion, this optimization may be overkill if you are an old-school person who looks at their assembly after compiling this and saying, "Wait! I didn't use a switch case. Why is there a jump table?". For that reason, we must always stay up-to-date with the features of GCC.
Implementing Switch Case
In this section, we are gonna cover two possible implementations in C but not use the switch
keyword.
For the sake of simplicity, we won't be embedding assembly via 6.17.2 Extended Asm | Assembler Instructions with C Expression Operands. Even though we don't use assembly directly, we can come pretty close to the jump table implementation as the GCC compiler sees it.
The first example is taken from the Wikipedia | Branch table.
"Another simple example, this time demonstrating a jump table rather than a mere branch table. This allows program blocks outside of the currently active procedure/function to be called:"
Jump Table with function pointers
#include <stdio.h>
#include <stdlib.h>
typedef void (*Handler)(void); /* A pointer to a handler function */
/* The functions */
void func3 (void) { printf( "3\n" ); }
void func2 (void) { printf( "2\n" ); }
void func1 (void) { printf( "1\n" ); }
void func0 (void) { printf( "0\n" ); }
Handler jump_table[4] = { func0, func1, func2, func3 };
int main (int argc, char **argv) {
int value;
/* Convert first argument to 0-3 integer (modulus) */
value = atoi(argv[1]) % 4;
/* Call appropriate function (func0 thru func3) */
jump_table[value]();
return 0;
}
As it implies, this is a jump table rather than a mere branch table. This jump table consists of function pointers instead labels. Labels directly use JMP
instructions while the function call like this (since function pointers cannot be inlined) uses the CALL
instruction. In the simplest terms, CALL
pushes the address of the instruction onto the stack. This value is often referred to as the return address. Once the CALL
instruction is executed, the RET
instruction pops the return address off the stack. As a result, JMP
instructions are faster than CALL
since jumps don't involve pushing and popping via the LIFO (Last-In-First-Out) stack. For more information, check out x86 assembly language.
🚧 Note
Another major disadvantage is that, compared to labels, we cannot pass variables from the outer scopes that easily unless we pass them to function arguments as pointers, which generates into more
MOV
instructions before the functionCALL
.It is important to note that this kind of function pointer call cannot be inlined even with the
-O1
,-O2
or-O3
optimizations.
To prove this point, let's time the following implementation.
Implementation
typedef void (*Handler)(volatile int*);
void func3 (volatile int* r) { *r = 3; }
void func2 (volatile int* r) { *r = 2; }
void func1 (volatile int* r) { *r = 1; }
void func0 (volatile int* r) { *r = 0; }
void func_pointer_jump_table(unsigned int type) {
// NOTE: we use static here since we don't want this array to be initialized per call. Like this, it is preserved between function invocations.
static Handler jump_table[4] = { func0, func1, func2, func3 };
type = type % 4;
volatile int r;
jump_table[type](&r);
}
int main() {
{
std::cout << "func_pointer_jump_table" << std::endl;
for(int i = 0; i < 10; i++) {
start_time;
for(int i = 0; i < 1000000; i++) {
func_pointer_jump_table(0);
}
end_time;
}
std::cout << "func_pointer_jump_table" << std::endl;
}
return 0;
}
Benchmark Result 1
func_pointer_jump_table
[CPU Time Used: 0.005723]
[CPU Time Used: 0.005588]
[CPU Time Used: 0.005661]
[CPU Time Used: 0.005637]
[CPU Time Used: 0.005622]
[CPU Time Used: 0.005693]
[CPU Time Used: 0.005615]
[CPU Time Used: 0.005656]
[CPU Time Used: 0.005703]
[CPU Time Used: 0.005668]
func_pointer_jump_table
We can deduce that this performance is almost identical to the worst case of if-else
statements. However, it can be a bit improved with the modulo operator replaced with an if
statement where we can check the index (unsigned int
) value is in bounds.
Benchmark Result 2
void func_pointer_jump_table(unsigned int type) {
// NOTE: We use static here since we don't want this array to be initialized per call. Like this, it is preserved between function invocations.
static Handler jump_table[4] = { func0, func1, func2, func3 };
// type = type % 4; // Replace with the CMP instruction
if(type > 3) {
// ... Default case
return;
}
int r;
jump_table[type](&r);
}
int main() {
{
std::cout << "func_pointer_jump_table" << std::endl;
for(int i = 0; i < 10; i++) {
start_time;
for(int i = 0; i < 1000000; i++) {
func_pointer_jump_table(0);
}
end_time;
}
std::cout << "func_pointer_jump_table" << std::endl;
}
return 0;
}
func_pointer_jump_table
[CPU Time Used: 0.005522]
[CPU Time Used: 0.005371]
[CPU Time Used: 0.005440]
[CPU Time Used: 0.005628]
[CPU Time Used: 0.005685]
[CPU Time Used: 0.005464]
[CPU Time Used: 0.005547]
[CPU Time Used: 0.005558]
[CPU Time Used: 0.005492]
[CPU Time Used: 0.005472]
func_pointer_jump_table
Given the benchmark results, we are slightly faster than the modulo operator.
In assembly,
type = type % 4;
translates into
and DWORD PTR [rbp-20], 3
if(type > 3) {
// ... Default case
return;
}
translates into
cmp DWORD PTR [rbp-20], 3
ja .L48
In the end, this is a good example to showcase the concept on a higher-level but definitely not the the main choice when it comes to implementing our own switch case.
Covered in the next section, the closest implementation to the switch
case is using Using the GNU Compiler Collection | 6.18.3 Labels as Values.
Computed GOTO: Goto Jump Table with Labels
According to Using the GNU Compiler Collection | 6.18.3 Labels as Values, we have the luxury of actually making labels part of the data structures as long as they are the void*
pointers. We can reference them with the unary operator &&
.
Let's look at the following implementation:
void _switch_go(unsigned int v) {
// As mentioned before, we keep the array preserved between function invocations.
static void* cases[] = { // Jump Table
&&l00,
&&l01,
&&l02,
&&l03,
&&l04,
&&l05,
&&l06,
&&l07,
&&l08,
&&l09,
&&l10,
};
// CMP For the default case - also implemented by the switch case
if(v > 10) {
goto _default;
}
// Jumps to the entry
goto *cases[v];
volatile int r;
l00: {
r = 0;
goto defer; // break
}
l01: {
r = 1;
goto defer; // break
}
l02: {
r = 2;
goto defer; // break
}
l03: {
r = 3;
goto defer; // break
}
l04: {
r = 4;
goto defer; // break
}
l05: {
r = 5;
goto defer;
}
l06: {
r = 6;
goto defer;
}
l07: {
r = 7;
goto defer;
}
l08: {
r = 8;
goto defer;
}
l09: {
r = 9;
goto defer;
}
l10: {
r = 10;
goto defer;
}
_default: {
r = 11;
goto defer; // break
}
// For us, defer serves as the alternative to where the "break" keyword jumps to as far as the switch case and jump table
defer: {
return;
}
}
The assembly of the code above is:
cases.1:
.quad .L49
.quad .L51
.quad .L52
.quad .L53
.quad .L54
.quad .L55
.quad .L56
.quad .L57
.quad .L58
.quad .L59
.quad .L60
_switch_go:
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-20], edi
cmp DWORD PTR [rbp-20], 10
ja .L63
mov eax, DWORD PTR [rbp-20]
mov rax, QWORD PTR cases.1[0+rax*8]
nop
jmp rax
.L49:
mov DWORD PTR [rbp-4], 0
jmp .L50
.L51:
mov DWORD PTR [rbp-4], 1
jmp .L50
.L52:
mov DWORD PTR [rbp-4], 2
jmp .L50
.L53:
mov DWORD PTR [rbp-4], 3
jmp .L50
.L54:
mov DWORD PTR [rbp-4], 4
jmp .L50
.L55:
mov DWORD PTR [rbp-4], 5
jmp .L50
.L56:
mov DWORD PTR [rbp-4], 6
jmp .L50
.L57:
mov DWORD PTR [rbp-4], 7
jmp .L50
.L58:
mov DWORD PTR [rbp-4], 8
jmp .L50
.L59:
mov DWORD PTR [rbp-4], 9
jmp .L50
.L60:
mov DWORD PTR [rbp-4], 10
jmp .L50
.L63:
nop
mov DWORD PTR [rbp-4], 11
nop
.L50:
nop
pop rbp
ret
If we compare the switch case assembly from the example above, they are pretty identical in essence:
_switch_run:
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-20], edi
cmp DWORD PTR [rbp-20], 10
ja .L2
mov eax, DWORD PTR [rbp-20]
mov rax, QWORD PTR .L4[0+rax*8]
jmp rax
.L4:
.quad .L14
.quad .L13
.quad .L12
.quad .L11
.quad .L10
.quad .L9
.quad .L8
.quad .L7
.quad .L6
.quad .L5
.quad .L3
.L14:
mov DWORD PTR [rbp-4], 0
jmp .L15 # break
.L13:
mov DWORD PTR [rbp-4], 1
jmp .L15
.L12:
mov DWORD PTR [rbp-4], 2
jmp .L15
.L11:
mov DWORD PTR [rbp-4], 3
jmp .L15
.L10:
mov DWORD PTR [rbp-4], 4
jmp .L15
.L9:
mov DWORD PTR [rbp-4], 5
jmp .L15
.L8:
mov DWORD PTR [rbp-4], 6
jmp .L15
.L7:
mov DWORD PTR [rbp-4], 7
jmp .L15
.L6:
mov DWORD PTR [rbp-4], 8
jmp .L15
.L5:
mov DWORD PTR [rbp-4], 9
jmp .L15
.L3:
mov DWORD PTR [rbp-4], 10
jmp .L15
.L2: # Default Case Label
mov DWORD PTR [rbp-4], 11
nop
.L15: # Break Label
nop
pop rbp
ret
After all, the _switch_go
function becomes difficult to maintain once we start having nested switch cases. This is when the programmer needs to be extremely cautious with the goto
statements. If you insist on it, you can always write additional unit tests!
No optimization flags
_switch_go
[CPU Time Used: 0.002202]
[CPU Time Used: 0.002200]
[CPU Time Used: 0.002201]
[CPU Time Used: 0.002200]
[CPU Time Used: 0.002362]
[CPU Time Used: 0.002200]
[CPU Time Used: 0.002203]
[CPU Time Used: 0.002283]
[CPU Time Used: 0.002202]
[CPU Time Used: 0.002202]
_switch_go
With -O2
optimizations
_switch_go
[CPU Time Used: 0.001712]
[CPU Time Used: 0.001755]
[CPU Time Used: 0.001662]
[CPU Time Used: 0.001691]
[CPU Time Used: 0.001976]
[CPU Time Used: 0.001709]
[CPU Time Used: 0.001746]
[CPU Time Used: 0.001774]
[CPU Time Used: 0.001760]
[CPU Time Used: 0.001691]
_switch_go
Switch Case with -O1
optimizations
If we inline the _switch_run
and optimize it with -O1
, we are able to spot the significance.
static inline void _switch_run(unsigned int type) {
/* ... */
}
_switch_run
[CPU Time Used: 0.000762]
[CPU Time Used: 0.000784]
[CPU Time Used: 0.000733]
[CPU Time Used: 0.000734]
[CPU Time Used: 0.000734]
[CPU Time Used: 0.000734]
[CPU Time Used: 0.000734]
[CPU Time Used: 0.000772]
[CPU Time Used: 0.000733]
[CPU Time Used: 0.000734]
_switch_run
No Inline
Without inline
, we get the performance similar to the _switch_go.
_switch_run
[CPU Time Used: 0.001754]
[CPU Time Used: 0.001754]
[CPU Time Used: 0.001702]
[CPU Time Used: 0.001661]
[CPU Time Used: 0.001987]
[CPU Time Used: 0.001691]
[CPU Time Used: 0.001739]
[CPU Time Used: 0.001660]
[CPU Time Used: 0.001660]
[CPU Time Used: 0.001770]
_switch_run
This, however, depends if the compiler is able to inline
a function.
The fact of the matter is it is possible for the compiler to inline
the switch case function as opposed to the computed goto, which is covered in the next section.
Inlining Computed GOTO
🚧 Optimization Problems
The most critical downside of this implementation is that it is hard for the GCC compiler to optimize
_switch_go
.
To attempt to inline the Computed GOTO, we can use static inline
in the function definition.
static inline void _switch_go(unsigned int v) {
/* ... */
}
After benchmarking this code with the -O1
to try to inline
it, the performance is, more or less, the same as the function without the inline
.
int main() {
volatile unsigned int value = 10;
{
printf("_switch_go\n");
for(int i = 0; i < 10; i++) {
start_time;
for(int i = 0; i < 1000000; i++) {
_switch_go(value);
}
end_time;
}
printf("_switch_go\n");
}
}
_switch_go
[CPU Time Used: 0.001755]
[CPU Time Used: 0.001804]
[CPU Time Used: 0.001731]
[CPU Time Used: 0.001754]
[CPU Time Used: 0.001712]
[CPU Time Used: 0.001721]
[CPU Time Used: 0.001756]
[CPU Time Used: 0.001711]
[CPU Time Used: 0.001755]
[CPU Time Used: 0.001727]
_switch_go
.L112:
mov edi, DWORD PTR [rsp+28]
call _switch_go
sub ebx, 1
jne .L112
call clock
The bottom line is it cannot be inlined.
According to 6.18.3 Labels as Values,
The &&foo
expressions for the same label might have different values if the containing function is inlined or cloned. If a program relies on them being always the same, `attribute((noinline,noclone)) should be used to prevent inlining and cloning. If
&&foo` is used in a static variable initializer, inlining and cloning is forbidden.
We ca verify this by the following code:
static inline void _switch_go (unsigned int) __attribute__((always_inline));
error: function 'void _switch_go(unsigned int)' can never be copied because it saves address of local label in a static variable
error: function 'void _switch_go(unsigned int)' can never be inlined because it contains a computed goto
Therefore, we have to consider other approaches such as 6.18.2 Locally Declared Labels and 6.18.1 Statements and Declarations in Expressions in case we need to get the value out of the nested block scope.
Switch Go Macro
Let's wrap the _switch_go inside of the macro with local labels.
#define __switch_go(type) { \
__label__ l00, l01, l02, _default, defer; \
static void* jtable[] = {&&l00, &&l01, &&l02}; \
if((unsigned int)type > 2) goto _default; \
goto* jtable[type]; \
volatile int r; \
l00: { \
r = 0; \
goto defer; \
} \
l01: { \
r = 1; \
goto defer; \
} \
l02: { \
r = 2; \
goto defer; \
} \
_default: { \
r = 11; \
goto defer; \
} \
defer: { \
} \
}
int main() {
volatile unsigned int value3 = 2;
{
printf("__switch_go\n");
for(int i = 0; i < 10; i++) {
start_time;
for(int i = 0; i < 1000000; i++) {
__switch_go(value3);
}
end_time;
}
printf("__switch_go\n");
}
}
After benchmarking it, there is a significant speed up.
__switch_go
[CPU Time Used: 0.000851]
[CPU Time Used: 0.000759]
[CPU Time Used: 0.000760]
[CPU Time Used: 0.000714]
[CPU Time Used: 0.000795]
[CPU Time Used: 0.000771]
[CPU Time Used: 0.000789]
[CPU Time Used: 0.000781]
[CPU Time Used: 0.000733]
[CPU Time Used: 0.000771]
__switch_go
Getting the value out of the statement expression while using local labels is a bit tricky but possible:
#define switch_go(type) ({ \
__label__ l01, l02, l03, _default, defer; \
static void* jtable[] = { &&l01, &&l02, &&l03 }; \
volatile int r; \
if((unsigned int)type > 2) goto _default; \
goto *jtable[type]; \
l01: {\
r = 1; \
goto defer; \
} \
l02: {\
r = 2; \
goto defer; \
} \
l03: {\
r = 3; \
goto defer; \
} \
_default: { \
r = 11; \
goto defer; \
} \
defer: { \
} \
r; \
})
Computed GOTO: C++ Statement Expression
According to 6.18.1 Statements and Declarations in Expressions,
In G++, the result value of a statement expression undergoes array and function pointer decay, and is returned by value to the enclosing expression. For instance, if A
is a class, then
A a;
({a;}).Foo ()
constructs a temporary A
object to hold the result of the statement expression, and that is used to invoke Foo
. Therefore the this
pointer observed by Foo
is not the address of a
.
class Foo {
public:
Foo() = default;
Foo(const Foo&) {
std::cout << "Foo()::COPY" << std::endl;
}
Foo(Foo&&) {
std::cout << "Foo()::MOVE" << std::endl;
}
Foo&operator=(const Foo&) {
std::cout << "Foo::operator=()::COPY" << std::endl;
return *this;
}
Foo&operator=(Foo&&) {
std::cout << "Foo::operator=()::MOVE" << std::endl;
return *this;
}
};
#define switch_go(type) ({ \
__label__ l01, l02, l03, _default, defer; \
static void* jtable[] = { &&l01, &&l02, &&l03 }; \
Foo r; \
if((unsigned int)type > 2) goto _default; \
goto *jtable[type]; \
l01: {\
r = Foo(); \
goto defer; \
} \
l02: {\
r = Foo(); \
goto defer; \
} \
l03: {\
r = Foo(); \
goto defer; \
} \
_default: { \
r = Foo(); \
goto defer; \
} \
defer: { \
} \
r; \
})
// Output:
// ---switch Go---
// Foo::operator=()::MOVE
// Foo()::COPY
// ---switch Go---
Local Label Concern
Local Labels, as the name suggests, are local. Meaning that a jump table is generated per individual block - each time we use our macro. The code above generates one jump table since we are in the loop.
The following isn't ideal unless we have a particular use case for it. For example, using it in extensive loops or function calls is a great trade-off.
int main() {
__switch_go(value);
__switch_go(value);
__switch_go(value);
return 0;
}
We get three jump tables.
jtable.1:
.quad .L124
.quad .L125
.quad .L126
jtable.2:
.quad .L120
.quad .L121
.quad .L122
jtable.3:
.quad .L116
.quad .L117
.quad .L118
Inlining Switch Case
A similar thing occurs when inlining
the switch case. When the compiler finds inappropriate to inline
the switch case (and, thus, creating another jump table), it replaces it with the CALL
instruction.
🚧 Note
In the case of the
__switch_go
macro, we are forcing the creation of a jump table each time since we are explicitly using local labels whereas with theswitch
case inlining - the compiler makes a better decision and doesn't waste the data segment memory with jump tables at all times. This is particularly significant if the jump table is very large.However, as mentioned if we use the
__switch_go
only when we need to - it is a great trade-off.
Consider the following code.
int main() {
{
printf("_switch_run\n");
for(int i = 0; i < 10; i++) {
start_time;
for(int i = 0; i < 1000000; i++) {
_switch_run(value);
}
end_time;
}
_switch_run(value);
_switch_run(value);
_switch_run(value);
printf("_switch_run\n");
}
return 0;
}
The _switch_run
within the loop generates:
.L125:
mov eax, DWORD PTR [rsp+4]
cmp eax, 10
ja .L111
jmp [QWORD PTR .L113[0+rax*8]]
.L113:
.quad .L123
.quad .L122
.quad .L121
.quad .L120
.quad .L119
.quad .L118
.quad .L117
.quad .L116
.quad .L115
.quad .L114
.quad .L112
Whereas for the three _swith_run
functions outside of the loop, we have
_switch_run:
cmp edi, 10
ja .L36
mov edi, edi
jmp [QWORD PTR .L38[0+rdi*8]]
.L38:
.quad .L48
.quad .L47
.quad .L46
.quad .L45
.quad .L44
.quad .L43
.quad .L42
.quad .L41
.quad .L40
.quad .L39
.quad .L37
.L143:
mov edi, DWORD PTR [rsp+28]
call _switch_run
mov edi, DWORD PTR [rsp+28]
call _switch_run
mov edi, DWORD PTR [rsp+28]
call _switch_run
mov edi, OFFSET FLAT:.LC3
call puts
mov edi, OFFSET FLAT:.LC4
call puts
mov ebp, 10
Customizing _switch_go
🚧 Note
This section is purely experimental and substandard.
You are free to skip it and read on if you are not interested.
What if we can find a way to optimize this implementation even more?
The only instruction that comes to mind is CMP
before the assembly gets to the jump table. This has previously been mentioned in the Switch Case in the beginning of this blog.
cmp DWORD PTR [rbp-20], 10
ja .L17
Roughly being equivalent to:
if(type > 10) {
goto L17;
}
Now, if we just remove that if statement - we remove the CMP
. This can result in faster code.
void _switch_go(unsigned int v) {
// As mentioned before, we keep the array preserved between function invocations.
static void* cases[] = {
&&l00,
&&l01,
&&l02,
&&l03,
&&l04,
&&l05,
&&l06,
&&l07,
&&l08,
&&l09,
&&l10,
};
// CMP For the default case - also implemented by the switch case
/*
if(v > 10) {
goto _default;
}
*/
goto *cases[v];
volatile int r;
// ...
}
The output code now contains no CMP
instructions.
_switch_go:
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-20], edi
mov eax, DWORD PTR [rbp-20]
mov rax, QWORD PTR cases.1[0+rax*8]
nop
jmp rax
Compared to:
_switch_go:
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-20], edi
cmp DWORD PTR [rbp-20], 10
ja .L64
mov eax, DWORD PTR [rbp-20]
mov rax, QWORD PTR cases.1[0+rax*8]
nop
jmp rax
Benchmark
The benchmark, without the CMP
instruction, gives us about 22%
faster code with no optimizations.
_switch_go
[CPU Time Used: 0.001753]
[CPU Time Used: 0.001714]
[CPU Time Used: 0.001712]
[CPU Time Used: 0.001713]
[CPU Time Used: 0.001713]
[CPU Time Used: 0.001755]
[CPU Time Used: 0.001713]
[CPU Time Used: 0.001712]
[CPU Time Used: 0.001747]
[CPU Time Used: 0.001711]
_switch_go
🚧 Warning
Removing the
if
statement andCMP
is more performant but the cost on the programmer's side of things is that the programmer has to make their code very strict with covering all the possible values and cases since this eliminates thedefault
case. Note that if the subscript is out of bounds - the programmer is going to end up withSegmentation fault (core dumped)
at runtime. All in all, this can lead to premature optimization.Here is the exact log from
valgrind
, showing that the program cannot jump to the invalid address.valgrind -s ./out.out
_switch_go ==49387== Jump to the invalid address stated on the next line ==49387== at 0x0: ??? ==49387== by 0x1097BC: main (in /src/out.out) ==49387== Address 0x0 is not stack'd, malloc'd or (recently) free'd ==49387== ==49387== ==49387== Process terminating with default action of signal 11 (SIGSEGV) ==49387== Bad permissions for mapped region at address 0x0 ==49387== at 0x0: ??? ==49387== by 0x1097BC: main (in /src/out.out)
Default Case: Can we do without it?
🚧 Note
This section is purely experimental and substandard.
You are free to skip it and read on if you are not interested.
In the Customizing _switch_go section, we talk about the CMP
instruction that determines whether the value is out of bounds and jumps to the default
case. To optimize even more, we removed that instruction by implementing a switch case of our own using labels but let's attempt to do the same for the switch statement. If you are still confused, please make sure to read Implementing Switch Case.
Our strategy here is to cover all the enumerations values so that the default case does not exist and, perhaps, no CMP
instruction.
enum N {
N0,
N1,
N2,
N3,
N4,
N5,
N6,
N7
};
void _switch_default_case(enum N type) {
volatile int r;
switch(type) {
case N0: {
r = 0;
break;
}
case N1: {
r = 1;
break;
}
case N2: {
r = 2;
break;
}
case N3: {
r = 3;
break;
}
case N4: {
r = 4;
break;
}
case N5: case N6: case N7: {
r = 1000;
break;
}
}
}
Even with practically covering all of the enumerations, we still end up with the CMP
.
cmp DWORD PTR [rbp-20], 7
ja .L22
Without this CMP
, the code is too obscure (prone to security breach) where someone can technically use unsigned int
to get the subscript out of bounds. This results in Segmentation fault (core dumped)
at runtime. This is explained in great detail in the Computed GOTO: Goto Jump Table with Labels and Customizing _switch_go.
_switch_default_case:
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-20], edi
cmp DWORD PTR [rbp-20], 7
ja .L22
mov eax, DWORD PTR [rbp-20]
mov rax, QWORD PTR .L16[0+rax*8]
jmp rax
.L16:
.quad .L21
.quad .L20
.quad .L19
.quad .L18
.quad .L17
.quad .L15
.quad .L15
.quad .L15
.L21:
mov DWORD PTR [rbp-4], 0
jmp .L14
.L20:
mov DWORD PTR [rbp-4], 1
jmp .L14
.L19:
mov DWORD PTR [rbp-4], 2
jmp .L14
.L18:
mov DWORD PTR [rbp-4], 3
jmp .L14
.L17:
mov DWORD PTR [rbp-4], 4
jmp .L14
.L15:
mov DWORD PTR [rbp-4], 1000
nop
.L14:
.L22:
nop
pop rbp
ret
Also note that for:
case N5: case N6: case N7: {
r = 1000;
break;
}
The GCC compiler recognizes the same label (.L15
) and uses it to fill the three entries in the jump table:
.L16:
.quad .L21
.quad .L20
.quad .L19
.quad .L18
.quad .L17
.quad .L15
.quad .L15
.quad .L15
Conclusion
Although there is a lot to unpack in this blog, the best approach is to use the switch statement, as the GCC compiler has a hard time optimizing the high-level C implementations as discussed above. If you're determined to optimize further and want to remove any redundant ASM
instructions that may impact your code, consider using 6.17.2 Extended Asm | Assembler Instructions with C Expression Operands, which allows you to craft your assembly from the ground up - untouched by the optimizations such as -O1
, -O2
, or -O3
.
If you are interested, we can try to do that as a follow-up blog. But keep in mind that your codebase then becomes a nightmare to maintain, especially with nested switch statements.
Thank you for reading! Feel free to leave any feedback in the comments!
References
- https://en.wikipedia.org/wiki/X86_assembly_language
- https://en.wikipedia.org/wiki/Branch_table
- https://en.wikipedia.org/wiki/Data_segment
- https://www.gnu.org/software/c-intro-and-ref/manual/html_node/
- https://gcc.gnu.org/onlinedocs/gcc
- https://gcc.gnu.org/onlinedocs/gcc/Code-Gen-Options.html
- https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html#Options-That-Control-Optimization
- https://www.gnu.org/software/c-intro-and-ref/manual/html_node/switch-Statement.html
- https://gcc.gnu.org/onlinedocs/gcc/Labels-as-Values.html
- https://gcc.gnu.org/onlinedocs/gcc/Statements-implementation.html#Statements-implementation
- https://en.cppreference.com/w/cpp/language/union
Top comments (0)