Optimization in C/C++ is a fundamental subject for any developer who want to maximize the performance of their programs. Understanding how code interacts with the hardware enables you to take full advantage of modern processor capabilities.

In this article, we will explore the CPU pipeline, memory management, compiler options, the efficient use of registers and the exploitation of SIMD (Single Instruction, Multiple Data) instructions with an example of a function that everyone has already used: strlen.

Demystifying optimizations..

Yes, optimization in low-level languages can by far be quite terrifying for several reasons but not that hard.

And please, don’t be this guy:

Indeed, reading some source code brings a certain abstraction that can be scary, but that's not in my opinion. The word is : just lean, take your time and don’t be scared about.. Starting from a simple function, I will try to give you the key points to begin to understand that optimization can be affordable (up to a certain point, but we'll see about that later).

Let’s talk about CPU

The most important aspects of optimization are understanding the CPU and memory. The CPU pipeline in x86_64 architecture is a key concept that enables modern processors to execute multiple instructions simultaneously, greatly enhancing efficiency. This pipeline breaks down the execution of each instruction into several stages. Each stage handles a specific part of the instruction's execution—fetching it from memory, decoding it, executing it, accessing memory, and finally writing the result.

In x86_64 processors, the pipeline is typically composed of stages such as:

Fetch: The CPU retrieves the next instruction from memory.
Decode: The instruction is decoded into a format the CPU can understand.
Execute: The CPU performs the actual operation (e.g., addition, multiplication).
Memory Access: If the instruction requires accessing memory (reading or writing), it happens here.
Writeback: The result of the operation is written back to the CPU registers or memory.

The benefit of pipelining is that while one instruction is being executed, another can be decoded, and yet another can be fetched. This parallelism increases throughput without increasing the clock speed. However, pipelining is not without its challenges, especially with hazards (situations where one instruction depends on the result of a previous one) and branching (when the program flow changes unexpectedly), which can stall the pipeline and reduce efficiency.

In a pipeline, each stage takes one clock cycle to complete. Modern CPUs aim to execute one instruction per clock cycle (IPC: Instructions Per Cycle), but in reality, this is rarely achieved due to various factors like data dependencies, memory delays, or branch mispredictions, the must is between 2 and 4 IPC. If an instruction depends on the result of a previous one, the CPU might need to wait (or "stall") until that result is ready, introducing additional cycles and reducing performance.

This was a “brief” introduction to CPU pipeline but now, let’s talk about code..