CPU Kitchen
Introduction
Understanding how CPUs work at a deep level is a valuable resource for any serious programmer. Knowing how a CPU works can both help with performance and take out some of the mystery when it comes to how your code is really executed. Even though CPUs are fairly complicated, the decisions made when designing them are very logical and tend to make sense intuitively.
There are many aspects of a CPU and so the vocabulary can become a little overwhelming. Therefore I will be explaining some of the most important parts of a CPU through the analogy of something a bit more familiar, a kitchen. Like how a good chef keeps his cooks productive and efficient, programmers can aim to constantly keep the CPU busy and optimized.
Memory
Suppose that it is prep time and a cook needs to chop a lot of tomatoes. Where and how they retrieve the tomatoes to chop is a very important consideration. Tomatoes can be stored in a few places such as the market, pantry, in a bowl beside the cook, and on their cutting board. When the cook needs another tomato to cut, ideally there would always be one on the cutting board ready to go to minimize downtime. However, the cook may need to go their bowl to get another tomato or worse, the pantry or even market. One pattern here is the more convenient and closer a storage location is, the less it is able to store.
This is very similar to how the memory hierarchy works in a CPU. The cutting board, bowl, and pantry represent the L1, L2, and L3 caches of a CPU, respectively. Similar to how storage works in a kitchen, these different levels of memory have varying size and speed associated with them. The L1 cache is the fastest but smallest way of storing memory and represents the cutting board. Main memory can hold the most data, but is the slowest to retrieve the data from and represents the market. In rough comparison to the access time of the L1 cache, the L2 cache is 4 times slower, the L3 cache is 16 times slower, and main memory is 100 times slower. By managing your data in such a way that the fast caches are primarily in use, significant performance increases can be achieved as the CPU doesn’t spend as much time waiting for data to arrive.
Storage for memory is just like storage for food
Latency and Throughput
Kitchens will, eventually, need to go to the market, despite the trip taking a while, and are able to stay efficient because they don’t only get one tomato, but truckloads of tomatoes. The amount of time between going to the market and coming back with tomatoes represents latency. If a tomato run only happened once the tomatoes ran out, tomatoes would not be cut for the full amount of time during the trip to the market. The latency would reduce the throughput of the kitchen, being the amount of tomatoes cut per hour, which is what the kitchen cares about the most. By sending someone to the market long before tomatoes completely ran out, tomatoes are able to stay stocked and throughput is kept high despite the latency of getting tomatoes remaining the same.
Even though main memory retrieval times are high, this only represents the latency. The CPU has a couple ways of offsetting high latency times for throughput as that is often more important. Like how the kitchen uses trucks to increase the amount of tomatoes it can get when it does go to the store, the CPU can retrieve many blocks of memory from RAM at once. Each block of memory retrieved is often larger than the amount requested because of spatial locality. The CPU can see what memory will be needed in the near future and start retrieving it early. This helps reduce occurrences when the CPU is just waiting for data to arrive. Note that none of these strategies reduce latency but find ways to mitigate its effects on the throughput of the program. CPUs, however, aren’t magic so increasing spatial and temporal locality ensure that the throughput remains high.
Several memory requests can be in flight at a time
Out of Order Execution
Even though a cook is managed by the chef, the chef doesn’t micromanage the cook’s every movement. Suppose the chef tells the cook to mix in a large bowl lettuce, then tomatoes, then onions. Because the chef issued multiple tasks at one time, the cook is able make some decisions regarding exactly what happens. The cook understands that this is to make a salad and may subtly defy the chef and mix the ingredients in a different order if the end result is the same. Sometimes mixing ingredients in a different order is faster, but sometimes it is done because it doesn’t really matter what the order is.
Sometimes the order of actions doesn’t matter
CPUs work very similarly by executing instructions in a different order than when they are first inputted into the CPU. It can analyze several instructions at a time to make informed decisions about what to do each clock cycle. If one instruction is waiting for memory to retrieved, the CPU can try to execute a different instruction who is ready to go and postpone the execution of the original instruction in order to stay productive. The CPU will only do this when it is sure that multiple orders of instructions result in the exact same outcome. So the exact order in which independent instructions occur ends up not mattering much because they are likely to get shuffled in the end. Reducing the amount of dependencies instructions have allows for it to be shuffled and optimized by the CPU which can lead to drastically more performance.
Specialization
A cook is able to do multiple things at one time, with varying degrees of success. It might be relatively easy to flip a steak and stir a pot at the same time. However, cutting a tomato and peeling a carrot at the same time could be outside of the cook’s ability. A good chef would give tasks to their cooks that are able to be done simultaneously and would thus increase the productivity of the kitchen.
A CPU also aims to perform several tasks at the same time and has limitations of its own. It may have multiple execution ports which can only perform a finite set of operations. A common distinction found in CPUs are ports which manage memory (load and store), and ports that can do arithmetic (addition, multiplication, etc.). Programs can only fully harness the CPU if they have a mix of memory and arithmetic instructions fed into the CPU at any moment. For example, a program that only moves memory around leaves its arithmetic ports idle. By understanding what your CPU is able to do you can better grasp what its throughput limits are and know how close you are to the theoretical limit of performance.
Different dishes can be prepared at the same time, just like instructions
Conclusion
CPUs may be very complicated, but for most programmers, the fine details aren’t crucial to get a significant benefit. Knowing about some of the features the CPU has can lead to better performance of programs by highlighting areas of improvement and signalling when to stop optimizing. It can also lead to clearer decisions as some small changes in your code may have no impact on performance at all. I highly recommend reading more about how CPUs work as it has been very helpful and interesting for me.