NUMA aware memory allocation

NUMA?

Non Uniform Memory Access (NUMA) has become a buzzword nowadays. To fully understand what exactly it means, we need to get back to the old days, where there processor count on a single board was single.
the processor architects modified the processor to hold multiple cores, and soon memory bandwidth started to be a bottle neck. Then, the processor architects added multiple memory controllers to increase parallelism, and increase the memory bandwidth for the chip. This worked out pretty well, and most of the software that was written did not matter on which exact memory controller it was using. Nowadays, the hardware became more sophisticated, and multiple processors are implemented on one single board, with the same SMP software. Because software developers are usually lazy, the processors architects invented a memory coherence protocols, to keep the memory as a single uniform array, avoiding the porting of software. This was a great technology achievement: the software is not needed to be aware of the memory partitioning (which partition on which processor), and all “the magic” will be done by the cache and memory coherence protocols between the chips.

Numa architecture

Numa architecture

This ofcourse is one side of the story. There is a big performance hit when memory allocation is not done correctly. Like every programmer know, abstraction has it’s costs, and with this example it’s no different.
accessing memory from one core, to a remote chip has greater latency than accessing it on the same processor.

Guidelines

To achieve the best performance I recommend to follow this rules:

  1. allocate memory (pay attention – i’m referring to physical memory ofcourse)  for a process on the same NUMA node that the requesting process is running on
  2. try to keep a running process on the same CPU.

If we take Linux OS as an example, it contain heuristic characteristics, on which the scheduler might schedule the process to a different CPU – if a call to malloc() does not return memory for the same NUMA node. By default the scheduler will try to allocate memory from the same NUMA node.
Any OS has it’s different characteristics, some need more guidance from the developers themselves. If you want to be deterministic as possible, you should not rely on any heuristics features.

 

Leave a Reply

Your email address will not be published. Required fields are marked *