In the packet processing world, usually the goal is to achieve one cache miss throughout the “life of the packet” in the system. Not many know, but in the latest Intel architectures, there is a new feature called “DCA”, which stands for Direct Cache Access.
This feature should help the I/O bandwidth management, increasing the throughput that a core can handle. This is very helpful on now days, where a single socket can handle dozens of gigabit of traffic.
DCA enables the NIC (let’s call it a producer, as it “produce” packets to the core memory), to push it’s data directly to cache. This feature, greatly reduce the cache misses for the consumer – the core which the packet is addressed to. More on Intel’s I/O acceleration and DCA can be found here.
When working with high rate traffic, let’s say a couple of 10Gig NICs on a single socket, you’ll usually get dozens millions of packets, requiring both buffer descriptors and data transactions to/from the core associated memory. When working with DCA enabled architecture, the consumer is expected to process the packets directly from cache.
This scenario is not always correct, and is very dependent on the architecture of the software.
When working with interrupts one should take in consideration the latency until the core starts to process the packet. Sometimes a lot of code and process is done before the actual handling, raising the risk that the packet data will be cached out. Another issue may raise if the interrupts are coalesced and generated for large amount of packets. Again, this will raise the risk of cache out of the packets data (sometimes it will be for the later packets in the burst). Interrupts are less preferable method for packet processing, as it’s not deterministic and reduce the PCI bandwidth.
The more common method for processing high-rate traffic is polling. Usually there is a main thread on each core, polling the associated queue on the NIC. The potential for cache out is less likely to occur, but due to the native of polling, which usually take place with a burst of packets. One should determine the exact burst length to avoid cache out of packet data. Packet processing itself, raise the risk for cache out, and together with large burst length, may cause cache outs very early in the burst processing stage.
DCA is not a “magic” feature to eliminate cache misses. While implementing the software, further profiling should be done to discover if there are cache outs of the packets. This can easily be done by working with profilers: detecting where in the code cache misses appear. Cache misses should appear in the packet data accesses, indicating cache-outs. Sometimes it is wise to loop unroll the processing code, to get better understanding on what stage of the burst processing the cache out start to occur.
With proper design, a packet process with zero cache miss is achievable.