State-Compute Replication: Parallelizing High-Speed Stateful Packet Processing
A trick to parallelize seemingly sequential code
State-Compute Replication: Parallelizing High-Speed Stateful Packet Processing Qiongwen Xu et al. NSDI'25
It is difficult for me to say if this idea is brilliant or crazy. I suspect it will force you to change some intuitions.
State, Skew, Parallelism
The goal is laudable: a framework to support generic packet processing on CPUs with the following properties:
Shared state between packets (this state can be read and written when processing each packet)
High performance with skewed distributions of flows (some flows may dominate the total amount of work, i.e., receive side scaling won’t work)
Throughput increases with the number of CPU cores used
As I see it, there are two options for parallel processing of network packets:
Distribute packets across cores
Pipeline parallelism
Both pose different problems for handling shared state. This paper advocates for distributing packets among cores and has a fascinating technique for solving the shared state problem.
Fast Forward State Replication
In this model, each core has a private replica of shared state. Incoming packets are sent to each core in a round-robin fashion. And now you are totally confused, how can this ever work? Enter the sequencer.
The sequencer is a hardware component (it could be a feature of the NIC or switch) which processes all incoming packets sequentially and appends a “packet history” to each packet immediately before it is sent to be processed by a core.
Say that four cores are used to implement a DDoS mitigator, and the crux of the DDoS mitigator state update depends on IP addresses from incoming packets. Each core receives a full copy of every fourth packet. Along with every fourth packet, each core also receives the IP addresses of the previous three packets (the packets that were sent to the other cores).
The per-core DDoS mitigator first updates its local state replica using the IP addresses of the previous three packets from the packet history. The paper calls this fast-forwarding the local state replica. At this point, the core has a fresh view of the state and can process the incoming packet.
Another way of thinking about this is that network functions can be decomposed into two steps:
Shared state update, which is relatively inexpensive, and only depends on a few packet fields
Packet processing, which is relatively expensive, and consumes both the shared state and a full packet as input
The shared state update computation is performed redundantly by all cores; there is no parallel speedup associated with it.
The authors argue that many network functions can be decomposed in such a manner and also argue that packet processing is fundamentally expensive because it involves the CPU overhead of programming a NIC to send an outgoing packet.
Results
Fig. 6 has results. The baseline (orange) implementation suffers in many cases because of cross-core synchronization. The sharding implementations suffer in cases of flow skew (a few flows dominating the total cost).
Dangling Pointers
There are heterodox and orthodox assumptions underpinning this paper.
An orthodox assumption is that the packet sequencer must run in hardware because it must process packets in order, and that is the sort of thing that dedicated hardware is much better at than a multi-core CPU.
A heterodox assumption is that many network functions can be expressed with the fast-forward idiom.
I wonder about the orthodox assumption. Hardware isn’t magic, is there a fundamental reason a multi-core CPU cannot handle 100Gbps networking processing with only pipeline parallelism?