Enabling Packet Spraying over Commodity RNICs with In-Network Support
Poor man's packet spraying
Enabling Packet Spraying over Commodity RNICs with In-Network Support Xiangzhou Liu, Wenxue Li, Zihao Wang, and Kai Chen EUROSYS'26
This paper proposes changes to top-of-rack (ToR) switch hardware to enable packets from a single flow to utilize many network paths (Falcon offers a similar benefit via changes to the NIC rather than the switch). The Falcon approach is robust but more invasive. The switch-based approach from this paper is a more incremental change.
Packet Loss and Packet Spraying
The sweet spot for packet spraying is a data center that has a large number of network paths compared to network flows (i.e., connections). In such an environment, there is an incentive to spray the packets associated with one flow across multiple paths. The trouble with packet spraying is that packets will commonly arrive out-of-order. The system has to be able to distinguish the out-of-order case from genuine packet loss.
Section 2.2. of the paper describes three techniques for handling packet loss:
PFC - the trouble is that this has scalability limits, and only addresses packet loss due to buffer overflows (not angels flying down and flipping your bits)
Timeouts - the trouble is that practical timeout values have to be large
Selective Repeat
Selective Repeat
Selective repeat is a feature of modern RDMA NICs which is similar to the bitmaps tracked by Falcon hardware. The idea is that a receiving NIC tracks an expected sequence number (ePSN) for each flow. If a packet arrives with a sequence number greater than (but not too much greater than) the ePSN, the NIC accepts it and records this fact in a per-flow bitmap. The receiving NIC then sends a NACK to the sender, requesting that the sender resend the packet corresponding to the ePSN. In the out-of-order case, this NACK is unnecessary as the expected packet will arrive soon enough.
The question asked by this paper is: can one easily modify switch hardware to filter the unnecessary NACKs?
Switch Modifications
The core idea proposed by this paper is that the NICs and switches agree on the number of paths for a particular flow. Sending NICs use a packet’s sequence number to determine which path to use (e.g., PathIndex = PSN % PathCount).
The switch can then track information for each path associated with each flow. When a NACK arrives at a switch, the switch can drop the NACK as necessary (thus avoiding unnecessary retransmissions).
For example, say there is a single flow mapped to 4 paths. Packets with PSNs [0, 4, 8, 12, …] will travel over path 0. Packets with PSNs [1, 5, 9, 13, …] will travel over path 1.
If packets arrive at the receiving NIC in this order: [0, 1, 5, 4] the NIC will send a NACK when it receives packet 5. However, the switch will drop that NACK because it “knows” that no packet has been received out of order with respect to its flow.
Results
Fig. 12 has simulated performance results for collective operations common in AI workloads. Themis is the work described in this paper. CCT is a measure of how long the collective operation took.
Dangling Pointers
This feels like an engineering solution to a business problem of how to get many NIC vendors to align on a packet spraying solution.
I suspect there are many applications where it would not be too difficult to introduce multiple flows. For example, in a machine learning workload, the weights/deltas associated with layer N could be assigned to flow N % 4. This would increase network path utilization without any hardware changes.


