State-Compute Replication: Parallelizing…

Sep 22

A trick to parallelize seemingly sequential code

3 Comments

Hi Blake, thanks for this post! I'm one of the authors of this paper and also the faculty member advising the lead grad student author.

I'll attempt to respond to your last question: why can pipelines process packets quickly while CPU cores can't, even if the latter is allowed to use pipelined parallelism? The answer lies in the partition of computation between the pipeline and the CPU, and not specifically whether something is done on line-rate hardware versus the CPU.

In pipelined parallelism, each stage of a pipeline finishes executing an operation over a piece of data within a single clock cycle, i.e., the time budget for each stage. Pipelines that are clocked faster have higher throughput, since they can admit a new piece of data to process in the pipeline in the next clock cycle.

However, stateful operations are challenging to clock fast because of dependencies. If the very next packet could require the most updated state from the previous packet, reading the state (from memory), computing the modification to the state, and writing it back (to memory) must together complete within a single clock cycle. You cannot break this operation into smaller computations spread over multiple pipeline stages. Fundamentally, when you want deterministic sequential semantics, fast pipelines (found in line-rate switch and NIC hardware, clocked at 1 GHz or higher, like RMT) must severely limit the kinds of permissible stateful operations.

In SCR, we do not perform complex stateful operations in the sequencer. Instead, we run the complex stateful computations (network functions) only on CPU cores. The main stateful operation we add to the line-rate pipeline is a ring buffer, something we show to be doable efficiently at high clock speeds.

Expand full comment

Reply (1)

Blake Pelton

Oct 27

Thanks for the explanation, the comments for this blog have been pretty quiet.

I wonder how fast a handful of CPU cores could implement something like the Tofino sequencer from Fig. 4(b). It seems like there is no feedback between Tofino pipeline stages, so one CPU core is too slow, then the pipeline could across multiple CPU cores, right?

Expand full comment

Reply (1)

Srinivas Narayana

Oct 28

You're welcome.

It is certainly possible to implement the sequencer in software. It isn't trivial to do an efficient multi-core implementation, however, since there is shared global state updated by each packet (the packet history in the ring buffer).

Expand full comment

Dangling Pointers

State-Compute Replication: Parallelizing…