Discussion about this post

User's avatar
Neural Foundry's avatar

The AccDirect optimization is clever becuase it eliminates the producer core bottleneck by moving packet enqueuing offto the NIC entirely. The power savings make sense, but I'm surprised the latency improvement is so modest given that you're cutting out multiple hops through CPU memory. Maybe the PCIe peer-to-peer write latency is eating most of those gains? Also wondering how this scales with multiple producer NICs all trying to write to DLB simultaneously.

Zhiyuan Zhang's avatar

Nice summary! was just reading the paper today, gon try recreate it on my server env.

No posts

Ready for more?