Presenter: Chuanxiong Guo
Co-Authors: Haitao Wu, Zhong Deng, Gaurav Soni, Jianxi Ye, Jitu Padhye, Marina Lipshteyn
TCP is still the
dominant protocol for transmitting packets. But it suffers from latency issues
and CPU overhead problems. RDMA, on the other hand, needs a lossless network, but uses a priority based flow control that prevent buffer overflow.
In this talk, the authors discussed their experience deploying RDMA over commodity Ethernet (RoCEv2) in Microsoft's datacenters. They also discuss how they overcame some of the challenges they encountered.
PFC is a single flow
control protocol that uses the priority that each packet contains in its VLAN Tag. he authors argue that this doesn't scale well.
It breaks PXE boot which is needed for OS updates. There is also no standard
way to carry VLAN tags in layer 3 networks. They would much rather use the DSCP
field in the IP header, thus removing VLAN related issues completely.
In addition to this, they observed transport livelock issues despite low packet drop rates. To overcome this, they chose to retransmit from the last sent packet as opposed to starting again from 0. They don't observe any deadlocks since packets travel up and then down, but do not experience any loops and hence, cannot have deadlocks.
Further, a malfunctioning NIC
may block the entire network owing to a series of pause frames (waits at the buffer) causing a domino
effect on each other. They install watchdogs to stop this process by monitoring
packets at those frames areas.
A latency comparison between ROCEV2 and TCP shows that the former at 99.9th percentile is better even against the 99th percentile of the latter. RoCEV2 also directly handles Incast also since it is lossless. On a network with 500 servers across two podsets, RDMA is able to achieve 3Tb/s throughput. Similar measurements were performed while shuffling data. The
latency increases as shuffling occurs and the authors observe that it isn't possible to achieve both
low latency and high throughput at once, in practice, even though a small congestion is observed.
In the future, the authors hope to explore RDMA
on inter-dataceter communication, gain a deeper understanding of deadlocks and develop more
applications that use RDMA instead of TCP.
Q: You haven't
talked about the classical problem of lossless network (tree-saturation) -
localized congestion that spreads out and blocks other innocent bystander
traffic at other ports. Do you notice this?
A: We do experience
this in our production networks. We have several approaches to deal with this.
DCQCN could be used, buffer sizes are tuned so that we can use Dynamic Buffers
and we have ways to deal with pause storms.
Q: You conduct your experiments at livelock. Why did you need retransmission despite that?
A: Lossless implies
packets shouldn't be dropped due to congestion. But, there could be other kinds
of packet drops like due to switch hardware issues or corruption problems.
Hence, we need packet retransmission to fix those.
Q : Are those
frequent?
A : The network is large
scale, so we do need to handle it.
Q: Can you share
details on workload on top of RDMA? Why are these latency sensitive and what
are the latency requirements?
A: We currently have
2 types of workloads - Bing traffic/indexing is very important there. The requirement there is to reduce latency to millisecond level. Other type of workload -
usually for storage, they are throughput-centric workload, move a lot of
traffic. Hence, we need to reduce CPU overhead there.
Q: Are there a lot
of NIC firmware changes or OS level changes?
A: To make the NIC
work, transport layer is in NIC firmware and has a driver. But, for mechanisms
like PSC and transport protocols, they avoid NIC firmware. Our providers
provide us with the nIC firmware necessary.
No comments:
Post a Comment