Authors: Danyang Zhuo (University of
Washington), Monia Ghobadi (Microsoft Research), Ratul Mahajan
(Microsoft Research & Intentionet), Klaus-Tycho Förster (Aalborg
University), Arvind Krishnamurthy (University of Washington), Thomas
Anderson (University of Washington)
Presenter: Danyang Zhuo
[Link to the paper]
Based on their measurements and analyses, they found out that:
CorrOpt also recommends specific actions to repair disabled links:
There was a discussion after the talk.
Presenter: Danyang Zhuo
[Link to the paper]
Several techniques for reducing packet loss have been
studied. The techniques mostly focused on congestion which is only one source
of packet loss whereas packet corruption, another source of packet loss, has received little attention.
Danyang Zhuo et al. worked on the packet corruption in DCNs. They monitored around 350K switch to switch, optical links within 15 data centres of a major cloud provider over seven months.
Danyang Zhuo et al. worked on the packet corruption in DCNs. They monitored around 350K switch to switch, optical links within 15 data centres of a major cloud provider over seven months.
Based on their measurements and analyses, they found out that:
- Packet corruption is a significant source of packet loss;
- Packet corruption has distinct symptoms and root causes. Its characteristics differ from congestion losses;
- Corruption impacts fewer links than congestion but imposes a heavier loss rate;
- Unlike congestion, corruption rate on a link is stable over time and is not correlated with its utilization.
- Ensuring that each top-of-rack switch has a minimum number of paths to reach other switches.
CorrOpt also recommends specific actions to repair disabled links:
- Clean dirty connectors
- Replace damaged fibres
- Replace dying transceiver laser
There was a discussion after the talk.
Q1: What’s
statistical percentage of link with corruption?
A:1 I can't comment on that. I just
can only report the relative ratio of link with congestion.
Q:2 What
are other remedies to mitigate the corruption other than bring it to links,
e.g. reduce the rate or change error code, …?
A2: What
we found in our data centres that loss rate is not correlated with traffic. So,
I think reducing rate wouldn’t remedy those hardware failure.
We target
at a deployable solution, I think for error correction code and end-to-end, I don’t
think people do that. I think a 100G technology has it. 10G and 40G usually don’t
have it on the switch.
Q2: How about changes on the modulation?
A2: For data
centres transceivers there is no option to change the modulation.
Q2: Really?
A2: It is
fixed but for long haul, I’m not sure (maybe changeable).
Q3: How
often you have multiple link corruption at the same path?
A3:The details are in the paper. What we found is that
for optical breakout cable (when you have dirt on the fibre) those links tend
to scatter across the entire DC but for failures such as optical breakout
cables or switch hardware problem , it is going to affect multiple links on the
same switch.
Looks really cool and awesome. Thanks for sharing! interior house painting galisteo nm
ReplyDelete