Authors: Amin Vahdat, Keith Marzullo
Fault recovery in Data Centers is the topic
of this paper.
Single failure can disconnect large part of
the network.
When a switch fails, it takes time for
network to broadcast the new upgrades and connect all part of the network
together.
Failure happens very frequent (80%) and
they are impactful and far-reaching.
Have lasting effects 10 sec to recover.
The solution can be adding extra links but
how we can find extra ports to add links.
1.
Increase the number of ports and
add double link for each connection but this solution is expensive.
2.
Another way is to build a
bigger network- add more switches on the top level and add one more layer of
switches. However more switches make
paths longer.
3.
Ports can be provided by removing
some of the links of switches to give the chance for more redundant links. This
scheme has scalability problem by losing some hosts.
Aspen trees:
Multi rooted tree with extra links at one
or more levels (eg. VL2)
The contribution of this paper is to find
the tradeoffs between fault tolerant, scalability and network size.
The evaluated results are compared against
OSPF.
Q: there are a lot of works on fault
tolerant, then whats new?
A: whats new is little bit different point
of space. They are adding a little bit of latency instead of adding hardware. There
are some topologies designed for super computer like aspen or subset of aspen
link splitting aspen trees and doubling the number of links at all levels. The
goal here is to see what the tradeoffs are and what the math is here that
designers don’t need to guess what is the cost of their opinion of the fault
tolerant need to be paid.
No comments:
Post a Comment