Presenter: Yu-Wei Eric Sung
Co-Authors: Xiaozheng Tie, Starsky H.Y.Wong, Hongyi Zheng
Co-Authors: Xiaozheng Tie, Starsky H.Y.Wong, Hongyi Zheng
Managing networks of
the size of the Facebook network involves tasks like router/switch turnup,
circuit migration, etc but this usually needs a lot of human interaction.
However, this is challenging since it might involve low level configurations
across distributed devices, these networks could involve multiple domains and
sub-networks with different versions across them. As a result, incorrect
management could cause a lot of issues for users and impact Facebook
significantly too. This becomes particularly challenging as the network grows
bigger.
Facebook's network
is setup such that there are multiple POPs near the user which may/may not be
able to satisfy an incoming request from the user. If the POP cannot satisfy
the request, it is sent through a backbone to a Datacenter (DC) and the
response is sent back. The tasks involved in management in this kind of network
could include adding/migrating circuits or routers in the backbone and building
clusters or upgrading capacity in the datacenter. It is also possible,
especially in datacenters, that multiple generations of cluster architectures
could co-exist. This makes these tasks particularly challenging as the network
evolves.
Facebook implements
this top-down network management essentially through an model of the network
called Fbnet. This process is extended
to model the entire network. There are also READ and Write APIs to interact with
the model where the write consists of atomic tasks. Storage consists of a single primary database and multiple secondary databases some of
which are locally available.
Network design
intent can be conveyed through the creation of distinct objects. For example,
for the above 4-pop cluster, FbNet creates distinct objects for the network
switch, the linecard, the physical interfaces, prefixes, circuit, etc. with
relationship fields pointing to other objects. Following this, Design intent
can be translated into device configs starting from Templates for a cluster
that are validated afterwards and then converted into fb net objects and then
per-device config objects. All of these are vendor-agnostic. After this, these
could be converted into vendor-specific configs and deployed.
By monitoring the
usage of FbNet, the authors were able to observe that FbNet model changes a lot
with time and still changes with new models and relationships as opposed to
becoming stable with time. POP and DC change a lot with a single design change
affecting almost 1000 devices on average. Backbone's also change but don't
affect more than 10 devices on average. Similar results seen with Robotron when
looking at the number of configuration lines that change as designs are
changed.
Robotron has evolved
over time bottom up and has been largely driven by experience since 2008 when
FbNet was modeled to active and passive monitoring in 2011 and deploying it
over time. Over this period, the authors have gained key insights into the process
of modelling, the coupling between network design, config generation and
deployment and the need for manual involvement at times. These came directly
from experiences involving egress link saturation and failing to deploy after
designing a new rack.
Q: You talked about
your network model, there are some fields and some of them have values, you
talked about the syntax, but not the semantics. How would you define the
meaning so that some formal analysis could be conducted on it?
A: There is no
formal semantics. We would like to work more on that too.
Q: About your claim
that this is the first paper of this kind. There is a conference called NOMS
for network management related papers including carrier network?
A: I am not very
aware, but will check it out.
Q: At the end of
your talk, you mentioned you needed a manual override. DO you have any
characterization of the types of situations where this was necessary?
A: Experiences
include a version of model was used where it didn't model BGP policies
correctly, we couldn't wait for that.
Q: Are there
fundamentally some things Robotron cannot handle or is it just lack of
anticipation?
A: It is usually
lack of anticipation. But, sometimes, performance isn't guaranteed even if
Robotron can handle something, so it becomes important to manually handle some
things.
No comments:
Post a Comment