Sunday, November 24, 2013
HotNets' 13: Active Security
Today's security systems have very limited programability. They can work individually to detect and response to attacks, but are not able to actively collect information, attribute attacks and adjust the configuration based on the observations to prevent future attacks.
We propose an active security architecture. It has a programmable control interface to automate the task of attack detection, data collection, attach attribution and system reconfiguration to react to attacks. The new architecture has four major components: 1) protection: a security infrastructure that can monitor and react to common attacks. 2) sensing: collecting alerts from an variety of sensors such as intrusion detection system or individual firewalls; 2) data collection: in case of a potential attack collecting different data from the whole system, e.g. routers, firewalls, individual devices in the network for the attribution of the attack. 3) adjust: a programmable interface to reconfigure the network to
react to the attack, and monitor future attacks.
Comment:
Q: You said existing systems often leads to a step behind the attacks, but it seems you architecture is also based on the those hard work done by others.
A: Some work has been done on dynamic malware analysis. If it can be done in almost real-time, we can do better job.
Q: What kind of data do you need to do this active security? speculate the datasets that could be useful
A: Probably not this time. Ongoing work..Couldn't comment on that fully right now
Q: Why would a victim host would trust some one to pull the memory?
A: Trust needs to be built between the supervisor and the host
HotNets' 13: On the Risk of Misbehaving RPKI Authorities
Presenter: Danny Cooper
RPKI is a new security architecture for BGP that can prevent malicious parties from originating routes of IP prefixes that don't belong to them, which are also called prefix and subprefix hijacks. The idea is to build a trusted mapping between the ASes and their IP prefixes.
However serious risks could exist if the RPKI authorities are misconfigured or compromised. This paper explores such possible risks in different aspects, and show that the new architecture gives the authorities (such as the owner of the superset of IP prefixes) arbitrary power to reclaim the IP prefixes unilaterally, while leaving the targeted ASes little power to protect themselves. This problem is even severe if the authority and targeted AS cross international borders.
In conclusion, this study shows that the new architecture brings in a completely new set of problems, which are worth noting by the researchers in this area.
Q: What would happen if not allowing multiple trees while only allow a single root of authority?
A: It doesn't change much as someone can manipulate you in the upper layer
Q: What happen if you have two valid certificate claiming the same segment?
A: Currently it does't provide mechanisms to deal with conflict
Q: Who will be the entities that do the monitoring work?
A: Anyone can run our current monitor tools, universities, institutes.
Saturday, November 23, 2013
Routing/Management Panel discussion
Q: John Wroclawski pointed to the main themes in the session: reconfigurability (Merlin), adaptability (RC3), and resilience (Plink) and wondered if any of the three systems could benefit from any of the other themes.
Brent Stephens and Robert Soule both agreed that their systems could benefit from reconfigurability and resilience respectively.
Q John Wroclawski: Would incremental recompilation benefit Merlin?
Robert Soule: That is definitely something we want to look at. If we look at subset of modified networks and policies, maybe we can more quickly adapt to changes in network configuration.
HotNets' 13: How to Improve Your Network Performance by Asking Your Provider for Worse Service
This paper makes two main observations to motivate its designs. First, network providers typically over-provision their networks (for resiliency). Second TCP's congestion control is cautious and ramps up slowly in Slow Start. These two behaviors interact adversely, and in an Internet where most flows do not leave slow start, waste capacity.
To address this, the authors step back and look at the goals of congestion control: To fill the pipe for high throughput, and to do no harm to other flows. Traditionally, a single mechanism achieves these two conflicting goals. Their system (RC'3)'s key insight is decoupling the two: Run regular TCP at high priority, and fill up the pipe at a lower priority. They call this worse quality of service (WQoS) and claim that the already-present priority queuing mechanisms in switches can be re-purposed for this task, by providing several layers of worse quality of service.
Next, they develop an idealized mathematical model to analyze the performance gains of RC3 relative to vanilla TCP as a function of flow size. Broadly, for flow sizes less than the initial window, RC3 gives no gains. For flow sizes between the initial window and the bandwidth-delay product, RC3's gains increase monotonically and then start falling off beyond the bandwidth-delay product. They also observe that the maximum possible gain increases with increasing bandwidth-delay product, making it future-proof.
RC3's design has two control loops: the standard TCP control loop, and RC3's own control loop. RC3 tries to minimize overlap between packets sent by the two control loops by transmitting the flow's packets in reverse order for RC3's control loop, and in the standard forward order for TCP's control loop. To make this scheme feasible with a fixed number of priority levels, RC3 transmits (say) 4 packets at priority level 1, 40 at priority level 2, 400 at level 3, and so on, until it crosses over into the standard loop's transmissions. Furthermore, to ensure flow-level fairness, every flow gets to transmit exactly the same number of packets at each priority level, ensuring that long flows can't squeeze out the shorter ones. They leave loss recovery to TCP's loss recovery mechanism, but require SACK to selectively acknowledge packets that are received through RC3's control loop.
Their evaluations show that their performance gains track their theoretical model well. The authors also compare RC3 with increasing the Initial Window, RCP, and traditional QoS and see that it improves flow completion times in all cases.
Q&A
1. How is this different from pFabric?
Ans: They focused their design largely on the data-center context. Our gains, on the other hand, are much more pronounced in wide-area settings with high bandwidth-delay products, and much lesser
in the data-center context due to smaller bandwidth-delay products.
2. Does the application set priorities for the packets?
Ans: No, the OS does this in the kernel. It happens automatically and it ensures that the longer flows do not starve the shorter ones.
3. Floyd and Allman proposed a similar mechanism a while back. They had trouble implementing this using the scheduler mechanisms in Linux.
Ans: We implemented it in Linux and it worked fine for us.
4. What about per-packet delay? Since RC3 sends the initial window + 4 + 40 + ... packets in one burst, doesn't it lead to large per-packet delays for other flows that are not interested in flow completion times, but individual per-packet delays?
Ans: The flows interested in per-packet delay would be sent at the same priority as standard TCP. Hence their packets would pre-empt packets that are sent at the lower priority levels by RC3, and would only compete with packets sent by the standard TCP control loop, just like they would today.
5. Did you explore any congestion control for the RC3 loop itself? Instead of sending all the packets at any priority level all at once?
Ans: No. We stuck with this idea because it was simple and worked.
6. Let's say two users want to start a connection simultaneously. Does RC3 give any gains to either user over standard TCP?
Ans: RC3 doesn't do any worse than TCP in the worst case. But if there is enough spare capacity for both users to fill the pipe quickly, RC3 will allow both flows to complete faster than standard TCP.
Hotnets '13: Plinko: Building Provably Resilient Forwarding Tables
The Internet is designed to be robust with redundant paths between nodes so that data can be delivered in spite of failures. However, most networks today drop data, at least for some time, when a link fails. This paper introduces a new forwarding model, Plinko, which can tolerate up to t failures with no packet drops.
Plinko's algorithm initially chooses a set of default routes for all source/destination pairs. Once these routes are chosen, the algorithm iteratively increases resilience by installing new alternate routes to protect the paths built in the previous round. The main challenge with this approach is that the backup routes can end up forming loops. Plinko solves this problem by performing matching on the reverse path of a packet, which is accumulated in the packet's header.
In order to reduce the amount of state that needs to be stored, Plinko also utilizes compression, by grouping together rules with the same output path. A good feature of Plinko's compression is that the compression ratios achieved are much higher for larger numbers of nodes and increased resiliency requirements, where the state would also be correspondingly higher. This makes Plinko pretty scalable; it can be utilized to build a 4-resilient network for topologies with up to about ten thousand hosts.
Q) In the presence of failures, we may want to prioritize some paths over others. How will this affect you?
A) Perhaps we can use a reservation or bandwidth aware version of Plinko. Typically, in datacenters, you only need best-effort.
Friday, November 22, 2013
Hotnets '13: Managing the Network with Merlin
Presented by: Robert Soulé
Enforcing network management policies is complicated using current techniques. A lot of different diverse machines have to be configured in multiple ways to enforce even relatively simple policies. When these rules need to be modified, it is even harder to reason about the effect of these changed rules upon the different devices and upon the network as a whole. Further, current techniques assume that networks belong to single administrative authorities and do not allow delegation of authority.
The paper addresses the above challenges. First, in order to make policies easier to express and easier to implement, the paper offers an abstraction for global networking policies using a high-level declarative language, which is then mapped to a constraint problem. Second, the paper introduces a framework where authority can be delegated to "tenants". To make sure that tenants are actually obeying their constraints, the paper also introduces a technique for verification.
Policies are expressed using a combination of logical predicates, regular expressions and bandwidth constraints, which can allow the administrator to specify the intended behavior of the network at a high level of abstraction. A compiler then evaluates these policies and generates the code. The problem is then mapped into a constraint problem using NFA, which is solved using linear programming.
To evaluate Merlin, the authors ran a TCP trace and a UDP trace in two network configurations - one with Merlin and another without. They noted that the TCP trace performed much better in the configuration with Merlin (which was configured to grant higher priority to TCP). The authors also evaluated Merlin on a testbed with 140 nodes to check how well the system scales. They noted that it took 6 seconds to generate and run the linear programming model, which they think is fairly low.
Q) What happens when there are failures?
A) One possible approach is to model the failures as input to the constraint problem.
Q) How scalable is Merlin? Constraint solving may not scale well with a higher number of devices and with different types of devices?
A) We looked at networks with 140 nodes. We think it's fairly good to look at numbers of this scale. Google B4 was doing the same, but they used only a dozen classes of traffic, whereas we handle about a hundred classes of traffic on 140 different nodes. Scaling to even larger sizes may be possible by coming up with engineering approaches to further reduce the time to solve the linear programming problem. Alternately, perhaps we can adopt appoximation solutions which could result in much faster run times at a small accuracy tradeoff
Q) What does the NFA abstraction buy you?
A) We are language researchers and want to program in abstractions. We are getting powerful abstractions such as verification of policies that are farmed out to tenants, something that nobody has done before.
HotNets '13: Cross-Path Inference Attacks on Multipath TCP
HotNets '13: CryptoBook
Authors: John Maheswaran, David Isaac Wolinsky, Bryan Ford (Yale University).
CryptoBook attempts to provide cross-site authentication in a privacy preserving way. Cross-site authentication is increasingly widespread: users can use OAuth and their Facebook accounts to log in to Pinterest, StackOverflow, etc. The reasons why users prefer this form of authentication are clear: they only have to maintain one account (their Facebook account), and don't have to sign up or create passwords at every other site.
The problem is that every thing a user does on a third party site can now be associated with their Facebook count, which is often not what the user wants.
CryptoBook acts as an intermediate login service. It is presented as a second login option, next to "Log In with Facebook" for example. Facebook issues an OAuth token to the intermediate CryptoBook service, rather than to the visited site (e.g., Wikipedia). The CryptoBook service vouches that the user's identity *does* correspond to *a* Facebook account, without revealing which one. In fact, the user provides a list of other Facebook accounts to act as the anonymity set - the website does not learn which of these users is signing in.
CryptoBook's backend consists of a collection of federated servers, of which only one (not necessarily a majority) needs to follow the protocol in order to maintain security and privacy. Standard ring signatures are used so that the token is provably signed by one user among this set, but it could be any of them.
Q. Why go through the effort of using crypto? Why not have the user's browser simply sign up, automatically, on their behalf, to various websites?
A. Part of the function of cross-site authentication is to protect the website against anonymous user accounts. In other words, the website may not want to accept anonymous signups, but instead only accepts legitimate Facebook accounts.
Comment: It was suggested that this work may duplicate existing work, as many state governments are busy implementing very similar things.
Q. What happens when a Facebook account is deactivated?
A. The token can be revoked. This could conceivably introduce a correlation attack.
HotNets'13: Patch Panels in the Sky: A Case for Free-Space Optics in Data Centers.
HotNets'13: Trevi: Watering Down Storage Hotspots with Cool Fountain Codes.
HotNets'13: Towards Minimal-Delay Deadline-Driven Data Center TCP
Paper: http://conferences.sigcomm.org/hotnets/2013/papers/hotnets-final92.pdf
The key idea in the paper is formulating the problem explicitly as a stochastic network optimization problem and derived an end-to-end scheme using a standard technique called "drift plus penalty method." See paper for more details.
What I think is interesting about the paper is that the authors formulated the problem explicitly and derived the end-to-end window update algorithm to achieve the optimal rates needed to meet as many flow deadlines as possible.
Q: You are adapting rate over time and starting rate. Have you quantified the benefit of each modification's? How much of the benefit is just due to the right starting rate?
A: It has to start at expected rate, or dynamics would take a long time to converge. If flows just stick to starting rate, the network will be unstable.
Q: Is the objective to maximize the number of deadlines met?
A: No, it is to minimize per-packet delays.
Q: Recent works look at flow-level metrics (mean FCT, etc.). What impact do these metrics have on application performance (e.g. MapReduce)?
A: Most work in this area is on flow-level performance, so we used the same. For specific applications, I think there is room for improvement.
Q: In your graphs, what is "optimal"? Why is it different from the line labeled "throughput"?
A: Optimal is computed centrally using per-hop information. It is hard for a stochastic program to always operate at the optimal point -- we just proved convergence to optimality in the stochastic sense.
Hotnets '13: On Consistent Updates in Software-defined Networks.
They argue that a problem with SDN may be in the area of transitions between current states and future states during rule changes. When the SDN controller wants to change some rules and sends some updates into the network, which is asynchronous, the state does not change atomically and may result in things like network loops during the transition.
One approach is to try to make the state transition atomic by propagating version numbers and don't tell the nodes to start using a version until all nodes have received it. This slows down updates when some nodes don't aren't affected by the rule change. The way of assuming all nodes rules are dependent on one another gives stronger packet coherence, but if we instead targeted only the nodes that have dependencies on one another we can try to achieve minimal SDN updates.
It is possible to find a minimal set of nodes that need to be involved for a given set of properties for various kinds of consistency, but the difficulty is not just how to compute new rules, but how to get from the current configuration to the new configuration gracefully.
Suppose you have a set of new rules and run it through an update plan generator that creates an update DAG that defines when and where various rule updates need to be applied. If the graph is cyclic the cycle has to be broken and then an optimizer can reduce the graph into a smaller set of operations. At this point, the executor can then apply the updates.
Q: There have been a couple of papers looking at program synthesis. You can view these as instantiations of your update planner. Can you say more about what is in the planner and how it works?
A: We only have it for some kind of properties. In this paper we only look at loop freedom. It doesn't work in general for any property that you define. For us we want to look at specific properties and understand them better.
Q: Work has been done in this area for distributed networks (pre-SDN). Is there a formal version of how to map the distributed approaches to the central approach.
A: We don't know yet. At the surface they look different because you can exactly control which node you are sending an update to, which isn't true in the distributed case. It is possible those algorithms can be used in some p
Q: Is there a small set of properties that are broadly useful enough or do you have to fall back to something like version numbers for the general case?
A: I wouldn't characterize the version numbers as a general solution, because they don't work universally. I would choose the fastest thing you can do that has the correctness property.
Observation (from audience): Depending on the application people go with different data consistently problems. Maybe all you care about is looping.
Hotnets '13: Software Defined Networking II: Panel Discussion
Anirudh: Several queue management schemes have non-trivial logic, most of these share little logic. It is a better approach to optimize control flow for all of them independently. Not sure which instruction set we will end up with.
Q: Can you also verify queue management and processing steps?
Mihai: Not looked into it yet, but why not.
Q: Not so much the instruction set is the main cost factor but rather the hardware gadgets. Other costs come with managing the flexibility at all (someone has to make sure program is verified). Maybe it should be the next step to articulate and quantify these costs. What do you think is expensive and what not?
Anirudh: Yes there is a need to quantify this. No idea on strong incentives for deployment yet.
Q: Have you thought about underlying hardware primitives that are needed for implementing or is general purpose hardware enough?
Ratul: Support from switches could help (e.g. if the switch tells neighbors when done after update).
Q: Instead of unrolling loops could you just look at the loop invariant? The programmer does what they want in the loop but just states the invariants. Have you seen code that is simple enough to do this?
Mihai: It depends in the invariant and the loop, but there is not a lot of algorithms in the data plane yet so this is still ongoing research.
Hotnets '13: No Silver Bullet: Extending SDN to the Data Plane
Speaker: Anirudh Sivaraman
In the past, several queue-management and scheduling algorithms for different scenarios have been proposed. Among these are RED, and more recently XCP, RCP, and PDQ. All of these optimize for different objectives and are effective and suitable for their specifc field. The idea of this work is to apply principles of SDN to the data plane in order to find a more flexible and general solution instead of implementing even more different scheduling and queueing schemes. This would allow one to optimize for different networks and different objectives like low latency, high bandwidth or low loss independently of the actual application and the operating system in a more dynamic manner.
The paper works on quantifying the non-universality of existing in-network methods. In particular, the authors look at three different queuing and scheduling schemes, namingly CoDel+FCFS, CoDel+FQ, Bufferbloat+FQ which again all are well applicable for different distinct objectives. Additonally, three different traffic types are taken into consideration: bulk transfers, web traffic (optimize for completion time), and interactive communication. In essence, since the selection of the right schemes, highly depends on various factors, mainly the underlying network and the application, there is no silverbullet solution in terms of a single good network queueing and scheduling configuration.
To overcome this problem, it is proposed to architect the data plane itself for flexibility and extendability in order use different schemes dynamically. This means that network operators should be able to dynamically, based on rules, specify the scheduling and queuing logic by getting an interface to the data plane of the switch exposing the head and tail of the queue. This approach relies on hardware adaption, customized I/O interfaces, state maintenance and a domain-specific instruction set. The hardware side is implemented using a FPGA attached to the switch. The instruction set is put in place to express control flow and implement functionality that is not available in the hardware directly. In conclusion the work proposes an open interface for queuing and scheduling that is portable across vendors, which is a significant step moving towards a programmable data plane.
Q: The methodology of proving seems odd. They don’t seem to prove the non existence of a silver bullet.
A: We tried 3 schemes that are typical. Yes, there could be other ones that we did not work on. We believe these schemes are representative.
Q: Thinking about Click (being one way of doing sdn) … can’t all this be done in Click?
A: Click is purely a software router. Nobody uses it in production. We wanted to work with actual hardware for performance and deployability in real networks.
Q: Instead of specifying all the logic, why don’t you use existing schemes in hardware and parameterize them?
A: Believe, the better approach is to expose API and programmability instead of parameterizing existing.
Q: Could you compare this to FCP (flexible control plane - specifies set of operations on the forwarding plane)?
A: Not aware of this work.
Q: What is the usage scenario for this approach? Just programmable dataplane? Adapt scheduling and queuing based on time (based on workload)? Every single packet possibly treated differently?
A: Our system can support either of these due to its minimal interface.
Q: There have been approaches doing this at the network edge (e.g. Fabric). What is the motivation to have two different architectures? Isn’t processing at the edge more the silver bullet?
A: No, if there are bottlenecks in the network core; if at the edge then possibly.
Q: If a bottleneck is in the core, can the edge identify and transfer control to the core?
A: Possibly.
Hotnets '13: Toward a Verifiable Software Dataplane
Authors: Mihai Dobrescu and Katerina Argyraki (EPFL).
Software dataplanes give operators more flexibility, allowing more rapid deployment of changes. Recent advances have increased their performance to tens of Gbps, making them more practical. However, hardware manufacturers have been reluctant to allow network operators to install software into the data plane because of fear of bugs and performance effects.
The goal of this paper is to work toward verifiable software dataplanes. Properties that are taken for granted, like crash freedom, bounded per-packet latency, and network reachability are all goals of this verification system.
A tool that can verify these properties would reduce testing time, speed time to market, and allow a market to develop for dataplane software that is verifiable.
Unfortunately proving properties of general purpose software is a difficult due to shared state and the difficulty of reasoning about code behavior due to factors like path explosion. Sacrificing flexibility or performance has made this problem easier in the general software case. The software demands of the data plane have specific needs that mitigate the need for these sacrifices, making verification feasible.
In dataplane software, the input to an element is a packet. The output to an element is a modified packet. By considering dataplane software elements as a collection of pipelines of independent packet processing components with well defined, narrow interfaces and no shared state, verification time can be reduced.
Verification consists of two main steps. First, verify each element in isolation and identify suspect paths against the property one is trying to prove. Then verify the target property by composing these possible paths. The assumption is that individual elements are verifiable. Ocassionally this can be challenging, for instance when an element loops through the packet and modifies it depending on various options. This causes an internal path explosion. Decomposing the loop and conceiving as each step in the loop as a separate element simplifies reasoning and reduces path explusion.
Another difficulty is dealing with mutable state. Use of verified data structures can help ameliorate this problem. For more details see the paper.
They built a prototype based on S2E, a symbolic execution engine. They proved properties, such as crash-freedom and a bounded number of instructions on Click pipelines in 10s of minutes.
Q: Do you assume it is single-threaded
A: You can parallelize across multiple cores but they don't share data.
Q: Can you compare to other verification efforts, like seL4.
A: That was mostly a manual effort. We are trying to develop an automatic approach. We try to bring together the advancements from the verification community and combine it with the needs of the networking community. For example, the pipeline programming model affords us opportunities that aren't available for general programs.
Q: My impression is the formal verification community has made a lot of progress in the past few years. That leads me to the question of whether you would advocate for stylized programming methodology even if verification tools could do these transforms for you.
A: If there is a tool that can do this, sure, but I'm not aware of anything for C or C++. Our tool needs minor modifications to your code to provide hints for the tool. You write it in a structured way and it allows the tool to do the job.
Q: How do you train the programmers to use the tool?
A: The tool lets you know. It says you can verify something or you can't, and here's why. We take more of an interactive approach and try to create guidelines around the common needs.
Q: When you talk about performance do you make assumptions about graph properties of the pipelines? The network of these modules can be highly non-trivial. It depends on how you schedule individual packets. Do you restrict topology?
A: We analyze the performance of an individual element, and then compose the results to analyze the entire pipeline. We've experimented with simple pipelines, but I think you can have an arbitrary path.
Q: Including loops?
A: As long as you put a bound on how many times you feed it back in.
Q: Is Click the right kind of abstraction for pipeline modules or do we need something more finely grained?
A: For the evaluation we used Click, so we think it's sufficient. We think its granularity is good.
Q: I question the possibility of saying there isn't shared state between modules. For instance you can implement NAT in Click, which is stateful.
A: We looked into NAT, so basically you're right, but you can look into why you need this state. You can do static allocation of the ports available and then you don't need to share state. There are imbalances, but this is feasible.
Thursday, November 21, 2013
Hotnets ‘13: On the Validity of Geosocial Mobility Traces
Hotnets '13: Towards Comprehensive Social Sharing of Recommendations: Augmenting Push with Pull
Authors: Harsha V. Madhyastha (UC Riverside), Megha Maiya
Speaker: Harsha V. Madhyastha
Session 4 Discussion: Social
A: [Ben] I'm agnostic. My work doesn't say anything about privacy, and since the truth isn't out there, it's almost irrelevant.
A: [Anmol] It's difficult to enforce privacy, so our approach is to provide transparency as a "different" approach to privacy preservation.
Q: [to Anmol] is your chrome extension going to be available?
A: [Anmol] Yes, maybe next week.
A: [Ben] We can all take proactive approaches to preserving privacy. For example, if I was FourSquare user, I'd add a script to obfuscate my own data.
A: [Anmol] most tools to help anonymize things today are point solutions. There is a SIGCOMM paper that talks about unifying these tools across the "user profile" stack (this type of work) that might help.
A: [Karen] there is some work about payment schemes were users (those targeted) can monetize the ads themselves.
Q: observation that the pull model might actually help simplify privacy, where you can just ask for, say, any 1000 samples of something, without attributing those samples.
A: [Harsha] Good though, but latency might be a concern (see his talk)
Q: As computer scientists, we already understand some of these things, but the average user might not be as sophisticated. We might want to be careful about giving these users another tool who may mistakenly reveal something private.
A: [Harsha] Overall, we are targeting users who are currently sharing nothing, and the pull model may help with this. But the proof will be in the pudding (upon deployment).
A: [Ben] Where should responsibility lie: users or providers? For the former, you're assuming sophistication, and for the latter, you're assuming the right incentives, etc. It's unclear if one answer will work.
HotNets '13: Network Stack Specialisation for Performance
Bonus: Panel discussion for Session 3 at the end!
Speaker: Ilias Marinos
Authors: Ilias Marinos, Robert N. M. Watson (University of Cambridge), Mark Handley (University College London).
Traditionally, servers and OSes have been built to be general purpose. However now we have a high degree of specialization. In fact, in a big web service, you might have thousands of machines dedicated to one function. Therefore, there's scope for specialization. This paper looks at a specific opportunity in that space. Network stacks today are good for high throughput with large transfers, but not small files (which are common in web browsing). For example one of their experiments showed with 8KB files, ~85% CPU load for just ~50% throughput (5 Gbps).
The goal of this paper is to design a specialized network stack that addresses that problem, by reducing system overhead and providing free flow of memory between the NIC and applications. The paper introduces a zero copy web server called Sandstorm which accomplishes that.
Key design decisions include putting apps and the network stack in the same address space; static content pre-segmented into packets; batching of data; and bufferless operation. In performance evaluation, the most impressive improvements are in flow completion time when there are a small number of connections, and CPU load reduction of about 5x. When moving to 6x10GbE NICs, the thoughput improvement also becomes impressive: about 3.5x higher throughput than FreeBSD+nginx and even bigger improvement over Linux+nginx. Overall, the paper's contribution is a new programmign model that tightly couples the application with the network stack, reaching 55 Gbps at 72% load.
Q&A
Q: Was this with interrupt coalescing on the NIC turned on?
A: Yes.
Q: Would it be possible to turn it off?
A: We can poll.
Q: Does this stack optimization perhaps suggest changes in TCP -- things like, say, selective ACKs, which we maybe thought was a good idea, but is just too complex?
A: Not so far. The real problem is the system stack not the protocols themselves.
Q: At what level do you demultiplex, if you have two different stacks, legacy and Sandstorm?
A: Could share stack if you are careful, or use two queues and put apps on different stacks.
Q: But current NICs generally don't have enough demultiplexing flexibility to send certain traffic types to different stacks.
A: There are a few NICs that can do that.
Q: Could you fix the stack overhead with presegmentation?
A: Presegmentation is only one part of the problem.
Q: It would be interesting to see a graph showing how much benefit comes from each of these specializations.
Panel discussion for all papers in this session
Q: How does big data affect these works on the data plane?
A: Facebook wouldn't be supporting Open Compute Project if it thought it didn't matter.
Q (for the Disaggregation paper [Han et al.]): The RAMCloud project also had tight latency constraints, but they're trying to treat servers as a pool of memory (not full disaggregation). How do the approaches the compare? Why isn't RAMCloud the solution?
A: RAMCloud doesn't allow you to change the CPU/Memory ratio. (Additional discussion not recorded. .
Q: Why not push functionality near to the storage device?
Q: Does it make sense to combine the visions, using an approach like tiny packet programs to get functionality closer to the network?
Q: There are several reasons you have hierarchies of latency. One is because we just don't know how to build really low latency memory. But also there are others, e.g. fast memory is more expensive. The cross-cutting vision of these papers seems to be removing implementation barriers, which then allows you to create a more optimized hierarchy (based on fundamental cost tradeoffs rather than implementation). Is this how you think about this?
A: May depend on workload.
HotNets '13: AdReveal: Improving Transparency Into Online Targeted Advertising
HotNets '13: Pharos: Enable Physical Analytics through Visible Light based Indoor Localization
HotNets '13: Give in to Procrastination and Stop Prefetching
HotNets '13: Network Support for Resource Disaggregation in Next-Generation Data Centers.
Traditionally data centers have segregated their computing resources into a collection of individual servers. There has been an ongoing trend with systems such as HP MoonShot and AMD SeaMicro, both of which disaggregate some resources. This paper looks at what data centers would look like in the future if this trend continues.
As this trend continues all resources would be accessible as standalone blades connected by a unified interconnect. This development would increase resource modularity dramatically. This would allow operators to update or replace hardware in tune with its upgrade cycle with less difficulty. It would also allow operators to expand capabilities in a more granular way. In this scenario, operators could purchase only the resource they need in aggregate.
While this is a significant conceptual change, hardware and software changes can be done incrementally. In the case of hardware the fundamentals would not need to be changed. The chief accommodation would be the addition of a network controller. For individual software applications, no changes would need to be made, but some minor changes would have to be made in the VM.
These modified virtual machines would then be able to achieve higher efficiency than in the traditional data center because of the increase in resource acquisition flexibility and fewer resources would be left unused on individual servers.
On a larger scale, instead of being connected on an internal bus, resources would be connected to a unified network. While a unified interconnect may seem radically different than a traditional internal interconnect, comparing PCIe to Ethernet shows them to be similar requirements.
A significant difference though is communication latency, which is particularly a factor with memory. By expanding the memory hierarchy to add a layer of local memory next to the CPU to act as a cache the cost of this could be mitigated. In an experiment they found that 10-40 Gbps network link is sufficient, with an average link utilization of between 1-5 Gbps. Latency of less than 10 microseconds kept overhead to less than 20%. Keeping latency low is a key ingredient to a performant disaggregated data center.
Q: This reminds me a lot of multiprocessor systems from the past used in mainframes.
A: They are similar in that they try to act as a big computer, but the main difference is how tightly coupled the resources are. A primary goal of a disaggregated data center is a decoupling of these resources. Previous systems coupled resources tightly to achieve higher performance.
Q: Instead of a proprietary bus you want to use a more open protocol?
A: Yes
Q: 20 years ago desk area network and some others had the idea that you can push the network very far in the device and very low latency interconnects was necessary. Are their common models or lessons?
A: Data centers are very big now and can achieve high economies of scale. This changes the economics of this approach, which was a problem with the earlier approach.
Q: What is the relationship between disaggregation and high performance computing. It seems like approach starts with commodity components. What if you started from supercomputing?
A: Modularity is everything. Vendor-lock-in increases costs.
Q: What is the overhead in hardware cost?
A: Network controllers at scale are very cheap.