Nalanda

November 25, 2008

A Policy Aware Switching Layer for Data Centers

Filed under: Networks — Tags: , — Ashwin @ 12:36 pm

Joseph, A.J., Tavakoli, A, Stoica, I. 2008. A Policy Aware Switching Layer for Data Centers. UC Berkeley Technical Report No. UCB/EECS-2008-82.

The authors deal with the problem of configuring middleboxes in datacenters. Current architectures call for middleboxes to be placed on the physical network path, which leads to a number of sticky configuration problems. These include removal of physical connectivity paths which do not cross the middlebox, manipulation of link costs and separation into VLANs. All these approaches carry penalties with them: loss of fault tolerance, difficulty of predicting behaviour, fatesharing of flows with middleboxes, and the loss of ability to run clustering and virtual server mechanisms which require layer 2 connectivity.

The authors propose a new approach, PLayer, consisting of policy aware switched, pswitches, which allow middleboxes to be taken off the physical network path, and allows for the explicit specification of middlebox routing policy, rather than the implicit mechanisms currently in use. Though conceptually simple, this is a difficult problem in practice, since a principal, though unstated, design goal is to not require any changes of the middleboxes themselves. Even for simple middleboxes, Ethernet frames have to be encapsulated for delivery. More complex middleboxes that require layer 3 and layer 4 data need assurance that streams are always directed to the same middlebox instance; this is achieved using consistent hashing to choose instances.

An interesting problem that the authors deal with is the dissemination of policy updates. Each pswitch maintains a copy of all policy rules for the datacenter, to allow for continued correct function in the event of any failures elsewhere on the network. When policies are updated, these must be adopted concurrently by all pswitches. The authors propose a mechanism where policies are pushed out pswitches, but not immediately adopted; a separate small control packet is dispatched to signal the switch to a new policy. As the packet is small, there is a greater likelihood that it will reach all switches synchronously.

Even with this mechanism in place, there are several scenarios under which flows will be processed by different policies, which call for very specific approaches to policy configuration in order to enable reliable and consistent dissemination. This did make me wonder about how such a mechanism might be deployed to a real datacenter. Could it be that, depending on topology, different portions of a network could be taken offline and updated independent of one another? For smaller, controlled functionality (e.g., load balancers and firewalls dedicated to a web server farm), it may be that this could provide a more reliable, albeit also more manual, update mechanism.

Improving MapReduce Performance in Heterogeneous Environments

Filed under: Networks — Tags: , , — Ashwin @ 12:11 pm

Zaharia, M., Konwinski, A., Joseph, A. D., Katz, R. and Stoica, I. 2008. Improving MapReduce Performance in Heterogeneous Environments.

The authors examine the performance of the MapReduce implementation in Hadoop, and find several flaws in Hadoop’s scheduler which can cause severe degradation in performance.The principal problems that are analysed are those of the identification of laggard tasks for speculative execution, and the identification of nodes to which these tasks should be assigned. These problems arise in Hadoop due to assumptions of homogeneity in tasks, and in the network itself.

To remedy these problems, the authors propose a new scheduling algorithm, LATE, which is sensitive both to the variance in tasks and also to variance in node performance. LATE chooses tasks for speculative execution based on estimated time to completion, rather than the simple score metric that Hadoop uses. Only nodes with performance above a specified threshold are chosen for the execution of speculative tasks. In addition, a cap is maintained on the total number of speculative tasks that may be run at a time. Testing on various configurations of EC2, and also on a testbed with virtual machines, demonstrates that LATE provides significant performance benefits.

The one question I have is with regards to the choice of virtual machines as testbeds. While I undertstand that these could be useful to simulate a heterogeneous environment, it also seems like they are a worst case scenario. MapReduce is itself a virtualization scheme that makes certain assumptions about the locality of data; layering this on top of another virtualization scheme seems like overkill. It would be interesting to know how much of an advantage LATE delivers in a carefully planned data center, even one with a degree of heterogeneity.

Powered by WordPress