Wednesday, April 10, 2013

The software defined data center - part 2: compute

This is the second post in the series, which started with an overview, and will cover compute.

Introduction

Compute virtualization arrived in the x86 world at the start of the millennium with the arrival of VMWare, though it had been around long before on the mainframe and then midrange systems (like large RISC based Unix boxes).

Virtualization offers the opportunity to consolidate workload, and hence reduce costs (including administration), but it carries with it the threat of an increased administrative overhead, which can come from two directions:
  1. I started out with 10 physical servers, and moved them onto virtual servers running on a single box, so now I have 11 things to manage.
  2. Servers just became cheaper and easier to obtain, so I can have more - lots more.
By around 2004 it was becoming clear that virtualisation threatened to overwhelm already hard pressed systems administrators by giving them an exploding volume of stuff to manage. Automation was going to be needed - after all, the job of any good systems administrator is to replace themselves with a script.

Tool evolution

The evolution of automation tools generally went through 3 phases:

  1. Development and test. Until people got comfortable with virtualisation (often by using it extensively in development and test) it was usual for it to be deployed only to non critical environments. One of the first products was Akimbi Lab Manager (which was one of the first acquisitions by VMWare). A typical usage pattern here would be self service, with VMs requested through a portal and delivered in a matter of minutes (the limiting factor normally being how quickly an image could be copied).
  2. Production usage - single environment. Once confidence grew that virtualisation could be used for production workloads the automation tools followed. This usually entailed a change in usage ethos, with self service being replaced with integration to ticketting and configuration management tools. Typically these tools would (initially) only work with a single type of virtualisation platform (usually ESX).
  3. Multi environment.These tools add the ability to work with multiple virtualisation platforms (e.g. Xen, Hyper-V, KVM) and/or various (public) cloud platforms (usually starting with Amazon's AWS).
I personally saw from up close one tool develop through these stages. When I worked next to Leslie Muller I became a guinea pig for his Virtual Developer Environment (VDE) - a system that allowed Windows desktop and server VMs (that ran on Microsoft Virtual Server) to be requested. Provisioning took around 8 minutes, which was an amazing improvement on the months it would take to get physical servers. Renewable one month leases on VMs were used to help capacity management.

Process re-engineering

Some time later (and after a complete rewrite) VDE morphed into the Virtual Machine Provisioning System (VMPS), which changed target platform to ESX and allowed for permanent VMs (for production) in addition to leases for test/dev. Many sacred cows had to be slaughtered to adapt provisioning workflows to being completed in minutes rather than weeks. The roadblocks typically fell into two categories:
  1. It's my job to... Provisioning tends to make heavy use of the singleton pattern to ensure that things (such as IP addresses) are properly allocated and not duplicated. This is why provisioning is often so slow, as each singleton is a process bottleneck. In real life this often boils down to a person with an Excel spreadsheet to keep track of things. Nobody likes being replaced by a database table (even if it does free them up to do something more interesting and useful)
  2. Something terrible once happened... (and it took out the trade floor). People make mistakes, and often the runbooks that systems administrators have to follow have been designed specifically to avoid mistakes (and mistypes) of the past. This usually involves moving changes to outside of working hours (more process bottlenecks). A properly automated system doesn't make mistakes, which means that changes during working hours should be possible (or changes outside of working hours can be queued up and left to run).

Once it was spun out of Credit Suisse as DynamicOps the tool  grew up some more to become Virtual Resource Manager (VRM) - a multi platform system that supported private and public clouds. DynamicOps has since been acquired by VMWare to become vCloud Automation Center (joining Akimbi, Dunes and a bunch of other automation technologies).

Hitting the limits

Compute virtualisation and automation go a long way towards having on demand systems, and at a small scale it can seem like  everything is taken care of. Sadly things can become difficult at scale, usually due to limitations in storage and network:
  • Storage has traditionally forced a choice between direct attached (DAS), network attached (NAS) or a storage area network (SAN). Each of these 3 creates a tradeoff between speed, flexibility and resilience (and cost).
  • Networking has (until very recently) been constrained by an inability to move VMs (for capacity management purposes) outside of VLANs, where the definition of VLANs is pretty static. This has meant that network boundaries have become constrainers of capability that would otherwise be possible with the underlying compute platform.

Conclusion

Compute virtualization has been with us for some time, and after threatening to further overwhelm already busy systems administrators a crop of automation tools have emerged to make things quicker and easier. At low scale compute automation shows what is possible with a software defined data center, but it's only with recent changes in the networking and storage arenas that things can be made to work at larger scale.

Post a Comment
Related Posts Plugin for WordPress, Blogger...