Design for Failure
The Amazon model is the “design for failure” model. Under the “design for failure” model, combinations of your software and management tools
take responsibility for application availability. The actual infrastructure availability is entirely irrelevant to your application availability. 100% uptime should be achievable even when your cloud provider has a massive, data-center-wide outage.
…from The AWS Outage: The Cloud’s Shining Moment, O’Reilly, April 2011
Back in the golden days of my youth, my father worked as a technical writer, producing brochure copy for computer companies. I can’t have been much more than ten when I read some of his work: a brochure advertising the benefits of a computer architecture (I forget which one, but it might have been Tandem NonStop that could continue operating in the event of any single, and some multiple, component failures by having redundant links between redundant storage and processor units.
This looked rather fun, but it was only much later, when I learnt about computer networks, that I began to realise you didn’t need special hardware to do this. Although the special hardware is built to be more reliable than commodity equipment, the magic is all in the software that fails over responsibilities to other hardware when part of it breaks. With a little more work, those software techniques can be applied to off-the-shelf hardware joined together by a network.
Back to square one?
But designing software for failure requires new skills. Some parts are easy; any software component that doesn’t hold state is easily replicated to a bunch of servers, as the loss of one merely means losing the state of an in-progress operation. Ideally, protocols like HTTP would do more to help us; perhaps Web browsers that find multiple servers for the same Web site should keep trying if the first one they try doesn’t work, rather than just giving up (there’s a standard for this already: SRV records) so that we do not need tricks like anycast DNS and front-end load balancers to provide decent failover of front-end application servers.
Designing software for failure is an extra barrier to overcome, but isn’t too hard, and it certainly pays off. Largely, it boils down to making sure that operations do not leave the system in an unstable state if they are aborted partway through for some reason. This is mainly a challenge for the frameworks and infrastructure upon which applications are built; with the infrastructure for retrying failed operations built into the system, application developers only really need to worry about the areas where the system can’t automate recovery from failure (such as operations that trigger real-world actions).
Porting legacy applications
But there’s a lot of legacy software out there, that hasn’t been designed for failure. What do we do about that? A lot of cloud users are trying to migrate their existing systems “into the cloud”, and they are the ones who will suffer worse quality of service because of it. Cloud servers and links have a high failure rate. An application that can take advantage of the fact that the cloud offers easy access to a large number of unreliable nodes, and very rapid provisioning of new ones can provide uptimes that put expensive mainframe hardware to shame; but applications written to expect expensive mainframe hardware will not fare well.
But you shouldn’t need to rewrite an application to make it failure-tolerant. Web applications, by their largely stateless nature, are often structured in ways that make building in failure-tolerance quite practical; the code that handles HTTP requests, nine times out of ten, will only have two effects – writing into a database and generating a response. A suitably smart load balancer can wait until a complete response is generated before sending it back to the client, and retry on another server if the response aborts partway through. If there is no storage of state other than in the database (and I’m looking at you, Java servlet session storage…), then the problem really boils down to:
- A fault-tolerant place to store your data.
- Managing a pool of stateless Web servers with request failover via load balancers.
But database fault tolerance is pretty tricky. The market is starting to swell with databases claiming to offer fault tolerance, but most of them merely survive server failures. What if the link between datacentres fails, splitting your system into a European cluster and a US cluster? Few databases can operate in a partitioned environment; and most of those are exotic NoSQL databases that your app will need rewriting to use.
As usual, there’s no silver bullet
Of course, at this point, you’d expect me to say “…except GenieDB!”, but there’s often more to the problem than just the database. You still need to audit the app to make sure it’s not squirreling data away anywhere else, and work out how to manage that “pool of Web servers with request failover via load balancers”.
Rolling out updates to the application becomes more of a challenge and all these new servers need monitoring. Fault tolerant systems are built to hide failures from normal users and, with GenieDB being extremely tolerant of almost complete system failure, measures need to be taken to make sure somebody is alerted before you’ve only got a single server left!
The audit process is quite simple, though. We recently audited WordPress, and found only one issue: it puts uploaded images into the filesystem. We fixed this by writing a WordPress plugin that, whenever an image or other media file is uploaded into a post, copies it to all the other servers, and a mechanism for servers to catch up with uploads they missed during an outage.
We’ve put a lot of work into helping migrate legacy apps into fault-tolerant environments, by finding ways to support the kinds of behaviour they expect from a single-node database, in a distributed environment. Porting WordPress to run on a NoSQL database would be a huge amount of work!
So I think we will see many exciting new things happening as application developers figure out how to take advantage of cloud systems. Applications will start to appear that are fault-tolerant from the start. New tools to help build these applications, with support for failure tolerance built in and that help applications to be fault tolerant, will continue to appear. New tips and tricks and techniques to handle common problems are all out there, waiting to be rediscovered…