Wednesday, March 23, 2016

Redundancy and Fault Tolerance

Individual nodes can fail without causing system failure.

This tends to be a contentious topic, because all of a sudden, project managers feel like they're being pressured into buying at least two of everything, driving system cost and complexity up.  I'll admit that I'd rather use a device that's simpler and 10x more reliable than having to use a paired cluster of devices that trigger a failure mode n times more often.  But we also have to take into account the consequences and impact of that one device failing along with the cost having a backup.  Unfortunately, people are bad at statistics when it comes to estimating when, not if, something critical to their system might fail.

Let me start with one counterexample.  Most airplanes have two engines.  The idea is that in the unlikely event that one of the engines fails in flight, the remaining engine will still be able to safely get everyone to an emergency landing at the nearest serviceable airport instead of whatever field or body of water is within gliding distance.  The largest single-engine aircraft, such as the Cessna Grand Caravan or Pilatus PC-12, get by with using one highly reliable Pratt and Whitney PT6 turboprop engine.  This turbine engine, with regular maintenance, has reliability that is orders of magnitude better than piston engines in the same power range, so a single-turbine engine aircraft can still maintain a similar safety record than competing twin-engine piston aircraft.  Of course, turbine engines are radically different technologies from piston engines, with lower complexity and fewer moving parts.  The higher price is eventually offset by higher operating efficiency.  But the point stands, if the technology is radically different, sometimes one highly-reliable device can be the better choice for your money than a bunch of older, less reliable tech.  You're more likely to successfully drive across the country in one modern car than a pair of 60s VW Beetles.  But that's about the difference in technology levels you'd need to be looking at.

Computers are such a commodity now, however, that it shouldn't hurt as much as it once did to buy two or more to do a job instead of throwing all your money at one big beefy server.  So you could configure one high-reliability server (often with redundant disks and NICs and fans and power supplies, the components most likely to fail) for about $2000, or you could configure two cheaper servers with similar specs for about $1000 each, and build a fully-redundant system where anything, including the mainboard, could break and the service could continue running.

The simplest way to make sure the service could continue running is to set up a Primary/Standby or Master/Backup style system.  Many would consider this a waste, since the standby machine is doing nothing most of the time.  Most people also forget to test the standby machine (particularly if it's turned off as a "cold" standby as opposed to a "hot" standby that stays up and running and keeps its data in sync with the primary), so it's also possible that it breaks first and no one notices until the primary machine fails and then they discover the standby doesn't work either!  So your backup systems can end up giving you a false sense of security unless you go through the trouble of testing your backup systems regularly.  The best scenario would just make this part of regular operations and exercise your fault recovery by actually switching between them every week or so.  It helps to break out of the "primary/standby" terminology and just call them something more generic, like systems A and B.

For running network services, having A and B systems could give you a lot more capability, such as the ability to do zero-downtime A/B deployments, upgrades, etc.  that simply aren't possible on the more expensive single-host system.  But those are topics to dwell on in future sessions.  For now, we just want to make sure our system is fully fault tolerant.  How can we be sure?

Well, you need to be able to take any individual component of your system out, and have the system continue functioning.  The easy way to spot vulnerable components is to look for redundancy.  There should be at least two of everything.  If there's just one of something, chances are that it's a potential single point of failure (SPoF).  You should be able to invite someone to walk up to your system, grab any single box or cable and remove it with no ill effect.  Ideally, this should include:

  • network cables
  • disks
  • entire computer nodes
  • power cable
  • UPS
  • switches and routers
  • network uplinks (can't be much of a service provider if you only have one ISP)

Monitoring and alerting is also key.  If a backup component silently fails, then you no longer have redundancy.  Anything that fails needs to generate an alert to have it fixed.  That needs to be part of the systems fault tolerance validation checklist.

Isn't having two of everything a waste?  Only if the potential downtime from a component failure creates more of a waste of time and effort and equipment to replace.  Parts break.  Think of it as replacing broken parts in advance.  Yes, if you're lucky enough that the parts don't break during the product lifecycle, then you're spending as much as twice as much extra money.  But this is only a waste if you're not creative enough to put the "working spare" parts to good use.

As a side benefit, a redundant system can be maintained without downtime, with components getting replaced or even upgraded while the system continues to operate in a degraded fashion.  This enables High Availability for the system, which we will discuss in the next section.