Thursday, May 28, 2015

Logically Distributed

"Logically distributed" simply means that nodes are capable of sharing their work.  No node is ever the only node that can perform a specific task -- other nodes must be able to come in and perform the same function.  They can either help (preferred) or take over completely.  A more practical and familiar way of looking at this is that the system should have no single point of failure (SPoF).

While this sounds kinda silly, this quality of logical distribution pretty much enables everything else we need for distributed operations.

Thursday, May 21, 2015

Geographically Distributed

Assuming we've made the step to a distributed model, it's necessary to consider where we're distributing our resources to.  There isn't much that can occupy the same space at the same time, so if we're going to have more than one of something, where should we put them?

Sometimes we want things as close to each other as possible.  With a distributed model, we often want to try to push them as far apart as practical.

Colocated Things

Performance and convenience are the main drivers that push the nodes of your system close together or colocated.  Proximity makes communication faster, cheaper, and lower-latency.  Plus, it's easier to maintain everything if it's all in one place.  But likely the primary reason to consolidate and centralize resources are to help minimize overhead.  Things that designers tend to centralize without really thinking about it too much:
  • Backups - yes, it's faster to make a local copy for disaster recovery, you probably want to ship your backups as far as practical if they're going to survive whatever kills your primary working copy.
  • Command HQ - everyone wants to hobnob with the big cheese, to the detriment of satellite offices.  But when the goal is to have clear and authoritative leadership and top-down communications, people haven't really figured out how to do it better yet.
  • Inventory and maintenance - buses and rail systems would have a single station to consolidate spares and specialists for repairs.
  • Databases and storage - network-attached storage is placed into large banks for centralized management and provisioning.  Stateful databases also tend to be the most performance and security critical element of an information system, so we try to keep them locked away in a secure central facility for compliance with various laws.
  • Network Switches - ironically the major piece of equipment that makes distributed operations possible also tends to get itself consolidated into huge backplanes in a central switching room.  But this ensures that high network performance is always available to throw at problems with ill-defined or emerging requirements.  Throwing more bandwidth at a problem can often be a decent substitution for planning.

Dispersed Things

For distributed systems, you will often find yourself pushing nodes out as far as practical.

The network enables decentralization.  From a philosophical standpoint, the DoD has plenty of insight (yes, the DoD pays people to wax philosophical about the uses of ARPANET).  DoD white papers on Effects-Based Operations talk about how data networks can push the power to the edge, allowing the decision-making to occur where it is most needed.  Instead of long feedback and control loops where all sensors must report data to a central command for analysis and synthesis of a response to be transmitted back to the effectors, the network allows the effectors themselves to understand the situation and take the appropriate action on the spot.

With this in mind, let's consider some of the ways where geographical dispersion of system components is beneficial.
  • Backups - disasters take place on different scales.  For business continuity, the farther your backups live from your computer systems, the better they'll be able to survive progressively more catastrophic events:
    • Backup disk in your computer:  a virus or a simple power surge could corrupt both your system and your backup volume in one fell swoop
    • Backup disk next to your computer: a thief or fire sprinkler could wipe out your electronic equipment
    • Backup disk in another room: an actual fire could destroy your building
    • Backup disk in another building: a flood or earthquake could wipe out your city
    • Backup disk in another city: probably will take a cyberattack or government legal action to shut you down at this point
    • Backup disk in another country: good luck complying with all applicable export laws
    • Backup disk on another planet: what we all aspire to
  • Web servers:  sure they're on the internet in the "cloud", so it shouldn't matter.  But studies by Amazon and other retailers have shown that server responsiveness does increase sales, and tenths of a second count.  The speed of light may be fast at 300,000 km per second, but that still translates to over a tenth of a second round-trip coast-to-coast over the US.  That may not sound like much, but add in all of the data transfer times and encryption and backend api abd database calls, each with their own set of handshaking delays, and you're easily counting web interaction response time in seconds.  For reference, video gaming lag becomes painfully noticeable above about 0.2 seconds, and 2-way human voice conversation becomes tortured above just 0.5 seconds with people mistakenly talking over each other.  So with internet services, the extra gains in responsiveness from locating your servers geographically close to your customers are measureable and significant.

Monday, March 30, 2015

One

1. Distributed means "more than one".
Like the Buddha going to the hot dog stand and asking the vendor to "make me one with everything," let us contemplate upon the meaning of this title.
Now, more than ever, we live in a binary world.  Almost all digital logic can be expressed as a seemingly endless string of ones and zeros.  Computers can perform any operation and calculation imaginable in base-2.  People could be divided into the "haves, and the have-nots."  Has the necessity for anything more become an outdated relic of the past?  A historical footnote of a simpler culture, like the aboriginal language that only had words for the concepts of none, one, and "more than one"?  Of course not.

But let us first consider the special circumstances conferred by 0 and 1.

e - 1 = 0
Euler's equation.  Well, one of them, anyway.  Notable for including most of the important numbers used in math.  Who would need anything more?

Is there any advantage by considering a third option for "many"?  That could add extra complexity, overhead, and waste.  And some things are impossible to duplicate.  You after all only live once.

 How can we justify investing extra energy and resources to redundancy?  Well, maybe it's not always worthwhile.  That's the first decision you need to be prepared to make right after deciding to build a product or capability -- going from none to one ...  Should you have gone from one to many at the outset?

I would argue yes... inherently you will always be faced with a multiplicity of things that you will need to maintain throughout their life cycle, so you might as well plan for handling the many from the beginning.   It can be extremely difficult to transition from a system that was only designed to be single to work or migrate to anything else.  You will hurt yourself more in the long run by taking the short view.  It isn't terribly difficult to plan for handling the many from the onset of you have an organizational framework.  That's what this blog is all about.

But first, let us take a moment to consider what kinds of situations actually make sense for there to be only one.

We mentioned "one life".  Marley would add "one love", but the animal kingdom gives us several examples which gives an evolutionary advantage to being... flexible with that rule.

What else would be better if you only had one.  One car?  Sure, if it breaks down you can telecommuter for a while or take public transit, but eventually that might take a toll on your job performance or personal time.  You probably end up renting a car while your only means of transportation is on the shop.  But that's like having more than one car available to you.

One house?  Yes, it doesn't make sense to purchase more than one house just in case one of them burns down or needs to be fumigated or simply losses power or some other utility for a week or so after a storm.  But chances are if you do, you have friends or family nearby that you can crash with, or at least stay in a hotel or shower at the gym on occasion.  We share what we own in times of need, that's distributed.

One phone?  We could certainly disappear into the mountain's away from contact for a while, but eventually our answering machine fills up and we miss bills and lose friends.  There's only so much time you can leave things to buffer up.  But you can certainly leave many things to buffer up for a few days while you replace a broken handset.  No financial harm done, unless you missed a job interview during that time or couldn't help a family member in need of emergency assistance.

Which brings us to computers.  Perhaps you only use your computer for entertainment, and you can only play one game or watch one movie at a time, so it doesn't make sense to have more than one, and if it breaks you just find some other pastime to keep your fun-meter pegged.  If you actually use your computer to do anything important, though, you may have found that losing it due to a disk failure can range from annoying to debilitating depending upon how much time it takes to set up a replacement.

So the common thread in all of this is that equipment loss or failure, in most cases, are a recoverable interruption in service that just means you need to spend some extra time, money, and apologies while you tinker with getting the replacement up and running again.  If you have extra time and money and reputation to spare, then by all means, do not worry too much about accounting for ever having more than one of a capability on hand.  Chances are, this won't happen during the worst possible moment.  As the old pilots' saying goes, "I'd rather be lucky than good."

There are several reasons why going the distributed route may not be necessary for your situation.  If you have the ability to convince your boss or customer to buy your excuse for delay or failure, that's great!  Second, if you are the only person inconvenienced by an equipment failure, and you're willing to just grin and bear it, then your not hurting anyome but yourself.  Third, perhaps you've poured substantial resources into this one house or commercial vehicle, and you just have to risk your livelihood on its continued reliable functioning.  If it does need maintenance, you simply drop everything and focus on getting that critical component up and running again.  Finally, perhaps you are blessed with a monopoly on some product or service.  Then if your ability to deliver had been interrupted, your clients will just buffer up and be forced to wait, because there is no other competition.  Then sure, you can go ahead and cut costs by neglecting fault tolerance or even preventative maintenance if you're not going to be losing any money in the long term.  That could work fine if you're some sort of government bureaucracy or sought-after artist, but it probably won't make you popular.

So far, we've only been talking about accounting for failure modes, which is likely only interesting for insurance actuaries.  There are plenty of other more interesting benefits for using the distributed model.  Let's engage in a few thought exercises now in order to save time and resources during crises in the future.  Then we'll be better equipped to decide if it's worth the extra complexity to consider design for distributed operation up front.

Tuesday, February 17, 2015

RoDS - Properties of Distributed Systems

What are some of the properties and capabilities enabled by distributed architectures?

This entry provides a brief outline to be expanded upon in this blog.
  1. "Distributed" means more than one.
  2. "GeographicallyDistributed" is a corollary of 1.
  3. "LogicallyDistributed" means the system is never partitioned so that a node is the only one that can perform a certain function or service part of the problem set; other nodes must be able to come in and perform the same function, which enables everything else:
  4. Redundancyand Fault Tolerance: individual nodes can fail without causing system failure. As a side benefit, the system can be maintained without downtime, with components getting replaced or even upgraded while the system continues to operate in a degraded fashion.
  5. High Availability: with redundancy, the system can continue to operate without interruption. For certain critical systems, such as air traffic control, scheduling downtime for repairs or upgrades is not cost-effective, practical, or acceptable.
  6. Load Balancing: The additional nodes providing fault tolerance are not always standing by or duplicating effort to provide redundancy. Under normal operating conditions, we want the extra nodes to be sharing the workload so the system can more effectively get its job done in parallel.
  7. Scalability: A flat organization of nodes can only grow so far. Eventually we need to provide a better means of organizing the nodes into groups to overcome various bottlenecks and other coordination limitations that prevent the system from tackling larger problem sets.
  8. Heterogeneity: The system must be able to grow and evolve gracefully over time as well, to meet availability goals. Inevitably, this means that the system will need upgrades. Rather than scheduling downtime to shut down the old system and turn on the new system, we should be able to introduce upgraded components to the new system and have them be able to interoperate... simultaneously running the older components in concert with the newer ones, gradually transitioning to newer protocols as necessary. Heterogeneity also increases the system's ability to survive total systems failure due to one fatal flaw or bug that affects all components of a homogeneous system at once (e.g., while a computer virus might take down all computers on one particular operating system, hopefully the system was designed to allow other computer clusters with different operating systems to resume those functions)
  9. Interoperability: In order to swap out and upgrade components of a system to achieve all of these worthy system features and capabilities, well-defined and standardized interoperability between components is key.

If you've read this far and you "get it", then great!  No need to go any further!
In each section I'll just be sharing some anecdotes, examples, and free association for each topic that builds upon the previous.

Rumblings on Distributed Systems


RoDS is about scalability, load balancing, and fault tolerance

But first a disclaimer: I am by no means an acknowledged expert in the field of reliability engineering.  This is merely a topic I've spent a fair amount of time reading, thinking, and practicing with, so hopefully someone might benefit from some random insight.

What are distributed systems?

Before coming to work for Boeing, I rode the crest of the 1990's tech wave at Ocean City, working at a supercomputing cluster start-up. We designed high-availability, fault-tolerant, scalable computing systems with a rather unique recursive network topology. They would often compare their reliability goals with the legendary reputation of Boeing aircraft, where double and triple redundancy would allow the plane to keep flying after multiple equipment failures. So I was kind of surprised when I did start working for Boeing, and did not find that philosophy pervasive in a lot of the work we were doing.

At the supercomputing company we would perform this demonstration where we'd start the cluster on an intensive task such as parallel raуtracing. As the nodes were working, we'd walk up to the machine and pull out various components -- network cables, power supplies, entire computing blades -- and show how the system would keep on running. The process would continue rendering - maybe show a hiccup a bit but go back and fill in the missing data.

A lot of my understanding and perhaps obsession with distributed systems was shaped by studying and designing for these types of computing components: RAID arrays, redundant power supplies, load balancing, etc. However, a lot of these patterns and considerations can be applied to many other fields, including products, systems, and people.

When most people make mention of Distributed Operations (and yes, that was the actual name of my workgroup), they generally mean it in the geographical sense, in that the network allows to decouple people, tools, and resources from particular locations. Let's spend some time musing over some of the many other senses of distributed architectures, however.