Tuesday, February 17, 2015

RoDS - Properties of Distributed Systems

What are some of the properties and capabilities enabled by distributed architectures?

This entry provides a brief outline to be expanded upon in this blog.
  1. "Distributed" means more than one.
  2. "GeographicallyDistributed" is a corollary of 1.
  3. "LogicallyDistributed" means the system is never partitioned so that a node is the only one that can perform a certain function or service part of the problem set; other nodes must be able to come in and perform the same function, which enables everything else:
  4. Redundancyand Fault Tolerance: individual nodes can fail without causing system failure. As a side benefit, the system can be maintained without downtime, with components getting replaced or even upgraded while the system continues to operate in a degraded fashion.
  5. High Availability: with redundancy, the system can continue to operate without interruption. For certain critical systems, such as air traffic control, scheduling downtime for repairs or upgrades is not cost-effective, practical, or acceptable.
  6. Load Balancing: The additional nodes providing fault tolerance are not always standing by or duplicating effort to provide redundancy. Under normal operating conditions, we want the extra nodes to be sharing the workload so the system can more effectively get its job done in parallel.
  7. Scalability: A flat organization of nodes can only grow so far. Eventually we need to provide a better means of organizing the nodes into groups to overcome various bottlenecks and other coordination limitations that prevent the system from tackling larger problem sets.
  8. Heterogeneity: The system must be able to grow and evolve gracefully over time as well, to meet availability goals. Inevitably, this means that the system will need upgrades. Rather than scheduling downtime to shut down the old system and turn on the new system, we should be able to introduce upgraded components to the new system and have them be able to interoperate... simultaneously running the older components in concert with the newer ones, gradually transitioning to newer protocols as necessary. Heterogeneity also increases the system's ability to survive total systems failure due to one fatal flaw or bug that affects all components of a homogeneous system at once (e.g., while a computer virus might take down all computers on one particular operating system, hopefully the system was designed to allow other computer clusters with different operating systems to resume those functions)
  9. Interoperability: In order to swap out and upgrade components of a system to achieve all of these worthy system features and capabilities, well-defined and standardized interoperability between components is key.

If you've read this far and you "get it", then great!  No need to go any further!
In each section I'll just be sharing some anecdotes, examples, and free association for each topic that builds upon the previous.

Rumblings on Distributed Systems


RoDS is about scalability, load balancing, and fault tolerance

But first a disclaimer: I am by no means an acknowledged expert in the field of reliability engineering.  This is merely a topic I've spent a fair amount of time reading, thinking, and practicing with, so hopefully someone might benefit from some random insight.

What are distributed systems?

Before coming to work for Boeing, I rode the crest of the 1990's tech wave at Ocean City, working at a supercomputing cluster start-up. We designed high-availability, fault-tolerant, scalable computing systems with a rather unique recursive network topology. They would often compare their reliability goals with the legendary reputation of Boeing aircraft, where double and triple redundancy would allow the plane to keep flying after multiple equipment failures. So I was kind of surprised when I did start working for Boeing, and did not find that philosophy pervasive in a lot of the work we were doing.

At the supercomputing company we would perform this demonstration where we'd start the cluster on an intensive task such as parallel ra—Étracing. As the nodes were working, we'd walk up to the machine and pull out various components -- network cables, power supplies, entire computing blades -- and show how the system would keep on running. The process would continue rendering - maybe show a hiccup a bit but go back and fill in the missing data.

A lot of my understanding and perhaps obsession with distributed systems was shaped by studying and designing for these types of computing components: RAID arrays, redundant power supplies, load balancing, etc. However, a lot of these patterns and considerations can be applied to many other fields, including products, systems, and people.

When most people make mention of Distributed Operations (and yes, that was the actual name of my workgroup), they generally mean it in the geographical sense, in that the network allows to decouple people, tools, and resources from particular locations. Let's spend some time musing over some of the many other senses of distributed architectures, however.