Tuesday, February 17, 2015

Rumblings on Distributed Systems


RoDS is about scalability, load balancing, and fault tolerance

But first a disclaimer: I am by no means an acknowledged expert in the field of reliability engineering.  This is merely a topic I've spent a fair amount of time reading, thinking, and practicing with, so hopefully someone might benefit from some random insight.

What are distributed systems?

Before coming to work for Boeing, I rode the crest of the 1990's tech wave at Ocean City, working at a supercomputing cluster start-up. We designed high-availability, fault-tolerant, scalable computing systems with a rather unique recursive network topology. They would often compare their reliability goals with the legendary reputation of Boeing aircraft, where double and triple redundancy would allow the plane to keep flying after multiple equipment failures. So I was kind of surprised when I did start working for Boeing, and did not find that philosophy pervasive in a lot of the work we were doing.

At the supercomputing company we would perform this demonstration where we'd start the cluster on an intensive task such as parallel raуtracing. As the nodes were working, we'd walk up to the machine and pull out various components -- network cables, power supplies, entire computing blades -- and show how the system would keep on running. The process would continue rendering - maybe show a hiccup a bit but go back and fill in the missing data.

A lot of my understanding and perhaps obsession with distributed systems was shaped by studying and designing for these types of computing components: RAID arrays, redundant power supplies, load balancing, etc. However, a lot of these patterns and considerations can be applied to many other fields, including products, systems, and people.

When most people make mention of Distributed Operations (and yes, that was the actual name of my workgroup), they generally mean it in the geographical sense, in that the network allows to decouple people, tools, and resources from particular locations. Let's spend some time musing over some of the many other senses of distributed architectures, however.

No comments:

Post a Comment