RoDS is about scalability, load balancing, and fault tolerance
But
first a disclaimer: I am by no means an acknowledged expert in the field of
reliability engineering. This is merely
a topic I've spent a fair amount of time reading, thinking, and practicing
with, so hopefully someone might benefit from some random insight.
What are distributed systems?
Before coming to work for Boeing, I rode the crest of the
1990's tech wave at Ocean City, working at a supercomputing cluster start-up.
We designed high-availability, fault-tolerant, scalable computing systems with
a rather unique recursive network topology. They would often compare their
reliability goals with the legendary reputation of Boeing aircraft, where
double and triple redundancy would allow the plane to keep flying after
multiple equipment failures. So I was kind of surprised when I did start working
for Boeing, and did not find that philosophy pervasive in a lot of the work we
were doing.
At the supercomputing company we would perform this
demonstration where we'd start the cluster on an intensive task such as
parallel raуtracing. As the nodes were working, we'd walk up to the
machine and pull out various components -- network cables, power supplies,
entire computing blades -- and show how the system would keep on running. The
process would continue rendering - maybe show a hiccup a bit but go back and
fill in the missing data.
A lot of my understanding and perhaps obsession with
distributed systems was shaped by studying and designing for these types of
computing components: RAID arrays, redundant power supplies, load balancing,
etc. However, a lot of these patterns and considerations can be applied to many
other fields, including products, systems, and people.
When most people make mention of Distributed Operations (and
yes, that was the actual name of my workgroup), they generally mean it in the
geographical sense, in that the network allows to decouple people, tools, and
resources from particular locations. Let's spend some time musing over some of
the many other senses of distributed architectures, however.
No comments:
Post a Comment