This entry provides a brief outline to be expanded
upon in this blog.
- "Distributed" means more than one.
- "GeographicallyDistributed" is a corollary of 1.
- "LogicallyDistributed" means the system is never partitioned so that a node is the only one that can perform a certain function or service part of the problem set; other nodes must be able to come in and perform the same function, which enables everything else:
- Redundancyand Fault Tolerance: individual nodes can fail without causing system failure. As a side benefit, the system can be maintained without downtime, with components getting replaced or even upgraded while the system continues to operate in a degraded fashion.
- High Availability: with redundancy, the system can continue to operate without interruption. For certain critical systems, such as air traffic control, scheduling downtime for repairs or upgrades is not cost-effective, practical, or acceptable.
- Load Balancing: The additional nodes providing fault tolerance are not always standing by or duplicating effort to provide redundancy. Under normal operating conditions, we want the extra nodes to be sharing the workload so the system can more effectively get its job done in parallel.
- Scalability: A flat organization of nodes can only grow so far. Eventually we need to provide a better means of organizing the nodes into groups to overcome various bottlenecks and other coordination limitations that prevent the system from tackling larger problem sets.
- Heterogeneity: The system must be able to grow and evolve gracefully over time as well, to meet availability goals. Inevitably, this means that the system will need upgrades. Rather than scheduling downtime to shut down the old system and turn on the new system, we should be able to introduce upgraded components to the new system and have them be able to interoperate... simultaneously running the older components in concert with the newer ones, gradually transitioning to newer protocols as necessary. Heterogeneity also increases the system's ability to survive total systems failure due to one fatal flaw or bug that affects all components of a homogeneous system at once (e.g., while a computer virus might take down all computers on one particular operating system, hopefully the system was designed to allow other computer clusters with different operating systems to resume those functions)
- Interoperability: In order to swap out and upgrade components of a system to achieve all of these worthy system features and capabilities, well-defined and standardized interoperability between components is key.
If you've read this far and you "get it", then great! No need to go any further!
In each section I'll just be sharing some anecdotes, examples, and free association for each topic that builds upon the previous.