This post is a work in progress.
Distributed computing systems are an important reality to wrestle with at many levels of technical exposure. Similar to the concept of a cluster I discussed in this blog post, a distributed system is “a collection of autonomous computing elements that appears to its users as a single coherent system” (van Steen, Tanenbaum). Distributed computing environments introduce new dynamics which can affect application development. Enterprise-level corporations (< 500-1000 employees) are best known to operate in distributed environments, and any organization larger than that is guaranteed to use distributed computing. To contrast a cluster from a distributed system, I would say that a cluster is a galaxy and a distributed computing environment is a universe.
It is common for someone new to a distributed system to make some basic assumptions which will cause more issues given time. These assumptions are known as the Fallacies of Distributed Computing.
The fallacies of distributed computing are:
The reading list below is a copy of the currently available version written by Christopher Meiklejohn.
The problems of establishing consensus in a distributed system.
In Search of an Understandable Consensus Algorithm
2013
A Simple Totally Ordered Broadcast Protocol
2008
Paxos Made Live - An Engineering Perspective
2007
The Chubby Lock Service for Loosely-Coupled Distributed Systems
2006
2001
Impossibility of Distributed Consensus with One Faulty Process
1985
The Byzantine Generals Problem
1982
Types of consistency, and practical solutions to solving ensuring atomic operations across a set of replicas.
Highly Available Transactions: Virtues and Limitations
2013
Consistency Tradeoffs in Modern Distributed Database System Design
2012
CAP Twelve Years Later: How the “Rules” Have Changed
2012
Calvin: Fast Distributed Transactions for Partitioned Database Systems
2012
2005
Brewer’s Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services
2002
Harvest, Yield, and Scalable Tolerant Systems
1999
Linearizability: A Correctness Condition for Concurrent Objects
1990
Time, Clocks, and the Ordering of Events in a Distributed System
1978
Studies on data structures which do not require coordination to ensure convergence to the correct value.
A Comprehensive Study of Convergent and Commutative Replicated Data Types
2011
A Commutative Replicated Data Type For Cooperative Editing
2009
CRDTs: Consistency Without Concurrency Control
2009
Languages aimed towards disorderly distributed programming as well as case studies on problems in distributed programming.
Logic and Lattices for Distributed Programming
2012
Dedalus: Datalog in Time and Space
2011
MapReduce: Simplified Data Processing on Large Clusters
2004
A Note On Distributed Computing
1994
Implemented and theoretical distributed systems.
Spanner: Google’s Globally-Distributed Database
2012
ZooKeeper: Wait-free coordination for Internet-scale systems
2010
A History Of The Virtual Synchrony Replication Model
2010
Cassandra — A Decentralized Structured Storage System
2009
Dynamo: Amazon’s Highly Available Key-Value Store
2007
Stasis: Flexible Transactional Storage
2006
Bigtable: A Distributed Storage System for Structured Data
2006
2003
Lessons from Giant-Scale Services
2001
Towards Robust Distributed Systems
2000
Cluster-Based Scalable Network Services
1997
The Process Group Approach to Reliable Distributed Computing
1993
Overviews and details covering many of the above papers and concepts compiled into single resources.