Scalable Systems

Distributed Computing Challenges

Ideally adding N servers allows the service to support N users.

Linear scalability is hard to achieve because…

Challenges

Scalability

Fault Tolerance

Failure Recovery Methods

High Availability

Downtime Guarantee Max Downtime per Year Estimated Costs
99.9% 8.77h ~3,000,000$
99.99% 52.6min ~300,000$

Consistency

Gmail’s email sending is strongly consistent, but marking a message is inconsistent.

Performance

Remember that overall latency >= latency of slowest component

Design Principles of Cloud Applications

Typical Design of Scalable Service

  1. Find the requirements and goals of the system (e.g., functional, non-functional)
  2. Figure out the workloads the system should be optimized for (e.g., is it a read-heavy workload, etc.)
  3. Do a back-of-the-envelope calculations for estimated storage capacity needs
  4. High-level system design
  5. Do the database schema based on the functional requirements
  6. Do the large-scale system design based on the non-functional requirements
  7. How do you scale the system?
  8. How can you make it reliable and redundant?
  9. How would you do data sharding?
  10. Cache and load balancing?
  11. How can you implement the functional compute requirements in the scaled system

Additional Content

Amdahl’s Law

In general terms, Amdahl’s Law states that in parallelization, if P is the proportion of a system or program that can be made parallel, and 1-P is the proportion that remains serial, then the maximum speedup S(N) that can be achieved using N processors is: S(N)=1/((1-P)+(P/N)) As N grows the speedup tends to 1/(1-P).

Speedup is limited by the total time needed for the sequential (serial) part of the program. For 10 hours of computing, if we can parallelize 9 hours of computing and 1 hour cannot be parallelized, then our maximum speedup is limited to 10 times as fast. If computers get faster the speedup itself stays the same.

http://www.umsl.edu/~siegelj/CS4740_5740/Overview/Amdahl's.html

a straggler slows everything down

  1. Overhead for parallelization
  2. Synchronization is required
  3. Load imbalances
  4. Amdahl’s Law

  1. Popular Content
  2. Poor hash functions

Better than linear scalability (per server >1 user)

Per resource added, exactly one extra user can additionally use the service

Per resource added, less than one user can additionally use the service

Too expensive to make everything fully redundant

  1. Replication
  2. Reconsumption

Replicate the data and service and have the job run again.

Remember the data lineage for a job and re-run task. Easier for stateless.

May introduce data consistency problems (e.g. data consumed, new data pushed)