CBDP

Define: Scalability

Independent parallel processing of tasks / sub-requests ==> Can add additional servers for further concurrency

Define: Fault Tolerance

Must be able to handle and recover from both software and hardware failures. Services and data must be replicated for redundancy.

Define: High Availability

Service must have 100% 24/7 uptime

Define: Consistency

Data stored/produced by the services must be consistent

Define: CAP Theorem

Consistency Availability Partition Tolerance

Define: Strongly Consistent

Cost of additional latency

Define: Inconsistent Operations

Better performance, applications are harder to write

Define: Performance

Predictable low latency processing with high throughput

Define: Tail Latency

The last 0.X% of the request latency distribution time

Define: Design for Self Healing

In a distributed system, failures happen all the time. Design the application to be self-healing

Define: Make all things redundant

Build redundancy into your application to avoid having single points of failure.

Define: Minimize Coordination

Minimize coordination between application services to achieve better scalability

Define: Design to Scale Out

Design your application so that it can scale horizontally, adding or removing new instances on demand.

Define: Partition Around Limits

Use partitioning to work around database, network and compute limits.

Define: Use of stateless services

Scaling without having a state is trivial

Define: Caching

Latency is king. Caching helps to significantly reduce the job’s latency

Define: Use the best data store for the job

Pick the storage technology that is the best fit for your data and how it will be used.

Define: Distribute Computation

Partition/Aggregate compute pattern is one that scales pretty well

Define: Design for Evolution

An evolutionary design is key for continuous innovation

Define: SLA

Service Level Agreement

Define: Scale Up

High cost powerful CPUs, with more core and more memory

Define: Scale Out

Add more lower cost servers

Define: PUE

Power Usage Effectiveness is the ratio of total energy used at DC vs that which is delivered to computing equipment

Define: Total Facility Power

All IT Systems (servers, networking etc.) + Other Equipment (cooling, UPS, generators, lights, fans)

Define: Public Cloud

Resources owned and operated by one org, usable by any paying customer

Define: Private Cloud

Resources owned, operated and used by a single org.

Note: Can be on-prem and/or hosted

Define: Hybrid Cloud

Combine Public and Private cloud. Better control over sensitive data, whilst taking advantage of scale of public cloud provider.

e.g. patient database in private cloud, but API in public cloud

Define: Multi Cloud

Use multiple clouds for an application / service. Redundant, but introduces complexity (API differences, migration)

Define: On-Premise

Resources located locally (or in a data center operated by company)

Define: Hosted

Resources hosted and managed by third-party

Define: Infrastructure as a Service

Rent IT infrastructure (servers and VMs), as well as corresponding networking and security measures.

Define: Platform as a Service

Get a on demand environment for development/deployment

Define: Function as a Service / Severless

Build app, without managing servers, as cloud vendors provide that. Using Lambdas, so better scaling?

Define: Software as a Service

Cloud hosted software, cloud provider handles all software and infrastructure.

Define: Replication

Keeping a copy of the same data on multiple machines that are connected via network

Define: Leader

Leaders are the replicas, which can be written to

Define: Follower

Followers are the nodes, who apply the replication log, they receive from leaders

Define: Synchronous Replication

Leader waits until all replicas confirm changes, before reporting success to user. Followers are guaranteed to have up to date versions of data. If follower does not answer, no writes can be done.

Define: Asynchronous Replication

Leader sends update message to all replicas, however responds success when it has made changes

Define: Catchup Recovery

Keep local log of changes already applied
After reboot, ask leader for outstanding changes

Define: Failover

Detect leader has failed
Promote new follower to leader
Adjust system accordingly

Define: Eventual Consistency

When running the same query on leader and follower results in different results. There is no limit to how far a replica can fall behind

Define: Reading you write

If a user updates a field, they are guaranteed that their reads will contain the updated information
- e.g. by reading only from the master
No guarantee for other users

Define: Monotonic Reads

Avoid time skips (e.g. comment appearing, and then gone, due to not being in other replica yet), by reading same data always from the same replica.

Define: Read Repair

Client detects stale response and writes the new response

Define: Anti-entropy process

Background process checks for inconsistencies and fixes them (unlike replication log, order is not preserved)

Define: Replica

A copy of the dataset

Define: read-after-write consistency

the ability to view changes (read data) right after making those changes

Define: Partitioning

For large datasets, or very high throughput, we need to break the data up into partitions. A piece of data, belongs to one exact partition

Define: Horizontal Partitioning

Splitting a table into multiple segments, whereby each segment has a different set of data.

Define: Vertical Partitioning

Split a table into multiple tables, whereby some columns are in the one table, and some in the other. Ideally, the table holds data that is linked together.

Define: Functional Partitioning

When it is possible to functionally partition data, e.g. a read only table and read-write table, one should partition these.

Define: Synchronous

Message latency no greater than known upper bound Executed at a known speed

Define: Partially synchronous

System is asynchronous for a finite time, synchronous otherwise

Define: Asynchronous

Messages may be delayed arbitrarily, could be paused, no timing guarantee

Define: Crash-Stop

Crash at any time, stops executing

Define: Crash Recovery

Crash at any time, may resume and some point in time

Define: Byzantine

Deviates from algorithm, could crash, could be malicious

Define: Failure

System as a whole is not working

Define: Fault

some part of the system is not working

Define: Node fault

crash (crash-stop/crash-recovery), deviating from algorithm (Byzantine)

Define: Network fault

dropping or significantly delaying messages

Define: Fault tolerance

System as a whole continues working, despite faults (some maximum number of faults assumed)

Define: Failure detector

Algorithm that detects whether another node is faulty

Define: Perfect Failure detector

labels a node as faulty if and only if it has crashed

Define: MTTR

Mean Time to Recovery

Define: MTBF

Mean Time Between Failures

Define: SLO

Service Level Objective Percentage of requests that need to return a correct response within a specific time

e.g. 99.9% of Requests in a a day get a response in 200ms

Define: SLO

Service Level Agreement Contract specifying a SLO, typically with penalties for violation

Define: Physical Clocks

Count number of seconds elapsed

Define: Clock Skew

Difference between two clocks

Define: Time of Day Clock

Time since a fixed date (e.g. Unix Epoch)

Define: Monotonic Clock

Time since arbitrary point (e.g. boot)

Define: Logical Clocks

Count Events (messages sent)

Define: ppm

Parts per Million Used to measure Quartz drift

Define: Single Leader Approach (Total Order Broadcast)

One node designated leader
Broadcast by sending to leader
Single point of failure -> Hard to change leader

Define: Logical Clocks Approach (Total Order Broadcast)

Attach a vector timestamp to every message
Deliver messages in order of timestamps
Requires FIFO links
- Wait until timestamp >=T on every node before continuing

Define: State Machine Replication

FIFO Total Order Broadcast all replica updates
Replica applies update
Replica acts as a state machine, whereby all replicas start at the same point and have the same inputs

Define: Consensus

When several nodes come to an agreement about an single value

Define: Uniform agreement

no two nodes decide differently

Define: Integrity

no node decides twice

Define: Validity

if a node decides value v, then v was proposed by some node.

Define: Termination

every node that does not crash, eventually decides some value.

Define: Term (Raft)

Time period, whereby there is at most 1 leader (sometimes none)

Define: ACID

Atomicity, Consistency, Isolation, Durability

Define: Strict Serializability / Strong One Copy Serializability

Combine Serializability and Linearizability

Define: Linearizability

Every operation takes effect atomically
All operations executed as if on a single copy of data
- Despite being multiple replicas
Every op returns an up to date value (strong consistency)

Define: Serializability

The outcome of a series of parallel executed transactions is the same if it were executed serially (without overlapping time)

Define: Eventual Consistency

If there are no more updates, replicas will eventually go towards the same state (may take a long time)

Define: Strong Eventual Consistency

If two replicas communicate with one another, they will converge towards the same state, even if updates occur. Consists of…

Eventual Delivery
Convergence

Define: Optimistic Replication

Replicas process operations based off of local state only

Define: Eventual Delivery

Every update made to a non-faulty replica is eventually passed on to every other non-faulty replica

Define: Convergence

Any two replicas that have processed the same set of updates are in the same state (even if processing order differs)

Define: LWW

Last Writer Wins

Define: Commutative

Order of delivery does not matter