Distributed systems

Encore seems nice.

Notes

Getting a million users is infinitely harder than scaling a system to handle a million users. Most systems could run comfortably on a Raspberry Pi
Fault-tolerant designs treat failures as routine. In large-scale systems, the assumption is that component failures will happen sooner or later. Any individual failure must be presumed imminent and component failures must be expected to be continuous.

Links

Setting up containers, load balancing, and service discovery on light hardware
Ask HN: Any recommended resources to develop system thinking? (2018)
Distributed Systems in One Lesson by Tim Berglund (2017)
Traefik - Modern HTTP reverse proxy and load balancer that makes deploying microservices easy. (Hello World with Traefik) (Awesome) (Helm Chart)
Traefik Training course resources (Web)
Kit - Standard library for microservices written in Go. (kit-auth)
Fear and Loathing in Lock-Free Programming (2017)
Reliable Systems Series: Model-Based Testing (2018)
Awesome Distributed Systems
Awesome Distributed Systems 2
Kong - Cloud-Native API Gateway & Service Mesh.
Disque - Distributed message broker.
Mesh - Tool for building distributed applications.
Raft - Raft distributed consensus algorithm implemented in Rust.
hraftd - Hashicorp's Raft implementation.
In Search of an Understandable Consensus Algorithm (HN)
libp2p specification - Technical specifications for the libp2p networking stack.
Class materials for a distributed systems lecture series
Raft Consensus Algorithm (Code)
Qri - Global dataset version control system (GDVCS) built on the distributed web.
Project Oak - Meaningful control of data in distributed systems.
mudb - Collection of modules for building realtime client-server networked applications.
Verdi - Framework for formally verifying distributed systems implementations in Coq.
PingCAP Talent Plan - Series of training courses about writing distributed systems in Go and Rust.
Protocol Labs - Build protocols, systems, and tools to improve internet.
Dark Crystal - Open source R&D affinity. Exploring the potential of new and existing technologies in crypto-space to encourage horizontal group collaboration.
Protozoa - Web developers, facilitators, crypto-engineers. Experts in Node.js & distributed systems.
Akka - Build highly concurrent, distributed, and resilient message-driven applications on the JVM. (Web) (Reddit) (Reddit)
Distributed Components - Provides reusable infrastructure for formally verifying distributed systems using the Coq proof assistant.
Practical Networked Applications in Rust, Part 1: Non-Networked Key-Value Store (HN)
LF - Fully Decentralized Fully Replicated Key/Value Store.
Awesome Consensus - Curated selection of artisanal consensus algorithms and hand-crafted distributed lock services.
Rezolus - Tool for collecting detailed systems performance telemetry and exposing burst patterns through high-resolution telemetry.
Cadence - Distributed, scalable, durable, and highly available orchestration engine to execute asynchronous long-running business logic in a scalable and resilient way.
Pilosa - Open source, distributed bitmap index that dramatically accelerates queries across multiple, massive data sets.
Finagle - Fault tolerant, protocol-agnostic RPC system. (Scaling out a Rails app with Finagle) (Twitter) (Tweet)
How To Build A Modern Distributed Compute Platform (2018)
Chaos Monkey - Resiliency tool that helps applications tolerate random instance failures.
Faust - Python Stream Processing.
"Consistency without consensus in production systems" by Peter Bourgon (2014)
Distributed consensus reading list
Titanoboa - Community version of fully distributed, highly scalable and fault tolerant workflow orchestration platform for JVM.
Buoyant - Helps you deploy and run Linkerd, the fully open source, ultralight service mesh.
Grappa - Runtime system for scaling irregular applications on commodity clusters.
MIT Distributed Systems course (2020) (Videos) (Notes) (HN) (Discord)
Correctness proofs of distributed systems with Isabelle/HOL (2019)
Apache Mesos - Cluster manager that provides efficient resource isolation and sharing across distributed applications, or frameworks.
Gleam - Fast, efficient, and scalable distributed map/reduce system, DAG execution, in memory or on disk, written in pure Go, runs standalone or distributedly.
Learning Distributed Systems - Cloud Native Podcast
etcd - Distributed reliable key-value store for the most critical data of a distributed system.
etcdadm - Command-line tool for operating an etcd cluster. It makes it easy to create a new cluster, add a member to, or remove a member from an existing cluster.
Learning to build distributed systems (2019) (Lobsters)
SwarmKit - Toolkit for orchestrating distributed systems at any scale. It includes primitives for node discovery, raft-based consensus, task scheduling and more.
How to get started with infrastructure and distributed systems (2016)
Advanced Napkin Math: Estimating System Performance from First Principles (2019) (Code)
Golimit - Uber ringpop based distributed and decentralized rate limiter.
System Design lectures (2020)
Awesome Scalability - Patterns of Scalable, Reliable, and Performant Large-Scale Systems.
LeetCode System Design Questions
Grokking the System Design Interview (Code)
Amazon Builders' Library - How Amazon builds and operates software.
Distributed Systems Wiki (Code)
Jepsen - Distributed Systems Safety Research.
ION - Distributed RTC system written by pure go and flutter.
Challenges with distributed systems (HN)
Systems design for Advanced Beginners (2020)
Performance Under Load (2018)
Veneur - Distributed, fault-tolerant pipeline for runtime data.
Going multi-region
List of distributed systems reading lists
Complexities of Capacity Management for Distributed Services (2020)
Hermes: a Fast, Fault-Tolerant and Linearizable Replication Protocol (2020)
WormSpace: A Modular Foundation for Simple, Verifiable Distributed Systems
Paxos vs Raft: Have we reached consensus on distributed consensus? (2020) (HN)
Debugging Distributed Systems (HN)
Distributed systems for fun and profit
Temporal - Open source microservices orchestration engine for running mission critical code at any scale. (Code) (Docs) (Why I joined Temporal) (Go SDK) (Talk)
Temporalite - Distribution of Temporal that runs as a single process with zero runtime dependencies.
Stateright - Model checker for implementing distributed systems. (HN)
Arvind Krishnamurthy's research
Distributed Services with Go
Fully asynchronous C implementation of the Raft consensus protocol
Notes on Distributed Systems for Young Bloods (2013) (HN)
Paxakos - Rust implementation of a distributed consensus algorithm based on Leslie Lamport's Paxos.
Riemann - Network event stream processing system, in Clojure.
Collection of the papers, conference talks, articles, blog posts, interesting Twitter threads, HN/reddit comments on systems engineering
Tess Rinearson - All Together Now: An Introduction to Distributed Consensus (2019)
Slurm - Open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters. (Code) (Docs) (Set up Slurm across Multiple Machines)
Submitit - Lightweight tool for submitting Python functions for computation within a Slurm cluster.
CAP FAQ
Readings in Distributed Systems
Control theory for fun and profit (2020) (HN)
Understanding Replication in Databases and Distributed Systems (2018)
A plain English introduction to CAP theorem
Debugging Incidents in Google's Distributed Systems (2020) (HN)
Odin - Programmable, observable and distributed job orchestration system which allows for the scheduling, management and unattended background execution of user created tasks on Linux based systems. (HN)
Verifying Strong Eventual Consistency in Distributed Systems (2017)
Patterns of Distributed Systems (2020) (HN)
Keeping CALM: When Distributed Consistency Is Easy (2020)
Distributed Systems Notes
Avoiding fallback in distributed systems
The Reactive Principles - Design Principles for Distributed Applications.
Paxi - Framework that implements WPaxos and other Paxos protocol variants.
Rafting Trip - Learn about network programming, concurrency, distributed systems, and more as you tackle the challenge of implementing the Raft distributed consensus algorithm.
Resources for learning distributed systems (2020)
Workload isolation using shuffle-sharding (2020)
Consensus is Harder Than It Looks (2020)
The Little Strangler (Lobsters)
A Review of Consensus Protocols (2020) (HN)
Disel: Distributed Separation Logic - Separation-style logic for compositional verification of distributed systems.
raft-zero - Implementation of the Raft consensus algorithm on top of the act-zero actor framework.
raft-playground - Application to simulate and test a Raft cluster, using raft-zero.
Building Netflix’s Distributed Tracing Infrastructure (2020)
Wikipedia's self-hosted CDN (2020)
Infinite Parallel Universes: State at the Edge (2020) (Summary)
Awesome Chaos Engineering
How you could have come up with Paxos yourself (2020) (HN)
Grafana Tempo - Open source, easy-to-use and high-scale distributed tracing backend. (Web) (Announcement) (HN)
Principles of chaos engineering (Code) (HN)
Chaos Experimentation, an open-source framework built on top of Envoy Proxy (2021)
Testing Distributed Systems - Curated list of resources on testing distributed systems. (Code) (HN)
Pegasus: Tolerating Skewed Workloads in Distributed Storage with In-Network Coherence Directories (2020) (Summary)
Notes on Paxos (2020) (HN)
This is why distributed systems are useful (and I am building one) (2020)
Distributed Systems lecture series by Martin Kleppmann (2020) (Lectures Notes)
Dkron - Distributed, fault tolerant job scheduling system for cloud native environments. (Web)
Braft - Industrial-grade C++ implementation of the RAFT consensus algorithm.
Distributed Systems course (2020) (NotesG)
MirBFT Library - Consensus library implementing the Mir consensus protocol.
Fairness in multi-tenant systems (2020)
Advanced Distributed Systems Design course
Raft implementation in Go
Loading Shedding Strategies - Demonstration of load shedding and how it can make your services more resilient in outages and come back online quicker.
A Byzantine failure in the real world (2020)
Byzantine Eventual Consistency
Interval Tree Clocks (2020)
Distributed Systems Reading List (HN)
Raft Visualization (HN)
Meld - Decentralized shared state.
Understanding Connections & Pools (2021) (HN)
Fission Whitepaper (Code)
Awesome distributed transactions
Rystsov's Blog on distributed systems
Compartmentalized Paxos - Scaling Replicated State Machines with Compartmentalization. (Tweet)
DistSys Reading Group
CASPaxos: Replicated State Machines without logs (2018) (Code)
Consensus: Bridging Theory and Practice - PhD dissertation on the Raft consensus algorithm.
The Fundamental Mechanism of Scaling (2021)
Ray - Simple, universal API for building distributed applications. Accelerating machine learning workloads. (Code) (Docs)
Jepsen - Framework for distributed systems verification, with fault injection. Clojure library.
How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh (2019)
Distributed Systems in Rust - Training course about the distributed systems in Rust.
rsraft - Raft implementation in Rust.
Implementing Raft's Leader Election in Rust (2021)
Effective Fallbacks (2020)
Ask HN: Recommended books and papers on distributed systems? (2021)
Raft implementation in Rust language
Porcupine - Fast linearizability checker for testing the correctness of distributed systems.
Testing Distributed Systems for Linearizability (2017)
Namazu - Programmable Fuzzy Scheduler for Testing Distributed Systems.
Engineering Dependability and Fault Tolerance in a Distributed System (2021)
Autopilot: workload autoscaling at Google (2020)
Byztime - Byzantine-fault-tolerant protocol for synchronizing time among a group of peers, without reliance on any external time authority.
Foundational Distributed Systems Papers (2021) (HN)
Making reliable distributed systems in presence of software errors by Joe Armstrong (2003)
unitalk - Distributed chat system which can be used as chat rooms or state synchronization.
Maelstrom - Workbench for learning distributed systems by writing your own.
An introduction to lockless algorithms (2021) (HN)
Clio - Functional, distributed programming language that compiles to JavaScript. (Code)
Distributed Systems Course (HN)
Sundial: Fault-tolerant Clock Synchronization for Data Centers (2021)
Achieving reliable dual writes in distributed systems (2021)
Paxos Made Simple (2016)
Fiber - Distributed Computing for AI Made Simple. (Web)
Raft Implementation & CLI Visualization in Rust
Ask HN: Learning Distributed Systems as a Junior Engineer (2021)
The Distributed Reading List
Launchpad - Library that simplifies writing distributed programs by seamlessly launching them on a variety of different platforms.
The Problem of Distributed Consensus (2021)
A robust distributed locking algorithm based on Google Cloud Storage (2021)
Sealer - Build share and run your distributed applications.
Scalability - Guides, Articles, Podcasts, Videos and Notes to Build Reliable Large-Scale Distributed Systems.
Building a Raft (2021)
Time, clocks, and order. (2020) - Look at the notion of time in a distributed system, and its effects on ordering.
The Generals (2020) - Look at the Two Generals' and Byzantine Generals' problem, two popular consensus problems.
Impossibility of Distributed Consensus with One Faulty Process (2020)
The CAP Theorem (2020)
Metastability and Distributed Systems (2021)
Distributed Systems Course (2021) (Tweet)
Metastable Failures in Distributed Systems (2021)
Distributed Systems Engineering Course Notes (2015)
Emitter - High performance, distributed and low latency publish-subscribe platform. (Web)
Patterns of Distributed Systems: Lamport Clock (2021)
Make your cluster SWIM (2020)
Systemizer - Tool for designing complex distributed systems, allowing you to simulate data flow with customizable components. (Web)
Patterns of Distributed Systems: Follower Reads (2021)
Getting To Know Logical Clocks By Implementing Them (2021)
Paxos vs Raft: Have we reached consensus on distributed consensus? (2021) (HN)
Consistency and Consensus – How Do Paxos and Raft Work? (2021)
Summer Blog Backlog: Distributed Systems (2021)
Fanouts and Percentiles (2020)
Distributed Tracing — we’ve been doing it wrong (2019)
How To Design A Reliable Distributed Timer (2021)
raft-engine - WAL-is-data engine that used to store multi-raft log.
Three Clocks are Better than One
RAMP up your distributed transactions (2021)
Errors found in distributed protocols
Python for Distributed Systems (2021)
FastPay - High-Performance Byzantine Fault Tolerant Settlement.
Distributed consensus made simple (for real this time!) (2021)
Hints and Principles for Computer System Design (2021) (HN)
Guide To Prepare for the Gremlin Certified Chaos Engineering Practitioner Exam
Balsam - High throughput workflows and automation for HPC.
Hypercore - Secure, distributed append-only log.
Hypercore Next - Append only log with multi-writer primitives built in.
"Waterpark: Distributed Actors vs the Pandemic" by Bryan Hunter (2021) - Building reliable, actor-based systems.
P language - Modular and Safe Programming for Distributed Systems. (Docs) (Tweet)
Raft Consensus Protocol (HN)
Paper review: Scaling Large Production Clusters with Partitioned Synchronization (2021)
MadSim - Magical Deterministic Simulator for distributed systems in Rust.
Deep dive into Yrs architecture (2021)
fantoch - Framework for evaluating (planet-scale) consensus protocols.
MultiPaxos made Simple (2021)
Paxos made Abstract (2021)
Unbase - Distributed database/application framework that is fundamentally reactive, fault tolerant, and decentralized.
Beating the CAP Theorem Checklist
Paper review: Paxos vs Raft
Shardz (2021) (HN)
microcosm - Prototype of distributed task scheduler.
Canary - Distributed systems library for making communications through the network easier, while keeping minimalism and flexibility. (Code)
Components Contrib - Community driven, reusable components for distributed apps in Go.
Paxos explained
Consistency Models Explained (2021)
Fault - Modeling language for building system dynamic models and checking them using a combination of first order logic and probability.
Events, Event Sourcing, and the Path Forward (2022)
How to make distributed system available (2022)
Best resources to learn about data and distributed systems (2022)
Lock-Free Locks Revisited (2022) (Lobsters) (HN)
ljepsen - Framework for distributed system's verification, with fault injection.
NATS.io - Cloud Native, Open Source, High-performance Messaging. (Code) (NATS 2.0 and Connectivity)
RustDDS - Rust implementation of Data Distribution Service.
Evolving clock sync for distributed databases (2022) (HN)
Ask HN: Do you find working on large distributed systems exhausting? (2022)
Life Beyond Distributed Transactions / Space-efficient Static Trees and Graphs (Video Overview)
Delicate - Lightweight and distributed task scheduling platform written in rust.
Practical Byzantium Fault Tolerant (PBFT) algorithm in Go
chaosd - Chaos Engineering toolkit.
dcache - CoreDNS Plugin: Asynchronous Distributed Cache for Distributed System.
Consensus that unifies paxos, raft, 2pc, etc.
Your computer is a distributed system (HN)
Consul at Fly.io (2022) (Lobsters) (HN)
MatrixCube - Fundamental Building Block for Elastic Storage With Strong Consistency and Reliability.
A Brief History of High Availability (2021) (HN)
Artillery - Fire-forged cluster management & Distributed data protocol.
Principles of Distributed Computing (lecture collection)
Distributed Systems Shibboleths (2022) (HN)
minicache - Distributed cache with client-side consistent hashing, distributed leader-elections, and dynamic node discovery. Supports both HTTP/gRPC interfaces secured with mTLS.
Sprinkle - Run jobs on distributed machines easily.
Fallacies of distributed systems (2022) (HN)
Distributed systems for fun and profit (Code)
Bistro - Fast, flexible toolkit for scheduling and running distributed tasks.
A Benchmarking Study to Evaluate Apache Spark on Large-Scale Supercomputers (2019)
Federation vs. Clustering: Self-determination vs. distributed computing? (2022)
Ask HN: Why are distributed systems so polarizing? (2022)
Surviving Continuous Deployment in Distributed Systems (2021)
Raft Consensus Animated (HN)

Notes
Links

Distributed systems

Notes​

Links​

Notes

Links