"Testing Distributed Systems w/ Deterministic Simulation" by Will Wilson
Strange Loop Conference
40 min, 20 sec
The talk discusses the benefits, challenges, and techniques of simulation testing in distributed systems.
Summary
- Will, from FoundationDB, discusses the difficulty of debugging distributed systems and introduces simulation testing as a solution.
- He explains how simulation testing works by creating deterministic simulations of the database to exhaustively debug before actual implementation.
- The talk covers the software engineering challenges faced and the strategies employed to address them, including creating a custom simulation framework called 'Flow'.
- Simulation testing allows for artificially increasing failure rates and the manipulation of the Hearst exponent to explore more potential bugs faster than real-world testing.
- Despite the effectiveness of simulation testing, real hardware testing (through 'sinkhole') is also necessary to catch bugs that simulation can't simulate.
Chapter 1

Will introduces himself, FoundationDB, and the concept of simulation testing.
- Will works at FoundationDB, which provides a scalable and fault-tolerant database with ACID transactions.
- He introduces the topic of simulation testing and its potential to make debugging distributed systems easier.

Chapter 2

Will talks about the inherent difficulties in debugging distributed systems compared to simple systems.
- Debugging distributed systems is complicated and is only slightly preferable to painful experiences like sticking a fork in one's eye.
- The complexity of distributed systems is acknowledged, but Will proposes that the real challenge lies in their non-deterministic nature caused by networks.

Chapter 3

Details on how simulation testing can address the debugging challenges in distributed systems.
- Will uses a simple packet transfer example between two servers to illustrate how random network conditions can lead to rare but critical bugs.
- The inability to repeat these conditions reliably in a real-world scenario highlights the need for a controlled testing environment.

Chapter 4

Will discusses the creation of a deterministic simulation environment for FoundationDB.
- FoundationDB started by writing a deterministic simulation of their database to debug exhaustively before actual implementation.
- This simulation, called 'Flow', allows simulating a network of communicating processes and their environment within a single physical process.

Chapter 5

The process of generating a simulated database and the software engineering challenges involved.
- FoundationDB's simulation creates virtual processes within a single physical process to avoid non-determinism.
- Flow allows writing actor-based concurrency in C++ using a syntactic extension that transforms actor definitions into callback-based code.

Chapter 6

Explains the detailed approaches and techniques used in simulation testing.
- Test files declare objectives and potential failure scenarios, including random clogging and network attrition.
- Various failure modes, including hardware issues and network disruptions, are simulated to expose bugs.

Chapter 7

Will highlights the crucial role of determinism in the simulation process.
- Determinism is essential to ensure that simulations are repeatable, with the same input leading to the same output.
- A small percentage of simulation runs are executed twice to ensure determinism.

Chapter 8

Discusses the debugging challenges that arise with simulation testing.
- Debugging code laced with callbacks within a simulation environment is difficult, leaving printf debugging as a primary tool.
- Deterministic simulation facilitates debugging by ensuring the same sequence of events upon reruns.

Chapter 9

Addresses the limitations of simulation testing and the need for real-world testing on hardware.
- Simulation cannot account for all real-world scenarios, particularly those involving other people's software or hardware-specific issues.
- FoundationDB uses a real hardware cluster, 'sinkhole', to test against power failures and hardware malfunctions.

Chapter 10

Will discusses the ongoing efforts to improve simulation testing and address its pitfalls.
- The concern of training programmers to write bugs that pass simulation tests is akin to antibiotic resistance.
- Potential solutions include having multiple simulation frameworks, more real-world testing, and additional hardware to reduce debugging cycle time.

More Strange Loop Conference summaries

"Lessons from building GitHub code search" by Luke Francl (Strange Loop 2023)
Strange Loop Conference
Luke Francl at GitHub discusses the challenges and solutions in building the new GitHub code search, 'Blackbird'.

"The Economics of Programming Languages" by Evan Czaplicki (Strange Loop 2023)
Strange Loop Conference
Evan discusses his journey in creating Elm, the challenges faced, and insights into the funding and economics behind programming languages.

"Making Hard Things Easy" by Julia Evans (Strange Loop 2023)
Strange Loop Conference
A detailed exploration of why systems like DNS, HTTP, and Bash can be challenging to master, even when they seem fundamental, and strategies to demystify them.

"Noether: Symmetry in Programming Language Design" by Daira Hopwood (2013)
Strange Loop Conference
Dara presents the concept and design of a programming language called Neta, emphasizing the importance of symmetry in programming language design.

"The Trouble With Types" by Martin Odersky (2013)
Strange Loop Conference
An in-depth examination of Scala's type system, its challenges, and potential simplifications.