I've recently been deep diving into Fly.io's infrastructure, particularly their flyd
orchestration server and the superfly/fsm
library that powers its stateful operations. To truly grasp the operational challenges, I built an interactive simulation game: flyd Operator Sim.
Play it here: https://flydsim.wsoltani.com/
Repo: https://github.com/wSoltani/flyd-operator-sim
🤔 Why Build a Simulation?
Fly.io's platform is impressive. Reading their insightful blog posts and their public infra-log revealed the complexities of flyd
. The superfly/fsm
library also highlighted their focus on robust state management.
I wanted to explore:
What kind of incidents can actually occur on a worker node running flyd?
How does an operator diagnose and respond to these issues?
What's the impact of different actions on system health and application uptime?
How do Finite State Machines (FSMs) play a role in managing complex operations like machine migrations, even if it's abstracted away from the operator in a crisis?
Building a sim felt like the best way to learn.
✨ Introducing: flyd Operator Sim!
In flyd Operator Sim
, you're an on-call engineer for a Fly.io region. Your goal:
-
Monitor worker health (CPU, memory,
flyd
status). -
Respond to incidents like
flyd
stalls,containerd
sync issues, network partitions, and storage corruption (many inspired by the infra-log). -
Act using tools like
flyd
restarts, worker drains, log inspection, and (risky!) FSM overrides. - Maintain Uptime over a simulated period.
Game Objective & Progression:
Your main goal is to maintain high application uptime across your workers for 7 simulated days. Each day lasts about 5 minutes in real time. To make things more interesting, you start with one worker, and an additional worker is added each day, up to a maximum of four, increasing your responsibilities and potential points of failure!
🎓 What I learned
- Orchestration is Complex: Simulating even a part of it showed me the immense challenge of managing global infrastructure.
- State Management is Crucial (and complicated): The game reinforced how vital accurate state is for
flyd
and why a solid FSM library likesuperfly/fsm
is essential, especially seeing potentialcontainerd
desync issues. - Observability is Non-Negotiable: Good metrics and logs (which the game simulates access to) are critical for diagnosing issues, a theme evident in Fly.io's own infra-log.
- Operational Trade-offs: The sim touches on the pressure of quick fixes versus safer, slower solutions.
🤓 Tech Stack
Built with: Next.js, TypeScript, Tailwind CSS, Radix UI (shadcn), and React Context.
💠Try It & Share Your Thoughts!
This was a personal learning project, but I hope others find it useful or fun.
- Play: https://flydsim.wsoltani.com/
- Repo: https://github.com/wSoltani/flyd-operator-sim (give it a 🌟!)
What incidents should I add next? How can it be a better learning tool? Let me know!
Thanks for reading 💖
Top comments (6)
Love how you turned complex infra ops into an interactive sim - I feel like more platforms need stuff like this for onboarding.
Have you thought about adding cascading failures or partial network outages to make it even closer to real-life chaos?
That sounds like the perfect next step!
I'd love to keep adding incident types and gameplay mechanics that can potentially help teach more infra concepts and mirror real-life chaos.
It would be awesome if more people decide to jump in and help improve the sim!
This project really flies above and beyond! I had a bug-tastic time simulating those incidents. 🪰
Not the fly emoji 😂
This is honestly genius, props for building something to actually see how it works in practice.
Thanks for the comment! Would love to see more projects like this. I think it makes learning quicker, easier and a lot more fun!