I Built a Game to Understand Fly.io's Orchestrator: flyd Operator Sim!

#webdev #gamedev #learning #fly

I've recently been deep diving into Fly.io's infrastructure, particularly their flyd orchestration server and the superfly/fsm library that powers its stateful operations. To truly grasp the operational challenges, I built an interactive simulation game: flyd Operator Sim.

Play it here: https://flydsim.wsoltani.com/
Repo: https://github.com/wSoltani/flyd-operator-sim

🤔 Why Build a Simulation?

Fly.io's platform is impressive. Reading their insightful blog posts and their public infra-log revealed the complexities of flyd. The superfly/fsm library also highlighted their focus on robust state management.

I wanted to explore:

What kind of incidents can actually occur on a worker node running flyd?
How does an operator diagnose and respond to these issues?
What's the impact of different actions on system health and application uptime?
How do Finite State Machines (FSMs) play a role in managing complex operations like machine migrations, even if it's abstracted away from the operator in a crisis?

Building a sim felt like the best way to learn.

✨ Introducing: flyd Operator Sim!

In flyd Operator Sim, you're an on-call engineer for a Fly.io region. Your goal:

Monitor worker health (CPU, memory, flyd status).
Respond to incidents like flyd stalls, containerd sync issues, network partitions, and storage corruption (many inspired by the infra-log).
Act using tools like flyd restarts, worker drains, log inspection, and (risky!) FSM overrides.
Maintain Uptime over a simulated period.

Game Objective & Progression:
Your main goal is to maintain high application uptime across your workers for 7 simulated days. Each day lasts about 5 minutes in real time. To make things more interesting, you start with one worker, and an additional worker is added each day, up to a maximum of four, increasing your responsibilities and potential points of failure!

🎓 What I learned

Orchestration is Complex: Simulating even a part of it showed me the immense challenge of managing global infrastructure.
State Management is Crucial (and complicated): The game reinforced how vital accurate state is for flyd and why a solid FSM library like superfly/fsm is essential, especially seeing potential containerd desync issues.
Observability is Non-Negotiable: Good metrics and logs (which the game simulates access to) are critical for diagnosing issues, a theme evident in Fly.io's own infra-log.
Operational Trade-offs: The sim touches on the pressure of quick fixes versus safer, slower solutions.

🤓 Tech Stack

Built with: Next.js, TypeScript, Tailwind CSS, Radix UI (shadcn), and React Context.

💭 Try It & Share Your Thoughts!

This was a personal learning project, but I hope others find it useful or fun.

Play: https://flydsim.wsoltani.com/
Repo: https://github.com/wSoltani/flyd-operator-sim (give it a 🌟!)

What incidents should I add next? How can it be a better learning tool? Let me know!

Thanks for reading 💖

MongoDB Atlas runs apps anywhere. Try it now.

MongoDB Atlas lets you build and run modern apps anywhere—across AWS, Azure, and Google Cloud. With availability in 115+ regions, deploy near users, meet compliance, and scale confidently worldwide.

Start Free