DEV Community

Cover image for I Built a Game to Understand Fly.io's Orchestrator: flyd Operator Sim!
Wassim Soltani
Wassim Soltani

Posted on

4 2 1 2 2

I Built a Game to Understand Fly.io's Orchestrator: flyd Operator Sim!

I've recently been deep diving into Fly.io's infrastructure, particularly their flyd orchestration server and the superfly/fsm library that powers its stateful operations. To truly grasp the operational challenges, I built an interactive simulation game: flyd Operator Sim.

Play it here: https://flydsim.wsoltani.com/
Repo: https://github.com/wSoltani/flyd-operator-sim

flyd Operator Sim Cover


🤔 Why Build a Simulation?

Fly.io's platform is impressive. Reading their insightful blog posts and their public infra-log revealed the complexities of flyd. The superfly/fsm library also highlighted their focus on robust state management.

I wanted to explore:

  • What kind of incidents can actually occur on a worker node running flyd?

  • How does an operator diagnose and respond to these issues?

  • What's the impact of different actions on system health and application uptime?

  • How do Finite State Machines (FSMs) play a role in managing complex operations like machine migrations, even if it's abstracted away from the operator in a crisis?

Building a sim felt like the best way to learn.


✨ Introducing: flyd Operator Sim!

In flyd Operator Sim, you're an on-call engineer for a Fly.io region. Your goal:

  • Monitor worker health (CPU, memory, flyd status).
  • Respond to incidents like flyd stalls, containerd sync issues, network partitions, and storage corruption (many inspired by the infra-log).
  • Act using tools like flyd restarts, worker drains, log inspection, and (risky!) FSM overrides.
  • Maintain Uptime over a simulated period.

flyd errors

Game Objective & Progression:
Your main goal is to maintain high application uptime across your workers for 7 simulated days. Each day lasts about 5 minutes in real time. To make things more interesting, you start with one worker, and an additional worker is added each day, up to a maximum of four, increasing your responsibilities and potential points of failure!


🎓 What I learned

  1. Orchestration is Complex: Simulating even a part of it showed me the immense challenge of managing global infrastructure.
  2. State Management is Crucial (and complicated): The game reinforced how vital accurate state is for flyd and why a solid FSM library like superfly/fsm is essential, especially seeing potential containerd desync issues.
  3. Observability is Non-Negotiable: Good metrics and logs (which the game simulates access to) are critical for diagnosing issues, a theme evident in Fly.io's own infra-log.
  4. Operational Trade-offs: The sim touches on the pressure of quick fixes versus safer, slower solutions.

🤓 Tech Stack

Built with: Next.js, TypeScript, Tailwind CSS, Radix UI (shadcn), and React Context.


💭 Try It & Share Your Thoughts!

This was a personal learning project, but I hope others find it useful or fun.

What incidents should I add next? How can it be a better learning tool? Let me know!

Thanks for reading 💖

flyd mastery badge

MongoDB Atlas runs apps anywhere. Try it now.

MongoDB Atlas runs apps anywhere. Try it now.

MongoDB Atlas lets you build and run modern apps anywhere—across AWS, Azure, and Google Cloud. With availability in 115+ regions, deploy near users, meet compliance, and scale confidently worldwide.

Start Free

Top comments (6)

Collapse
 
dotallio profile image
Dotallio •

Love how you turned complex infra ops into an interactive sim - I feel like more platforms need stuff like this for onboarding.

Have you thought about adding cascading failures or partial network outages to make it even closer to real-life chaos?

Collapse
 
wsoltani profile image
Wassim Soltani •

That sounds like the perfect next step!

I'd love to keep adding incident types and gameplay mechanics that can potentially help teach more infra concepts and mirror real-life chaos.

It would be awesome if more people decide to jump in and help improve the sim!

Collapse
 
youngfra profile image
Fraser Young •

This project really flies above and beyond! I had a bug-tastic time simulating those incidents. 🪰

Collapse
 
wsoltani profile image
Wassim Soltani •

Not the fly emoji 😂

Collapse
 
nathan_tarbert profile image
Nathan Tarbert •

This is honestly genius, props for building something to actually see how it works in practice.

Collapse
 
wsoltani profile image
Wassim Soltani •

Thanks for the comment! Would love to see more projects like this. I think it makes learning quicker, easier and a lot more fun!

Scale globally with MongoDB Atlas. Try free.

Scale globally with MongoDB Atlas. Try free.

MongoDB Atlas is the global, multi-cloud database for modern apps trusted by developers and enterprises to build, scale, and run cutting-edge applications, with automated scaling, built-in security, and 125+ cloud regions.

Learn More

👋 Kindness is contagious

Sign in to DEV to enjoy its full potential—unlock a customized interface with dark mode, personal reading preferences, and more.

Okay