Forem: Jose Soares

We built an AI agent for DevOps engineers. They didn't want it.

Jose Soares — Thu, 26 Mar 2026 12:46:21 +0000

We spent months building an AI agent for Terraform. When we did user interviews with SREs and DevOps engineers, their reaction was pretty unanimous: "This looks great but not very useful to me."

This completely changed the trajectory of our startup.

Here is the story of how we accidentally built the wrong product for the wrong people, and how it led us to build Grafos v2: a tool designed to help founders and SWEs survive the transition from MVP to production-grade.

It started as a visualisation tool

During a hackathon a few months ago, we built the foundational blocks of Grafos.ai. Initially, it was just an infrastructure visualisation tool.

As a frontend engineer, I knew nothing about Terraform. I had to go through a hardcore introduction to IaC in just one week. By week two, we had built this:

It was, let's say, "rough" but this had potential. As far as we knew, there were only a couple of decent infra visualisation tools around — Brainboard and Pluralith. The latter wasn't even maintained anymore, and users in their community were practically begging for updates.

But we knew a read-only diagram wasn't the end goal. Now that we understood how to visualise the infrastructure, how hard could it be to build a "Lovable for infra." LLMs know how to code so plugging one in that lets you edit the graph, and translate those edits into Terraform shouldn't be too complicated.

During a second hackathon, we built it. Then we productionised it and shipped Grafos v1. (You can read about the second hackathon here).

We thought we had built a superpower for DevOps but when we talked to actual SREs, their natural skepticism kicked in. There will never be an AI that can claim to be a better SRE than an actual senior engineer. To them our tool was just another abstraction layer to babysit. This forced us to ask who actually needs this.

The answer was staring me in the face. Our ideal user wasn't an SRE. Our ideal user was a founding engineer like me.

Think about the typical founding engineer today. You build a product, you host it on Vercel or Heroku, and you get some traction.

You have a couple of hundred users, and they start complaining about downtime or speed dips. You know you need to migrate to a real GCP or AWS setup. You know of these clouds, but you have no idea what you actually need, let alone how to provision it safely. And you definitely cannot afford to hire a $150k/year infrastructure person yet. Your infra isn't good enough to scale, but you are stuck.

Grafos v2

This is why we threw out our original assumptions and started working on Grafos v2.

From the learnings of our first iteration, we are building a product explicitly to help productionise applications for founders who don't have an SRE. Grafos v1 already knows how to analyze and decide what your application needs. Grafos v2 actually sets up that infrastructure for you based on those requirements.

Because we know the dangers of AI hallucinating cloud infrastructure, we are building v2 with a very strict, opinionated philosophy:

Highly Deterministic: The system relies on hard logic for as much of the process as possible.
LLMs in their lane: We only use LLMs to do what they are actually good at. Stuff like reading massive amounts of documentation and interpreting the user's plain-English intent.
Fail fast to a human: Every step is transparent. If the agent isn't sure, it stops and asks the user, rather than guessing and breaking things.

We aren't trying to replace infrastructure engineers. We are empowering founders to continue their journey, scale their apps, and survive until they reach the point where they can hand it over to a DevOps team.

We're a team of 4 engineers currently deep in the trenches building this. If you are a founder or engineer dreading your infrastructure migration, we are opening up an early alpha for v2 soon to help us test it. Drop your email here to get on the waitlist, or leave a comment below.

I’m a Frontend Engineer. Let me spin up a scalable GCP backend real quick.

Jose Soares — Wed, 25 Mar 2026 11:24:12 +0000

I'm a frontend engineer who had to build an AI backend, and later debug a collapsed GCP environment. Here is what those two weekends taught me about the context gap between code and infrastructure.

During a hackathon, our data engineer was near the summit of Mount Toubkal in Morocco (image above), and I was left alone to build an end-to-end AI backend. A few months later, when our CTO was away for the weekend, our staging environment collapsed.

So naturally, I decided to spin up a scalable GCP backend and fix our infrastructure real quick.

Okay, I didn’t build a distributed backend from scratch. But over those two weekends, I did build a working LLM agent from scratch, debug a cascade of GCP failures I’d never encountered before, untangle IAM permissions, and learn how to run production database migrations. A year ago, any one of those things would have taken me weeks, but look at me now.

The Hackathon Project

We’d just shipped the MVP of Grafos.ai as an infrastructure visualiser - a clean, interactive graph of your Terraform. Looking at it, I had a thought that felt obvious: we had a beautiful way to see infrastructure. How hard could it be to add a chat interface that allowed users to change it?

The barrier to building something like this isn’t the code. It’s the time it takes to acquire the context. I’d never touched the Gemini API or written an LLM agent. I barely knew how our own FastAPI backend was wired up. Which endpoints existed, how authentication worked or how the Terraform data was stored and accessed. Under normal circumstances, that’s a week of reading documentation before writing a single line that does anything useful.

So before I opened my editor, I opened Gemini and asked for a crash course. Not a general “explain LLMs to me”, I needed a specific, dense, 60-minute conversation on the Gemini API, context management, intent classification, and retry logic. The kind of briefing you’d get from a senior engineer who had 45 minutes before a flight. By the time I opened Cursor, I had enough of a mental model to know what to build.

Then Cursor did something I still find uncanny. I pointed it at our FastAPI backend and started describing what I needed: an endpoint that takes a user message, loads the right Terraform context, classifies the intent, and returns a response. Cursor already knew how our authentication middleware worked, where our Terraform data lived, and how our existing endpoints were structured. I’d describe the logic, the conventions to follow, the existing code to draw from, and it’d write the implementation. I spent my two days on the design decisions, not fighting the boilerplate.

I gave the UI to Lovable. Ten minutes for a chat interface I’d have spent an hour on myself. With two days on the clock, an hour was too expensive to spend on something I already knew how to build. Not worth the pride.

By the end of the second day, a user could ask a question about their Terraform, get a sensible answer, and request a PR. The agent had no memory between messages, the JSON parsing was so brittle that a malformed response could break the whole flow. When Peran came back from the mountain and looked at the code, he was not delighted. His post on making it production ready is well worth reading if you want to know how he improved this.

But it worked, and it took two days instead of two weeks. That gap comes down to how fast I was able to filter through research and documentation for the specific problem in front of me, without having to wade through everything that wasn’t relevant.

The Infrastructure Outage

A few months later, our CTO was away for the weekend and our staging environment collapsed. Fresh off the hackathon, I felt I could do it. Cursor had made me feel like I could squash any bug. So I pulled up GCP and started debugging.

It was painful in a way the hackathon hadn’t been. Not because the problems were harder in isolation, but because I had no map. Every LLM was happy to help but none of them knew what our infrastructure actually looked like. Cursor had our codebase indexed. It didn’t have our GCP project, our IAM roles, our Terraform state, or any concept of what would break if I changed an Ingress setting.

What followed was about six hours of whack-a-mole, where fixing one problem would immediately surface the next one hiding behind it.

The first problem was that I couldn’t see any problems. Our FastAPI service was returning 500 errors, but GCP Logs showed nothing. Just a clean HTTP request log with a status code and silence. It took a conversation with Gemini to understand why: a try/except block in our middleware was catching every unexpected exception and converting it into a tidy HTTPException. FastAPI received a neat error response, decided the developer had handled it, and logged nothing. GCP’s Error Reporting was completely blind. The fix was one line, but finding that it was even needed took an hour.

Once I could see errors, I hit the second problem: one of our services wasn’t responding. The traceback pointed to a ConnectionRefusedError trying to reach localhost:80. After working through the networking layer with Gemini, I eventually found it: the GitHub Action that built and deployed that service was running docker build from the root of the monorepo instead of inside the service folder. It had been deploying the wrong Docker image to the wrong service, probably for longer than I’d realised.

Fixing the Docker context fixed the build. But now our main app was getting a 404 back from the service, but the GCP logs were empty. Not a single entry. Google’s load balancer was silently swallowing them before they ever touched the container, because the service’s Cloud Run Ingress was set to Internal. From the outside it looked like a dead service but from the inside, nothing had happened. No logs, no errors, no trace of any request ever arriving. I only figured it out by noticing a column in the Cloud Run dashboard that said “Internal” next to the service name, compared it to our production service set to “All”, and changed it.

Now I was getting a 403 instead of a 404. This was actually progress: the request was reaching the service, but being rejected because our main app had no authentication token. I had to write Python logic to fetch a Google Identity Token and grant the Cloud Run Invoker role to our service account. That got the services talking.

Then the app crashed because a database column didn’t exist. Running the migrations job appeared to succeed, green checkmark and all, but the logs showed it had connected to localhost:5432 the entire time. The job’s environment variables had never been configured to point at the actual SQL instance. I ran migrations against the right database, and after a full day of debugging, the staging environment was back up.

What struck me was how differently the two weekends had felt. The hackathon was hard but navigable. Every time I hit a wall, I could show Cursor the exact file it needed to understand the problem. The infrastructure outage was disorienting in a different way. I wouldn’t have been able to do it without LLMs but every conversation started from scratch. I had to describe our setup, copy-paste logs, explain what IAM roles existed and what they were supposed to do. None of them could see the blast radius of changing that Ingress rule. None of them knew which Docker image was deployed where, or that our migrations job had never been pointed at the right database. I was the only source of context, and also the one who didn’t understand the system.

Six months earlier, a frontend engineer attempting to debug a Cloud Run IAM lockout would have just waited for the CTO to come back from his weekend. The reason I could do it at all was Gemini explaining GCPs logging architecture step by step and Cursor helping me write Python I’d never written. The context gap made it painful and slow in a way that the hackathon wasn’t.

What this means for Grafos

The two weekends are basically the same story told twice. In the first one, the tools worked because the context existed. Cursor had the codebase, Gemini could give me a crash course on a well-documented API. In the second one, the same tools were harder to use because the context was split. Gemini knew GCP inside out but it didn’t know our specific setup. It couldn’t see our IAM configuration, our Terraform state, or which services depended on which. Every time I hit a new failure, I had to reconstruct a new picture from scratch before Gemini could help me reason about it. The general knowledge was there but the specific context wasn’t.

This is what Grafos is built around. The whole product is a bet that infrastructure has a context problem, and that if you can make that context legible, you can give any developer the same footing with their cloud environment that Cursor gives them with their codebase. Turning thousands of lines of Terraform into an interactive graph, with an AI assistant that actually understands your state, is just solving the same problem that made the hackathon work and made the outage so much harder than it needed to be.

I didn’t come out of that weekend a cloud architect but I did come out of it more certain that the era of “I can’t touch that, I’m a frontend dev” is ending.

We originally published this on our Substack. I'm part of a team of 4 engineers building Grafos AI - check it out if you're tired of debugging Terraform infrastructure in the dark without context.