Forem: Ajay Agrawal

I Built an AI to Monitor Servers. Then I Built a Chaos Proxy to Break Them 💥

Ajay Agrawal — Wed, 29 Apr 2026 11:55:59 +0000

It’s 3:00 AM. Your phone is buzzing furiously. Your Grafana dashboard looks like a Jackson Pollock painting done entirely in red. A CPU on server-04 is screaming at 99%.

Cool graph, you think, rubbing your eyes. But what do I actually do about this?

We don’t have a data problem in modern DevOps. We have an Actionable Intelligence problem. We've built massive pipelines to funnel petabytes of Redfish server telemetry into time-series databases... just so we can set up Slack alerts that everyone inevitably mutes.

What if we put an AI in the loop? Not just a chatbot that spits out generic stack-overflow tips, but an Agentic AI ... a digital colleague that can reach out, inspect the infrastructure, and say: "Hey, Server 3 is melting down due to a runaway memory leak. I suggest a graceful reboot. Want me to pull the trigger?"

But there was a catch. To test a server-healing AI, I needed broken servers. And I really didn't want to explain to my hosting provider why I intentionally deep-fried my bare-metal rig.

So, I built NeurOps: half infrastructure intelligence, half intentional sabotage.

Here is the story of how I built an AI agent to monitor my servers, and a Chaos Proxy designed specifically to lie to it.

😈 Meet the Chaos Proxy: My Digital Gremlin

In the enterprise world, servers talk via the Redfish API. It's the standard RESTful way to ask a motherboard, "Hey, are you on fire?"

Instead of hooking my AI monitoring tool directly to the servers, I built a FastAPI middleware called the Chaos Management Proxy.

Normally, this proxy is a model citizen. It intercepts the Redfish request, grabs the real JSON payload from the server, and passes it along. But hit the right endpoint, and it turns into an absolute gremlin. With a simple POST request, it intercepts the payload mid-flight and injects a "Deep Merge" override.

Take a look at this snippet from the proxy router:

@app.post("/simulate/{server_id}/memory/leak")
def memory_leak(server_id: ServerEnum):
    # Deep merge this dict into the actual live Redfish API response!
    overrides[server_id.value]["Memory"] = {
        "UsagePercent": 92,
        "Status": {"Health": "Critical"}
    }
    return {"message": f"Memory leak injected for {server_id.value}"}

With one API call, the proxy alters reality. The monitoring system thinks the server is dying. The actual hardware is sipping a digital piña colada. We can simulate thermal spikes, disk failures, or even a slow, torturous CPU degradation ... all safely in software.

🧠 The LLM is a Routing Engine (Wait, That's Clever)

So the servers are (virtually) melting. How does the AI step in?

I used the Google Agent Development Kit (ADK) and Gemini to build NeuroTalk. Here’s the secret sauce: a good AI agent isn’t just a clever prompt. It’s about giving the AI the right tools and explicitly teaching it when to use them.

Here is the actual configuration of my AI Agent:

agent = Agent(
    name="NeuroTalk",
    model=Gemini(model="gemini-3-flash-preview"),
    tools=[
        get_live_status,    # Hits the live Redfish API via Chaos Proxy
        get_past_issues     # Queries BigQuery for historical telemetry
    ],
    instruction="""
    Tool selection strategy:
    1. Real-time Status: When asked about "current status", ALWAYS use get_live_status().
    3. Historical Analysis: Only use get_past_issues() when explicitly asked for trends.
    4. Combined Analysis: Use both if you need to compare live data with history.
    """
)

The LLM doesn't just guess; it acts as an intelligent router.

Ask it: "Why is server-02 acting weird right now?" ➡️ It writes a Python script to hit the live Chaos Proxy API.
Ask it: "Has server-02 been running hot all week?" ➡️ It writes a SQL query to hit BigQuery.

It investigates before it speaks.

🚧 The Statefulness Trap

It wasn't all smooth sailing. I quickly ran into a major problem: State.

If a CPU hits 90%, is it a 2-second spike because a cron job started, or is the server entering a death spiral? LLMs are notoriously bad at analyzing high-frequency time-series data on the fly.

To solve this, I had to build a fast, localized deque-based ring buffer into the polling collector (Neurosight) just to track the last 5 intervals.

# A simple ring buffer for trend detection!
def is_increasing(arr):
    return len(arr) == TREND_WINDOW and all(x < y for x, y in zip(arr, list(arr)[1:]))

If the temperature goes up 5 times in a row, the collector flags a TEMP_TREND_UP anomaly before the server actually hits the critical threshold. It attaches this tag to the payload sent to BigQuery. The AI simply reads this tag, bypassing the need to do any complex math.

🎭 The 5-Step Dance of Destruction and Salvation

When you boot up NeurOps, here is the wild sequence of events that happens in seconds:

The Target: We spin up Redfish emulators (or connect to real servers).
The Sabotage: We hit the Chaos Proxy and inject a fake 95°C thermal event on server-01.
The Detection: The Neurosight Collector polls the proxy, sees the 95°C spike, flags a TEMP_CRITICAL anomaly, and fires the data via Google Pub/Sub into BigQuery.
The Investigation: An engineer opens the Streamlit UI and asks NeuroTalk: "What just happened to server-01?"
The Salvation: The AI Agent queries BigQuery, sees the thermal spike, reads the Redfish status, and responds: "Server-01 has experienced a critical thermal event. I recommend triggering the /heal/server-01/reboot webhook to attempt a recovery."

🛠️ If You Want to Build This...

If you are looking to build agentic AI into your own DevOps workflows, here are my biggest takeaways:

Don't let the AI guess. Give it strict tools. An LLM without access to a live API or a database is just a very confident hallucinator. Treat it like a junior dev ... give it read-only API keys and watch what it does.
Chaos Engineering is mandatory. You cannot trust your AI if you have never watched it panic. Build a proxy, intercept payloads, and break things on purpose.
Start stupid simple. You don't need a massive Kubernetes cluster to test this. A simple FastAPI proxy and a Python polling script will get you 90% of the way there.

🏁 Wrapping Up

We are entering a wildly exciting era where AI doesn't just help us write code; it actively manages the infrastructure the code runs on. By combining standard protocols (Redfish), robust data pipelines (BigQuery), and Agentic AI, we can stop staring at dashboards at 3 AM and start actually fixing problems.

If you thought this was interesting, drop a comment! How are you using AI in your DevOps workflows? Or better yet... what is the most creative way you've ever broken a server on purpose?

Let me know below! 👇

🚨 Why Production-Grade Logging Isn’t Optional: A Technical Deep Dive 🔍

Ajay Agrawal — Wed, 22 Oct 2025 09:23:37 +0000

In today’s fast-paced software world, logging often gets treated as an afterthought—a few lines sprinkled here and there before a release. But when a production incident strikes at 3 AM, those logs become your North Star ✨ for making sense of chaos.

After years in backend engineering and incident response, it’s clear: logging isn’t just about recording events—it’s about building observability into your system from day one. 💡

The Hidden Cost of Poor Logging 💸

Research shows developers spend up to 35–50% of their time debugging issues. And a big chunk of that time is wasted digging through incomplete logs or trying to guess what really happened. In production, where you can’t just “add a print statement,” logs become your system’s black box 📦

Consider the real-world impact:

Faster incident fixes: Teams with great logs resolve production issues 🚑 60–80% faster
Lower resource overhead: Efficient logging prevents CPU and memory slowdowns ⚡
Cost control: Smart logging keeps cloud costs predictable and minimized 📉

Why Logging Matters at Every Stage 🛠️

During Development:

📝 Interactive documentation for onboarding and code understanding
🧩 Faster debugging (no more guesswork!)
⏱️ Built-in profiling to catch bottlenecks early

In Production:

🕵️ Rapid incident response
📊 Real-time monitoring and proactive alerts
🔒 Compliance for audits and standards
📈 Performance tuning, based on real usage patterns

The Microservices Challenge 🤹‍♂️

Modern architectures often see requests span 10+ services, scattering logs everywhere. Without context propagation or smart correlation, root cause analysis becomes a detective saga 🕵️‍♀️.

To stay on top, you need:

✍️ Automatic context propagation
🔗 Correlation IDs
📚 Centralized, queryable structured logs

Best Practices for Pro Logging 🧙

🧾 Structured (JSON) logs
🧑‍💻 Context-rich entries (who, what, where, when, why)
🚀 Async, non-blocking writes
⚙️ Granular log levels (DEBUG, INFO, WARNING, etc.)
🛡️ Never log sensitive data

Modern Libraries to the Rescue 🛟

While Python’s default logging module works, scaling for production needs more.

MickTrace is a lightweight, modern library I’ve recently explored that brings subtle superpowers:

🔌 Zero-config setup—just works
⚡ Async-native (built for FastAPI, etc.)
⏱️ Sub-microsecond overhead
🛠️ Auto context propagation across async
🌩️ Cloud and CI/CD-friendly

Installation is just: pip install micktrace

Quickstart example:

import micktrace
logger = micktrace.get_logger(name)
logger.info("User login", user_id=12345, ip_address="192.168.1.1", success=True)

Community & Contribution 🤝

Try out MickTrace: pip install micktrace
⭐ Star the repo if it saves you time: https://github.com/ajayagrawalgit/MickTrace
Got ideas or want to contribute? PRs welcome!
Share your logging adventures in the comments

In short: robust logging is your insurance in production. Make it your friend, not your afterthought. Your future self—and your teammates—will thank you! 😊

What are your thoughts on modern logging practices? Have you faced challenges with logging in production environments? Let’s discuss below!

Disclaimer: This article reflects my personal experiences and technical perspective. MickTrace is one of several excellent logging solutions in the Python ecosystem.