Forem: Kanak Waradkar

I put alot of effort into this one.I hope people appreciate it.

Kanak Waradkar — Fri, 24 Apr 2026 16:11:11 +0000

Google Cloud NEXT '26 Challenge Submission

Kanak Waradkar

Apr 24

The Boring Protocol That Quietly Ate the Cloud

#devchallenge #cloudnextchallenge #googlecloud

Comments 1

10 min read

The Boring Protocol That Quietly Ate the Cloud

Kanak Waradkar — Fri, 24 Apr 2026 16:09:52 +0000

This is a submission for the Google Cloud NEXT Writing Challenge

Act I — A Scene You've Already Lived

It is 2:47 AM.

You have been staring at the same error trace for three hours. Your AI agent — the one you spent two weeks fine-tuning, the one your manager called "the future of the team's workflow" — is broken. Not because the model is wrong. Not because your logic is flawed. It is broken because the tool it needs to call speaks a different dialect than the one you wrote the connector for. The authentication handshake is slightly off. The schema your agent expects doesn't quite match what the API returns. You wrote a wrapper, then a wrapper for the wrapper, then a helper function to normalize the wrapper's output.

You are not building intelligence. You are a translator in a room where everyone is shouting in a different language.

This is not a fringe experience. This is the defining condition of modern AI development, and almost nobody is talking about it. Because while the conference halls of Las Vegas erupted for a new Gemini model and eighth-generation TPUs, something quieter happened at Google Cloud NEXT '26 that will matter far longer than any benchmark score.

A protocol became the law of the land.

Act II — The Tower of Babel, Serialized

To understand why the Model Context Protocol (MCP) is the most important announcement from Google Cloud NEXT '26, you have to understand the problem it kills. And to understand that problem, you have to go back further than you think.

In the beginning, there were APIs. This was supposed to solve everything.

APIs were the handshake that let software systems talk to each other. They were elegant in theory: you expose endpoints, you document them, and any system in the world can connect. For a while, this worked. The internet was built on it.

Then something changed. The internet didn't get smaller. It got bigger. Exponentially, chaotically bigger. And as it grew, every platform, every service, every product started developing its own dialect. REST. SOAP. GraphQL. gRPC. Webhooks. OAuth 1.0. OAuth 2.0. API keys in headers. API keys in query strings. Rate limits that differ per endpoint. Error codes that mean different things across different services. Every integration you built was, in reality, a custom translation layer — a bespoke diplomatic relationship between two systems that fundamentally did not understand each other's native tongue.

This was tolerable when software was built by humans, for humans, at human speed.

Then came the agents.

Act III — The Shatter Point (December 2025, Largely Unnoticed)

Here is the moment historians will circle back to.

In December 2025, before the spotlights, before the Vegas keynote stage, Google quietly launched fully managed remote MCP servers for Google Maps, BigQuery, Compute Engine, and Kubernetes Engine. No major press cycle. No splashy product announcement. Just a protocol, extended to four services, slipped into the changelog.

That was the shatter point.

Because here is what MCP actually is, stripped of the marketing language: it is a universal adapter for AI agents. Instead of an agent needing a custom-written integration for every tool it wants to use — instead of you, the developer, writing that integration — MCP defines a single, standardized way for agents to discover, connect to, and operate tools. One protocol. One handshake. Every tool that speaks MCP is instantly available to every agent that speaks MCP.

It is the USB port for artificial intelligence. And in December 2025, Google plugged it into four of its services and said nothing about it.

By April 2026, at Cloud Next NEXT '26, they plugged it into everything.

Act IV — What Actually Happened in Las Vegas

While the crowd was applauding the Gemini 3.1 Pro benchmark numbers — and they are genuinely impressive numbers — the structural revolution was being announced three slides earlier, in language so dry it barely registered.

Quote from the Opening Keynote, Thomas Kurian:

"We've used the Model Context Protocol to turn every Google Cloud service into a tool that agents can orchestrate directly, enabling them to troubleshoot our own infrastructure using decades of our own telemetry."

Read that sentence again. Not some Google Cloud services. Not the popular ones. Every Google Cloud service is now an MCP server. Google turned its entire cloud infrastructure — the same infrastructure that runs YouTube, Search, and Android — into a unified, agent-readable tool surface.

The implication is almost too large to process.

If you are building an agent today on Google Cloud, you no longer write integration code for BigQuery. You do not write a connector for Cloud Storage. You do not negotiate with the IAM API. You point your agent at the MCP server and it figures out the tools available, their schemas, their capabilities — automatically, standardly, and with enterprise security baked in by default.

The Workspace MCP Server, announced in preview at NEXT '26, extends this further. Agents can now synthesize Drive documents, draft Gmail responses, and manage Calendar logic — all through a single, standardized, open framework. The new Workspace CLI will allow agents to trigger these capabilities directly.

And then there is Apigee.

Act V — The Apigee Revelation (The One Nobody Talked About)

Apigee is Google Cloud's API management platform. Most people think of it as a gateway — something that sits in front of your APIs and handles rate limiting and authentication. Boring infrastructure. Background noise.

At NEXT '26, Apigee became something else entirely.

Apigee now functions as an MCP bridge. What this means in practice: any standard API — not just Google's, any API — can be translated into a discoverable, agent-ready MCP tool. Existing security controls and governance policies carry over automatically.

Think about what that sentence means. Every legacy REST API your organization has ever built. Every third-party service you've integrated. Every custom internal tool that runs on some vendor's proprietary endpoint. All of it can now be surfaced as a standardized, agent-accessible MCP server.

The fragmentation problem doesn't get solved by replacing all your APIs. It gets solved by giving every existing API a universal translator. Apigee is that translator. And it quietly went live while everyone was watching the TPU benchmark charts.

Act VI — The Part Where It Gets Dangerous (And Why That's the Point)

Here is the fear that lives at the center of every serious conversation about agentic AI: who controls what the agent can do?

If an agent can call any tool, what stops it from calling the wrong one? What stops it from writing to a database it should only be reading? What stops a compromised agent from exfiltrating data through a tool it was never supposed to have access to in the first place?

The old answer was: application logic. You write guardrails in code. You validate inputs. You build permissions into the software layer.

The problem is that application logic is fragile. It can be bypassed. It can drift. In a world where agents are autonomous — where they run for days, chain tasks across dozens of tools, spawn sub-agents — trusting the application layer to enforce security is like trusting someone to lock the vault and also trusting them to not write the combination on the door.

MCP, as implemented at Google Cloud NEXT '26, solves this at a different layer entirely.

The Agent Gateway, announced as the "air traffic control" system for the Gemini Enterprise Agent Platform, is the enforcement point. It understands MCP natively. Every tool call, every agent action, every data access request routes through the Gateway. And the Gateway is connected directly to Google Cloud's Identity and Access Management (IAM) system.

The demo shown in the Developer Keynote made this tangible. A Planner Agent attempting to call a Finance MCP server — to read and write financial records — was stopped dead. Not by application code. Not by a hard-coded rule buried in a function somewhere. It was stopped because a developer had applied a single conditional IAM policy to the MCP server connection: read-only = true. The write privilege evaporated, instantly, at the protocol layer.

This is what zero-trust security looks like when it grows up. You do not write permission checks into every agent. You enforce permissions at the protocol that every agent must use. Change the policy once, and every agent that touches that MCP server is immediately governed by it.

One rule. Every agent. No exceptions.

Act VII — The Ghost in the Protocol

There is a concept in systems theory called antifragility — the idea, coined by Nassim Nicholas Taleb, that some systems don't just survive stress and chaos; they actively grow stronger from it. They are built such that every shock to the system makes the structure more resilient, not less.

The history of software integration has been, by every measure, fragile. Every new service added to your stack was a new point of failure. Every new agent you deployed was a new attack surface. Every new API you connected was a new diplomatic negotiation that could break at any time, without warning, for reasons entirely outside your control.

MCP is not just a protocol. It is an antifragile architecture for the agentic era.

Every new tool that adopts MCP makes every other MCP-compatible agent immediately more capable. Every new security policy applied at the Agent Gateway makes every connected agent immediately more secure. Every new service Google exposes as an MCP server makes the entire ecosystem more interconnected — not more fragile, but more robust. The network effect runs in the direction of stability, not entropy.

Google adopted MCP — a protocol originally designed by Anthropic — as the foundation of its entire agentic infrastructure. Microsoft's A2A protocol, now at version 1.2 and running in production at 150 organizations, is designed as a complement to MCP, handling agent-to-agent communication while MCP handles agent-to-tool communication. Salesforce runs it. ServiceNow runs it. SAP runs it. The Linux Foundation now governs A2A. The ecosystem is converging.

This is what the end of the fragmentation era looks like. Not a single vendor winning. Not one cloud eating the others. A protocol becoming the shared grammar of an industry.

Act VIII — What This Means For You (The Practical Part)

If you are building AI agents today — whether on Google Cloud, another platform, or your own infrastructure — MCP should change how you think about your architecture.

Stop writing custom connectors. Every hour you spend writing a custom integration between your agent and a tool is an hour you are spending on a problem that has already been solved. If the tool you need speaks MCP, you connect to it with a standard call. If it doesn't, Apigee can translate it.

Govern at the protocol layer. Do not write security logic into your agents. Apply IAM policies to your MCP server connections. This gives you a single, auditable, enforceable control point for every agent in your ecosystem, regardless of which model is running underneath.

Think in tool surfaces, not in integrations. The mental model shift is from "this agent connects to these specific APIs" to "this agent has access to this governed surface of tools." The tools are discoverable. The capabilities are standardized. The security is inherited.

Here is a minimal example of what connecting to a remote MCP server looks like in the Gemini Agent SDK today:

from google.cloud import agent

# Connect your agent to the Cloud Storage MCP server
# No custom connector. No schema negotiation. No auth wrapper.
agent_client = agent.AgentClient()

response = agent_client.run(
    agent_id="your-agent-id",
    message="List all objects in my-project-bucket and summarize the largest file.",
    mcp_servers=["https://storage.googleapis.com/mcp/v1"]
)

print(response.text)

No custom authentication. No wrapper library. No three-hour debugging session at 2:47 AM. The agent discovers the tools, understands their schemas, and calls them — governed by whatever IAM policy you've already applied to your Cloud Storage resources.

That is the reality MCP makes possible.

The Full Circle

We began in a room where everyone was shouting in a different language. A developer, exhausted, writing wrappers for wrappers, trying to make a system that was supposed to be intelligent do something as simple as access the right tool in the right way.

The Tower of Babel was not destroyed by building a better tower. It was dismantled by giving everyone a shared language.

Google Cloud NEXT '26 was, on the surface, a showcase of computational power. Eighth-generation TPUs. Gemini 3.x models. New data center fabrics. A new IDE. The hardware was extraordinary. The models are genuinely capable.

But the announcement that will define this era — the one that will matter in five years when we look back and ask how AI agents went from fascinating experiments to the operational backbone of every enterprise — was a protocol. A dry, specification-level decision about how agents should communicate with tools.

The Gemini models get the headlines. The benchmarks get the LinkedIn posts.

MCP gets the future.

And while you were watching the keynote for the part with the big numbers, the boring protocol was quietly wiring itself into everything. Every service. Every tool. Every agent. Every organization.

Not with a bang. Never with a bang.

Just a handshake. A standard one. The same one, every time.

Written in response to Google Cloud NEXT '26. All technical details sourced from official Google Cloud keynotes and blog posts, April 22–23, 2026.

References

ChronoAgent: I Built an Agent That Reads My WhatsApp So I Stop Missing Deadlines

Kanak Waradkar — Fri, 24 Apr 2026 06:07:39 +0000

This is a submission for the OpenClaw Challenge.

What I Built

I'm a second year computer engineering student.I often joke that my degree runs on Whatsapp.Often most students dont attend lectures but just see what someone has uploaded on whatsapp and do their work from there. I'm going to be honest with you about something embarrassing: I missed a lab submission last semester not because I didn't do the work, but because the deadline came through a WhatsApp group at 11 PM and I just didn't see it in time.

The professor sent a reminder email three days before. Someone in the group chat forwarded it with "guys don't forget!!". Someone else sent a voice note I never opened. By the time I remembered, the portal was closed.

This is not a unique experience. If you're a student in India — or honestly anyone whose professional life runs partially through WhatsApp — you know that deadlines don't live in one place. They're fragmented across group chats, Gmail threads, DMs, and the occasional Instagram message from a friend who remembered you hadn't registered for something yet.

The "fix" people suggest is: "just use Google Calendar." Sure. Do you manually create an event every time someone texts you about something? I don't. Nobody does.

So here comes our hero of the story OpenClaw.

How I Used OpenClaw

I've played with n8n, with custom Python cron scripts, with Zapier. None of them had native WhatsApp access without either paying for a business API (which requires a separate phone number and a whole approval process) or running a fragile Selenium scraper.

OpenClaw has WhatsApp built into its Gateway via Baileys. One command, one QR scan. Done. The agent immediately has access to every message I receive, treated as a first-class channel.

The other thing that made OpenClaw the right fit was Standing Orders. This is not a feature you'd expect to care about until you actually use it. The idea is simple: you write a file called AGENTS.md in your workspace and the agent reads it every session. You define programs — "here's what you're authorized to do, here's when to do it, here's when to stop and ask me."

For OpenClaw, the Standing Order looks roughly like this:

## Program: WhatsApp Deadline Monitor

**Authority:** Read all inbound WhatsApp messages. Extract and store deadline/event data.
Send review messages back to the user for ambiguous extractions.
**Trigger:** Every inbound WhatsApp message
**Approval gate:** Auto-write calendar for confidence ≥ 0.80. Request approval below that.
**Escalation:** If extraction fails 3 times in a row, alert me and pause.

This single file is why it is an agent and not a script. It has defined scope, defined escalation rules, defined approval gates. It knows when to act and when to ask. I wrote those rules once and now they govern every message that comes through, forever, without me thinking about it again.

Demo

So I built ChronoAgent.

ChronoAgent runs in the background on my laptop as an OpenClaw Gateway daemon. It has a WhatsApp channel connected (OpenClaw supports this natively through Baileys — no external service, no webhook nonsense, just scan a QR code). Every message that comes in gets silently processed. If it contains a deadline, a due date, a meeting, an exam — it gets extracted, deduplicated, and written to Google Calendar automatically.

I don't prompt it. I don't open a chat window. It just runs.

When it adds something to my calendar, it sends me a message back: "📅 Added: Assignment 3 submission on April 28, 11:59 PM". I can reply NO to undo it. That's the entire user interaction for 80% of cases.

For things it's less sure about — "sometime next week we should meet bro" — it holds the event in a pending queue and asks me to confirm with a YES or NO reply. No calendar write until I say so.

Source code and setup instructions: github.com/Labreo/openclaw-calendar-agent

What I Learned

Here's the mistake I almost made: trying to have the LLM do everything.

My first draft had the LLM reading messages, extracting dates, checking the calendar for duplicates, deciding whether to write, writing the event, formatting the confirmation message — all in one big chain of reasoning.

It was slow. It was expensive. And it hallucinated duplicates constantly.

The version that actually works is much dumber-looking on the surface, but it's solid:

Incoming Message
    │
    ▼
[Python: normalize to standard envelope]
{ source, sender, timestamp, raw_text }
    │
    ▼
[LLM: ONLY job is translation]
Input: raw text + today's date
Output: JSON array of extracted events with confidence scores
    │
    ▼
[Python: resolve relative dates]
"next Friday" → 2026-04-25T00:00:00
    │
    ▼
[Python: deduplicate]
Fuzzy title match + date proximity + semantic similarity
    │
    ▼
[Confidence router]
≥ 0.80 → write to calendar, notify me
0.50–0.79 → pending queue, ask me
< 0.50 → silent discard

The LLM's job is exactly one thing: read messy human text and output clean structured JSON. That's it. Every other decision in the pipeline is deterministic Python.

The extraction prompt that actually worked

Getting the LLM to reliably output JSON (and only JSON) took more iteration than I expected. The trick was treating the model like a data parser in the system prompt, not like a conversationalist:

EXTRACTION_SYSTEM_PROMPT = """You are a deadline and event extraction engine.
Your ONLY output is a valid JSON array. No prose. No markdown. No explanation.

Given a message, extract every actionable deadline, due date, meeting, or event.
For each item output:
{
  "title": string,
  "date_raw": string,       // verbatim from the message
  "date_iso": string|null,  // resolved ISO 8601 if possible, else null
  "confidence": float,      // 0.0 to 1.0
  "source_quote": string,   // the exact fragment that contains the deadline
  "event_type": string      // deadline | meeting | exam | submission | event | other
}

Confidence guide:
  1.0  — explicit date + time + clear action
  0.85 — explicit date, no time
  0.70 — relative date that can be resolved
  0.55 — vague but likely actionable
  0.30 — might be an event, very unclear
  0.10 — references a past deadline

If NO events found, return: []
Today's date is injected in the user message."""

The source_quote field was an afterthought that turned out to be the most useful field in the whole schema. Every Google Calendar event created by ChronoAgent has the original message fragment in its description. When I look at a calendar event three weeks later and don't remember what it's about, I can see "Original: 'bro the quiz is moved to Thursday 10am right?' — Source: WhatsApp, Sender: Abdullah". That's enough context.

The deduplication problem

This was the hardest part of the whole project by a significant margin.

The naive approach: ask the LLM "is this event already in my calendar?" Tried it. Terrible. Slow, expensive, and the model would confidently say "no duplicate found" when there obviously was one.

The approach that works is a 2-out-of-3 vote between three deterministic checks:

Fuzzy title match (Levenshtein ratio ≥ 0.85)
Date proximity (within ±24 hours)
Semantic similarity (cosine similarity using a local sentence-transformers model, threshold 0.80)

def is_duplicate(candidate, existing, embeddings_cache):
    votes = 0

    # Check 1: title similarity
    if lev_ratio(candidate["title"].lower(), existing["title"].lower()) >= 0.85:
        votes += 1

    # Check 2: date proximity
    if candidate.get("date_iso") and existing.get("date_iso"):
        a = datetime.fromisoformat(candidate["date_iso"])
        b = datetime.fromisoformat(existing["date_iso"])
        if abs((a - b).total_seconds()) <= 86400:  # 24 hours
            votes += 1

    # Check 3: semantic similarity
    vec_a = get_or_compute_embedding(candidate["title"], embeddings_cache)
    vec_b = get_or_compute_embedding(existing["title"], embeddings_cache)
    if cosine_similarity(vec_a, vec_b) >= 0.80:
        votes += 1

    return votes >= 2

"Assignment 3 due Friday" and "A3 submission end of week" — different titles, so Levenshtein fails. But they're semantically similar and the dates are within 24 hours, so they correctly merge as duplicates.

The semantic model I used (all-MiniLM-L6-v2) is 80MB and runs locally. No API call, no cost, ~10ms inference time. The LLM is completely out of the loop for dedup.

The 3 AM quota problem (a real thing that happened)

Midway through building this, I burnt through my initial API credits testing the extraction pipeline on WhatsApp group traffic. A college group chat is... a lot of messages. Most of them "ok", "haha", "send notes pls" — but the extraction call fires on all of them before it knows they're non-events.

I had two options: add pre-filtering, or switch to a cheaper model.

I did both. Added a simple length and keyword pre-filter that drops messages under 15 words with no date-adjacent terms before they even hit the LLM. Dropped API usage by about 70% immediately.

Then switched from Claude Sonnet to Gemini 2.5 Flash for the extraction step. Flash is significantly cheaper for this kind of structured output task and the JSON reliability was equivalent in my testing. I still use Claude for anything that requires more complex reasoning — the confidence routing logic and the digest generation — but for the high-volume extraction pass, Flash works well.

The lesson: the best agentic systems aren't the ones throwing the strongest model at every task. They're the ones with smart orchestration about when to call what.

What "passive operation" actually means in practice

The phrase "runs in the background" is easy to say and actually kind of hard to build well.

OpenClaw's Gateway is a daemon — you run openclaw onboard --install-daemon and it sets up a launchd/systemd service that starts automatically on login and stays running. The WhatsApp channel maintains a persistent Baileys session. Cron jobs handle the Gmail polling every 4 hours.

The Standing Order hook fires on every inbound WhatsApp message without any scheduler. The agent just... responds to events. It's not polling. It's not a loop. It's closer to how a web server handles requests — always listening, processes when something arrives, goes quiet otherwise.

The Gmail part is a Python script (ingest_email.py) that the OpenClaw cron calls every 4 hours:

openclaw cron add \
  --name email-ingestion \
  --cron "0 */4 * * *" \
  --timeout-seconds 120 \
  --message "Execute email ingestion per standing orders. Run ingest_email.py, process queue, run extraction and dedup on each envelope, route by confidence, write calendar entries, report summary."

The cron message references the Standing Order rather than duplicating the logic. The agent reads its AGENTS.md, knows the full procedure, and executes it. I don't need to write a long prompt into the cron command because the authority is already defined.

What I'd build next if I had more time

The biggest gap right now is thread context. When someone says "bro don't forget about the thing on Friday" — that's a 0.3 confidence extraction at best. But if I had the last 5 messages of that WhatsApp thread, I'd probably know what "the thing" is.

OpenClaw's session history tools make this possible but I didn't have time to implement it cleanly before the deadline. It's the next feature.

The other thing is cross-source entity resolution. Right now if the same event appears in email and WhatsApp, the dedup engine usually catches it. But it runs independently per message — there's no weekly "let me look at everything I've collected and find clusters" pass. I have a consolidation script written but not wired into the cron yet.

ClawCon Michigan

I didn't attend ClawCon Michigan this time around, but building ChronoAgent has made me want to.

The honest summary

ChronoAgent is not a product. It's a personal tool I built because I was genuinely failing at calendar management and the existing solutions didn't work for how I actually communicate (mostly WhatsApp, some email, basically no structured calendar input).

OpenClaw made the WhatsApp part possible without fighting through Business API approvals. The Standing Orders made the "passive, always-on, doesn't need prompting" part possible without writing a custom daemon. The exec tool made it easy to keep the heavy lifting in plain Python scripts that I can test independently.

The thing I keep coming back to from this project: the right job for the LLM in an agentic system is usually much smaller than you initially think. Translation, intent recognition, natural language output — yes. Calendar math, deduplication logic, date resolution — no. The moment I moved those out of the LLM and into deterministic code, the whole system became faster, cheaper, and more reliable.

Since I set it up, I've had zero missed deadline incidents. That's the only metric that matters.

How I Hijacked YouTube Music's DOM to Build a Custom Mini-Player 🎵

Kanak Waradkar — Thu, 23 Apr 2026 01:58:18 +0000

The Challenge: Dealing with Aggressive UI "Pruning"

YouTube Music is built with a highly responsive engine. When you narrow the window to a "mini" size—essential for a browser extension mini-player—YTM starts aggressively pruning the interface. One of the first things to go are the Like and Dislike buttons. They aren't just hidden; they are often completely stripped from the active layout to save space.

For a multitasking user, having to expand the player just to "Like" a song is a major friction point.

The Strategy: Native Reparenting (DOM Hijacking)

Initially, I tried simulating keyboard shortcuts (Shift + =), but found it unreliable when the window lost focus. I also tried creating custom buttons, but syncing their state (Blue for liked, Red for disliked) with YTM's internal player was complex and prone to breaking.

The solution? Native Reparenting. Instead of building new buttons, I "stole" the official ones and moved them into my own persistent UI.

Step 1: Locating the Native Renderer

Even when hidden, the native ytmusic-like-button-renderer component often persists in the DOM (hidden in the background player page). We simply need to find it.

const nativeRenderer = document.querySelector('ytmusic-like-button-renderer');

Step 2: Creating a Custom Pill Container

To give the buttons a modern, floating feel, I created a custom "Pill Bar" with glassmorphism effects.

const container = document.createElement('div');
container.id = 'ytm-pill-container';
document.body.appendChild(container);

Step 3: The "Hijack"

This is the magic line. By appending the native renderer to our new container, we move the actual functional component—including its click handlers and state management—out of its original location and into our floating UI.

if (nativeRenderer && nativeRenderer.parentElement !== container) {
    // Clear any existing content to prevent duplicates
    container.innerHTML = ''; 
    container.appendChild(nativeRenderer);
}

Step 4: Overriding the Style Engine

Because YTM's CSS still thinks these buttons should be hidden in narrow windows, we have to use high-priority !important rules to force them back into visibility.

#ytm-pill-container ytmusic-like-button-renderer {
    display: flex !important;
    visibility: visible !important;
}

The Result: A Robust, Draggable UI

By using the native components, we get 100% reliable Like/Dislike functionality. We even added draggability so users can move the pill bar anywhere in the window.

🚀 Links

GitHub Repository: Labreo/ytm-miniplayer
Install Firefox: YTM Mini Mode
Chrome: (Coming Soon)

Happy coding!

Tracking ML Experiments with MLflow: A Simple Guide for Beginners

Kanak Waradkar — Thu, 10 Jul 2025 23:36:41 +0000

Originally published on Hashnode

Introduction:

This blog draws inspiration from the excellent MLflow tutorial by CodeBasics, which clearly demonstrates the core concepts we will be discussing here. If you are looking for a more detailed or visual walk through, I highly recommend checking it out.

This post is written from the perspective of a beginner and aims to offer a more hands-on, less theoretical explanation of MLflow, based on my own experience implementing it in a real project. Think of it as a beginner learning out loud.

What is MLflow?

MLflow is an open-source MLOps tool that helps you track, log, and manage everything involved in machine learning experiments including metrics, parameters, models, and other useful artifacts.

It is especially useful when you are trying out different models or hyper parameters and need a way to compare them without losing track. Instead of manually noting results, MLflow stores everything in one place, helping you stay organized and productive.

For example, in my Titanic survival prediction project, I used MLflow to log different models such as Logistic Regression and Random Forest, track accuracy scores, and compare results using the MLflow UI. This made it easy to identify the best-performing model for deployment.

How to install and setup MLflow?

It is actually pretty straightforward to install into MLflow into your python notebook.Just:

pip install mlflow

Now to setup MLflow we will be using my titanic-model code as a reference which can be located from my GitHub repository.For this example, I had trained and compared three different models: Logistic Regression, Random Forest, and Gradient Boosting Classifier.

Here's a simplified version of the model setup:

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier

models = [
    (
        "Logistic Regression", 
        {
            "class_weight": None,
            "random_state": 8888,
            "solver": "lbfgs",
            "max_iter": 100
        },
        LogisticRegression(), 
    ),
    (
        "Random Forest", 
        {
            "n_estimators": 100,
            "random_state": 42
        },
        RandomForestClassifier(), 
    ),
    (
        "Gradient Boosting Classifier", 
        {
            "n_estimators": 100,
            "learning_rate": 1.0,
            "max_depth": 1,
            "random_state": 0
        },
        GradientBoostingClassifier(), 
    ),
]

Now we integrate MLflow to log the model parameters and performance metrics for each run.

import mlflow
import mlflow.sklearn

# Set the tracking URI (for local use)
mlflow.set_tracking_uri("http://localhost:5000")

# Set the experiment name
mlflow.set_experiment("Accuracy Model v3")

for i, element in enumerate(models):
    model_name = element[0]
    params = element[1]
    model = element[2]
    report = reports[i]  # Assume this contains the classification report for the model

    with mlflow.start_run(run_name=model_name):
        mlflow.log_params(params)
        mlflow.log_metrics({
            'accuracy': report['accuracy'],
            'recall_class_1': report['1']['recall'],
            'recall_class_0': report['0']['recall'],
            'f1_score_macro': report['macro avg']['f1-score']
        })

        # Log the trained model
        mlflow.sklearn.log_model(model, "model", registered_model_name=model_name)

To view your runs visually, launch the MLflow tracking UI with:

mlflow ui

Then open your browser at http://127.0.0.1:5000. You will see a dashboard that shows all your logged experiments, parameters, metrics, and models.

Logging Parameters, Metrics, and Models

MLflow provides a clean API to track everything that matters in your experiments.

Here are the three main functions you will use:

Logging Parameters

mlflow.log_param("learning_rate", 0.1)
mlflow.log_param("n_estimators", 100)

These are the hyper parameters of your model that you might want to compare across runs.

Logging Metrics

 mlflow.log_metric("accuracy", 0.875)

You can log accuracy, precision, recall, or any other custom metric you calculate.

Logging the Model

mlflow.sklearn.log_model(model, "model")

This saves your trained model in a format you can reload or even serve later.

What Gets Stored and Where?

Once you run your experiment, MLflow stores everything inside a folder called mlruns in your project directory. Each experiment is given a unique ID, and inside that folder, you will find:

A params folder containing your logged parameters
A metrics folder for evaluation scores
An artifacts folder which includes the saved model

You can compare runs side by side in the UI, which makes model selection and debugging much easier.These all can be then seen on the web application and MLflow will do most of the heavy lifting with management and data consolidation.Feel free to mess around with the UI and draw your own conclusions.

Things I Learned (or Struggled With)

Like any beginner working with a new tool, I ran into a few bumps while setting up MLflow and learned a lot in the process.

Confusion About MLflow Runs

At first, I didn’t fully understand what a “run” was in MLflow. I was running multiple models, but everything was showing up under the same run or being overwritten. I realized I had forgotten to call mlflow.start_run() with a unique run_name.This could possibly fix it for you:

with mlflow.start_run(run_name="Gradient Boosting"):
    ...

MLflow UI Not Launching

I tried running mlflow ui and nothing happened. It turned out I had a port conflict on 5000, as another local server was already running.But it still wouldn’t run after making these changes.Using this command helped me launched it for the first time after which it launched normally using mlflow ui.

mlflow server \
  --backend-store-uri sqlite:///mlflow.db \
  --default-artifact-root ./artifacts \
  --host 0.0.0.0 \
  --port 5000

Then you should be able to access everything as normal.

Next Steps

Now that I’ve successfully set up MLflow and tracked multiple models in my Titanic prediction project, here’s what I plan to learn next:

DVC (Data Version Control): To manage data and model files across versions and collaborators.
Prefect: To automate my training pipeline and possibly run it on a schedule.

I’m also working on a blog series covering each of these topics as I learn them.

Want to Follow Along?

You can check out my code and future updates here:

https://github.com/Labreo/titanic-ml

I also post progress daily on Twitter:

https://x.com/Kaker0th

If you're just starting with MLOps too, feel free to reach out or share what you're building. Let's learn and grow together!