Forem: Kaslin Fields

How I Made an MCP Server That Saves Me an Hour per Week

Kaslin Fields — Wed, 29 Oct 2025 18:09:55 +0000

Recently, I’ve been travelling around the country to help engineers learn how to build MCP Servers and AI Agents serverlessly on Cloud Run in our Accelerate AI with Cloud Run workshops. Attendees often ask, how can I use what I learned in my use case? This blog tells the story of how I used the first hands-on lab from that workshop to build something that saves me time and effort in my real day-to-day work!

As a co-host of the Kubernetes Podcast from Google, I love the conversations and learning about cool technology and use cases. But like with any content series, there’s so much time and effort that goes into running and publishing each episode. In this article, I explain how and why I built an MCP server to simplify and speed up our publishing process. You can also try out my solution yourself by checking out the code for my podcast-assistant-mcp server on GitHub!

Making Podcast Publishing Faster and Easier

There are a lot of steps involved with running a podcast: picking guests and topics, scheduling interviews, editing episodes, etc, etc. You might think that publishing should be the fastest part of the process- you just click a button and it’s out, right?! Alas, dear reader, how I wish it were so quick and easy.

For every episode, our publishing workflow looks something like this:

Listen to the episode for quality and write show notes. (This step is perhaps the most time-consuming and high-active-effort step in the whole process, really).
Upload the final audio to our publishing platform and fill out all the metadata fields (show notes, tags, etc.)
Update kubernetespodcast.com by opening a pull request in the repo (our site is built with Hugo).
Promote the new episode on social media (X, LinkedIn, BlueSky, Reddit, etc.).

The sheer number of manual steps in our publishing process makes it a significant time sink. I needed a better way, and I saw an opportunity for AI to help.

Step 1: An AI-Based Script

AI excels at summarizing and transforming content. Many of the most time-consuming aspects of our publishing workflow are essentially summarization tasks. This presented a clear opportunity to leverage AI for drafting publishing artifacts such as show notes and social media posts.

To test out my theory that AI could save me time with these summarization steps, I created a simple Python script that:

Took an audio file as input.
Used AI to convert the audio to a transcript.
Used the transcript to generate drafts of show notes and social media posts

With just a quick Python script, I significantly reduced the time I was spending meticulously listening to episodes to write down show notes. While I still review and edit everything I publish, the AI-drafted show notes give me something to work off of, minimizing the need for frequent note-taking pauses while listening to the episode for quality. Additionally, the social media drafts streamlined the preparation of posts for publishing. I began using the script regularly in early 2025 and it consistently saved 1.5-2 hours per episode, becoming an integral part of my publishing process over the last several months.

But there was a problem. You see, I’m not the only host for the Kubernetes Podcast, and I’m not always the one doing the publishing. The issue with my wonderful time-saving script was:** I was the only one using it!** Sure, I shared the script with my co-hosts, but its lack of integration into their existing publishing workflow meant they weren't leveraging it to reclaim valuable publishing time.

Step 2: From Local Script to Shareable MCP Server

With a new goal aimed at saving my co-hosts time, I started to explore ways to share my automation more effectively, and to expand the initial script to further optimize our publishing workflow.

Initially, I envisioned a comprehensive, wizard-style web application that would manage the entire process, incorporating a "human in the loop" review step. This approach would centralize our publishing tools and processes, streamlining onboarding for new hosts.

Then I had to deal with reality:

I’m a Kubernetes/Cloud Infrastructure expert - I don't have the front-end skills to build a multi-step web app quickly!
Many of the APIs required to publish to all our platforms are not readily available. 😞

Ultimately, I lacked the time to develop such a comprehensive solution. Thus, I came back to the age-old wisdom: prioritizing simplicity. To solve my problem, I really needed a system that my team could easily adopt to achieve immediate time savings. And thinking about the longer term, I sought an extensible architecture, something that would allow me to incrementally build out the ideal publishing process as time permits.

The Solution: Create an MCP (Model Context Protocol) server! This approach had several key benefits:

It's a simple user interface, mostly implemented in backend code (no need to learn frontend languages or libraries).
It integrates with the Gemini CLI, which our team already uses. This meant no new tools for them to learn or add to their workflows!
It's super extensible! I can add new tools (like generate_hugo_post or publish_to_platform_x) one by one, incrementally, when I have the time.
Leveraged existing work: I utilized two existing pieces of content to build this MCP server fast:
- I was able to reuse the MCP Server design and deployment methodology from the first hands-on lab from our Accelerate AI with Cloud Run workshop series, which demonstrates running a secure MCP server on Cloud Run.
- And to customize it to my use case, I converted the Python script I had already developed into an MCP server. The conversion was amazingly easy- I just integrated the FastMCP library and modified the output mechanism to ensure each function independently saved its output to a storage bucket.

How the Podcast Assistant MCP Server Works

I built the server, which I call my podcast-assistant-mcp, using FastMCP and Google's GenAI libraries, all containerized via Docker and ready for deployment to Cloud Run.

The Stack

The project is straightforward. Just like in the lab, I use the command line tool uv to manage my dependencies, which means I have a pyproject.toml file that defines the key dependencies:

FastMCP: For creating the MCP server itself.
google-genai: To interact with the Gemini 2.5 Flash model on Vertex AI.
google-cloud-storage: For reading audio files and writing the text outputs to a Google Cloud Storage Bucket.

The MCP Server Tools

When I created the code for this project as a Python script, it was designed to be sequential, where the output of one tool serves as input for the next. In an MCP server, each component can be called independently, which offers more flexibility. For instance, if we’ve already created a shownotes file manually (which we do occasionally), or if you already have a transcript and you want to use it to generate social media & blog posts (which we might do for old episodes), the AI Podcast Assistant is flexible enough to assist with any specific part of the publishing process it supports. It can also still do everything it has tools for, and knows how to do them in a sensible order! This flexibility will be even more useful as we develop more tools for the server.

This project consists of 3 main files, a server.py that defines the MCP server itself, and a Dockerfile and pyproject.toml file for deployment.

The server.py file defines 4 main tools, and the README includes instructions to make them** accessible via the Gemini CLI. As a rule, we ***treat all generated content as a first draft*- we make sure to carefully review and edit anything we’re going to publish.

Function 1. generate_transcript(audio_file_uri: str, episode_name: str) -> str

This is the starting point. It takes a GCS URI for an .mp3 or .wav file and uses a detailed prompt to transcribe it. The prompt is specific, asking for speaker labels and timestamps, which is crucial for making show notes that I can then confirm the quality of, either by hand, or with a quality evaluation tool. It then saves the transcript to GCS and returns the new URI.

Function 2. generate_shownotes(transcript_gcs_uri: str, episode_name: str) -> str

This tool takes the transcript's GCS URI. In our production server, it uses a prompt specifically tailored to our podcast's style, which means a shownotes file in markdown format, that essentially consists of a set of links to relevant additional materials for anyone who wants to learn more about the technologies mentioned in the episode.

For the GitHub repo, I used a slightly more generic take on the show notes concept that summarizes the episode. If you want to try using this for yourself, you can always edit the prompt to make it work differently! Whether in our production environment or if deployed from the repo, this function saves the markdown show notes file to GCS and returns that file's URI.

Function 3. generate_blog_post(transcript_gcs_uri: str, episode_name: str) -> str

The third function takes the transcript and generates a full, engaging blog post in Markdown format, saving the result to GCS. This is where that MCP server extensibility comes into play a little bit, because currently, kubernetespodcast.com does not host a blog. We want to implement one at some point in the future. The drafts generated by this tool should work as a handy starting point some day when we have that blog page ready. We’ll use this function’s outputs as a starting point so we can spend more time on ensuring accuracy, linking to relevant reference material, and making sure the main points in the episode come across in blog form.

Function 4. generate_social_media_posts(transcript_gcs_uri: str, episode_name: str) -> str

Finally, the fourth tool in this MCP server generates formatted drafts for X (under 280 characters) and LinkedIn, based on the transcript. And as with all the outputs from this MCP server, we review these and edit them before publishing.

Deployment

The whole point of implementing this as an MCP server is to share it with my co-hosts, which means deploying it on Cloud Run with authentication required - which means anyone can use it- as long as we tell them how! This should also make onboarding new hosts easier, especially as we implement more pieces of our publishing process as tools. For deployment, we have a simple Dockerfile for containerization, and a pyproject.toml file for handling dependencies.

Making It Right: Design Choices

There are several choices that I made intentionally when designing this project. If you’re considering building your own MCP server, you might want to keep these considerations in mind.

Selecting the Right Model

My initial script utilized Gemini 2.5 Pro, a powerful model for content generation. However, I encountered an issue where the Gemini CLI would time out while awaiting the tool's output. After unsuccessful troubleshooting of potential timeout configuration changes, I switched to Gemini 2.5 Flash on Vertex AI. Flash proved sufficiently fast to complete generation and save-to-bucket operations within the client's timeout limit, enabling reliable workflow execution. While other models are viable, Gemini 2.5 Flash is currently my preferred choice for this application. \

FastMCP Tools vs Prompts

As a key dependency, FastMCP offers a feature allowing developers to register a prompt instead of a full tool. For a workflow like this, where the primary action involves taking a single input (the transcript URI) and generating a single, non-chainable output (e.g., show notes, blog posts, social media updates), a configured prompt might appear to be an ideal solution. Given that these tools are already integrated with an AI Model, it would logically follow to leverage that AI for the desired content generation.

However, the primary objective of the Podcast Assistant extends beyond content generation to long-term content persistence in a storage bucket. The prompt feature is designed to output generated text directly to the user, typically via the Gemini CLI. For this server, generated text must be saved to a Google Cloud Storage (GCS) bucket, enabling its use in subsequent steps or easy download by co-hosts. Therefore, I opted to implement full tools—despite their increased complexity—as they provide the explicit control required for output storage and saving the GCS URI for the next step in the workflow.

Security in the Form of Authentication via Cloud Run

The ability to share this MCP server with my co-hosts is what makes it a great solution - but I don’t want to pay the bill for everyone in the world to use it! That’s why I wanted to maintain the spirit of the original codelab, “How to deploy a secure MCP server on Cloud Run,” and require authentication to use my MCP server.

There are ways to set up Auth at the MCP Server level, but I’m lazy and that sounded like more code, so I wanted to make use of the authentication at the Cloud Run level, which I did by deploying with the gcloud cli flag of “--no-allow-unauthenticated.”

In the codelab, authentication is handled essentially on a per-session basis by having the end user configure an ID Token that expires after 1 hour. This is fine for my podcast use case too, since I can add the ID Token creation as part of our publishing workflow, and actually using the server’s tools shouldn’t take more than a few minutes. But I also wanted to explore options that would be smoother for my limited number of intended users.

In the README for my podcast-assistant-mcp server, I outline three different authentication options:

Option 1: Using an authentication token with Gemini CLI: This method is quick to set up and is useful for temporary access to the MCP server. The token is valid for one hour.
Option 2: Using a proxy with Gemini CLI: This method is more robust and is recommended for continuous development. The proxy automatically handles authentication and forwards requests to the MCP server.
[Example] Option 3: Using a service account: Options 1 & 2 involve authenticating as a user via Gemini CLI. This method involves authenticating as a service or agent (rather than a user) via a service account. This project does not include such a service or agent, so this option is provided as an example for reference, and is not immediately usable just from deploying this project.

For my use case, either Option 1 or 2 will work fine. I might test out both with my co-hosts and see what they like better. Our preference and needs might also change as we continue to expand the MCP server’s capabilities.

Conclusion: A Workflow That Works

While I still aspire to build a fully integrated, comprehensive solution, this MCP server is a practical tool that works, integrates into our team's usual workflow, and is extensible, allowing me or my co-hosts to add more functionality as needed, benefiting everyone.

This project has consistently saved significant time for me over the last several months. Now, my co-hosts can readily utilize it directly from their Gemini CLI, collectively saving valuable time that can be redirected to producing future episodes. Check out the straightforward, three-file implementation of this MCP server in this GitHub repository, and get started building your own useful, time-saving MCP servers!

Intro to Ray on GKE

Kaslin Fields — Thu, 12 Sep 2024 19:28:31 +0000

This post continues my AI exploration series with a look at the open source solution Ray and how it can support AI workloads on Google Kubernetes Engine.

What is Ray?

Initially created at UC Berkeley in 2018, Ray is an open-source unified compute framework to scale AI and Python workloads. The Ray project is predominantly managed and maintained by Anyscale.

At a high level, Ray is made up of 3 layers: Ray AI Libraries (Python), Ray Core, and Ray Clusters. These layers enable Ray to act as a solution that spans the gap between data scientists/machine learning engineers and infrastructure/platform engineers. Ray enables developers to run their AI applications more easily and efficiently across distributed hardware via a variety of libraries. Ray enables data scientists/machine learning (ML) engineers to focus on the applications and models they’re developing. And finally, Ray provides infrastructure/platform engineers with tools to more easily and effectively support the infrastructure needs of these highly performance-sensitive applications.

This diagram from the Ray documentation shows how the Ray Libraries relate to Ray Core, while Ray Clusters are a distributed computing platform that can be run on top of a variety of infrastructure configurations, generally in the cloud.

Through its various Python libraries in the Ray AI Libraries and Ray Core layers, Ray provides ML practitioners with tools that simplify the challenge of running highly performance-sensitive distributed machine learning-style applications on hardware accelerators. Ray Clusters are a distributed computing platform where worker nodes run user code as Ray tasks and actors. These worker nodes are managed by a head node which handles tasks like autoscaling the cluster and scheduling the workloads. Ray Cluster also provides a dashboard that gives a status of running jobs and services.

Ray on Kubernetes with KubeRay

Ray Clusters and Kubernetes clusters pair very well together. While Ray Clusters have been developed with a focus on enabling efficient distributed computing for hardware-intensive ML workloads, Kubernetes has a decade of experience in more generalized distributed computing. By running a Ray Cluster on Kubernetes, both Ray users and Kubernetes Administrators benefit from the smooth path from development to production that Ray’s Libraries combined with the Ray Cluster (running on Kubernetes) provide. KubeRay is an operator which enables you to run a Ray Cluster on a Kubernetes Cluster.

KubeRay adds 3 Custom Resource Definitions (CRDs) to provide Ray integration in Kubernetes. The RayCluster CRD enables the Kubernetes cluster to manage the lifecycle of its custom RayCluster objects, including managing RayCluster creation/deletion, autoscaling, and ensuring fault tolerance. The custom RayJob object enables the user to define Ray jobs (Ray Jobs, with a space!) and a submitter, which can either be directed to run the job on an existing RayCluster object, or to create a new RayCluster to be deleted upon that RayJob’s completion. The RayService custom resource encapsulates a multi-node Ray Cluster and a Serve application that runs on top of it into a single Kubernetes manifest.

KubeRay on GKE with the Ray Operator Add-On

The Ray Operator is an add-on for Google Kubernetes Engine (GKE) which is based on KubeRay and provides a smooth, native way to deploy and manage KubeRay resources on GKE. Enabling the Ray Operator on your GKE Cluster automatically installs the KubeRay CRDs (RayCluster, RayJob, and RayService), enabling you to run Ray workloads on your cluster. You can enable the operator either at cluster creation or, you can add it to an existing cluster via the console, gcloud cli, or IAC such as Terraform.

The Ray Operator Add-On in a GKE cluster is hosted by Google and does not run on GKE nodes, meaning that no overhead is added to the cluster in order to run the operator. You may also choose to run the KubeRay operator on your GKE cluster without using the Add-On. In the case where you are not using the Add-On, the KubeRay operator would run on the nodes, meaning there may be some slight overhead added to the cluster. In other words, the Ray Operator Add-On for GKE enables users to run Ray on their GKE clusters without any added overhead from the operator.

This short video (~3.5 minutes) shows you how to add the Ray Operator to your cluster and demonstrates creating a RayCluster and running a Ray Job on that cluster running on GKE.

When to Use Ray on GKE

In many organizations, the people creating AI applications and the people running and managing the GKE clusters and other infrastructure are different people, because the skills to do these types of activities well are specialized in nature.

As a platform engineer, you may want to consider encouraging use of Ray as a single scalable ML platform that members of your organization could use to simplify the development lifecycle of AI workloads. If AI practitioners are using Ray, you can use the Ray Operator for GKE to simplify onboarding and integration into your existing GKE ecosystem. As a practitioner who is building AI applications, you may want to consider using or advocating for Ray if your organization is already using GKE and you want to reuse the same code between development and production without modification, and to leverage Ray’s ML ecosystem with multiple integrations.

Ray Alternatives

Ray does not exist in a vacuum and there are numerous other tools that can support the same types of AI and Python workloads. Ray's layered approach solves challenges from development to production that alternatives may not holistically address.. Understanding how Ray relates to other tools can help you understand the value it provides.

The Python Library components of Ray could be considered analogous to solutions like numpy, scipy, and pandas (which is most analogous to the Ray Data library specifically). As a framework and distributed computing solution, Ray could be used in place of a tool like Apache Spark or Python Dask. It’s also worthwhile to note that Ray Clusters can be used as a distributed computing solution within Kubernetes, as we’ve explored here, but Ray Clusters can also be created independent of Kubernetes.

To learn more about the relationships between Ray and alternatives, check out the “Ray & KubeRay, with Richard Liaw and Kai-Hsun Chen” episode of the Kubernetes Podcast from Google.

As in many other areas of tech, the right answer to the question “which tool is best for me” is, “it depends.” The best solution for your use case will depend on a variety of factors including which tools you’re comfortable with, how you divide work, what's in use in your environment, and more.

Try out the Ray Operator on GKE today!

Ray is a powerful solution that bridges the gap between those developing AI applications, and those running them on infrastructure. The Ray Operator for GKE makes it easy to take advantage of Ray on GKE clusters.

Learn more about Ray and how to use it in the Google Cloud “About Ray on Google Kubernetes Engine (GKE)” docs page. And check out the “Simplify Kuberay with Ray Operator on GKE” video on YouTube to see an example of the Ray Operator on GKE in action.

At time of publishing, Ray Summit 2024 is just around the corner! Join Anyscale and the Ray community in San Francisco on September 30 through October 2nd for 3 days of Ray-focused training, collaboration, and exploration. The schedule is available now!

AI & Kubernetes

Kaslin Fields — Mon, 13 May 2024 14:35:00 +0000

Just like so many in the tech industry, Artificial Intelligence (AI) has come to the forefront in my day-to-day work. I've been starting to learn about how "AI" fits into the world of Kubernetes - and vice versa. This post will start a series where I explore what I'm learning about AI and Kubernetes.

Types of AI Workloads on Kubernetes

To describe the AI workloads engineers are running on Kubernetes, we need some terminology. In this post I’m going to describe two major types of workloads: training and inference. Each of these two terms describes a different aspect of the work Platform Engineers do to bring AI workloads to life. In this post, I’ll highlight two roles in the path from concept to production for AI workloads. Platform Engineers bridge the gap between Data Scientists who design models, and the end users who interact with trained implementations of the models those Data Scientists designed.

Data Scientists design models while Platform Engineers have an important role to play in making them run on hardware.

There's a lot of work that happens before we get to the stage of running an AI model in production. Data scientists choose the model type, implement the model (the structure of the "brain" of the program), choose the objectives for the model, and likely gather training data. Infrastructure engineers manage the large amounts of compute resources needed to train the model and to run it for end users. The first step between designing a model and getting it to users, is training.

Note: AI training Workloads are generally a type of Stateful Workload, which you can learn more in my post about them.

Training Workloads

"Training" a model is the process for creating or improving the model for its intended use. It's essentially the learning phase of the model's lifecycle. During training, the model is fed massive amounts of data. Through this process, the AI "learns" patterns and relationships within the training data through algorithmic adjustment of the model's parameters. This is the main workload folks are usually talking about when discussing the massive computational and energy requirements of AI.

During training, the AI model is fed massive amounts of data, which it "learns" from, algorithmically adjusting its own parameters.

It’s becoming a common strategy for teams to utilize pre-trained models instead of training their own from scratch. However, a generalized AI is often not well-equipped to handle specialized use cases. For scenarios that require a customized AI, a team is likely to do a similar “training” step to customize the model without fully training it. We call this “fine-tuning.” I’ll dive deeper into fine-tuning strategies another time, but this overview of model tuning for Google Cloud’s Gemini model is a good resource to start with.

Why Kubernetes for Training

Kubernetes makes a lot of sense as a platform for AI training workloads. As a distributed system, Kubernetes is designed to manage a huge amount of distributed infrastructure and the networking challenges that come with it. Training workloads have significant hardware requirements, which Kubernetes can support with GPUs, TPUs, and other specialized hardware. The scale of a model can vary greatly- from fairly simple, to very complex and resource-intensive. Scaling is one of Kubernetes' core competencies, meaning it can manage the variability of training workloads' needs as well.

Kubernetes is also very extensible, meaning it can integrate with additional useful tools, for example, for observability/monitoring massive training workloads. A whole ecosystem has emerged, full of useful tools for AI/Batch/HPC workloads on Kubernetes. Kueue is one such tool- a Kubernetes-native open source project for managing the queueing of batch workloads on Kubernetes. To learn more about batch workloads on Kubernetes with Kueue, you might check out this tutorial. You can also learn more about running batch workloads on Kubernetes with GPUs in this guide about running them on GKE in Google Cloud.

Inference Workloads

You could say that training makes the AI into as much of an "expert" as it's going to be. Running a pre-trained model is its own type of workload. These "inference" workloads are generally much less resource-intensive than "training" workloads, but the resource needs of inference workloads can vary significantly. IBM defines “AI Inferencing as: "the process of running live data through a trained AI model to make a prediction or solve a task."

An "Inference Workload" describes running a trained model. This model should be able to do its expected tasks relatively well.

Inference workloads can range from a fairly simple, lightweight implementation - to much more complex and resource-intensive ones. The term "inference workload" can describe a standalone, actively running implementation of a pre-trained AI model. Or, it can describe an AI model that functions essentially as a backend service within a larger application, often in a microservice-style architecture. This term seems to be used as a catch-all for any case where a trained AI model is being run. I’ve heard it used interchangeably with the terms “serving workload” and “prediction workload.”

Why Kubernetes for Inference

Inference workloads can have diverse resource needs. Some might be lightweight and run on CPUs, while others might require powerful GPUs for maximum performance. Kubernetes excels at managing heterogeneous hardware, allowing you to assign the right resources to each inference workload for optimal efficiency.

Kubernetes provides flexibility in the way an inference workload is used. Users may interact with it directly as a standalone application. Or they may go through a separate frontend as part of a microservice. Whether the workload is standalone or one part of a whole, we call the workload that runs an AI model, "inference."

On Terminology

In writing this blog post, I learned that the terminology of AI workloads is still actively being determined. “Inference” is currently used interchangeably with “serving,” “prediction,” and maybe more. Words have meaning, but when meanings are still being settled, it’s especially important to be as clear as possible about what you mean when you use a term.

One area which I think is not well-served by existing AI terminology is the difference between running a model, and running a full AI application. I have also seen the “inference” family of terms used to describe not just the running model, but the full application it is a part of, when the speaker is focusing on the AI aspects of that application.

It’s worth noting that Kubernetes is good not just for running the AI part, but also for running the application that AI model is part of, as well as related services. Serving Frameworks like Ray are useful tools for managing not just AI models, but the applications around them. I’ll likely dive deeper into Ray in a future blog post. If you’d like to learn more about Ray, you might check out this blog post about Ray on GKE.

AI models often fill a role within a larger application that serves users.

Ultimately, be careful when you talk about AI workloads! Try to explain what you mean as clearly as you can so we can all learn and understand together!

Customizing AI: It's All About Context

I'm enjoying learning about the ways "AI" fits into the world of "Kubernetes," and there's a lot more to learn! In this post, we explored AI training, inference, and serving workloads and why to run them on Kubernetes. These workload types are great for understanding what it means to run AI models on Kubernetes. But the real value of AI is in its ability to understand and convey information in-context. To make a generic AI model useful in many use cases, it needs to be made aware of the context it's operating in, and what role it's fulfilling. "Fine-tuning" refers to the techniques for customizing a generic model, often by partially retraining it. There are also other techniques like RAG and prompt engineering that can be used to customize a generic model’s responses without altering the model itself. I’ll dive deeper into these techniques in a future blog post.