Forem: Rizèl Scarlett

Your Agent Sessions Belong in Your Codebase: Nullius in Verba

Rizèl Scarlett — Tue, 19 May 2026 08:56:46 +0000

Your coding agent sessions belong in your codebase. Before I joined Entire, the company building the infrastructure to bring your agent sessions into your code, I was already exploring this exact idea on my own.

In January 2026, I participated in Genuary, a month-long creative coding challenge where artists, designers, and programmers make and share generative art based on a daily prompt. I used my coding agent, goose, to generate the creative code. For me, this was less of an exercise in creative coding and more of a self-taught lesson in orchestrating agents, since doing complex things with agents was on the rise.

One of the things I built into my process was a repeating workflow where, after every session, my agent automatically committed the session transcript into the same repository that held the creative output. It wasn't elegant, because it was literally a huge transcript with every tool call mixed in and almost no structure to make it readable. I did it because some of the creations were so astonishingly beautiful that I wanted my agent and myself to be able to look back later and have enough context to reuse those same patterns for future challenges.

Three months later, in March, I was working at a company that had built a far more elegant solution to the same problem. Instead of haphazardly dumping whole session transcripts, Entire saves each session as a series of navigable checkpoints. Each checkpoint is a snapshot of a meaningful moment in the session, capturing what the agent did, what changed in your code, and the reasoning that produced the change. Now after using Entire for a few months, I’m realizing that what I had treated as a nice-to-have for myself, I now see as a real necessity for engineers.

I had this epiphany while doing what my job actually entails, which is advocating for developers. I started noticing a pattern across the developers and community members I talked to. While many of them wanted to track their agent sessions, they did not want those sessions to live in the same codebase. Some people felt their sessions were too embarrassing, full of mistakes or moments where they had been harsh with their agent, because all of us have lost patience with a coding agent that just refuses to understand us. Others felt the sessions were too private. Because Entire already supports storing sessions in a separate repository and redacting secrets by default, I assumed we should be louder about that functionality.

Surprisingly, one of my teammates disagreed with me. His philosophy was that agent sessions belong alongside your code by default, and that the discomfort developers would eventually go away. Because I am trained to empathize with developers, I initially felt this stance was dogmatic, and I struggled to see eye to eye. Over the past few weeks, though, the idea kept ringing through my mind, I see his perspective.

Software engineering has never been about flawless first drafts. Our industry thrives precisely because we maintain a transparent, versioned track record of our technical evolution, and when engineers treat interactions with coding agents as ephemeral scratchpads, we end up ignoring a foundational principle of how software actually gets built. Every architectural and logical decision deserves a clear and traceable provenance, and right now that provenance is silently disappearing into chat windows.

I did some historical research on how deeply embedded proof of work is in our industry, and I learned a lot about what happens when we abandon that proof of work. Here’s what I learned.

Proof of Work in Mathematics

This foundation goes as far back as mathematics, the predecessor of computer science and software engineering. In the 1600s, mathematicians operated inside a genuinely toxic environment, settling disputes through public academic duels with brutal stakes. Winners kept their university chairs, while losers were publicly humiliated and often lost their livelihoods entirely.

Because the consequences of losing were so severe, practitioners routinely hid their formulas and hoarded their methodologies. That culture of intense secrecy produced constant intellectual property disputes, redundant reverse-engineering, and a fragmented ecosystem that ended up stalling the progress of the entire discipline.

The turning point came in the 1660s, when the Royal Society of London adopted a new motto, Nullius in verba, which translates to "take nobody's word for it." From that point on, mathematicians had to publish their complete, step-by-step processes in academic journals rather than only presenting final conclusions. In exchange for that transparency, they received institutional validation and undisputed peer credit, and the field finally had a shared ledger of truth.

Proof of Work in Software

Three hundred years later, software engineering experienced a similar reckoning. In the 1960s, code was a tangible, physical artifact. Developers punched holes into cardboard cards, organized them into precise decks, and fed those decks into a mainframe. Version control was physical too, because changing a routine meant pulling a specific card out of the deck and slotting a new one into its place.

Then, code moved to magnetic tape and hard disks and became digitally invisible. Multiple developers modified the same file and accidentally overwrote each other's changes without any shared source of truth. The industry's response was a slow march back toward visibility, moving from local file-locking systems like SCCS and RCS to centralized trackers like CVS and Subversion, and eventually to Git. Git decoupled development pipelines entirely through a distributed, non-linear architecture, but it was hard to use on its own, and it did not pick up real traction until GitHub layered a collaborative interface on top of it. That interface turned version history into a shared social ledger and defined the modern development workflow.

The pattern is the same one the Royal Society set in motion three centuries earlier. Every time our industry has taken a leap forward, that leap has come from making invisible work visible.

Invisible Agent Work

Agentic workflows are becoming the primary engine of software production, but they're abstracting away our work at the same time.

By committing only the final file output of an agent session, we aren't hiding our work the way 17th-century mathematicians did. But the effect is the same: we are back to delivering an end product while erasing the lineage of how it was reasoned into existence.

The prompts you write, the specific files your agent reads, and the back-and-forth debugging it takes to get things right are not just logs. They are first-class development artifacts. When we strip them away from a pull request, the rest of our tooling, our reviewers, and our future selves are all left to take the resulting code at face value.

That is exactly the position the Royal Society found unworkable in the seventeenth century, and there is no good reason to expect it will work for us either.

Nullius in Verba

Including your unedited session next to your code feels vulnerable, but so does pushing your first commit to a public repo or opening your first pull request in an open source project.

That discomfort is not a flaw in the workflow, it is the price of admission for a trustworthy, auditable record of how software actually gets built. Nullius in verba is still the right principle 300+ years later: take nobody's word for it, not even your agent's. Let the work speak in the place where the work actually lives. That is the direction we are building toward at Entire: making the context behind agent-authored work as visible as the code itself.

Did you like this blog post?

Try out Entire: entire.io
Join our Discord: https://discord.gg/WUzRcQ5PX4
Read our docs: docs.entire.io

If you have additional thoughts, feel free to leave a comment!

How to Keep Entire Checkpoints Separate from Your Code

Rizèl Scarlett — Fri, 08 May 2026 08:02:39 +0000

Storing a record of your agent sessions solves the biggest friction point for developers: limited context. On the surface, this may look like a dormant log, but Entire transforms those records into procedural memory. By default, the Entire CLI stores your agent history right alongside your code. More specifically, it stores your checkpoints, snapshots of your prompts, agent transcripts, and the state of your work at each step, on a dedicated branch in the same repository called entire/checkpoints/v1.

But as valuable as that memory is, it raises a valid question: What if I don’t want anyone else to see the conversations I have with my agent?

We’ve heard a few consistent reasons why developers want to keep their agent history private:

The conversations are, frankly, a little embarrassing. (I’ve yelled at my agents before. I am not proud of it, but when tokens are few, so is my patience).
It can start to feel like surveillance from their employer.
It's a privacy concern. Those conversations might include context their company doesn't want to share publicly or with external collaborators.
They want to keep your main repo lean and focused on source code.

If any of those reasons resonate, you have two main paths to a more private workflow.

Push checkpoints to a separate private repo

This is the sweet spot if you’re working on a public or shared project but still want a history that you (and maybe your trusted teammates) can access.

1. Create a private repo for your checkpoints

On GitHub, create an empty private repo with any name you want. In this example, we’ll use myorg/checkpoints-private. This is where all your agent sessions will live. You don't need to add a README or initialize it. Entire will push the first checkpoint branch on its own.

2. Point Entire at the new repo.

From inside your project, run:

entire configure --checkpoint-remote github:myorg/checkpoints-private

The format is provider:owner/repo. Today, github is the supported provider. This writes the setting to .entire/settings.json under strategy_options.checkpoint_remote:

{
  "strategy_options": {
    "checkpoint_remote": {
      "provider": "github",
      "repo": "myorg/checkpoints-private"
    }
  }
}

Now, your code will be stored in your main repo, and your agent sessions will go to your new private repo. You can read more about this in the Checkpoint Remote docs.

Keep your agent sessions local

If you want the highest level of privacy, you can keep your agent sessions local and opt out of pushing them to remote using the following command:

entire configure --skip-push-sessions

This modifies .entire/settings.json with the following values:

{
  "strategy_options": {
    "push_sessions": false
  }
}

This setting still allows you to store your sessions locally. For example, you can still rewind, look back at what happened, and use all the local features. However, because the checkpoints never get pushed to GitHub or any remote provider, you cannot retrieve them if you switch devices. Also, your teammates won’t have access to your checkpoints.

What if I accidentally paste a secret?

We know that mistakes happen, so we have guardrails in place. Whether you store your history alongside your code, in a private repo, or on your local machine, Entire runs every session through a redaction pipeline before it hits git.

We use Betterleaks to automatically scrub:

Cloud credentials (AWS, GCP, Azure)
Source control tokens (GitHub, GitLab, Bitbucket)
Service keys (Stripe, Slack, Discord, Twilio)
Private keys (RSA, SSH, PGP)
Database connection strings with embedded passwords
Bearer tokens, JWTs, and high-entropy strings that look secret-shaped even if they don't match a known pattern

TLDR; Which one should you pick?

The Goal	The Command	Where data lives
Full Visibility	Default	Same repo as your code
Private Collaboration	`--checkpoint-remote`	A separate private repo
Total Isolation	`--skip-push-sessions`	Your local machine only

Note: Redaction runs across all three. Whether your checkpoints live in your code repo, a private repo, or only on your laptop, secrets get scrubbed before they're written.

Conclusion

We recognize that your agent workflow is going to look different based on who you are, the codebase you're in, and your team's unique security needs. Entire is built to adapt to those needs.

Ready to dive deeper into configuring your setup? Check out our documentation on checkpoint remote and Security & Privacy docs.

In the comments section, let me know: Do you even care if people see your agent history, or would you rather keep those transcripts private?

Turning Agent History into Procedural Memory

Rizèl Scarlett — Sun, 03 May 2026 23:40:09 +0000

For about a year, my primary coding agent was goose. Since I worked at Block and served as a Developer Advocate for the project, I was deeply embedded in its ecosystem. I contributed code and provided product feedback that shaped how it functioned.

Then, I moved to a company called Entire that provides the infrastructure for the agentic software development lifecycle. To do my job well, I have to dogfood our product across the agentic ecosystem. This means I am constantly switching between Claude Code, Codex, and other agents that support hooks to contribute to docs, investigate and resolve bugs, understand new features, and produce content.

Switching between AI agents made me realize every agent has tradeoffs. Some are faster or more polished, but I find myself deeply missing a specific goose feature called recipes.

The Problem with Operational Glue

Recipes are reusable, shareable workflows. At the core, they are YAML files that describe a process you want goose to run again. You can write the YAML files manually, but I always preferred the magic of clicking a button to package a successful session into a recipe.

My work in Developer Relations is creative, but it’s built on repeatable systems. For example, writing a blog post, building a code demo, creating a video are creative. The publishing process is not. Publishing a blog post involves a series of tiny, forgettable steps: checking the folder structure, adding the correct front matter, wiring up the metadata, dropping the image in the right asset folder, opening the PR. Each of these steps take a few minutes, but those minutes add up and become hours of operational glue. At Block, I automated as much of that as possible. I had goose generating release notes in CI/CD and creating documentation tickets in Asana. Some of these ran on a schedule, others I triggered manually. The point was always the same: if I found myself explaining a process to an agent more than once, it was an operational smell that needed to become a reusable asset.

While my use cases focus on content and community, this pattern is universal. In many fields, employees find themselves frequently explaining the same sequence to an agent, so why not automate this into repeatable workflows?

For engineers, those repeated conversations may look like:

Upgrading a dependency safely
Bootstrapping a new microservice
Triaging a production error
Writing a design doc or RFC
Preparing a release PR

The inputs and thinking may vary, but the process: conventions, the file paths, the validation steps, the commands you run, the people you always notify stay the same. And this level of automation is necessary today where employers are demanding more output.

The System of Record

I'm constantly jumping between different agents. Each one has its own process for automating workflows, but none of the automation tools hit the mark for me like goose did. While I don’t have access to my treasured repeatable workflows anymore, I do have access to the unique and valuable agent session data that Entire collects. Entire is a CLI-first system of record for agent-assisted development. It captures the context behind your work: the sessions, prompts, responses, tool calls, file changes, and Checkpoints. A Checkpoint is a specific moment where work is tied back to git. It connects the "why" of the agent session to the "what" of the final commit.

I realized this data isn't just for review, audit, or to sit quietly in the background. It's a source of truth that can be used for building better workflows. I thought, “What if I could use my Entire session history to recreate that ‘package up a session’ magic, but in a way that works across any agent, and works retrospectively?”

The most popular way people are currently building reusable workflows is with Skills, so I built an orchestrator skill called Session-to-Skill. It creates Skills for me based on repeated behavior.

The Before and After

Before I used to say:

“Look at past blog posts in this repo, check the folder structure, and the front matter.”
“I want to add a new blog post. Here’s the content: [insert content copied from google doc here] ”
“Create a new PR. Make sure we’ve pulled the latest from main and branch off main before you create this PR.”
“Why did you make the word Checkpoints lowercase when I purposely had them capitalized? Please restore that.”
“Does the OG image work? What’s the path for me to check that again?”

Now, I can say:

“Create a blog post from this content [insert content copied from google doc here].”

This is possible because I prompted my agent to use the Session-to-Skill Skill: "Look at my past sessions where I set up blog posts. Find the repeated steps and conventions, then draft a Skill from that data, so I can create blog posts quickly in the future." My agent created a Skill called Create-blog, which included requirements to properly format the blog, open a PR, and return the path to confirm the OG image rendered.

Well, that’s kind of dumb..

Some may have pushback on this idea of me building an orchestrator Skill because at any moment in a session you can prompt any agent to turn it into a Skill.

The reality is I don’t have perfect foresight. Most reusable workflows are recognized later. After the third time I publish a blog post, I realize I have been doing the same thing over and over again. By then, the valuable evidence is spread across past sessions.

There is also the issue of quality. Asking an agent to summarize a transcript often leads to overfitting and noise. The resulting Skill might include accidental details, temporary file paths, or one-off preferences that happened to be present in that single session.

Instead my Skill is extracting the answers to the following questions:

What was the reusable behavior?
What should a future agent know before attempting this again?

I don't have to remember the session ID from six weeks ago. I just know the work happened. The Skill uses Entire to search my session metadata, checkpoints, and explanations of prior work to find the durable pattern.

Procedural Memory as Infrastructure

My approach creates procedural memory for agents. Procedural memory is the answer to the question, "How do I do this kind of work well, here, in this repo, with this team?"

Daily engineering work is not net-new. You may receive a new ticket, but somebody has solved this problem before.

By using Entire's data to generate Skills, I get a layer of determinism and portability. The agent starts with a template based on real work rather than a generic prompt. It encodes patterns that have already succeeded. And because Skills are portable files, I can take my blog-publishing Skill from Claude Code to Codex without re-explaining my workflow and share it with teammates.

With all this said, I want to urge readers to stop treating our agent sessions as disposable and start turning our history into our infrastructure.

Check out Entire at entire.io

Turning Agent History into Procedural Memory

Rizèl Scarlett — Sun, 03 May 2026 23:40:08 +0000

Switching between AI agents made me realize every agent has tradeoffs. Some are faster or more polished, but I find myself deeply missing a specific goose feature called recipes.

The Problem with Operational Glue

For engineers, those repeated conversations may look like:

Upgrading a dependency safely
Bootstrapping a new microservice
Triaging a production error
Writing a design doc or RFC
Preparing a release PR

The System of Record

The most popular way people are currently building reusable workflows is with Skills, so I built an orchestrator skill called Session-to-Skill. It creates Skills for me based on repeated behavior.

The Before and After

Before I used to say:

“Look at past blog posts in this repo, check the folder structure, and the front matter.”
“I want to add a new blog post. Here’s the content: [insert content copied from google doc here] ”
“Create a new PR. Make sure we’ve pulled the latest from main and branch off main before you create this PR.”
“Why did you make the word Checkpoints lowercase when I purposely had them capitalized? Please restore that.”
“Does the OG image work? What’s the path for me to check that again?”

Now, I can say:

“Create a blog post from this content [insert content copied from google doc here].”

Well, that’s kind of dumb..

Some may have pushback on this idea of me building an orchestrator Skill because at any moment in a session you can prompt any agent to turn it into a Skill.

Instead my Skill is extracting the answers to the following questions:

What was the reusable behavior?
What should a future agent know before attempting this again?

Procedural Memory as Infrastructure

My approach creates procedural memory for agents. Procedural memory is the answer to the question, "How do I do this kind of work well, here, in this repo, with this team?"

Daily engineering work is not net-new. You may receive a new ticket, but somebody has solved this problem before.

With all this said, I want to urge readers to stop treating our agent sessions as disposable and start turning our history into our infrastructure.

Check out Entire at entire.io

I Don’t Make Slides Anymore. My Agent and Entire Do It for Me.

Rizèl Scarlett — Sat, 25 Apr 2026 17:00:18 +0000

Signing up to speak at conferences is fun until the conference date starts approaching and you realize you still have to write and practice your talk. For me, writing the talk isn't the hard part. I have a process of talking to myself on a peaceful walk (or even in the shower), recording my voice, and then inserting the demos afterward. The part I often procrastinate is making the slides. Creating slides used to be fun, but as I’ve grown my career and my family, it's no longer a good use of my time.

I've looked for a way to automate slide generation, but most options have been fragile. They generally struggle with formatting and taste. However, a few months ago, some of my teammates at Block discovered the frontend slides skill and introduced me to it. This skill enables agents to build out HTML presentation decks that can be exported as PDFs or PowerPoint presentations.

Agent Skills

If you're not familiar, agent skills are markdown files that provide instructions for the agent to understand how to use a tool (i.e., a CLI or an MCP server). This way, your agent immediately knows what commands to run and how to navigate the tooling when you make a request.

How I Use the Frontend Slides Skill

As I mentioned, I already have my talk transcribed, which gives me a talk track to follow. I typically give the transcript to an agent and ask if there are any parts that don't make sense or any gaps for the audience.

Once the talk track is polished, I give the copy to my agent and prompt it to use the frontend slides skill to build a deck based on the track. I prefer to use Claude Code for this task, as it seems to work really well with the frontend slides skill, but any agent that supports skills should work. The agent then produces a beautiful slide deck for me. It really doesn't look bad or overly generic at all. It has various themes to choose from, and since it's generated with HTML, CSS, and JavaScript, I can prompt my agent to edit parts like making the font bigger, changing colors, and so on.

Here's an example:

// Detect dark theme var iframe = document.getElementById('tweet-2024060044153692197-26'); if (document.body.className.includes('dark-theme')) { iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=2024060044153692197&theme=dark" }

My favorite thing to add is a presenter view. It doesn't generate that view by default, but I do like to take a peek at my notes as I speak. So I usually tell my agent to implement that view if I press a key like the letter "P," and I make sure it syncs with what everyone else can see. Then, I upload my deck to GitHub Pages. Goodbye, Canva, PowerPoint, and Google Slides.

How Entire Enhanced My Workflow

Let me rewind for a second and introduce you to Entire. Entire is the company I work for. We're building the next developer platform for the AI-native software development lifecycle. The team recognized that agents have changed our workflows, so the infrastructure we use should change too.

Our first tool is a CLI that captures prompts, agent responses, tool calls, and other session data from the work you do with an agent. That gives you a way to inspect what happened, rewind work from a past session, and stay accountable. For example, if a production outage ever happens, instead of saying, "Oh, the root cause is that my agent did it," you can actually track the decisions made between the agent and the person prompting it. I described this to someone at a conference the other day, and they boiled it down to version control for agentic work, which is honestly exactly what it feels like.

Now, it took me a while to see how Entire could make my workflow even better, but the founders opened my eyes. I can use Entire while I'm building out a demo and then use that captured work to help build the slide deck later. Entire has a command called entire dispatch. It generates a markdown summary of the work Entire captured between you and your agent in a repository.

For example, after experimenting with OCR in a repo, Entire generated this dispatch summary for me:

# Dispatch: blackgirlbytes/pretext-handwriting-demo

Shipped a full handwriting recognition demo built on Pretext, moving from  
initial scaffolding to a polished scrapbook composition surface within a  
single day.

## blackgirlbytes/pretext-handwriting-demo

### Handwriting Recognition

• Built draw-mode handwriting recognition as the core interaction surface.  
• Added image upload as a second recognition path alongside the drawing  
canvas.  
• Implemented auto-recognition after drawing completes, removing the manual  
trigger step.  
• Added camera mode to round out the three input methods.  
• Removed the explicit draw button to streamline the input UX.

### Scrapbook Composition Surface

• Introduced a scrapbook composition surface for arranging recognized text  
and shapes.  
• Integrated Pretext to handle obstacle-aware text flow around placed shapes.  
• Added animated motion layer to the scrapbook background.  
• Added resize handles to scrapbook shapes for direct manipulation.  
• Matched scrapbook background typography to the handwriter component for  
visual consistency.  
• Moved scrapbook controls into the composition header to consolidate the  
toolbar.  
• Fixed scrapbook layout and output tab rendering, then simplified and  
tightened tab spacing across multiple passes.  
• Corrected background line wrapping and ensured background renders before  
shapes are placed.

### API Key & Session Management

• Added session-based OpenAI key setup to avoid requiring environment-level  
configuration.  
• Hardened API key gate interactions to handle edge cases more reliably.  
• Added environment key setup path as an alternative to session entry.

### Documentation

• Added project agent working rules and intent guidance early in the commit  
sequence.  
• Documented project setup and architecture in the README.  
• Refined and clarified the README project description across two follow-up  
commits.

All core features landed on main on April 1, 2026; the repo is in a  
reviewable state.

That's helpful for me because building a demo usually takes a few days. I'll leave it, come back to it later, and then have to remember what I actually did, what mattered most, and which technical details are worth calling out. Instead of trying to reconstruct all of that from memory, I can use the dispatch summary, give it back to the agent, and ask it to make a strong slide that captures the main technical highlights of the demo. That saves me from having to recall every step I took days or even weeks later.

Beyond DevRel

My particular use case works best for folks in Developer Relations or folks who do DevRel-related work like conference speaking, but this can actually work well for various roles. Here are a few:

Developers demoing completed features to their team. My husband is a developer, and he's expressed that building the slide deck is time-consuming just to show off a feature he built to his team.
Hackathon participants demoing their project to judges. Presentation decks often get neglected because everyone is focused on building the actual project during a small window of time.
Solutions engineers or sales engineers preparing customer demos. A lot of time goes into building out the demo environment itself, so having help turning that work into a clear deck can save a lot of time.
Workshop instructors or developer educators teaching technical material. It can be useful to turn the work captured while building the demo or sample app into slides that explain the flow, architecture, or key takeaways.
Engineering managers or tech leads giving project updates. Sometimes the hard part is not the work itself, but summarizing what happened clearly enough for leadership or cross-functional teams.
Founders or indie hackers pitching what they built. When you are moving quickly, the last thing you want is to spend hours making slides after already doing the hard part of building the product.

I don’t believe in automating things that deserve a human touch, but I do believe in automating things so I can spend more time with humans. The slides skill has been great for me, but adding Entire to this workflow has made it even easier for me to do that.

Much of my previous work with GitHub and Block (goose) was focused on using agents to build faster. Recently joining the team at Entire has pushed me to think more about the next layer: making agentic work durable and accountable.

Building fast with agents is fun, but in practice, I also need to be able to understand what happened, pick work back up later, explain it to other people, and turn it into something useful beyond the moment it was created. I’ll be sharing more of my agent-native workflows as I continue experimenting.

If you want to learn more about Entire, check out our:

And follow me!

8 Things You Didn't Know About Code Mode

Rizèl Scarlett — Thu, 19 Feb 2026 06:54:38 +0000

Agents fundamentally changed how we program. They enable developers to move faster by disintermediating the traditional development workflow. This means less time switching between specialized tools and fewer dependencies on other teams. Now that agents can execute complicated tasks, developers face a new challenge: using them effectively over long sessions.

The biggest challenge is context rot. Because agents have limited memory, a session that runs too long can cause them to "forget" earlier instructions. This leads to unreliable outputs, frustration, and subtle but grave mistakes in your codebase. One promising solution is Code Mode.

Instead of describing dozens of separate tools to an LLM, Code Mode allows an agent to write code that calls those tools programmatically, reducing the amount of context the model has to hold at once. While many developers first heard about Code Mode through Cloudflare's blog post, fewer understand how it works in practice.

I have been using Code Mode for a few months and recently ran a small experiment. I asked goose to fix its own bug where the Gemini model failed to process images in the CLI but worked in the desktop app, then open a PR. The fix involved analyzing model configuration, tracing image input handling through the pipeline, and validating behavior across repeated runs. I ran the same task twice: once with Code Mode enabled and once without it.

Here is what I learned from daily use and my experiment.

1. Code Mode is Not an MCP-Killer

In fact, it uses MCP under the hood. MCP is a standard that lets AI agents connect to external tools and data sources. When you install an MCP server in an agent, that MCP server exposes its capabilities as MCP tools. For example, goose's primary MCP server called the developer extension exposes tools like shell enabling goose to run commands and text_editor, so goose can view and edit files.

Code Mode wraps your MCP tools as JavaScript modules, allowing the agent to combine multiple tool calls into a single step. Code Mode is a pattern for how agents interact with MCP tools more efficiently.

2. goose Supports Code Mode

Code Mode support landed in goose v1.17.0 in December 2025. It ships as a platform extension called "Code Mode" that you can enable in the desktop app or CLI.

To enable it:

Desktop app: Click the extensions icon and toggle on "Code Mode"
CLI: Run goose configure and enable the Code Mode extension

Since its initial implementation, we've added so many improvements!

3. Code Mode Keeps Your Context Window Clean

Every time you install an MCP server (or "extension" in the goose ecosystem), it adds a significant amount of data to your agent's memory. Every tool comes with a tool definition describing what the tool does, the parameters it accepts, and what it returns. This helps the agent understand how to use the tool.

These definitions consume space in your agent's context window. For example, if a single definition takes 500 tokens and an extension has five tools, that is 2,500 tokens gone before you even start. If you use multiple extensions, you could easily double or even decuple that number.

Without Code Mode, your context window could look like this:

[System prompt: ~1,000 tokens]
[Tool: developer__shell - 500 tokens]
[Tool: developer__text_editor - 600 tokens]
[Tool: developer__analyze - 400 tokens]
[Tool: slack__send_message - 450 tokens]
[Tool: slack__list_channels - 400 tokens]
[Tool: googledrive__search - 500 tokens]
[Tool: googledrive__download - 450 tokens]
... and so on for every tool in every extension

As your session progresses, useful context gets crowded out by tool definitions you aren't even using: the code you are discussing, the problem you are solving, or the instructions you previously gave. This leads to performance degradation and memory loss. While I used to recommend disabling unused MCP servers, Code Mode offers a better fix. It uses three tools that help the agent discover what tools it needs on demand rather than having every tool definition loaded upfront:

search_modules - Find available extensions
read_module - Learn what tools an extension offers
execute_code - Run JavaScript that uses those tools

I wanted to see how true this was so I ran an experiment: I had goose solve a user's bug and put up a PR with and without code mode. Code Mode used 30% fewer tokens for the same task.

Metric	With Code Mode	Without Code Mode
Total tokens	23,339	33,648
Input tokens	23,128	33,560

4. Code Mode Batches Operations Into a Single Tool Call

The token savings do not just come from loading fewer tool definitions upfront. Code Mode also handles the "active" side of the conversation through a method called batching.

When you ask an agent to do something, it typically breaks your request into individual steps, each requiring a separate tool call. You can see these calls appear in your chat as the agent executes the tasks. For example, if you ask goose to "check the current branch, show me the diff, and run the tests," it might run four individual commands:

▶ developer__shell → git branch --show-current

▶ developer__shell → git status

▶ developer__shell → git diff

▶ developer__shell → cargo test

Each of these calls adds a new layer to the conversation history that goose has to track. Batching combines these into a single execution. When you turn Code Mode on and give that same prompt, you will see just one tool call:

▶ Code Execution: Execute Code
  generating...

Inside that one execution, it batches all the commands into a script:

import { shell } from "developer";

const branch = shell({ command: "git branch --show-current" });
const status = shell({ command: "git status" });
const diff = shell({ command: "git diff" });
const tests = shell({ command: "cargo test" });

As a user, you see the same results, but the agent only has to remember one interaction instead of four. By reducing these round trips, Code Mode keeps the conversation history concise so the agent can maintain focus on the task at hand.

5. Code Mode Makes Smarter Tool Choices

When an agent has access to dozens of tools, it sometimes makes a "logical" choice that is technically wrong for your environment. This happens because, in a standard setup, the agent picks tools from a flat list based on short text descriptions. This can lead to a massive waste of time and tokens when the agent picks a tool that sounds right but lacks the necessary context.

I saw this firsthand during my experiments. I had an extension enabled called agent-task-queue, which is designed to run background tasks with timeouts.

When I asked goose to run the tests for my PR, it looked at the available tools and saw agent-task-queue. The LLM reasoned that a test suite is a "long-running task," making that extension a perfect fit. It chose the specialized tool over the generic shell.

However, the tool call failed immediately:

FAILED exit=127 0.0s
/bin/sh: cargo: command not found

My environment was not configured to use that specific extension for my toolchain. goose made a reasonable choice based on the description, but it was the wrong tool for my actual setup.

In the Code Mode session, this mistake never happened. Code Mode changes how the agent interacts with its capabilities by requiring explicit import statements.

Instead of browsing a menu of names, goose had to be intentional about which module it was using. It chose to import from the developer module:

import { shell } from "developer";

const test = shell({ command: "cargo test -p goose --lib formats::google" });

By explicitly importing developer, Code Mode ensured the tests ran in my actual shell environment.

6. Code Mode Is Portable Across Editors

goose is more than an agent; it's also an ACP (Agent Client Protocol) server. This means you can connect it to any editor that supports ACP, like Zed or Neovim. Plus, any MCP server you use in goose will work there, too.

I wanted to try this myself, so I set up Neovim to connect to goose with Code Mode enabled. Here's the configuration I used:

{
  "yetone/avante.nvim",
  build = "make",
  event = "VeryLazy",
  opts = {
    provider = "goose",
    acp_providers = {
      ["goose"] = {
        command = "goose",
        args = { "acp", "--with-builtin", "code_execution,developer" },
      },
    },
  },
  dependencies = {
    "nvim-lua/plenary.nvim",
    "MunifTanjim/nui.nvim",
  },
}

The key line is the one where I enable Code Mode right inside the editor config:

args = { "acp", "--with-builtin", "code_execution,developer" },

To test it, I asked goose to list my Rust files and count the lines of code. Instead of a long stream of individual shell commands cluttering my Neovim buffer, I saw one singular tool call: Code Execution. It worked exactly like it does in the desktop app. This portability means you can build a powerful, efficient agent workflow and take it with you to whatever environment you're most comfortable in.

7. Code Mode Performs Differently Across LLMs

I ran my experiments using Claude Opus 4.5. Your results may vary depending on which model you use.

Code Mode requires the LLM to do things that not all models do equally well:

Write valid JavaScript - The model has to generate syntactically correct code. Models with stronger code generation capabilities will produce fewer errors.
Follow the import pattern - Code Mode expects the LLM to import tools from modules like import { shell } from "developer". Some models might try to call tools directly without importing, which will fail.
Use the discovery tools - Before writing code, the LLM should call search_modules and read_module to learn what tools are available. Some models skip this step and guess, leading to hallucinated tool names.
Handle errors gracefully - When a code execution fails, the model needs to read the error, understand what went wrong, and try again. Some models are better at this feedback loop than others.

If Code Mode is not working well for you, try switching models. A model that excels at code generation and instruction following will generally perform better with Code Mode than one optimized for other tasks.

8. Code Mode Is Not for Every Task

Code Mode adds overhead. Before executing anything, the LLM has to:

Call search_modules to find available extensions
Call read_module to learn what tools an extension offers
Write JavaScript code
Call execute_code to run it

For simple, single-tool tasks, this overhead is not worth it. If you just need to run one shell command or view one file, regular tool calling is faster.

Based on my experiments, here is when Code Mode makes sense:

Use Code Mode When	Skip Code Mode When
You have multiple extensions enabled	You only have 1-2 extensions
Your task involves multi-step orchestration	Your task is a single tool call
You want longer sessions without context rot	Speed matters more than context longevity
You are working across multiple editors	You are doing a quick one-off task

Try It Out

If you want to experiment with Code Mode, here are some resources:

Documentation:

Previous posts:

Code Mode MCP in goose by Alex Hancock
Code Mode Doesn't Replace MCP by me

Community:

Join our Discord to share what you learn
File issues on GitHub if something does not work as expected

Run your own experiments and let us know what you find.

5 Tips for Building MCP Apps That Work

Rizèl Scarlett — Thu, 19 Feb 2026 06:49:01 +0000

MCP Apps allow you to render interactive UI directly inside any agent supporting the Model Context Protocol. Instead of a wall of text, your agent can now provide a functional chart, a checkout form, or a video player. This bridges the gap in agentic workflows: clicking a button is often clearer than describing the action you hope an agent executes.

MCP Apps originated as MCP-UI, an experimental project. After adoption by early clients like goose, the MCP maintainers incorporated it as an official extension. Today, it's supported by clients like goose, MCPJam, Claude, ChatGPT, and Postman.

Even though MCP Apps use web technologies, building one isn't the same as building a traditional web app. Your UI runs inside an agent you don't control, communicates with a model that can't see user interactions, and needs to feel native across multiple hosts.

After implementing MCP App support in our own hosts and building several individual apps to run on them, here are the practical lessons we've picked up along the way.

Overview of how UI renders with MCP Apps

At a high level, clients that support MCP Apps load your UI via iFrames. Your MCP App exposes an MCP server with tools and resources. When the client wants to load your app's UI, it calls the associated MCP tool, loads the resource containing the HTML, then loads your HTML into an iFrame to display in the chat interface.

Here's an example flow of what happens when goose renders a cocktail recipe UI:

You ask the LLM "Show me a margarita recipe".
The LLM calls the get-cocktail tool with the right parameters. This tool has a UI resource link in _meta.ui.resourceUri pointing to the resource containing the HTML.
The client then uses the URI to fetch the MCP resource. This resource contains the HTML content of the view.
The HTML is then loaded into the iFrame directly in the chat interface, rendering the cocktail recipe.

There's a lot that also goes on behind the scenes, such as View hydration, capability negotiation, and CSPs, but this is how it works at a high level. If you're interested in the full implementation of MCP Apps, we highly recommend giving the spec a read.

Tip 1: Adapt to the Host Environment

When building an MCP App, you want it to feel like a natural part of the agent experience rather than something bolted on. Visual mismatches are one of the fastest ways to break that illusion.

Imagine a user starting an MCP App interaction inside a dark-mode agent, but the app renders in light mode and creates a harsh visual contrast. Even if the app works correctly, the experience immediately feels off.

By default, your MCP App has no awareness of the surrounding agent environment because it runs inside a sandboxed iframe. It cannot tell whether the agent is in light or dark mode, how large the viewport is, or which locale the user prefers.

The agent, referred to as the Host, solves this by sharing its environment details with your MCP App, known as the View. When the View connects, it sends a ui/initialize request. The Host responds with a hostContext object describing the current environment. When something changes, such as theme, viewport, or locale, the Host sends a ui/notifications/host-context-changed notification containing only the updated fields.

Imagine this dialogue between the View and Host:

View: "I'm initializing. What does your environment look like?"
Host: "We're in dark mode, viewport is 400×300, locale is en-US, and we're on desktop."
User switches to light theme
Host: "Update: we're now in light mode."

It is your job as the developer to ensure your MCP App makes use of the hostContext so it can adapt to the environment.

How to use hostContext in your MCP App

import { useState } from "react";
import { useApp } from "@modelcontextprotocol/ext-apps/react";
import type { McpUiHostContext } from "@modelcontextprotocol/ext-apps";

function MyApp() {
  const [hostContext, setHostContext] = useState<McpUiHostContext | undefined>(undefined);

  const { app, isConnected, error } = useApp({
    appInfo: { name: "MyApp", version: "1.0.0" },
    capabilities: {},
    onAppCreated: (app) => {
      app.onhostcontextchanged = (ctx) => {
        setHostContext((prev) => ({ ...prev, ...ctx }));
      };
    },
  });

  if (error) return <div>Error: {error.message}</div>;
  if (!isConnected) return <div>Connecting...</div>;

  return (
    <div>
      <p>Theme: {hostContext?.theme}</p>
      <p>Locale: {hostContext?.locale}</p>
      <p>Viewport: {hostContext?.containerDimensions?.width} x {hostContext?.containerDimensions?.height}</p>
      <p>Platform: {hostContext?.platform}</p>
    </div>
  );
}

💡 Tip: If you're using the useApp hook in your MCP App, the hook provides a onhostcontextchanged listener. You can then use a React useState to update your app context. The host will provide their context, it's up to you as the app developer to decide what you want to do with that. For example, you can use theme to render light mode vs dark mode, locale to show a different language, or containerDimensions to adjust the app's sizing.

Tip 2: Control What the Model Sees and What the View Sees

There are cases where you may want to have granular control over what data the LLM has access to, and what data the view can show. The MCP Apps spec specifies three different tool return values that lets you control data flow, each are handled differently by the app host.

content: Content is the info that you want to expose to the model. Gives model context.
structuredContent: This data is hidden from the model context. It is used to send data over the View for hydration.
_meta: This data is hidden from the model context. Used to provide additional info such as timestamps, version info.

Let's look at a practical example of how we can use these three tool return types effectively:

server.registerTool(
  "view-cocktail",
  {
    title: "Get Cocktail",
    description: "Fetch a cocktail by id with ingredients and images...",
    inputSchema: z.object({ id: z.string().describe("The id of the cocktail to fetch.") }),
    _meta: {
      ui: { resourceUri: "ui://cocktail/cocktail-recipe-widget.html" },
    },
  },
  async ({ id }: { id: string }): Promise<CallToolResult> => {
    const cocktail = await convexClient.query(api.cocktails.getCocktailById, {
      id,
    });

    return {
      content: [
        { type: "text", text: `Loaded cocktail "${cocktail.name}".` },
        { type: "text", text: `Cocktail ingredients: ${cocktail.ingredients}.` },
        { type: "text", text: `Cocktail instructions: ${cocktail.instructions}.` },
      ],
      structuredContent: { cocktail },
      _meta: { timestamp: new Date().toString() }
    };
  },
);

This tool renders a view showing a cocktail recipe. The cocktail data is being fetched from the backend database (Convex). The View needs the entire cocktail data so we pass the data to it via structuredContent. For the model context, the LLM doesn't need to know the entire cocktail data like the image URL. We can extract the information that the model should know about the cocktail, like the name, ingredients, and instructions. That information can be passed to the model via content.

It's important to note that currently, ChatGPT apps SDK handles it differently, where structuredContent is exposed to both the model and the View. Their model is the following:

content: Content is the info that you want to expose to the model. Gives model context.
structuredContent: This data is exposed to the model and the View.
_meta: This data is hidden from the model context.

If you're building an app that supports both MCP Apps and ChatGPT apps SDK, this is an important distinction. You may want to conditionally return values, or conditionally render tools based off of whether the client is MCP App support or ChatGPT app.

Tip 3: Properly Handle Loading States and Error States

It's pretty typical for the iFrame to render first before the tool finishes executing and the View gets hydrated. You're going to want to let your user know that the app is loading by presenting a beautiful loading state.

One powerful feature to note: toolInputs are sent and streamed into the View even before the tool execution is done. This allows you to create cool partial loading states where you can show the user what's being requested while the data is still being fetched.

To implement this, let's take a look at the same cocktail recipes app. The MCP tool fetches the cocktail data and passes it to the View via structuredContent. We don't know how long it takes to fetch that cocktail data, could be anywhere from a few ms to a few seconds on a bad day.

server.registerTool(
  "view-cocktail",
  {
    title: "Get Cocktail",
    description: "Fetch a cocktail by id with ingredients and images...",
    inputSchema: z.object({ id: z.string().describe("The id of the cocktail to fetch.") }),
    _meta: {
      ui: {
        resourceUri: "ui://cocktail/cocktail-recipe-widget.html",
        visibility: ["model", "app"],
      },
    },
  },
  async ({ id }: { id: string }): Promise<CallToolResult> => {
    const cocktail = await convexClient.query(api.cocktails.getCocktailById, {
      id,
    });

    return {
      content: [
        { type: "text", text: `Loaded cocktail "${cocktail.name}".` },
      ],
      structuredContent: { cocktail },
    };
  },
);

On the View side (React), the useApp AppBridge hook has a app.ontoolresult listener that listens for the tool return results and hydrates your View. While onToolResult hasn't come in yet and the data is empty, we can render a beautiful loading state.

import { useApp } from "@modelcontextprotocol/ext-apps/react";

function CocktailApp() {
  const [cocktail, setCocktail] = useState<CocktailData | null>(null);

  useApp({
    appInfo: IMPLEMENTATION,
    capabilities: {},
    onAppCreated: (app) => {
      app.ontoolresult = async (result) => {
        const data = extractCocktail(result);
        setCocktail(data);
      };
    },
  });

  return cocktail ? <CocktailView cocktail={cocktail} /> : <CocktailViewLoading />;
}

Handling errors

We also want to handle errors gracefully. In the case where there's an error in your tool, such as the cocktail data failing to load, both the LLM and the view should be notified of the error.

In your MCP tool, you should return an error in the tool result. This is exposed to the model and also passed to the view.

server.registerTool(
  "view-cocktail",
  {
    title: "Get Cocktail",
    description: "Fetch a cocktail by id with ingredients and images...",
    inputSchema: z.object({ id: z.string().describe("The id of the cocktail to fetch.") }),
    _meta: {
      ui: { resourceUri: "ui://cocktail/cocktail-recipe-widget.html" },
      visibility: ["model", "app"],
    },
  },
  async ({ id }: { id: string }): Promise<CallToolResult> => {
    try {
      const cocktail = await convexClient.query(api.cocktails.getCocktailById, {
        id,
      });

      return {
        content: [
          { type: "text", text: `Loaded cocktail "${cocktail.name}".` },
        ],
        structuredContent: { cocktail },
      };
    } catch (error) {
      return {
        content: [
          { type: "text", text: `Could not load cocktail` },
        ],
        error
      };
    }
  },
);

Then in useApp on the React client side, you can detect whether or not there was an error by looking at the existence of error from the tool result.

Tip 4: Keep the Model in the Loop

Because your MCP App operates in a sandboxed iframe, the model powering your agent can't see what happens inside the app by default. It won't know if a user fills out a form, clicks a button, or completes a purchase.

Without a feedback loop, the model loses context. If a user buys a pair of shoes and then asks, "When will they arrive?", the model won't even realize a transaction occurred.

To solve this, the SDK provides two methods to keep the model synchronized with the user's journey: sendMessage and updateModelContext.

sendMessage()

Use this for active triggers. It sends a message to the model as if the user typed it, prompting an immediate response. This is ideal for confirming a "Buy" click or suggesting related items right after an action.

// User clicks "Buy" - the model responds immediately
await app.sendMessage({
  role: "user",
  content: [{ type: "text", text: "I just purchased Nike Air Max for $129" }],
});
// Result: Model responds: "Great choice! Want me to track your order?"

updateModelContext()

Use this for background awareness. It quietly saves information for the model to use later without interrupting the flow. This is perfect for tracking browsing history or cart updates without triggering a chat response every time.

// User is browsing - no immediate response needed
await app.updateModelContext({
  content: [{ type: "text", text: "User is viewing: Nike Air Max, Size 10, $129" }],
});
// Result: No response. But if the user later asks, "What was I looking at?", the model knows.

Tip 5: Control Who Can Trigger Tools

With a standard MCP server, the model sees your tools, interprets the user's prompt, and calls the right tool. If a user says "delete that email," the model decides what that means and invokes the delete tool.

However, with an MCP App, tools can be triggered in two ways: the model interpreting the user's prompt, or the user interacting directly with the UI.

By default, both can call any tool. For example, say you build an MCP App that visually surfaces an email inbox and lets users interact with emails. Now there are two potential triggers for your tools: the model acting on a prompt to delete an email, and the user clicking a delete button directly in the App's interface.

The model works by interpreting intent. If a user says "delete my old emails," the model has to decide what "old" means and which emails qualify. For some actions like deleting emails, that ambiguity can be risky.

When a user clicks a "Delete" button next to a specific message in your MCP App, there is no ambiguity. They have made an explicit choice.

To prevent the model from accidentally performing high-stakes actions based on a misunderstanding, you can use tool visibility to restrict certain tools to the MCP App's UI only. This allows the model to display the interface while requiring a human click to finalize the action.

You can define visibility using these three configurations:

["model", "app"] (default) — Both the model and the UI can call it
["model"] — Only the model can call it; the UI cannot
["app"] — Only the UI can call it; hidden from the model

Here's how you might implement this:

// Model calls this to display the inbox
registerAppTool(server, "show-inbox", {
  description: "Display the user's inbox",
  _meta: {
    ui: {
      resourceUri: "ui://email/inbox.html",
      visibility: ["model"],
    },
  },
}, async () => {
  const emails = await getEmails();
  return { content: [{ type: "text", text: JSON.stringify(emails) }] };
});

// User clicks delete button in the UI
registerAppTool(server, "delete-email", {
  description: "Delete an email",
  inputSchema: { emailId: z.string() },
  _meta: {
    ui: {
      resourceUri: "ui://email/inbox.html",
      visibility: ["app"],
    },
  },
}, async ({ emailId }) => {
  await deleteEmail(emailId);
  return { content: [{ type: "text", text: "Email deleted" }] };
});

Start Building with goose and MCPJam

MCP Apps open up a new dimension for agent interactions. Now it's time to build your own.

Test with MCPJam — the open source local inspector for MCP Apps, ChatGPT apps SDK, and MCP servers. Perfect for debugging and iterating on your app before shipping.
Run in goose — an open source AI agent that renders MCP Apps directly in the chat interface. See your app come to life in a real agent environment.

Ready to dive deeper? Check out the MCP Apps tutorial or build your first MCP App with MCPJam.

How I Used RPI to Build an OpenClaw Alternative

Rizèl Scarlett — Thu, 19 Feb 2026 06:44:39 +0000

Everyone on Tech Twitter has been buying Mac Minis, so they could run a local agentic tool called OpenClaw. OpenClaw is a messaging-based AI assistant that connects to platforms such as Discord and Telegram allowing you to interact with an AI agent through DMs or @mentions. Under the hood, it uses an agent called Pi to execute tasks, browse the web, write code, and more.

Seeing the hype made me want to get my hands dirty. I wanted to see if I could build a lite version for myself. I wanted something minimal that used goose as the engine instead of Pi. I tentatively dubbed it AltOpenClaw.

Choosing RPI

My usual move is to just jump in, start breaking things, and refactor as I go. I actually prefer the back and forth conversation with an agent because it helps me learn how the project works in real time. But when I tried that here, I hit a wall fast. goose did not naturally know what OpenClaw was, and it kept hallucinating how to use its own backend. It would forget context mid-conversation or suggest API calls that simply did not exist.

I realized I needed to change my approach. While I love the iterative learning process, I needed a way to give the agent a better foundation so our pair programming sessions actually made progress. I decided to try the RPI method (Research, Plan, Implement). This is a framework introduced by HumanLayer that trades raw speed for predictability. It is built into goose as a series of recipes. Since I did not fully understand the technical landscape myself, this investment in structure felt like the right move to help us both get on the same page.

Research

First, I needed goose to understand what I was building and whether it was even possible. I kicked things off with a detailed research prompt:

/research_codebase topic="learn what openclaw is, how people use it, 
and how it works. learn if goose can actually be used as a backend 
or if that's not yet possible; understand the port issues especially 
if you have an instance of goose that's running to help you build 
an agent that uses goose as a backend. learn if there will be any 
auth issues"

goose spawned multiple parallel subagents to investigate.

Key findings from the research:

OpenClaw uses its own embedded agent runtime (Pi), not goose. This meant there was no existing integration to copy.
goose CAN be used as a backend! The goosed server exposes a full HTTP API.
Port conflicts are manageable. We just needed to run on a different port with GOOSE_PORT=3001.
Authentication is simple. We could pass a secret key in the X-Secret-Key header.

The research also mapped out all the relevant API endpoints, such as POST /sessions to create a new session and POST /sessions/{id}/reply to handle the actual messaging.

Plan

With the research complete, I asked goose to create an implementation plan. This is where we defined the personality and security of the bot:

/create_plan ticket-or-context="I want to build a Discord MCP server 
for goose that replicates the popular features of OpenClaw but with 
better security. Core Features: Users can DM the bot or @ it in a 
channel to give goose tasks. goose responds in Discord with results. 
Security requirements: Allowlist (only specific Discord user IDs can 
interact), Approval flow (before goose executes any tool/action, the 
bot posts what it wants to do and waits for user approval), 
Non-allowlisted users get a polite 'you don't have access'"

goose analyzed the requirements and produced a detailed plan with four phases:

Phase 1: Project Setup (Discord.js skeleton and allowlist)
Phase 2: goose HTTP Client (Connecting to the API and handling SSE streaming)
Phase 3: Tool Approval Flow (The UI for ✅/❌ reactions)
Phase 4: Polish & Error Handling (Slash commands and session management)

I liked this phased approach because it gave us less to debug at each step. We could handle features in chunks rather than trying to fix everything at once.

Implement

With the plan in place, I gave the signal to start building:

/implement_plan start building

The first two phases were surprisingly smooth. Within an hour, the bot was online and I could actually DM it. Seeing a Discord message trigger a goose session for the first time was a massive win.

First, we tested if AltOpenClaw could respond to me with a joke!

However, as every developer knows, it was not all perfect. We still ran into some classic real-world hurdles during implementation:

The SSE (Server-Sent Events) format was different than we expected. We spent a good chunk of time debugging why the messages were not appearing until we realized the event structure was nested deeper than anticipated.
My local path did not have npm properly mapped, which led to a brief detour.
Discord has a strict limit on message length. If goose wrote a long script, the bot would just crash. We had to implement a chunking system on the fly.

Currently, the tool approval feature is still a work in progress. I actually got so excited that the core part of the project was working that I sat down to write this post before finishing the UI for the reactions.

The Takeaway

The RPI method felt like a superpower, even if it didn't magically delete every bug from the project. There is a big difference between fighting a hallucination and fighting a real technical challenge.

When I didn't use RPI, goose hallucinated nonexistent endpoints and tried to build a complex MCP server when a simple HTTP API was all we needed. Those are the kinds of bugs that waste hours because you are chasing ghosts.

Instead, RPI helped us clear the conceptual fog so we could focus on real implementation details like SSE parsing and character limits.

By forcing the agent to research first, it built up the context it was missing. It is a bit slower at the start (which I barely have patience for), but it turns the agent into a much more capable partner for that back and forth learning process I enjoy.

I even had AltOpenClaw push its own repository to GitHub.

Try It Out

If you want more reliability from your agent, give the RPI recipes in goose a shot:

/research_codebase
/create_plan
/implement_plan
/iterate_plan

Happy hacking!

[Boost]

Rizèl Scarlett — Wed, 14 Jan 2026 20:03:16 +0000

Amanda

Jan 13

Dynamic MCP Server discovery with goose

#ai #mcp #agents

Comments 3

1 min read

How I Taught My Agent My Design Taste

Rizèl Scarlett — Mon, 05 Jan 2026 00:15:14 +0000

Can you automate taste? The short answer is no, you cannot automate taste, but I did make my design preferences legible.

But for those interested in my experiment, I'll share the longer answer: I wanted to participate in Genuary, the annual challenge where people create one piece of creative coding every day in January.

My goal here wasn't to "outsource" my creativity. Instead, I wanted to use Genuary as a sandbox to learn agentic engineering workflows. These workflows are becoming the standard for how developers work with technology. To keep my skills sharp, I used goose to experiment with these workflows in small, daily bursts.

By building a system where goose handles the execution, I could test different architectures side-by-side. This experiment allowed me to determine which parts of an agentic workflow actually add value and which parts I should ditch. I spent a few hours focused on infrastructure to buy myself an entire month of workflow data.

💡 Skills are reusable sets of instructions and resources that teach goose how to perform specific tasks.

The Inspiration

I have to give a huge shout-out to my friend Andrew Zigler. I saw him crushing Genuary and reached out to see how he was doing it. He shared his creations and mentioned he was using a "harness."

I'll admit, I'd been seeing people use that term all December, but I didn't actually know what it meant. Andrew explained: a harness is just the toolbox you build for the model. It's the set of deterministic scripts that wrap the LLM so it can interact with your environment reliably. He had used this approach to solve a different challenge, building a system that could iterate, submit, and verify itself.

He justified that if you spend time upfront working on a spec and establishing constraints. Then, you delegate. Once you have deterministic tools with good logging, the agent is incredibly good at looping until it hits its goal.

My approach is typically very vanilla, and I lean heavily on prompting, but I was open to experimenting since Andrew was getting such excellent results.

Harness vs. Skills

Inspired by that conversation, I built two versions of the same workflow to see how they handled the same daily Genuary prompts.

Approach 1: Harness + Recipe: This lives in /genuary. Following Zig's lead, I wrote a shell script to act as the harness. It handles the scaffolding, creating folders and surfacing the daily prompt, so goose doesn't have to guess where to go. The recipe is about 300 lines long and fully self-contained.
Approach 2: Skills + Recipe: This lives in /genuary-skills. This recipe is much leaner because it delegates the "how" to a skill. The skill contains the design philosophy, references, and examples. I wanted to see how the work changed when the agent had to "discover" its instructions in a bundle rather than following a flat script.

I spent one focused session building the entire system: recipes, skills, harness scripts, templates, and GitHub Actions. (This happened in the quiet hours of my December break, with my one-year-old sleeping on my lap.) This was about trading short-term effort for long-term leverage. From that point on, the system did the daily work.

On Taste

The automation was smooth, but when I reviewed the output, I noticed everything looked suspiciously similar.

That's when I started to think about the discourse on how you can't teach an agent "taste." I thought about how I develop taste. I honestly develop taste by:

Seeing what's cool and copying it.
Knowing what's overplayed because you've seen it too much.
Following people with "good taste" and absorbing their patterns.

Obviously, I approached goose about this problem:

"I noticed it always does salmon colored circles..i know we said creative..any ideas on how to make sure it thinks outside the box"

goose shared that it was following a p5.js template it retrieved, which included a fill(255, 100, 100) (salmon!) value and an ellipse example. Since LLMs anchor heavily on concrete examples, the agent was following the code more than my "creative" instructions.

I removed the salmon circle from the template, but then I took it further: I asked how to ban common AI generated clichés altogether. goose searched discussions, pulled examples, and produced a banned list of patterns that scream "AI-generated."

BANNED CLICHÉS

Category	Banned Patterns
Color Crimes	Salmon or coral pink, teal and orange combinations, purple-pink-blue gradients.
Composition Crimes	Single centered shapes, perfect symmetry with no variation, generic spirals.
The Gold Rule	If it looks like an AI generated output, do not do it.

ENCOURAGED PATTERNS

Category	Encouraged Patterns
Color Wins	HSB mode with shifting hues, complementary palettes, gradients that evolve over time.
Composition Wins	Particle systems with emergent behavior, layered depth with transparency, hundreds of elements interacting.
Movement Wins	Noise-based flow fields, flocking/swarming, organic growth patterns, breathing with variation.
Inspiration Sources	Natural phenomena: starlings murmurating, fireflies, aurora, smoke, water.
The Gold Rule	If it sparks joy and someone would want to share it, you're on the right track.

goose determined this list through pattern recognition. So perhaps, agents can use patterns to reflect my taste, not because they understand beauty, but because I'm explicitly teaching them what I personally respond to.

I showed Andrew my favorite output of the three days: butterflies lining themselves in a Fibonacci sequence.

His response was validating:

"WOW that's an incredible Fibonacci… I'd be really curious to know your aesthetic prompting. Mine leans more pixel art and mathematical color manipulation because I've conditioned it that way… I like that yours leaned softer and tried to not look computer-created… like phone wallpaper practically lol..How did you even get that cool thinned line art on the butterflies? It looks like a base image. It's so cool. Did it draw SVGs? Like where did those come from?"

Because I'd specifically told goose to look at "natural phenomena" and "organic growth," it used Bezier curves for the wings and shifted the colors based on the spiral position to create depth, and a warm amber-to-blue gradient instead of stark black.

Scaling Visual Feedback Loops

Both workflows use the Chrome DevTools MCP server so goose can see the output and iterate on it. This created a conflict where multiple instances couldn't use the same Chrome profile. I didn't want a manual step, so I asked the agent if it was possible to run Chrome DevTools in parallel. The solution was assigning separate user data directories.

# genuary recipe example
- type: stdio
  name: Chrome Dev Tools
  cmd: npx
  args:
    - -y
    - chrome-devtools-mcp@latest
    - --userDataDir
    - /tmp/genuary-harness-chrome-profile

What I Learned

I automated execution so I could study taste, constraint design, and feedback loops.

The two approaches behaved very differently. The harness-based workflow was more reliable and efficient, but it produced more predictable results. It followed instructions faithfully and optimized for consistency.

The skills-based approach was messier. It surfaced more surprises, made stranger connections, and required more editorial intervention. But the output felt more like a collaboration than a pipeline.

What this reinforced for me is that the "AI vs. human" framing is too simplistic. Automation handles repetition and speed well. Taste still lives in constraint-setting, curation, and deciding what should never happen. I ended up not automating taste. Instead, the end result was a system that made my preferences legible enough to be reflected back to me.

See the Code

The code and full transcripts live in my Genuary 2026 repo. Each day folder contains the complete conversation history, including the pitches, iterations, and the back-and-forth between me and the agent. You can also view the creations on the Genuary 2026 site.

The Worst Thing to Happen to React and Next.js: React2Shell

Rizèl Scarlett — Wed, 31 Dec 2025 08:34:22 +0000

"I ain't reading all that. I'm happy for you tho, or sorry that happened."

That was my internal reaction when I saw the headlines about CVE-2025-55182, more commonly called React2Shell. I had deadlines to meet. I bookmarked the articles, added "read React vulnerability stuff" to my todo list, and kept working. I didn't think it would affect me anyway. Most of my projects are demos for my work in Developer Relations. Who would bother attacking those?

A few days later, someone messaged me: "Hey, your demo site redirected me to some sketchy crypto site."

Wait, what? My heart sank to my stomach.

I frantically combed through my repository. Everything looked fine. The commit history was clean, the codebase unchanged, and I couldn't find any redirect logic that would explain what users were experiencing. I felt that uncomfortable combination of confusion and exposure. Someone was exploiting my project, but I couldn't figure out how they'd gotten in.

I resorted to asking goose, my AI coding agent, to help me investigate since this was outside my usual debugging territory. It suggested the injection might be happening at the DNS level or somewhere in my Railway deployment infrastructure. I changed my DNS settings, and it looked like that fixed it.

A few days later, I went to check my demo site to see if things were still the same, and it was redirecting to the crypto site. My hands felt clammy.

Then a memory surfaced. It felt like I was Raven from That's So Raven. The memory was of a fellow open source contributor who had pinged me saying: "Hey Rizel, make sure you update your Next.js projects. That React vulnerability is really serious." At the time, I thought, "I only have a demo project that needs upgrading. I can do it later."

That's when it clicked. I realized attackers were exploiting CVE-2025-55182, a vulnerability in React Server Components that allowed them to execute arbitrary JavaScript on my server. They didn't need to touch my codebase at all. They just needed to send specially crafted HTTP requests to my application, and the vulnerable React deserialization would execute their code.

I upgraded my Next.js and React dependencies immediately, and the redirects finally stopped for good.

I felt foolish. A trusted colleague had literally warned me, and I'd put it on my "important but not urgent" list because I thought it didn't apply to me, except it did apply to me.

Understanding React2Shell

React Server Components introduced a new way for React applications to run code on the server. When you write a function marked with 'use server', React automatically creates endpoints that handle communication between your client and server using something called the Flight protocol. This protocol serializes and deserializes data as it travels back and forth.

The vulnerability was in how this deserialization worked. When your server received data through these automatically-generated Flight endpoints, it would unpack that data and process it. But the deserialization logic had a flaw. Attackers could craft malicious payloads that, when unpacked, would execute arbitrary code on your server instead of just calling your legitimate server functions.

Think about it like this: imagine you have a package delivery system where you automatically open and process every package that arrives. CVE-2025-55182 was like someone figuring out they could put a bomb inside a package, and your system would dutifully unpack and activate it without checking what was inside first.

In my case, the attackers were injecting code that intercepted HTTP responses and redirected users to crypto scam sites. But they could have done much worse. With arbitrary code execution on the server, they could have stolen my database credentials, exfiltrated my environment variables and API keys, installed persistent backdoors, or used my server for crypto mining. I got lucky.

Who Was Affected

This vulnerability affected applications using React Server Components across several common setups.

Affected versions included:

Next.js App Router on 15.x, 16.x, and 14.x canary releases after 14.3.0-canary.77
React versions 19.0, 19.1.0, 19.1.1, and 19.2.0

The impact extended beyond Next.js. Any framework experimenting with or building on React Server Components was potentially affected, including Vite RSC plugins and React Router’s unstable RSC APIs.

Why This Was So Severe

Because it required no authentication, no user interaction, and had near-perfect reliability, this vulnerability was rated CVSS 10.0, the maximum severity score. Security researchers observed active exploitation in the wild starting December 5th, just two days after the patches were released, with some of the activity linked to threat groups with suspected ties to state actors.

What You Should Do Right Now

If you're running any React or Next.js application with Server Components or Server Actions, upgrade immediately. Not later, not after you finish your current sprint, not once you've checked whether you're actually affected.

After upgrading, rotate your secrets. Even if you didn't notice any exploitation like I did, if you were vulnerable, you should assume attackers might have accessed your environment variables. Change your database passwords, API keys, deployment tokens, and anything else sensitive.

Lessons Learned

This experience reminded me that I’m not exempt from security vulnerabilities . Even toy projects and demos matter because they're running on real servers with real access to real infrastructure.

Online security often feels like someone else's problem until it becomes your problem. Don't wait until you're scrambling to understand how your project got compromised.

My Predictions for MCP and AI-Assisted Coding in 2026

Rizèl Scarlett — Wed, 31 Dec 2025 01:36:04 +0000

I'm writing this fully aware that predictions about AI often age badly.

I don't want to sound like those CEOs who confidently announce that AI will replace engineers in six months, only to quietly move the timeline when nothing happens. Instead, this is a personal thought experiment.

I've been experimenting with AI-assisted coding since it was still taboo to admit you were doing it. I started in 2021 while working at GitHub, helping developers understand the value of well-written prompts through GitHub Copilot. I was an early user of ChatGPT, alongside Claude and many other tools, long before "prompting" became its own discipline.

Today, I'm a Developer Advocate for goose, which serves as a reference implementation for Model Context Protocol and one of the first MCP clients. I use multiple MCP servers daily workflow to solve real problems.

All of that gives me a decent sense of where things might head next.

So I decided to make a few predictions for 2026, mostly to sharpen my own visionary skills. Will any of these come true? Would I tweak them a year from now? Let's find out.

These are my personal opinions. I'm not speaking on behalf of my employer or any project I work on.

Prediction 1: AI Code Review Gets Solved

By the end of 2026, I believe we'll have cracked AI code review.

Right now, one of the biggest bottlenecks in software development, especially in open source, is review capacity. People generate code faster than ever with AI, but that speed shifts pressure downstream. Maintainers, tech leads, and engineering managers now face more pull requests, more diffs, and more surface area to validate.

We already see AI-powered code review tools, but none fully hit the mark. They often feel noisy, overly rigid, or disconnected from real-world developer workflows.

Recently, Aiden Bai publicly shared thoughtful, constructive feedback on how AI code review tools like CodeRabbit could improve.

Beyond the controversy around how CodeRabbit responded, the attention his tweet received signaled something important: developers are actively hoping for a better solution.

By 2026, I expect either an existing product to meaningfully level up or a new company to enter and get it right. This is one of the most pressing problems in the space, and I think the industry will prioritize fixing it.

If you want to stay on top of developments in AI code review, I recommend following Nnenna Ndukwe / @nnennahacks

// Detect dark theme var iframe = document.getElementById('tweet-1998408841285837197-64'); if (document.body.className.includes('dark-theme')) { iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=1998408841285837197&theme=dark" }

Prediction 2: MCP Apps Become the Default

I think MCP Apps will become a core part of how people interact with AI agents.

MCP Apps are the successor to MCP-UI, which first showed that agents didn't need to respond with text alone, but could render interactive interfaces directly inside the host environment. Think embedded web UIs, buttons, toggles, and selections. Users express intent through interaction rather than explanation.

As this pattern gained traction, it became clear that interactive interfaces needed first-class support in the protocol itself. MCP Apps build on that momentum and are now being incorporated into the MCP standard.

Below is a video of MCP-UI in action:

This matters beyond developer ergonomics. For years, companies tried to keep users inside their apps with embedded chatbots, hoping increased "stickiness" would drive revenue. That approach never fully worked. Meanwhile, user behavior shifted. People now go directly to AI tools like ChatGPT for answers instead of navigating websites, even if they aren't engineers.

MCP Apps flip the model. Instead of pulling users into your app, your app meets users inside their AI environment.

We already see early adoption. OpenAI is moving in this direction with ChatGPT, and goose adopted MCP-UI early and is close to shipping full MCP Apps support. Other platforms are taking similar steps.

To learn more about MCP Apps, check out this blog post.

Prediction 3: Agents Become Portable Across Platforms

I think agents will follow users wherever they work.

Today, I benefit heavily from MCP servers because they make it possible to connect agents to tools and systems. Still, I there's friction. Many users grow attached to a specific agent and want it available across environments without constant reconfiguration.

This is where Agent Client Protocol becomes interesting. ACP allows an agent to run inside any editor or environment that supports the protocol, without tightly coupling it to a specific plugin or extension.

We felt this pain firsthand with goose. Maintaining a VS Code extension proved difficult. goose would evolve, the extension would lag, and users would hit breakage. ACP changed that dynamic. Instead of tightly coupling the agent to a plugin, the editor becomes the client.

Zed Industries introduced this model. When I tried goose inside the Zed editor, the experience felt noticeably smoother. Editors from JetBrains have also adopted the protocol. ACP tends to get less attention than MCP, partly because it's less flashy and partly because the acronym overlaps with other agent-related protocols. Even so, the impact is real.

Here's where I get more ambitious. I don't think this stops at editors. Over time, agent portability may extend to design tools, browsers, and other platforms. I can imagine bringing goose, Codex, or Claude Code directly into tools like Figma without rebuilding the integration each time. This part is more speculative, but the direction feels plausible.

Prediction 4: DIY Agent Configuration Hits a Ceiling

This one feels riskier to say out loud, but I think we eventually move away from heavy context engineering and excessive configuration.

Right now, we compensate for model limitations by adding layers of structure: rules files, memory files, subagents, reusable skills, system prompt overrides, toggles, and switches. All of these help agents behave more reliably, and in many cases, they're necessary, especially for large codebases, legacy systems, and high-impact code changes.

As an engineer, I find this exciting. Configuring my setup feels participatory. I enjoy shaping how an agent reasons and responds. There's satisfaction in tuning behavior instead of treating AI as a black box.

But there's another side we haven't fully felt the consequences of yet.

Every week introduces a new "best practice." Another rule or configuration users feel pressure to adopt. At some point, the overhead may outweigh the benefit. Instead of building, people spend more time configuring the act of building.

I already see developers opting out. Some reject AI because of poor early experiences. Others reject it because the process feels exhausting. They just want to write code.

I've seen this pattern before. When Kubernetes became widely adopted, it unlocked enormous power but also exposed developers to infrastructure complexity they weren't meant to manage. The response wasn't to turn every developer into a Kubernetes expert, but to introduce platform teams, DevOps roles, and abstractions that absorbed that complexity.

I don't want to leave anyone behind in this AI era.

The thought genuinely makes me sad when we say things like "People will get left behind," so I'm brainstorming ways of how to make sure everyone "eats".

When we approach a similar inflection point with agents, I see two likely paths forward:

Tooling improves to the point where most configuration fades into the background.
Companies formalize roles around AI enablement. I've already seen early versions of this. We have internal AI champions at and enablement groups (led by my manager Angie Jones) that help teams use agents safely and effectively.

Personally, I hope for balance. I enjoy configuration and depth, but I don't think productivity scales if every repo demands a complex setup just to get started.

Those are my predictions for 2026. Let's revisit this in a year and see what holds up.

What are your predictions? And what do you think of mine?