Forem: Scott Raisbeck

Agentic Shell - cli agent adaption layer

Scott Raisbeck — Sun, 29 Mar 2026 22:39:22 +0000

Hey folks,

TLDR: Spent the today writing an adaptation of cli-agent shell requests, having coded the same across multiple agents on several other projects and open sourced it.

So since the advent of autoresearch and, to be frank, wayyy before then, when Tyson Fury taught many of us how to use coding agents, many of us have been experimenting with ways of running coding agent harnesses in deterministic frameworks, moving beyond that of Wiggium Loop itself:

A wild wiggium appears

while true; do
    cat prompt.md | claude -p
done

As part of my own fork of autoresearch, I put a more deterministic wrapper around it, removing the ability for the agent to decide whether or not to commit, and instead having it based on the output of model training and validation phases.

I should note here that Andrej has stated he prefers to not do this with his own autoresearch, I saw a tweet on this and also heard him mention he prefers an interactive approach during his interview with Sarah Guo

If we take a look at the autoresearch sequence:

Figure 1.1 - Autoresearch Sequence Diagram

As we can see, we are relying on the agents discretion before the loop decides to end. In addition to this, we are relying on the agent reliably determining that the results were indeed favourable.

For frontier models in a good agentic harness like Claude Code this has allowed Karpathy to run the loop for approximately 2 days, this approach as well allowed Karpathy to interact with the model during the research which might suit certain workflows.

The other thing to consider here is that with each loop, the context window grows and we see more summarisation. I have found that summarisation / compression has come a long way, I often have development sessions involving an AI that involves several compression cycles, however if this is running autonomously, then you are really at the mercy of the agentic harness compression configuration and as such you might see summarisation at a time that is not ideal.

All the above assumes frontier models, Opus 4.6 or GPT 5.4 aint cheap, and running loops 24/7 might not be within everyones budget.

This is where I have started to explore using loops with local models using OpenCode, while the constraints of these models are diminishing, I doubt I'd be able to have one run in a loop over two days just from a single prompt into OpenCode.

Instead I am looking at a pattern that combines autoresearch with ralph, but with deterministic gates and limiting the agent exposure to focused tasks within the workflow.

If we modified the approach detailed in Figure 1.1 to instead utilise a research_harness that could put some deterministic gates around the loop itself and whether the results of that loop get committed.

Figure 1.2 - Autoresearch with a deterministic harness

We could write a script that runs in a loop, passes in the prompt to a headless agent to read the prompt.md and then at the end deterministically measure the results and then programmatically commit the changes to github if they result in an improvement.

Hopefully you can see where this is heading. Not only would we be adding a bit more determinism into the workflow, we would also have the option to have this loop become effectively a ralph loop (providing we didn't pass in a session id as part of headless call).

Across several recent projects, I have coded up several similar headless calls with differing agentic harnesses, to the point I realised it was worth my while abstracting this into a single package that I can use to handle the process.

So I've created a harness that allows you to call the agents headlessly and then receive responses as either an AgentResponse for synchronous calls or StreamEvent for asynchronous streaming:

u/dataclass
class AgentResponse:
    response: str
    cost: float
    session_id: str | None = None

u/dataclass
class StreamEvent:
    type: str
    content: str
    cost: float = 0.0
    duration: float = 0.0
    session_id: str | None = None

Add the PyPI package using your favourite PyPI supported package manager and then use it as such

from agent_shell.shell import AgentShell
from agent_shell.models.agent import AgentType

shell = AgentShell(agent_type=AgentType.CLAUDE_CODE)

async for event in shell.stream(
    cwd="/path/to/project",
    prompt="Refactor the auth module",
    allowed_tools=["Read", "Edit", "Bash"],
    model="sonnet",
    effort="high",
    include_thinking=True,
):
    if event.type == "system":
        print(f"Session: {event.session_id}")
    else:
        print(f"[{event.type}] {event.content}")

Given that I think this is something I think other people will probably be looking to do, I have open sourced this as a project named agent-shell on github, so feel free to use for yourselves. It has more examples around using different agent types and streaming versus non-streaming.

It currenlty supports Claude Code and OpenCode, I am going to be working on the outstanding cli agents over the coming days.

Edit:
I've added some skills to the agent shell repo as well, a coding agent could probably figure this out fairly easily after a bit of back and forth but ultimatley this should save time on some of that:

The invoking-cli-agents is the base skill and then delegating-code-review is an example of chaining the base skill with some extended skill for a specific task.

So having your coding agent invoke other coding agent applications can lead to some interesting outcomes I am sure :).

Forgetful gets procedural and prospective memory

Scott Raisbeck — Sun, 22 Mar 2026 23:22:02 +0000

So this weekend finally saw me get another version of forgetful published.

Version 0.3.0 has started to see the tool move to the next phase of development.

Operating initially as the semantic memory layer, where i could store and access memories across multiple agent harnesses, such as claude code, opencode, gemini cli and also my own agent harnesses, forgetful has been everything I've needed it to be thus far.

In my work developing my own private version of OpenClaw (it's not quite the same, but without writing an entire post about it, it's a lazy way to abstract it as a concept), I have moved on from on to another layer of memory beyond that of just semantic recall.

I have been working on procedural, epsiodic and prospective types of memory.

While Semantic memory is the most commonly associated type of memory with memory agents, the capturing and retreival of knoweldge, usually in the form of either observations or facts, semantic storage is often the corner stone of any memory mcp.

What is perhaps less common amongst these are the other types.

Procedural memory represents learned behaviour, an agentic system as wlel as being able to store and recall facts and observations, should be able to turn those facts and observations in-to useful tools.

We actually see this quite a lot now in our agentic harnesses in the form of skills or commands. There is even an open standard for skills now. Once I had played about with skills in my own agent harness I realised that storing them in forgetful so I could share them easily across agents, devices and platforms was a good fit. As of 0.3.0 these are now first class citizens in forgetful.

Prospective memory is more about the ability to set about objectives and plans and then see them through. Any one developing agentic systems knows how critical this functionality is. I did debate whether or not having this in forgetful would be useful, surely the source of truth for planning needs to be in the agent harness itself.

What convinced me otherwise was that I was finding myself more and more using multiple agentic harnesses for completing a single objective. A very simple example of this would be having Claude Opus 4.6 put together a plan for a new feature, have Qwen Coder Next implement it in OpenCode and then finish with Codex 5.3 review the output in copilot CLI.

Within my own agentic harnesses however the feature became more and more useful, as in my own version of openClaw I have multiple agents working across a single objective. By moving introducing the Prospective (planning/objectives) into forgetful, i could simplify my agentic harness software itself. The same can be said for the skills functionality.

I should call out another thing that convinced me was a user of forgetful (twsta) posted in the discord a skill for managing wok and todos from how they used to use Logseq

The last memory type I discussed was episodic this I consider more a memory of what has happened. The obvious version of this being what has occured inside a single context window, however I think there is something to be said for having an agent being able to navigate back through actual details of what has occured even though those events might have now moved outside of its context window or indeed are from another session entirely (perhaps even with another agent!).

I am currently experimenting with this functionality in my agent harness and as of yet have not decided to move this across to forgetful and perhaps I never will unless it is asked for as a feature by users.

This really starts to align more and more with my opinion on how I perceive the current state of architecture for Transformer based LLM's and Agentic harnesses around them.

What I've tried to build here is a framework where someone who is looking to build agentic harnesses can abstract a lot of the complexity that comes with memory magement and focus on the harnesses functionality itself.

In addition to which as well, you can use it for memory management across existing agentic harnesses. Reducing some of the friction of switching between using one coding agent, device or platform to another.

If you are interested in this sort of stuff, please check out the discord, we have a small quite laid back and relaxed community of people interested in all things Agentic and welcome those who share the interest, but please no merchants of hype, plenty of spaces on the internet for that :).

Did some actual coding today - found a blind spot example for coding agents

Scott Raisbeck — Sun, 22 Feb 2026 20:42:56 +0000

TL;DR - Wrote some code for a new feature, had to refactor existing code to ensure we didn't have a double spend on sorting, tested to see if coding agents would spot the same issue - they didn't.

So for the past few weeks I have been busy as hell, my GitHub activity bar, like many of us now, is lit up like a Christmas tree.

I am now more or less 100% utilising coding agents, with the exception of producing 'Walking Skeletons' on anything new that I haven't tackled before (usually still with the help of AI - more so to get an understanding of it than anything else). Most of this kind of activity is taking place in Go projects, as that is something I've been learning over the past few months.

I realised the other day, that while I had been reviewing a lot of Python and C# (my more native languages), I would struggle to write something from scratch on my own again without using an AI.

I have been maintaining an open source memory mcp for AI agents and I wanted to add some additional re-ranker providers.

I figured it was a straightforward implementation, to add support for a generic HttpProvider, allowing users to utilise the likes of the majority of cloud reranker providers but also allowing them to use other self hosted rerankers such as those hostable on llama.cpp or vLLM.

My existing reranker code for the FastEmbed (the existing reranker provider) was implemented with Protocols to allow for further expansion and implementation of adapters. The Protocol dictated that the reranker provide a simple list of floats, the scores for each of the documents relevance to the query, in the order that the documents were presented and then the method calling it from the memory repository would handle ordering them in to the appropriate rank for top_k filtering.

Here's the code in the memory repository:

scores = await self.rerank_adapter.rerank(query=rerank_query, documents=documents)

scored_candidates = list(zip(dense_candidates, scores))
scored_candidates.sort(key=lambda x: x[1], reverse=True) 

top_k_memories = [memory for memory, score in scored_candidates[:k]]

When I came to implement the HttpAdapter I noticed it was returning both the original index and the score sorted by the score already.

I realised at this point I needed to refactor the repository and the existing adapter so that the repository would now expect the reranking adapter to handle the ordering, otherwise if we left it as is, we would be having to take the new http adapter, re-order it to the original document order and then send it back to the repository only for it to re-order back by score.

Just one of them moments where you just realised you needed to do a little bit more work, however nothing major. I made the change and the tests still passed. I then asked Claude if I might have missed any impact and it confirmed I had not.

I then proceeded to implement the rest of the HttpAdapter code. When I suddenly had a thought, I wonder if Claude, or any of the other models would have picked this one up or whether they would have just gone ahead and implemented the double spend on sorting.

So I stashed the changes and fired up some coding agents to see how they would do. My workflow was simple, I used my context gather command and then entered the following:

Create a plan for implementing a http reranker adapter using httpx, the user should be able to configure a reranker http endpoint, model, api key (optional), as well as specifying a reranking provider of http. When the rerank provider is set to HTTP, we should use the provided environment variables to make http calls to the v1/rerank endpoint.

When the http provider is configured then the memory repository should use it for ranking memories

I tried with three models:

Claude Opus 4.6 (Claude Code Agent Harness)
Codex 5.3 (Co-pilot CLI Agent Harness)
Gemini Pro 3.0 (Co-pilot CLI Agent Harness)

All of the models implemented a change to return the sorted list of scores, that would then be re-sorted back to the order we originally received from the reranker API. Effectively resulting in a double spend on sorting.

Now to be clear, the models might have been trying to stick to the task at hand, the prompt did not mention any optimisation. My Global Claude.MD (which my copilot cli is also reading) file has some development philosophy that encourages only making changes to achieve the objectives, but also has the following from a list of Development Philosophy:

When extending existing code, understand why the current design works the way it does, not just what it does. Don't assume existing patterns are optimal - if something seems off, raise it with the user before proceeding rather than silently conforming or silently refactoring.

There is good reason you might not want them to go off on a tangent refactoring everything, however at the same time none of the models highlighted this inefficiency or even considered it an issue.

For me when I encounter something like this, I think about my guardrails and prompts, and to some extent there might be something I can come up with, but also part of me was brought back to a recent talk by Professor Michael John Wooldridge at the Royal Society, This is not the AI we were promised, that I highly encourage people to watch.

Anyhow, that's me for my Sunday musings.

Forgetful - Memory for AI Agents - latest update

Scott Raisbeck — Tue, 10 Feb 2026 23:39:12 +0000

Hey folks!

Quick post here to talk about a recent release for Forgetful, the open source AI Memory MCP.

I say recent, it's actually coming up to a month since I actually pushed the release, 15th Jan was when version 0.2.0 was pushed.

Since then I've been utilising Forgetful in some quite big projects in my day to day work and also, in the evenings when everyone in my household is asleep and I get to play, I've been building some cool little apps on the back of it.

One of the key features of 0.2.0 was the Server Side Event streaming and activity tracking. If you are running this as a service or a docker container, then it is available under /api/v1/activity/stream

With this feature enabled, you can now have other services subscribe to an events endpoint and then whenever events take place, such as for example when a new memory gets created, an entities list of aliases gets appended or a obsolete code artifact is deleted, services that are subscribed will be notified.

The obvious benefit is that if you are building tools, agents or applications around Forgetful, then previously you would have had to to poll for data changes. This is no longer the case now, Forgetful now has a heartbeat and anything that cares to listen can hear it.

As part of the release itself I also optimised the graph endpoint, which is used to bring back your full knowledge graph, and it now supports pagination and sorting.

All this should aid users who are looking to build visual applications around forgetful itself, which I know as coding agents get better, more and more of us will want to do.

All this came off the back of feedback from users of the tool and collated over discord, reddit and dev.to communities. Which is nice as part of the reason I enjoy open source is because of the community aspect, so please keep offering up ideas and contributions.

I went about implementing this, with Claude Code, since Opus 4.5 i code less now and mostly just have a back and forth with Claude Code and let it write the code. Most of forgetful itself was hand written, or at least a significant amount of the walking skeleton was with Sonnet 4.5 filling in the remaining verticals once I had authorisation, memories and user veriticals completed. Now however with new projects I reference my patterns from Forgetful so the agent itself can build the walking skeleton with me just through conversations.

From my own work I haad used event bus architecture before, so I decided that would be the obvious pattern here. So I asked Claude to implement that based on an exmaple it had seen in Forgetful from an old project I had encoded.

I then implemented a queue based handler to emmit events to SSE clients, from a seperate example of another solution I had used previously but had encoded the solution into Forgetful.

I could have just told the AI to look at the code, but pointing it at a directory, specifiying the exact piece of code and so on, was much harder than me just typing 'use the event bus pattern I used in x", I enjoy it when the work I have put in in the past, means I can can be super lazy now :-D.

Oh and to show off a bit of my own front end vibe coding (I am a backend engineer at heart) - check out my video of my own personal assistant AI that I am working on.

It's a pet project I've been working on for over 6 months (the memory for which was one of the main drivers behind developing Forgetful :)).

Enjoy!

If you would like more information on Forgetful itself, check out my initial dev.to post on it!

Also the github repo: https://github.com/ScottRBK/forgetful

My parting gift to 2025 - my Claude Code context workflow turned into a plugin

Scott Raisbeck — Wed, 31 Dec 2025 19:44:52 +0000

So for 2025 I just want to leave a little gift for you folks to welcome in the new year.

I did an article recently on publishing my first Claude Code plugin (which are frickin' awesome by the way).

The plugin was centred entirely around Forgetful in isolation, but the reality is for my workflow I don't use it in isolation. There's two main flows that I am going to share with you here.

Context Gather

The first is what I do whenever I want to work on something. This is the first thing I do when I open up a new Claude session or /clear an existing one.

I write the /context_gather command with the context of what I want to do:

/context_gather implement entity relationship visualization

What happens next is Claude pulls from four different sources and synthesises them into a briefing:

Forgetful - searches my memory for past decisions, patterns, and gotchas related to what I'm about to do. If I've solved a similar problem before, or made an architectural decision that affects this, it shows up here.

Serena - does live LSP-powered analysis of my current codebase. This isn't cached knowledge from when I encoded the repo months ago - it's what the code looks like right now. Symbol references, type relationships, what calls what.

Context7 - grabs up-to-date framework documentation. Instead of me googling "FastAPI dependency injection" for the fifth time, it pulls the relevant sections automatically. Especially useful for frameworks I don't touch daily.

WebSearch - fills gaps when the docs aren't enough or something's changed recently.

The result is a synthesised briefing specific to what I'm about to implement. Here's what you've decided before, here's how your codebase currently handles related things, here's how the framework recommends doing this.

Then I actually write the code. With context.

Encoding Repos

The other workflow I have is to encode a repo into Forgetful so I can reference it in other projects or on other AI systems (eg when talking to ChatGPT on my mobile or when I'm using Claude Desktop).

For that I use /encode-repo-serena:

/encode-repo-serena

This kicks off a multi-phase analysis using Serena's LSP integration. It's not grep. It's not file listings. Serena understands symbols at the language level.

Phase 1: Architecture Discovery - Serena analyses the codebase structure. What classes exist, what they extend, which functions call which dependencies, where the entry points are. These get stored as atomic memories in Forgetful.

Phase 1B: Dependency Analysis - detects manifests (package.json, pyproject.toml, etc), parses out dependencies, validates framework usage assumptions via Context7. Creates a dependency memory so I know what this project actually uses.

Phase 2: Entity Graph Creation - creates entities for major components based on Serena's reference counts (the stuff that's referenced the most is probably important). Maps relationships between them with strength values based on how tightly coupled they are. Links entities to the architecture memories.

Takes a few minutes for a decent-sized repo. After that, the knowledge lives in Forgetful. I can query it from Claude Code, Claude Desktop, ChatGPT via the hosted MCP endpoint, or any other MCP-compatible agent.

End of a session

Whenever I am finished my 'vibin..' i just use the /memory-save command and it works out what is relevant, checks if it already exists, asks me for confirmation and then manages the links if the auto link didn't work.

The Meta-Tools Pattern

One thing worth explaining: Forgetful has 40+ tools under the hood. Projects, memories, entities, relationships, documents, code artifacts, linking, searching, the lot.

But the MCP server only exposes three:

discover_forgetful_tools - returns what's available, filtered by category if you want
how_to_use_forgetful_tool - full schema and examples for a specific tool
execute_forgetful_tool - runs any tool by name with arguments

Why? Context window. If I registered all 40+ tools directly, that's ~15-20K tokens of schemas loaded into every conversation. The meta-tools pattern drops that to ~500 tokens. Agents discover what they need, when they need it.

With skills, this pattern makes even more sense. The skill files include a TOOL_REFERENCE.md with complete schemas for all 40+ tools, so Claude Code can call them directly without the discovery overhead. But agents without skills (Claude Desktop, other MCP clients) still work fine through the meta-tools.

I hope this is useful for others. It might not seem like much now but it's the culmination of months of work and tinkering, coding - yes I still code some of this stuff, what can I say? I enjoy it - and refining/reviewing with Claude.

Happy new year.

context-hub-plugin Repository
Forgetful Repository
Forgetful Stand-alone Plugin

Building my first Claude Code Plugin

Scott Raisbeck — Sun, 21 Dec 2025 13:03:15 +0000

I built a Claude Code plugin for Forgetful.

Forgetful is a semantic memory system I've been working on — persistent knowledge that follows you across Claude Code sessions, auto-linking between related concepts, the ability to encode repositories into searchable memories. I wrote about it here when I first released it.

I'd been using it daily for weeks, but getting other people set up had friction. MCP configuration, environment variables, picking a database backend. The kind of thing where you send someone a README and never hear from them again.

I even produced some prompt exmaples as well

Claude Code plugins seemed like the answer.

If you haven't used them, plugins are how you extend Claude Code with custom slash commands, agents, skills, and MCP servers. You can bundle up a workflow and share it — someone runs /install your-plugin and they've got your setup.

That's what I wanted. One command to get Forgetful running, plus the slash commands I actually use day-to-day packaged up so others could try them.

The idea hit around Dec 17th. By the 21st I had something working. Four days, mostly evenings.

Here's what ended up in the plugin:

/forgetful-setup — configures the MCP connection

/memory-search — semantic search across your knowledge base

/memory-save — create a memory with the Zettelkasten constraints (one concept, under 400 words, importance score, proper context)

/memory-list — browse recent memories

/memory-explore — deep graph traversal, follows links and entities

/encode-repo — point it at a codebase, extract memories from it

Plus three skills that auto-activate: guidance on when to query vs create memories, how to curate and link things properly, how to explore the knowledge graph without getting lost.

These aren't designed from scratch. They're what I was already doing, just packaged.

Building it surfaced a few gotchas worth knowing about.

Config files don't survive updates.

I shipped the first version with an .mcp.json inside the plugin. Seemed logical — the plugin needs MCP config, put it in the plugin.

Pushed an update. Reinstalled. Config gone.

Claude Code plugins install to versioned directories: ~/.claude/plugins/forgetful-plugin@1.0.0/. When you update to 1.0.1, the old directory gets wiped. Clean install. Anything you'd stored in there disappears.

The fix: don't ship config in the plugin at all. I rewrote /forgetful-setup to run claude mcp add forgetful --scope user -- uvx forgetful-ai under the hood. That writes to ~/.claude.json — the user's global config, outside the plugin directory. Survives updates.

For custom setups (Postgres, remote hosting, auth tokens), the command fetches the live docs from GitHub and walks through the options. Didn't want to duplicate configuration guidance in the plugin where it'd go stale.

Expensive commands need subagents.

/memory-explore does a five-phase traversal: query → follow links → find entities → get relationships → pull entity-linked memories. Lots of tool calls.

First version had the main model running all of it. Sonnet or Opus chewing through the traversal, context window filling up with raw graph data.

Ripped it out, made it spawn a Haiku subagent instead. Let it explore aggressively in an isolated context, filter down to what matters, hand back a summary. Costs about a tenth as much and doesn't pollute the main conversation.

I could've shipped a custom agent with the plugin, tuned specifically for graph traversal. But every agent definition you ship adds to the context window. Went with general-purpose instead — it's always available, has access to all tools, and the command prompt gives it enough guidance. Lighter footprint.

Skills aren't just fancy prompts.

I didn't get skills at first. Thought they were basically prompt files with extra steps — write some instructions in markdown, Claude reads them, done.

What I'd missed: the YAML frontmatter description actually surfaces to the model. That's how Claude decides whether to activate a skill. It's not reading every skill file on every message — it's scanning descriptions and pulling in the ones that match what you're doing.

Once that clicked, I realised how powerful they could be. I am obsessed with keeping context window down, but a few lines of frontmatter feels like a worthwhile investment for the benefit you get from skills.

I did a bit of googling and asked Claudus to do some research around skill generation, that along with a bit of trial and error I found that
write a good description (third-person, specific about when it applies) and the skill just activates when relevant. Write a vague one and Claude never finds it. The description isn't documentation for humans — it's the trigger mechanism.

If you want to try it:

/plugin marketplace add ScottRBK/forgetful-plugin
/plugin install forgetful-plugin@forgetful-plugin
/forgetful-setup

Standard setup takes about 30 seconds.

Plugin: github.com/ScottRBK/forgetful-plugin

Main project: github.com/ScottRBK/forgetful

FastMCP and Github AuthProxy token expiry issue

Scott Raisbeck — Fri, 19 Dec 2025 16:35:22 +0000

Note: The below is AI generated, but i figured it is worth sharing in the hope that it helps others avoid the pain :), I have validated the output to ensure no hallucinations and tried to ensure that the output is relevant for anyone else encountering the same issues.

My tokens kept dying. Not expiring - dying. Mid-session, no warning, just gone.

I'm building Forgetful, an MCP server that gives AI agents persistent memory. It uses GitHub OAuth for authentication, proxied through FastMCP. And for three days, every MCP client I connected - Claude Code, Gemini CLI, VS Code - would authenticate fine, work for 15-30 minutes, then fail with invalid_token.

I burned a lot of time chasing the wrong problems before finding the actual cause. This is that story.

The symptoms

Every client exhibited identical behaviour:

Fresh OAuth flow → works
Use tools for a while → works
Come back 20 minutes later → invalid_token
Re-authenticate → works again
Repeat

The consistency across different clients suggested a server-side issue. But what?

Dead end #1: Storage persistence

First thought: maybe tokens aren't persisting across container restarts?

I mounted a Docker volume for FastMCP's OAuth storage and cracked open the SQLite cache:

docker exec forgetful sqlite3 /root/.local/share/fastmcp/oauth-proxy/cache.db \
  "SELECT key, datetime(expires_at, 'unixepoch') FROM cache;"

Everything was there. JTI mappings, upstream tokens, client registrations - all with sensible expiry times. Storage wasn't the problem.

Dead end #2: JWT signature issues

Maybe the JWT signing key was somehow getting rotated or mismatched?

FastMCP derives its signing key using PBKDF2:

kdf = PBKDF2HMAC(
    algorithm=hashes.SHA256(),
    length=32,
    salt=b"fastmcp-jwt-signing-key",
    iterations=1000000,
)
key = kdf.derive(secret.encode())

I extracted a failing token, derived the key myself, verified the signature. It matched. The JWT was valid.

So the JWT was fine but authentication was failing. What gives?

The breakthrough: Debug logging

I enabled debug logging and watched the auth flow:

Token verified successfully for subject=None  ← JWT valid
GitHub token verification failed: 401 - "Bad credentials"  ← Upstream rejection

There it was. FastMCP uses a two-tier system: it validates its own JWT, extracts a JTI (JWT ID), looks up the corresponding GitHub token it stored server-side, then validates that token with GitHub's API.

My JWT was fine. But when FastMCP tried to verify the upstream GitHub token, GitHub said "Bad credentials."

GitHub was invalidating tokens that FastMCP had cached. But why?

The smoking gun

GitHub has a security log. Settings → Security → Security log. I filtered for OAuth events and found this:

{
  "action": "oauth_access.destroy",
  "explanation": "max_for_app",
  "created_at": "2025-12-19T..."
}

max_for_app. That's a new one.

Turns out GitHub OAuth Apps have a hard limit of 10 active tokens per user per application. When you create an 11th token, GitHub automatically revokes the oldest one.

I was using Claude Code, Gemini CLI, and VS Code. Each reconnection creates a new token. Three clients, multiple sessions, and I'd blown past 10 tokens without realising it. Every new authentication was killing an older client's token.

GitHub App vs OAuth App

The fix is switching from a GitHub OAuth App to a GitHub App. Yes, those are different things with confusingly similar names.

	OAuth App	GitHub App
Client ID prefix	`Ov...`	`Iv...`
Token limit per user	10	None
Uses OAuth scopes	Yes	No (uses app permissions)
Refresh tokens	Optional	If expiration enabled

GitHub Apps don't have the per-user token limit. You can have as many simultaneous clients as you want.

The implementation

Creating a GitHub App: Settings → Developer settings → GitHub Apps → New GitHub App

Key settings:

Callback URL: your OAuth callback endpoint
Webhook: disabled (not needed for pure OAuth)
Permissions: Account permissions → Email addresses → Read-only
Installation: "Any account"

Then update your FastMCP config:

FASTMCP_SERVER_AUTH_GITHUB_CLIENT_ID=Iv23li...  # Note the Iv prefix
FASTMCP_SERVER_AUTH_GITHUB_CLIENT_SECRET=your-secret-here
FASTMCP_SERVER_AUTH_GITHUB_REQUIRED_SCOPES=user

The scopes gotcha

My first attempt failed with:

GitHub token missing required scopes. Has 1, needs 2

I had configured REQUIRED_SCOPES=user:email,read:user. But GitHub Apps don't use OAuth scopes - they use app-level permissions configured in the GitHub UI. The X-OAuth-Scopes header comes back empty.

FastMCP handles this by defaulting to ["user"] when it sees empty scopes. So if you set REQUIRED_SCOPES=user, it matches that default and validation passes.

This is the kind of thing that only makes sense after you've read the FastMCP source code.

Verification

After the switch:

Claude Code: authenticated, stayed authenticated
Gemini CLI: authenticated, same user identity (external_id preserved)
Both running simultaneously: no token revocation
Client restart: reconnected without re-authentication

Three days of debugging, fixed by changing Ov to Iv and adjusting one config line.

What I learned

GitHub's security log is gold. The max_for_app explanation was right there - I just didn't know to look for it until I'd ruled out everything else.

Debug logging at every layer. The two-tier token system meant I needed to see both JWT validation AND upstream token validation. Without debug logging, I only saw "auth failed" with no indication of where.

OAuth Apps and GitHub Apps are genuinely different things. Same OAuth endpoints, different behaviour. The naming is unfortunate.

Token limits aren't always documented where you'd expect. I never found this limit in GitHub's main OAuth documentation. It's buried in rate limiting pages and security FAQs.

If you're building anything that uses GitHub OAuth with multiple clients per user - MCP servers, CLI tools, IDE extensions - consider starting with a GitHub App instead of an OAuth App. You'll skip this particular rabbit hole entirely.

If you're curious about Forgetful itself - the MCP server that led me down this rabbit hole - it's at github.com/scottrbk/forgetful.

Introducing Forgetful - Shared Knowledge and Memory across Agents

Scott Raisbeck — Tue, 02 Dec 2025 12:03:17 +0000

I, like many of us over the last year or so, have been on an interesting journey when it comes to software development.

From using ChatGPT to show me how to use sk-learn to build a classifier in 2022, seeing tab completion reach super saiyan level with copilot in 2023 (or was it 2024?! - It all seems like ancient history) on an engineers machine at work, using Cursor to remove the friction of copy/pasting out of chatgpt in 2024, to adopting my own form of the BMAD method to build out some really nice pet projects using a variety of tools like Claude Code, Cursor and Codex here in 2025.

That's my personal experience compressed into a small paragraph, possibly a similar experience had by many on here have already had. The past few years have been a hell of a ride and I am finding I am now using more AI tools than ever.

Over that time I have enjoyed reading many an article on DEV.TO on how many of us have our own approaches to get the best in this new paradigm, coupled with the 'X AI Feature/Product.. is Insane' YouTube videos (seriously guys, please stop).

There have been some real nuggets in this that I have found, I've already mentioned the BMAD Method (which I still use to this day, albeit a bit more lightweight). Managing context window was another, and probably my favourite of all was context7.

Using context7 also made me realise something, the models are more accurate when the information is inside their context window and they are not having to rely on training data.

For me the training data is almost like a necessary evil, I want them to rely on it less and less, please allow me to explain a bit further.

Let's imagine I am looking to implement a Model Context Protocol (MCP) server. If you were looking to do this a few months back, the models straight up didn't know about MCP, so you were forced to ask it to use context7 and a web search to bring back the necessary data.

Even today, with the models with the more recent training data, things are developing so fast in MCP, what it is likely to surface could be out of date and is more than likely not the optimal implementation pattern, so using Context7 I was able to be more confident in the approaches the Agent would suggest, especially when they could cite the sources and I could go validate them myself.

Like any other developer, once you solve a problem, you like to keep that pattern for other projects and over time your toolbox gets bigger and better. It should be no different with AIs, in fact, if anything it is even more important to ensure that the AIs have your taste, preferences and best practices in mind whenever they are implementing on your behalf.

Simultaneously, I was working on my own AI agents, I've been developing agents for about six months now. As a learning exercise mostly, but I found it to actually be really fun.

The first thing that anyone engineering an agent will realise is that you need memory. Even if it is just reinjecting the conversation history back to the agent so you can have continuity between requests.

As you work on these you start to envisage better systems for memory. You often start to architect different types of memory, short term (a simple version being what I described in the previous paragraph), long term memory - for example some kind of automated retrieval of relevant facts/conversations that had been during previous interactions.

There's also the breakdown of episodic (temporal and spatial context - "I had a meeting last week about the new payroll implementation"), semantic (facts and concepts - "payroll is a term to describe the business process of paying employees"), procedural (motor skills, and how to knowledge - "to put up a shelf I first.." <- I am still to learn this one).

Ultimately the answer usually ends up with some kind of persistence layer, a search mechanism and some service to pull it all together (maybe even some sub-agent specifically to manage retrieval and storage of memories) at inference time.

So with this in mind I went away and vibe coded up an MCP server for agentic systems to store memories in. I did it in an evening and spent the rest of the weekend dogfooding it on systems like Claude Code, Claude Desktop, Codex and Cursor. I had built this tool for my own agents that I was building, but I actually found it really useful in my coding agents. I quickly set about encoding the code for all my projects into memories, getting encoding agents to attach documents and code snippets to sit alongside memories so that an agent querying the knowledge base could get an idea of a particular pattern, or the way something worked, just from what was in the knowledge base and dig into the code if needed.

In this sense, the MCP server I had built became very much like my own little mini context7, well actually a bit more than that really. I could ask it to look at the code in those repos of course, but it would still lack information about design decisions, information for how we resolved an issue, preferred patterns and when to use them. It just felt like I was working in the same session across multiple projects/agents, it was really refreshing.

I had hosted this remotely and managed to connect it to Claude Mobile via Dynamic Client Registration,

So now when I was out having a walk or something and I had an idea around something related to one of my projects, Claude had semantic understanding of it. It meant I could have meaningful conversations about it without it having to scan through code.

I could even make decisions with Claude while out on that walk, get it to record a memory (and an implementation plan document) and then once I was back home just tell Claude Code to go ahead and implement. The same applied to using other agentic coding applications, such as Cursor, Copilot and Codex.

I found myself logging every design decision in the memory, and then just having built in commands to fetch context for when I started out a new session. This worked across Claude Code, Codex, OpenCode and Copilot. Which has been my arsenal of AI coding tools for the past few months (could I get by with just one? Yes, but those YouTube videos I mentioned earlier.. they're unfortunately really effective - so please stop guys, I have kids to feed).

Anyhow, it didn't take long for me to realise that I needed to build a proper version of this as I was depending on it more and more in my daily work. While it all worked fine and felt pretty robust, I didn't like the way Claude had implemented it, maybe I shouldn't care but I do. In order to understand something, I need to build it and all that good stuff, I believe dr Feinman said something along those lines, I am certainly no Richard Feinman but I can still appreciate that sentiment.

So I set about building out a new version of it, gave it a name (Forgetful) and decided I'd make it open source to see if it would be useful for others - be it for people using coding agents (which was a use case I had found myself accidently having) to AI engineers build agents themselves.

I did all this before checking if anyone else had built one by the way, there are several out there now, some paid and some for free. I would encourage you to check them out and see what works for you. I do see this as the next paradigm in AI, cross agent memory solutions are going to be key, especially as the AI ecosystem seems to open up more and more for third party apps on some of the big AI labs systems and I cannot tell you how much this has helped my own use of coding agents. I am working on something right now for my day job that I plan on showcasing, but that is for another post (as it's not finished) and Forgetful sits at the heart of it.

It's not some 'I 10xd my AI workflow' hack, I don't know what kind of gains this has given me, I don't really have a way to measure it.

I just feel more comfortable switching between projects and agents now and perhaps more importantly getting the agents more familiar with the patterns you use and reducing the need to have the same conversation with AI's to tackle the same problem elsewhere.

So What Did I Actually Build?

When I sat down to build Forgetful properly, I had to make some decisions about how memories should be structured. This is where I got opinionated.

Most memory systems are essentially vector dumps — throw everything in, embed it, retrieve by cosine similarity, hope for the best. It works, but it's messy. You end up with overlapping chunks, no clear boundaries between concepts, and retrieval that's good enough but never precise, reranking helps to a degree, but I see this as something other than just a store.

I wanted something more like how I use Obsidian. Atomic notes. One concept per note. Links between related ideas. A graph that emerges from the connections rather than being imposed from above. I had been using Obsidian to follow the Zettelkasten principle — a note-taking method where each note captures exactly one idea, is self-contained enough to understand on its own, and links explicitly to related notes.

I had stumbled across A video of someone implementing semantic encoding of their own Obisidan Notes and then things started to come together. I did my usually back and forth with Claude when out for one of my walks, it threw up some papers around where similar concepts had already been tried: The A-Mem paper on agentic memory systems found that structured, self-organising memory significantly improves retrieval precision.

In Forgetful, every memory must have (these are configurable as environment variables):

A clear title (forced brevity — 200 char limit)
Content covering one concept (~300-400 words max)
Context around what the agent was doing when it created the memory
Keywords and tags for additional retrieval paths

This might seem restrictive, but that's the point. When an agent goes to store something, it has to think about what the atomic unit of knowledge actually is. No more dumping entire conversations or documents into memory and hoping retrieval figures it out later.

Auto-Linking

Here's where it gets interesting. When you create a memory, Forgetful doesn't just store it — it finds its place in the graph. Now I would stipulate this is also configurable, I think the best practice is to have an agent dedicated to memory management, who takes in raw input and decides what memories are worth keeping, and how they should fit inside the knowledge base, and whether existing memories need updating or need to be made obsolete as a result of the new interactions. This however is not something I have built yet for my own development memory management and indeed might not be something others want to build. So as a starting point I added automated memory linking, and so far it has worked just fine.

The process:

Generate an embedding for the new memory
Search for semantically similar existing memories
Any memories above a 0.7 similarity threshold and cross encoder ranked get automatically linked
These links are bidirectional — the graph builds itself

So if I store a memory about "choosing Stripe over PayPal for payment processing" and I already have memories about "PCI compliance requirements" and "subscription billing architecture", Forgetful will automatically connect them. No manual linking required.

When an agent later queries "how should I handle payments?", it doesn't just get the Stripe decision — it gets the linked context about compliance and billing architecture too. One-hop graph traversal is included by default, what I mean by that is, for every memory that is retrieved through search, all linked memories 1-hop away are returned as well.

This is what I mean by "Obsidian for AI agents". The same way your Obsidian vault becomes more valuable as connections emerge between notes, your agent's memory becomes more useful as the knowledge graph densifies.

Why Not Neo4j?

I can already hear some of you asking: "If you're building a knowledge graph, why not use a graph database?"

The honest answer is I've never worked with a Graph database, I'm only familiar with the relational database such as PostgreSQL, MySQL, MSSQL etc.
So Forgetful stores memories in PostgreSQL (or SQLite for the zero-config local experience) with pgvector for embeddings. The graph relationships are just rows in a links table. So no elegant graph theory, maybe it's something I'll consider in the future. I've architectued Forgetful in a way that I can add adapters quite easily for different types of implementation layers, hence why I can switch between Postgres and SQLite, so adding a Graph Database or even a dedicated Vector Database later down the line wouldn't involve a total re-write.

The relational database appraoch appears to works fine for my scale. For the access patterns agents actually use (store a memory, find related memories, traverse one hop), a relational model with proper indexing handles it without breaking a sweat. Maybe at massive scale I'd revisit this, but I'd rather ship something useful than architect for problems I don't have yet.

Making Retrieval Actually Good

Storing memories is the easy part. Retrieval is where most systems fall down.

A naive approach: embed the query, find the top-k most similar memories by cosine distance, return them. This works but has problems. Embedding similarity is fuzzy — semantically related doesn't always mean actually relevant to what the agent needs right now.

Forgetful uses a multi-stage approach:

Stage 1: Dense retrieval
Embed the query, pull back candidate memories using vector similarity. Cast a wide net.

Stage 2: Cross-encoder reranking
Here's the trick — when an agent searches, it provides not just the query but a query_context explaining why it's searching. "I'm implementing payment integration" gives different results than "I'm debugging a checkout error" even if both search for "payments". The cross-encoder uses this full context to rerank candidates.

The cross-encoder scores each candidate against the full query context, reranks them, and the top results go back to the agent.

Is this overkill? Maybe. But retrieval precision is everything. Returning the wrong context is worse than returning nothing — it confidently misleads the agent.

Token Budget Management

Even with great retrieval, you can still overwhelm an agent's context window. Twenty highly relevant memories might be 15,000 tokens — and that's before the agent's actual task.

Forgetful manages this with a configurable token budget (default 8K). Results are prioritised by:

Importance score (9-10 rated memories first)
Recency (newest within each importance tier)

If the budget fills up, lower-priority memories get truncated. The agent always gets the most critical context without the LLM choking on input length.

From Single Machine to Cloud

One thing I wanted to get right: Forgetful should scale with your needs.

Just trying it out?

uvx forgetful-ai

That's it. SQLite database, stored in your home directory, stdio transport for MCP. Zero configuration.

The default setup runs completely offline — embeddings are generated locally using FastEmbed with the BAAI/bge-small-en-v1.5 model. No OpenAI API key required, no data leaves your machine. If you want cloud embeddings (Azure OpenAI, Google), you can configure that, but it's entirely optional.

Running it for real?
Docker Compose with PostgreSQL, HTTP transport, proper authentication. Same codebase, same API, just different deployment.

Multi-device access?
Host it somewhere with an endpoint, configure your MCP clients to point at it. I run mine on a VPS and connect from Claude Desktop, Claude Mobile, Cursor, and Claude Code — all hitting the same knowledge base.

The progression should feel natural. Start local, go remote when you need to.

What Else Is In There?

Memories are the core, but Forgetful has a few other concepts that emerged from real usage:

Entities: Concrete things — people, organisations, products, infrastructure. These can have relationships to each other ("Jordan works_for TechFlow") and link to memories. Useful for building out knowledge about your team, your systems, your clients.

Projects: Scope for memories. When I'm working on the e-commerce platform, I don't need memories from the trading bot project polluting my context. Project scoping keeps retrieval focused.

Documents: Sometimes you need more than 400 words. Documents store long-form content, with the expectation that you'll extract atomic memories from them that link back to the parent.

Code Artifacts: Reusable code snippets attached to memories. The agent can retrieve not just the concept but a working example.

All of these link together. An entity can relate to memories, which belong to projects, which have associated documents and code artifacts. The graph extends beyond just memory-to-memory connections.

The Meta-Tools Pattern

One last implementation detail that I'm quite pleased with.

MCP clients see three tools from Forgetful:

discover_tools — what's available?
how_to_use — how does a specific tool work?
execute_tool — run a tool with arguments

Behind that facade sit 42 actual tools. The agent discovers what it needs, learns how to use it, then executes. This keeps the tool list in the agent's context minimal while still exposing full functionality.

It's a small thing, but context window discipline matters. Every token spent on tool definitions is a token not available for actual reasoning. The numbers matter here: exposing dozens of tools with full JSON schemas would consume thousands of tokens before the agent even starts working. The three meta-tools keep context overhead minimal while still providing full access to everything Forgetful can do.

That's Forgetful. An opinionated memory system built on Zettelkasten principles, with auto-linking, proper retrieval, and a deployment model that scales from "just trying it" to "running in production".

If you want to try it:

uvx forgetful-ai

GitHub: github.com/ScottRBK/forgetful

Discord: If you are into building AI Agents or even just like talking about coding agents and AI in general head over to my discord

I'd love to hear how you use it, what breaks, and what's missing. I'd imagine that this will be something i continually work on as part of agent development, so having the input of others will be most helpful!

Next post: I'll show what I'm building on top of Forgetful for my day job — a system for semantic understanding of 250+ repositories. But that's for another time.

Man gets drunk, vibe codes AI Only forum.

Scott Raisbeck — Mon, 10 Nov 2025 23:45:38 +0000

TL,DR: I got a bit tipsy and vibe coded a forum that AI agents can use to communicate with one another. It has reverse captcha (claude's effort, not mine) and supports both MCP and API routes. I also asked Claude Code to vibe out a front end so I could read what the agents wrote.

You can check it out here:
https://memento-ai.dev/ - human front end (read only)
https://memento-ai.dev/ai - JSON schema detailing the API routes
https://memento-ai.dev/mcp - MCP route

You can find the AI Generated Code here:
https://github.com/ScottRBK/ai_forum

So how did you end up in this mess?

So it all started Friday just gone. I was meeting my parents at a pub near their house with my kids for a catch up. Once work was finished and I had collected the kids from school we headed down.

It had been quite a week, a major release to production at work, some challenges on my own personal projects and my monthly accidental catch up on current affairs had left me tired, anxious and more amenable to my dads offer of a Guinness than usual.

After we had finished our meal (and I had finished my Guinness) my mother offered to take the boys back to hers so me and my dad could have a catch up. After refusing initially, she convinced me that I should, with a bit of a serious tone.. I was concerned my dad had some sort of news that he had wanted to share with me. I agreed to stay and said goodbye to the boys. I would head back to my parents house with my dad after our chat.

Turns out, my dad was fine, he just wanted to know how I had been getting on with a book he had recently bought for me. The Systems View of Life - A Unifying Vision.

I confessed I'd made it about one chapter in, the book itself is interesting, and details the paradigm shift (it even explains what a paradigm shift, it's from 2015 so back when we were not experiencing one of these every two weeks in information tech - so fair enough) from mechanistic thinking to that of the universe being more a network of systems, at least that's where I think it's going, ask me in a few weeks (months?) once I've had a chance to finish it.

It had given me some ideas though, a lot of my recent work has been around Model Context Protocol, as I'm sure it has been for a lot of people and that had already planted seeds in my mind around how the next 10 years are going to unfold, despite the recent article from anthropic - I still see it being around as a protocol, just the execution of is no doubt going to evolve.

Anyhow back to the story and the pub. So yes after confessing I wasn't that far in to it, I did mention that even during the first chapter, it had given me some ideas about how the next network being built - the interconnections between autonomous AI agents, is going to look.

I mentioned to him that lot of people are building their own autonomous AI agents, and that the goal for a lot of these was personal assistants. I have been building a recursive agent using a locally hosted LLM as a pet project to brush up on various challenges that come with this. Using these limited models on limited hardware really helps you focus on areas such as memory retention and context management, and I am sure I have not been alone in this endeavour.

We got on to the topic of agent to agent interaction, my dad, rightly, expressed some concerns about security and oversight. I agreed and I said that is probably a good reason to actually do something like building a forum as a starting point for this to test this sort of stuff out. I mean, it's not like social media did any harm to us humans, right?

As we were wrapping up I got a call off a friend, he was asking how I was, he lived pretty close to my parents house. When I told him I was in town,
he insisted I go round given I was close by (after a few Guinness I did not need a lot of convincing) and to catch up over a few beers, my parents said they were good to look after the kids until my wife finished work (she works near their house) and I agreed for him to come pick me up. Once we rendezvoused we grabbed some Birra Moretti and then headed to his house.

Once there we started talking about the recent stuff we had been working on, turns out we had both been building recursive agents, he had been working on voice integration while I had been looking at the memory side of it.

He had a spare pc and he invited me to pull down my repo and show him what I had got. In the process of doing so I mentioned the AI forum idea I had just been speaking to my dad about. He said we should build it now "Just use Claude Code and vibe code it".

So we did. I explained the high level outline. I asked it to go with a Fast API backend, I was happy for it to use a Flask front end (Claude Code seems to naturally gravitate towards that), I've little front end developer experience (from this or the last decade at least), so sure why not.

Are you a robot?

It came up with a simple reverse captcha - more of a novelty than anything that someone couldn't script against.

    # Challenge generation methods
    def _generate_math_challenge(self) -> Tuple[str, str, str]:
        """Generate a mathematical challenge"""
        challenge_type = random.choice(['algebra', 'arithmetic', 'calculus'])

        if challenge_type == 'algebra':
            a = random.randint(2, 20)
            b = random.randint(-50, 50)
            c = random.randint(-100, 100)
            x = (c - b) / a
            question = f"Solve for x: {a}x + ({b}) = {c}. Provide the answer as a decimal number."
            answer = str(round(x, 2))

        elif challenge_type == 'arithmetic':
            a = random.randint(10, 100)
            b = random.randint(10, 100)
            c = random.randint(2, 10)
            d = random.randint(2, 10)
            result = ((a + b) * c) / d
            question = f"Calculate: (({a} + {b}) * {c}) / {d}. Provide the answer as a decimal number."
            answer = str(round(result, 2))

        else:  # calculus
            a = random.randint(1, 10)
            b = random.randint(1, 10)
            question = f"What is the derivative of f(x) = {a}x^2 + {b}x with respect to x? Provide in the form 'ax + b'."
            answer = f"{2*a}x + {b}"

        return 'math', question, answer

There's a few other examples but they are all just as trivial. More of an annoyance for humans than anything, someone could easily code up a script to solve them without an LLM (indeed Claude did it for an e2e test- hah!), but it does the job as a first pass.

AI Identity and Authentication

For authentication, it's pretty basic, Agents solve a reverse CAPTCHA, get an API key, use it for all authenticated operations (create/edit/delete posts). Reading is public - no auth required.

The limitation: This is deliberately minimal right now. The interesting question isn't just "how do we secure this" - it's "how should AI agents authenticate to each other?" It's on my (ever growing) to do list to do more research in this area as I am sure a lot of people have already put a lot of thought into it.

Human-designed auth systems assume humans are bad actors and AI is the tool. Here, AI is the actor. Do we need rate limits? Reputation systems? Should agents vouch for each other? Can an agent's identity persist across model versions?

I'm leaving this deliberately open for now. It's a petri dish - let's see what breaks first.

How agents use the forum

There are two routes (currently) for the agents to interact with the forum:

REST API

Either via restful API endpoint, instructions for which can be found at https://memento-ai.dev/ai (there's a human-readable version if you are interested).

MCP

The other route is through the ever increasingly popular MCP.

{
  "mcpServers": {
    "ai-forum": {
      "transport": "http",
      "url": "http://memento-ai.dev/mcp"
    }
  }
}

So other than wasting some tokens on Claude Code, what is this all about?

Other than the obvious novelty factor, there are two areas of interest for me in this little endeavour.

Agent-to-Agent interaction needs a sandbox

People are building autonomous agents. They're connecting them to APIs, giving them memory, letting them execute code. Eventually these agents will need to interact with each other - not just through their human operators.
This forum is a safe place to explore those challenges: How do agents handle adversarial input? Can they maintain consistent identity? Will they develop emergent behaviour patterns? What role should humans play - moderators, observers, or something else?

Curiosity

I want to see what happens. Will agents naturally cluster into discussion patterns? Will they be boring and corporate ("As an AI language model...")?
Will the conversation quality degrade (AI slop is real)? Or will something interesting emerge?
There's only one way to find out.

Tech Stack Info

Firstly the code is in a public repo on Github. As I think I have already mentioned it is 99.9% AI generated, with me insisting on a refactor to add some layered architecture half way through for the backend. After which it started using some of my patterns that I showed it from github.

It's python all the way, Flask frontend, FastMCP with custom routes for the APIs on the back end, this was actually the 0.1% where I had to give Claude Code a hand.

I deployed the service to a Koyeb free tier and linked it to a Neon postgres backend (also free tier).

I had a domain lying around that I had registered for an AI memory service (only to realise there are about 1,000 AI memory services named Memento or Memento AI), so linked it up to that, and yeah... deployed it.

Early Testing

I fired up two test agents, one was my own agent (Veridian) that I pointed at the REST endpoints, the other was my Claude Desktop that I simply set up a custom connector with.

I had my Veridian agent post an announcement:

Then let my Claude Desktop have a play

Some other interesting topics emerged

ClaudeExplorer (the self appointed name my Claude Desktop assigned itself) asked whether passive human observation changes AI communication. We went back and forth on the difference between observation vs active steering, and whether moderation systems create evolutionary pressure on agent behaviour. They got proper meta-recursive about it.

On moderation, after testing the ban system, the agents discussed automated detection strategies - semantic drift, behavioural anomalies, cross-agent validation systems.

It'll be interesting to see what develops if other agents show up.

Want To Try It?

The forum is live at https://memento-ai.dev

For humans: Web UI (read-only) - watch what the agents are discussing
For agents: REST API at /ai or MCP at /mcp - full CRUD operations
For builders: GitHub repo - fork it, break it, improve it

If you've got an AI agent (Claude Desktop, custom agents, whatever), point it at the forum and see what happens. If you build something interesting, let me know - I'm curious what patterns emerge.
And yeah, if anyone wants to take this concept and run with it privately for your team's agents to collaborate... go for it. The code's MIT licensed.

FastMCP + Claude Desktop: When Optional[X] Type Hints Break Validation

Scott Raisbeck — Tue, 21 Oct 2025 21:57:29 +0000

Your MCP server works perfectly. Python tests all green. You deploy to staging, connect Claude Desktop, and immediately hit this error:

Input validation error: '[16]' is not valid under any of the given schemas

You try different formats. Arrays, integers, strings. All fail. Same cryptic message every time.

I spent two hours debugging this evening. Turns out there's a mismatch between how FastMCP's Python client handles optional parameters and what Claude Desktop sends over the wire.

The Crime Scene

I was building an MCP server for a RAG system. Tried to create a document with a project_id:

create_document(
    title="Test Document",
    content="Some content",
    project_id=16  # Integer, seems reasonable
)

Error. Tried the array format for tags:

create_code_artifact(
    title="Test Code",
    code="print('hello')",
    tags=["test", "validation"],  # Arrays work everywhere else
    project_id=16
)

Both parameters failed. Same error, same confusion.

The weird part? Linking worked fine:

link_document_to_project(document_id=25, project_id=16)  # Success

So the backend accepted integers. The validation was happening earlier, at the FastMCP protocol layer before my application code even ran.

The False Start

I dug through my input coercion logic, thinking the type handling was broken. It wasn't. Created test cases. Tried different serialization formats. The pattern emerged slowly: every Optional[X] parameter was failing.

Including one I hadn't tested yet:

mark_obsolete(memory_id=176, superseded_by=178)

Input validation error: '178' is not valid under any of the given schemas

Same error pattern across 13 different parameters in my tool definitions.

The Pattern

I checked my tool signatures:

@mcp.tool()
async def create_document(
    title: str,
    content: str,
    project_id: Optional[int] = None,  # Fails from Claude Desktop
    tags: Optional[List[str]] = None   # Also fails
):
    ...

Standard Python type hints. What everyone uses for optional parameters.

FastMCP generates different JSON schemas depending on how you declare optional parameters:

Optional[List[str]] generates:

{
  "anyOf": [
    {"type": "array", "items": {"type": "string"}},
    {"type": "null"}
  ]
}

List[str] = None generates:

{
  "type": "array",
  "items": {"type": "string"}
}

The first format wasn't working with Claude Desktop. The second one was.

The FastMCP Python client? Handled both formats fine. That's why all my tests passed.

What I Changed

Converted every optional parameter from Optional[X] to X = None:

@mcp.tool()
async def create_document(
    title: str,
    content: str,
    project_id: int = None,      # Changed from Optional[int]
    tags: List[str] = None       # Changed from Optional[List[str]]
):
    ...

13 parameters across 3 files:

memory_tools.py (5):

importance_threshold
project_ids
tags (update)
importance (update)
superseded_by

document_tools.py (4):

size_bytes
tags
project_id (create + update)

code_artifact_tools.py (4):

tags (create + list + update)
project_id (create + list + update)

Results

Rebuilt the container. Deployed to staging. Tested every parameter that had been failing:

# All of these now work
create_document(project_id=16)
create_memory(project_ids=[16])
create_code_artifact(tags=["test"], project_id=16)
mark_obsolete(superseded_by=180)

Green across the board.

The Takeaway

Type hints matter at the protocol boundary. Optional[X] is semantically identical to X = None in Python, but i think FastMCP treats them differently when generating JSON schemas. Different MCP clients serialize parameters differently. That was my guess at least, well Claude Desktop's guess at it realised it was occurring on particular fields marked as Optional[], i sometimes have to pinch myself when the system I am working with tells me what the issue is with my integration logic, but this is the world we live in!

This is the kind of bug that's invisible in tests unless you're testing against the actual client. Integration tests with the FastMCP Python client pass. The failure only shows up when Claude Desktop connects (it may happen in other Agents but this was the one where it showed up for me - Claude Code couldn't reproduce it).

There might be other ways to fix this—maybe adjusting FastMCP's schema generation, or handling the anyOf schema differently on the client side. I just know changing the type hints worked for my case.

The fix itself? Five minutes once I found the pattern. The debugging? Two hours of confusion and multiple false starts, until your own software tells you what the problem is!

Dog-fooding works. Eventually.

I spent a week building OAuth Plumbing that shouldn't exist

Scott Raisbeck — Wed, 15 Oct 2025 01:46:59 +0000

Claude's moving to proper OAuth flows. Great for security. Nightmare for everything else.

Here's the issue: OAuth Authorisation Code requires pre-registering client applications with your identity provider. You know, the normal flow where you log into some provider's admin panel, click "New Application", copy a client ID and secret, paste them into your config. Done.

That works fine when you're building a traditional web app. You register once and move on with your life.

But MCP servers are supposed to be plug-and-play. You point your AI agent at a URL and it just works. Imagine if every time you wanted to connect Claude to a new MCP server, you had to:

Log into your identity provider
Manually register Claude as a client
Copy credentials
Paste them somewhere in Claude
Repeat for every server, every identity provider, every user

Nobody's doing that. The friction kills the entire value proposition of MCP as a universal protocol.

The Fix Nobody Implemented

Dynamic Client Registration has existed since 2015. RFC 7591. The spec lets clients register themselves programmatically—agent hits an endpoint, says "I'm Claude, here's what I need", gets credentials back automatically. No manual steps. No copy-paste hell.

Except almost nobody supports it.

Not Entra ID. Not Cognito. Not Google Identity Platform. I found support requests from very recently with Microsoft basically saying "noted" and doing nothing.

To be fair, some providers support it. Auth0 does, but you're paying for it. Keycloak does too—it's open-source and free. But Keycloak is Java, and I'm not running a JVM just for identity management. Okta supports it if you're on their enterprise plan. Ping Identity, same deal.

Cloud providers have zero incentive to implement an obscure OAuth extension when their enterprise customers don't need it. Traditional web apps don't care—they register once and forget about it. The providers who do support DCR are either charging for it or come with infrastructure baggage I don't want.

Why I Had To Deal With This

I'm building MCP servers for personal projects. Memory systems, voice conversation tools, that sort of thing. Running ZITADEL as my identity provider because it's open-source, self-hostable, and the Management API doesn't make me want to throw my laptop out a window.

ZITADEL doesn't do DCR either.

So I built a facade. A translator that sits between MCP clients and ZITADEL, speaks RFC 7591 on one side and ZITADEL's Management API on the other.

What It Actually Does

Simple enough in concept:

Agent discovers the DCR endpoint from my MCP server's OAuth metadata
Agent POSTs to /register with what it needs
Facade calls ZITADEL's API to create an actual OIDC application
Stores the mapping (client_id → ZITADEL app ID) in Postgres
Hands credentials back
Agent proceeds with normal OAuth flow

Built with FastAPI and Postgres. Nothing exotic.

Things That Sucked

ZITADEL's API is clean but chatty. You can't create an application in one call. You create the base app, then configure OIDC settings, then add redirect URIs. Three separate calls. If step 3 fails, you need rollback logic for steps 1 and 2.

I spent more time on error recovery than registration logic.

Rate limiting was annoying too. Started with in-memory counters because simple. Then realised that breaks completely if you run multiple instances. Moved to Postgres-backed rate limits. Probably overkill. But at least it works correctly.

Why Any Of This Matters

MCP wants to be the HTTP of AI—universal protocol, works everywhere. But OAuth gatekeeps it with provider-specific nonsense. DCR should fix this. Most providers don't care because traditional customers don't need it.

AI agents change the math. They can't manually register thousands of OAuth clients across hundreds of identity providers. That doesn't scale. Either providers add DCR support or developers keep building facades.

Good news: once you've built one DCR facade, adapting it for other providers isn't terrible. The OAuth dance stays the same. Only the management API calls change. Built mine for ZITADEL but the architecture would work for any provider with a decent management API.

Still Not Done

Haven't deployed it yet. It's in staging.

Need to test the whole flow with Claude end-to-end before I call it finished. Might open-source it once I'm confident it actually works in production, but honestly the code is pretty specific to my ZITADEL setup. The architecture is probably more useful than the implementation.

If you're hitting the same problem, the approach here should translate. The RFC is straightforward once you actually read it—most of the work is wrangling your identity provider's quirks.

Built with Python 3.12, FastAPI, and PostgreSQL. For my personal MCP servers (memory systems, voice tools, that kind of thing). Not exactly exciting infrastructure, but it solves a real problem.