Forem: Endogen

Using Claude Code With NVIDIA Build’s Free Models

Endogen — Sun, 26 Apr 2026 21:36:12 +0000

I like Claude Code a lot.

Not because it always picks the perfect model, and not because every answer is magical, but because the workflow is good. It feels fast, focused, and genuinely useful for day-to-day coding. The catch, of course, is that Claude Code normally assumes you’re plugged into Anthropic’s own API.

But there’s a clever workaround.

If all you want is the Claude Code interface—the CLI, the editor integration, the overall UX—you can keep that frontend and swap out the backend model. One of the more interesting ways to do that right now is with NVIDIA Build, which offers a catalog of hosted models and free serverless endpoints for development.

The glue between those two worlds is an open-source project called free-claude-code.

This post walks through what the setup actually is, why it’s interesting, and how to get it running.

What this actually means

Let’s clear up the most important point first:

This does not give you Anthropic’s Claude models for free.

What it does give you is a way to use Claude Code as the client while routing requests to a different model provider behind the scenes.

In this case, that provider is NVIDIA Build / NVIDIA NIM.

So the setup looks like this:

Claude Code -> local compatibility proxy -> NVIDIA-hosted model

That distinction matters. If you publish this as “free Claude,” people will feel misled. If you publish it as “use Claude Code with NVIDIA’s free models,” that’s accurate—and honestly, still pretty compelling.

Why this is interesting

There are really two separate things people often bundle together:

The model itself
The interface used to work with the model

Claude Code is both a model ecosystem and a very polished coding interface. The neat trick here is that you can separate those concerns.

If you enjoy the Claude Code UX, but you want to experiment with lower-cost or free hosted models, this setup gives you that option.

And NVIDIA Build is a strong fit for that kind of experimentation because it already exposes a large public model catalog, including a set of free serverless endpoints.

The two pieces you need

1. An NVIDIA Build account and API key

Start here:

Create an account, go through NVIDIA’s developer sign-in flow, and generate an API key.

That key is what the proxy will use to talk to NVIDIA’s hosted model endpoints.

2. The `free-claude-code` proxy

The project lives here:

Alishahryar1/free-claude-code

What it does is simple in principle:

it exposes an Anthropic-compatible API surface locally
Claude Code points at that local server
the proxy translates and forwards those requests to another provider

The project supports several providers, but for this post the NVIDIA path is the one that matters.

How the NVIDIA route works

Inside the project, NVIDIA is treated as a first-class provider.

The relevant bits are straightforward:

the provider name is nvidia_nim
the API key variable is NVIDIA_NIM_API_KEY
requests go to:

https://integrate.api.nvidia.com/v1

The sample configuration in the repo currently defaults to this model:

MODEL="nvidia_nim/z-ai/glm4.7"

That’s an important detail because it shows the project is not just generically “NVIDIA-compatible” in theory—it ships with a concrete NVIDIA-backed model configuration out of the box.

Installing the proxy

There are two ways to install it.

Option 1: Clone the repo

git clone https://github.com/Alishahryar1/free-claude-code.git  
cd free-claude-code  
cp .env.example .env

Option 2: Install it as a tool

uv tool install git+https://github.com/Alishahryar1/free-claude-code.git  
fcc-init

That fcc-init command creates a config file at:

~/.config/free-claude-code/.env

One thing to note: the current project configuration requires Python 3.14+.

Configuring it for NVIDIA Build

Open the generated .env file and set at least these values:

NVIDIA_NIM_API_KEY=your_nvidia_key_here  
MODEL="nvidia_nim/z-ai/glm4.7"  
VOICE_NOTE_ENABLED=false

A few details are worth knowing:

the model value must include the provider prefix
for NVIDIA, that prefix is nvidia_nim/...
the repo can also map different backends to Opus, Sonnet, and Haiku-style requests, but you can ignore that at first and just set MODEL

That’s enough for a simple initial setup.

Starting the local proxy

If you cloned the repo, run:

uv run uvicorn server:app --host 0.0.0.0 --port 8082

If you installed it as a tool, you can usually just run:

free-claude-code

At that point you have a local server that looks enough like Anthropic’s API for Claude Code to use it.

Pointing Claude Code at the proxy

This is the key handoff.

Launch Claude Code like this:

ANTHROPIC_AUTH_TOKEN="freecc" ANTHROPIC_BASE_URL="http://localhost:8082" claude

The subtle but important detail is the base URL.

It should point to the proxy root:

http://localhost:8082

—not to /v1.

That small detail is easy to get wrong, and if you do, the whole setup feels broken for no obvious reason.

What you get out of it

If everything is set up correctly, the result is pretty nice:

you keep the Claude Code workflow
you use NVIDIA-hosted models underneath
you don’t need an Anthropic API key for the model calls themselves
you can experiment without immediately committing to another paid API bill

That makes this a good fit for people who:

like Claude Code’s UX
want to try coding with alternative models
already have an NVIDIA Build account
want a lower-cost or free development setup

What to expect in practice

This is where expectations matter.

Using Claude Code with a non-Claude model is a bit like putting a different engine in a familiar car. The dashboard still looks the same, the steering wheel is where you expect it to be, but the feel changes.

Some models will be surprisingly good.

Some will be worse at tool use.

Some will feel faster.

Some will be noticeably less consistent.

That’s not a flaw in the proxy—it’s just the reality of using a frontend designed around one ecosystem with models from another.

So the right expectation is not:

“I now have free Claude.”

The right expectation is:

“I now have Claude Code’s interface connected to a different model provider.”

That’s still useful. It’s just a different claim.

Why I think this matters

We’re heading toward a world where the interface layer and the model layer are increasingly interchangeable.

That’s good news.

It means tools people genuinely enjoy using don’t have to stay locked to a single backend forever. If you like a workflow, you should be able to keep it and swap the model depending on cost, speed, quality, or availability.

That’s exactly why projects like free-claude-code are interesting.

They make the model layer more replaceable.

And NVIDIA Build makes that especially practical because it lowers the barrier to trying a bunch of hosted models without having to build your own inference setup first.

Final thoughts

I wouldn’t pitch this as a magic loophole.

I’d pitch it as something more honest—and more useful:

a practical way to use Claude Code as a frontend for NVIDIA Build’s free hosted models.

If you already enjoy Claude Code, that’s worth trying.

And even if you end up going back to Anthropic’s native stack later, this setup is a nice reminder that the future probably belongs to tools that treat model providers as swappable infrastructure rather than fixed destiny.

Building a Telegram Bot for Allen AI's Open-Source Models

Endogen — Sat, 07 Mar 2026 01:33:48 +0000

I wanted a Telegram bot that lets me chat with Allen AI's open-source language models — OLMo, Tülu, and Molmo 2 — without running any models locally. No GPU, no inference server, just a lightweight Python bot that talks to Allen AI's free public playground API.

The result is OLMo Bot, and it ended up with more capabilities than I initially planned: multi-model switching, web search, vision, and even visual object pointing with annotated image overlays.

Connecting to Allen AI

Allen AI runs a public playground with their latest models. There's no official API, but I built Web2API — a tool that turns websites into REST APIs — and created a recipe for it. The bot doesn't scrape anything itself; it just calls Web2API endpoints:

async def query_model(model, prompt, history=None, file_path=None):
    endpoint = MODELS.get(model)  # e.g. "/allenai/olmo-32b"
    url = f"{WEB2API_URL}{endpoint}"
    params = {"q": full_prompt}

    async with httpx.AsyncClient(timeout=120) as client:
        resp = await client.get(url, params=params)

    items = resp.json().get("items", [])
    return items[0]["fields"]["response"]

The Allen AI recipe in Web2API uses a custom scraper that handles their streaming NDJSON chat API directly — no browser automation needed for this one.

Model Switching

The bot supports six text models and two vision models, switchable per user with simple commands:

Command	Model
`/olmo32b`	OLMo 3.1 32B Instruct (default)
`/think`	OLMo 3.1 32B Think (reasoning)
`/olmo7b`	OLMo 3 7B Instruct
`/tulu8b`	Tülu 3 8B
`/tulu70b`	Tülu 3 70B
`/molmo2`	Molmo 2 8B (vision)
`/molmo2track`	Molmo 2 8B Tracking

Each user's model choice is stored in memory. Send /think, and all your subsequent messages go to the reasoning model until you switch again.

The Think model is particularly interesting — it's Allen AI's chain-of-thought model that shows its reasoning process, similar to what you'd get from o1 or DeepSeek R1, but fully open-source.

Conversation Memory

Memory is off by default (stateless, each message is independent) but can be toggled with /memory:

if mem_on:
    # Build context from history
    parts = []
    for msg in history:
        role = msg["role"]
        parts.append(f"{'User' if role == 'user' else 'Assistant'}: {msg['text']}")
    parts.append(f"User: {prompt}")
    full_prompt = "\n\n".join(parts)

When enabled, the bot maintains up to 20 turns of conversation per user. The full history is prepended to each prompt so the model has context. /clear wipes it.

Web Search via Tool Calling

This is where Web2API's MCP bridge comes in. Allen AI's models support tool calling — you pass a tools_url parameter pointing to a tool endpoint, and the model can decide to call those tools during generation.

I configured the bot to always pass the Brave Search tool:

# config.py
DEFAULT_TOOLS_URL = os.getenv(
    "OLMO_TOOLS_URL",
    "http://127.0.0.1:8000/mcp/only/brave-search",
)

# bot.py — included in every text model request
params = {"q": full_prompt}
if DEFAULT_TOOLS_URL and model not in VISION_MODELS:
    params["tools_url"] = DEFAULT_TOOLS_URL

The flow works like this:

User asks "What's the weather in Berlin?"
Bot sends the prompt to Web2API with tools_url pointing to the Brave Search bridge
Web2API's Allen AI scraper passes the tool definition to the model
OLMo decides it needs current data, calls web_search
The scraper executes the search via the MCP bridge, feeds results back to the model
OLMo generates a response incorporating the search results
Bot sends the answer to the user

The model decides autonomously whether to search — if you ask "What is 2+2?", it just answers directly. If you ask about current events, it searches. All of this happens inside Web2API's Docker container.

One detail worth mentioning: the tools_url points to http://127.0.0.1:8000 (container-internal port), not the external 8010. Since the Allen AI scraper runs inside the same Docker container as the MCP bridge, it can reach it on localhost without going through nginx.

Vision models skip the tools parameter — Molmo 2 doesn't need web search.

Vision: Image and Video Analysis

Send a photo or video to the bot with a caption, and it analyzes it using Molmo 2:

# Auto-switch to molmo2 if current model doesn't support vision
if model not in VISION_MODELS:
    model = "molmo2"

The bot downloads the file from Telegram, sends it as a multipart POST to Web2API, and returns the model's analysis. If no caption is provided, it defaults to "Describe this image in detail."

The auto-switch is key for usability — you don't have to manually switch to Molmo 2 before sending a photo. Send an image on any model, and the bot temporarily uses Molmo 2 for that message, then stays on your selected text model for the next.

Point Overlay: "Show Me Where"

This was the feature I didn't plan but couldn't resist building. Molmo 2 has a pointing capability — ask it to point at objects, and it returns coordinates in a normalized 0–1000 coordinate space:

User: "Point to the eyes" (with photo attached)
Molmo 2: <points coords="1 1 421 430 2 633 352">eyes</points>

The response format encodes multiple points: the first point has a two-number prefix plus x,y coordinates, subsequent points have an index plus x,y. All values are in a 0–1000 space relative to image dimensions.

The bot parses these coordinates and draws colored markers on the original image using Pillow:

def _make_marker(color, radius, label, *, scale=4):
    """Render an anti-aliased marker via 4× supersampling."""
    sr = radius * scale
    marker = Image.new("RGBA", (size, size), (0, 0, 0, 0))
    draw = ImageDraw.Draw(marker)

    # White border ring
    draw.ellipse([...], fill=(255, 255, 255, 240))
    # Colored circle
    draw.ellipse([...], fill=(*color, 230))
    # Centered number label
    draw.text((cx, cy), label, fill=(255, 255, 255), font=font, anchor="mm")

    # Downscale for smooth anti-aliasing
    return marker.resize((final_size, final_size), Image.LANCZOS)

The markers are rendered at 4× resolution and downscaled with LANCZOS filtering for smooth, anti-aliased edges — no jagged circles or pixel artifacts. Each point gets a distinct color (red, blue, green, orange...) with a white border and a numbered label.

The bot sends the annotated image back as a photo with a caption like "📍 eyes (2 points)". Prompts that trigger pointing include variations of "Point to...", "Find the...", "Where is the...", and "Locate the...".

Setup

The bot is a single bot.py file plus a config and the pointing module. Dependencies are minimal: python-telegram-bot, httpx, and Pillow.

git clone https://github.com/Endogen/olmo-bot.git
cd olmo-bot
python3 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
cp .env.example .env  # set OLMO_BOT_TOKEN
python bot.py

It requires a running Web2API instance with the allenai recipe (and optionally brave-search for web search). Access can be restricted to specific Telegram user IDs via the OLMO_ALLOWED_USERS env var.

What's Next

The main limitation is Allen AI's native tool calling — while the model acknowledges tools and can call them, it doesn't always do so proactively. A bot-side tool loop (parsing tool-call JSON from the model output and executing tools locally) would make this more reliable.

The pointing coordinate format from Molmo 2 also isn't officially documented — I reverse-engineered it from testing. It works reliably, but the format could change.

Links:

Web2API — Turning Websites into REST APIs (and MCP Tools)

Endogen — Sat, 07 Mar 2026 01:13:57 +0000

I needed data from websites that don't have APIs. Not once, not as a quick scrape, but as persistent, queryable endpoints I could hit programmatically. So I built Web2API.

The Problem

Most useful data on the internet lives behind HTML. Some sites offer APIs, many don't. The typical approach is writing one-off scrapers — fragile scripts that break whenever the site changes a CSS class. I wanted something different:

Declarative — define what to extract, not how to click through pages
Persistent — a running service with stable endpoints, not a script I run manually
Modular — add new sites without touching the core codebase
AI-ready — expose scraped data as tools that language models can call

The Solution

Web2API is a FastAPI service backed by Playwright (headless Chromium). You define recipes in YAML — each recipe describes a website, its endpoints, and what data to extract. The service runs continuously and serves the scraped data as clean JSON REST endpoints.

A Recipe Looks Like This

name: "Hacker News"
slug: "hackernews"
base_url: "https://news.ycombinator.com"
endpoints:
  read:
    url: "https://news.ycombinator.com/news?p={page}"
    items:
      container: "tr.athing"
      fields:
        title:
          selector: ".titleline > a"
          attribute: "text"
        url:
          selector: ".titleline > a"
          attribute: "href"
          transform: "absolute_url"
        score:
          selector: ".score"
          context: "next_sibling"
          attribute: "text"
          transform: "regex_int"
          optional: true
    pagination:
      type: "page_param"
      param: "p"
      start: 1

That's it. No Python code. Install the recipe, and you get:

curl http://localhost:8010/hackernews/read?page=1

{
  "items": [
    {
      "title": "Show HN: I built a thing",
      "url": "https://example.com",
      "fields": { "score": 153 }
    }
  ],
  "pagination": { "current_page": 1, "has_next": true }
}

When YAML Isn't Enough

Some sites require actual interaction — typing into fields, waiting for dynamic content, handling streaming responses. For those, recipes can include a custom Python scraper alongside the YAML:

recipes/
  allenai/
    recipe.yaml     # endpoint definitions
    scraper.py      # custom interaction logic

The scraper gets a blank Playwright page and full control:

class Scraper(BaseScraper):
    def supports(self, endpoint: str) -> bool:
        return endpoint in {"chat", "olmo-32b"}

    async def scrape(self, endpoint, page, params):
        # Navigate, interact, parse streaming responses...
        return ScrapeResult(items=[...])

Endpoints not handled by the scraper fall back to declarative YAML extraction. This hybrid approach means simple sites stay simple, and complex ones get the flexibility they need.

Recipe Management

Recipes live in a catalog — a git repository with available integrations. The service has a CLI and web UI for managing them:

# See what's available
web2api recipes catalog list

# Install one
web2api recipes catalog add hackernews --yes

# Check dependencies
web2api recipes doctor hackernews

You can also install recipes from local paths, custom git repos, or just drop a folder into the recipes directory. The web UI shows both the catalog and installed recipes with one-click install/uninstall.

The MCP Server

This is where it gets interesting. Web2API includes a built-in MCP (Model Context Protocol) server that automatically exposes every recipe endpoint as a native tool for AI assistants.

Install a recipe → it's immediately available as an MCP tool. Uninstall it → the tool disappears. No configuration, no restart needed.

{
  "mcpServers": {
    "web2api": {
      "command": "npx",
      "args": ["-y", "mcp-remote", "https://your-host/mcp/"]
    }
  }
}

Add that to your Claude Desktop config, and suddenly Claude can search the web, translate text, query Hacker News — whatever recipes you have installed.

How Tools Are Built

Each recipe endpoint becomes its own MCP tool with a proper name, description, and typed parameters. The tool registration happens dynamically — when recipes change, tools rebuild automatically:

# Inside _ToolRegistry
async def _fn(**kwargs: str) -> str:
    response = await execute_recipe_endpoint(
        app=self.app,
        recipe=recipe,
        endpoint_name=endpoint,
        page=1,
        q=kwargs.get("q", ""),
        query_params=params,
    )
    return format_tool_result(response.model_dump(mode="json"))

Tools execute recipes in-process — no HTTP self-calls, no overhead. The function signatures are built dynamically with inspect.Signature so MCP clients get proper parameter schemas.

Recipes can also define custom tool_name values for AI-friendly naming:

endpoints:
  search:
    tool_name: "web_search"  # instead of the default "brave-search__search"

This matters more than you'd think — some models struggle with names containing dashes or double underscores.

HTTP Bridge

For non-MCP clients, there's also a simpler HTTP bridge:

# List available tools
curl https://your-host/mcp/tools

# Call a tool
curl -X POST https://your-host/mcp/tools/web_search \
  -H "Content-Type: application/json" \
  -d '{"q": "latest news"}'

The bridge supports filtering by recipe slug — useful when you want to expose only specific tools to a particular consumer:

GET /mcp/only/brave-search/tools     # only brave-search tools
GET /mcp/exclude/allenai/tools       # everything except allenai

File Uploads

Some recipes need files — vision models that analyze images, document processors, etc. Web2API handles multipart uploads:

curl -X POST "http://localhost:8010/allenai/molmo2?q=Describe+this+image" \
  -F "files=@photo.jpg"

Files are saved to a temp directory, passed to the scraper, and cleaned up after the response. Upload filenames are sanitized against path traversal.

Architecture

The stack is deliberately simple:

FastAPI for the HTTP layer
Playwright (Chromium) for browser automation
Pydantic for config validation
Docker for deployment

A shared browser pool manages Playwright contexts with configurable concurrency and TTL. An in-memory response cache with stale-while-revalidate keeps things fast for repeated queries.

Request → Cache check → Browser pool → Playwright page → Extract → Cache store → Response

What I Use It For

I run Web2API on a VPS behind nginx with a handful of recipes:

Allen AI — chat with OLMo and Tülu models, analyze images with Molmo 2
Brave Search — web search that my AI tools can call
DeepL — translation between German and English
Hacker News — front page and search
Wikipedia — article search and full content extraction

The MCP server feeds into Claude Desktop for direct tool use, and the HTTP bridge provides web search capabilities to a Telegram bot I built on top of it.

Try It

git clone https://github.com/Endogen/web2api.git
cd web2api
docker compose up --build -d

# Install a recipe
docker compose exec web2api web2api recipes catalog add hackernews --yes

# Query it
curl -s http://localhost:8010/hackernews/read?page=1 | jq '.items[:3]'

The recipe catalog is open — contributions welcome.