Forem: Hassann

API Design Patterns from the World's Largest Prediction Market: Lessons from Polymarket

Hassann — Sat, 09 May 2026 09:50:04 +0000

Prediction market APIs are hard to design because they combine expiring financial instruments, real-time probability pricing, multi-outcome capital relationships, human users, and automated trading bots. Every API decision gets tested under latency, security, and correctness pressure.

Try Apidog today

Polymarket, one of the largest prediction market platforms by volume, is useful to study because its API is not just CRUD over markets and orders. It separates discovery, trading, analytics, authentication, signed orders, and real-time updates into distinct surfaces.

This article extracts eight implementation patterns you can apply when designing APIs for trading systems, crypto apps, fintech products, or any domain where state, trust, and data semantics matter.

Pattern 1: Separate APIs by domain, not by database entity

Polymarket exposes three main APIs:

Gamma API (gamma-api.polymarket.com) — market discovery, events, tags, search
CLOB API (clob.polymarket.com) — order book data, pricing, order placement
Data API (data-api.polymarket.com) — user positions, trades, analytics, leaderboards

Each API has a different purpose:

API	Primary use	Auth model	Consumer
Gamma API	Browse and discover markets	Public	Apps, users, indexers
CLOB API	Read books and place orders	Public reads, authenticated writes	Traders, bots, market makers
Data API	Query wallet-based activity	Public, address-scoped	Dashboards, analytics tools

A less deliberate design would put everything under one API:

/api/markets
/api/orders
/api/users
/api/trades

Polymarket instead separates by operational domain:

https://gamma-api.polymarket.com
https://clob.polymarket.com
https://data-api.polymarket.com

That separation matters because discovery, trading, and analytics have different requirements:

Discovery optimizes for searchability and browsing.
Trading optimizes for correctness, latency, and authentication.
Analytics optimizes for historical reads and wallet-level aggregation.

Implementation takeaway

When designing your API, start with usage boundaries instead of tables.

Ask:

Who calls this API?
How often do they call it?
Does it need authentication?
Does it need low latency?
Can it scale independently?
Can it fail independently?

If the answers differ significantly, consider separate API surfaces.

Pattern 2: Make read access public when data liquidity matters

Polymarket makes market data public:

curl "https://gamma-api.polymarket.com/events?limit=5"

No API key is required for basic market discovery.

That includes data such as:

Event metadata
Market metadata
Prices
Order books
Historical trades

This is a deliberate platform decision. Traditional financial exchanges often monetize market data directly. Polymarket treats market data as infrastructure: the more people can read it, analyze it, and build on it, the more useful the market becomes.

Implementation takeaway

Separate read access from write access.

A common mistake is to require authentication for everything:

GET /markets       requires auth
GET /order-book    requires auth
POST /orders       requires auth

For many platforms, this creates unnecessary friction. A better model is:

GET /markets       public
GET /order-book    public
GET /trades        public
POST /orders       authenticated
DELETE /orders     authenticated

Public reads are especially useful when:

Data consumers vastly outnumber writers.
Developers need to explore before integrating.
Bots, dashboards, and indexers increase platform value.
The sensitive action is mutation, not observation.

Add authentication at the point where risk appears: placing orders, moving funds, changing state, or accessing private account information.

Pattern 3: Use different authentication levels for different trust levels

Trading endpoints require authentication, but Polymarket uses two authentication levels with different responsibilities.

L1 authentication: prove wallet ownership

L1 authentication uses an EIP-712 signature from the user’s private key. It proves that the caller controls the wallet.

You use it to create or derive API credentials:

// L1: Use your private key to derive API credentials
const credentials = await client.createOrDeriveApiKey();

// Example result:
// {
//   key: "...",
//   secret: "...",
//   passphrase: "..."
// }

This is a high-trust action. It should require the strongest credential: the private key signature.

L2 authentication: sign each API request

After API credentials exist, routine trading requests use HMAC-SHA256 headers:

{
  "POLY_ADDRESS": "0x...",
  "POLY_SIGNATURE": "<hmac-sha256>",
  "POLY_TIMESTAMP": "1716000000",
  "POLY_API_KEY": "550e8400-...",
  "POLY_PASSPHRASE": "..."
}

L2 authentication proves that a specific request came from the credential holder without requiring a private-key signature on every API call.

Implementation takeaway

Do not use the same authentication ceremony for every action.

A practical model:

Operation	Auth strength
Create API key	Strong identity proof
Rotate credentials	Strong identity proof
Place order	Request signature/session credential
Read public market data	No auth
Read private account data	Session credential
Withdraw funds	Strong identity proof

This maps beyond crypto. In a traditional app:

L1 is “prove you are the account owner.”
L2 is “prove this request came from an active authorized session.”

That distinction improves both security and usability.

Pattern 4: Treat high-stakes actions as signed payloads, not just API calls

On Polymarket, placing an order is not merely sending JSON to a server. The order is a cryptographically signed financial instruction.

Example order:

const response = await client.createAndPostOrder(
  {
    tokenID: "71321045679...",
    price: 0.65,
    size: 100,
    side: Side.BUY,
  },
  {
    tickSize: "0.01",
    negRisk: false,
  },
  OrderType.GTC
);

Under the hood, the SDK creates an EIP-712 typed data structure, signs it with the user’s private key, and submits the signed order. The matching engine runs offchain, but matched trades settle on Polygon using those signatures.

The important design point: the operator cannot fabricate trades or move funds without user authorization. The signed message is the authorization.

Conventional API semantics

In a normal API, this means:

Please perform this action for me.

Example:

POST /orders
Authorization: Bearer <token>

The server decides whether to execute the action.

Signed-message semantics

With signed orders, the payload means:

Here is a signed instruction authorizing this exact action.

The API acts more like a relay than an authority.

Implementation takeaway

For high-stakes operations, consider making the payload itself verifiable.

Useful domains include:

Financial transactions
Legal approvals
Contract signatures
Permission grants
Sensitive workflow approvals
Cross-system authorization

Instead of relying only on transport-layer credentials, encode authorization into the payload:

{
  "action": "transfer",
  "amount": "100.00",
  "asset": "USDC",
  "recipient": "0x...",
  "expiresAt": "2026-01-01T00:00:00Z",
  "signature": "0x..."
}

This gives you better auditability, non-repudiation, and replay protection when designed correctly.

Pattern 5: Encode the domain ontology in the API model

Polymarket models prediction markets using two important objects:

Event
Market

An Event is the broader question:

“Who will win the 2026 US Senate race in Pennsylvania?”

A Market is a specific tradable binary outcome inside that event:

“Will Bob Casey win?”

One event can contain many markets.

Example structure:

{
  "id": "501",
  "title": "2026 Pennsylvania Senate Race",
  "negRisk": true,
  "markets": [
    {
      "id": "2301",
      "question": "Will Bob Casey win?",
      "outcomePrices": "[\"0.42\", \"0.58\"]"
    },
    {
      "id": "2302",
      "question": "Will Dave McCormick win?",
      "outcomePrices": "[\"0.35\", \"0.65\"]"
    },
    {
      "id": "2303",
      "question": "Will a third candidate win?",
      "outcomePrices": "[\"0.23\", \"0.77\"]"
    }
  ]
}

This distinction is not cosmetic. It tells API consumers how the domain works.

The event groups related markets. The market represents a tradable outcome. The negRisk flag signals that markets inside the event have capital relationships.

Implementation takeaway

Avoid flattening important domain concepts into generic resources.

A weak model might expose only:

GET /markets

A stronger model exposes relationships:

GET /events
GET /events/:id/markets
GET /markets/:id

If the distinction matters to business logic, it should exist in the API.

Good domain modeling helps clients avoid incorrect assumptions. For example, if an automated trader ignores negRisk: true, it may construct the wrong position model.

Your API should make these relationships visible instead of hiding them in documentation.

Pattern 6: Represent domain invariants as API fields

The negRisk flag is one of Polymarket’s most interesting design choices.

In a standard multi-outcome event, each market can be treated independently. But in a NegRisk event, exactly one outcome can win. That creates mathematical relationships between positions:

1 No token on outcome A ≡ 1 Yes token on every other outcome

Example:

Before	After
1× No (Other)	1× Yes (Casey) + 1× Yes (McCormick)

This is not just theoretical. It affects trading and settlement behavior.

Polymarket exposes this through API fields:

{
  "negRisk": true
}

And when placing orders, the client must pass the correct market options:

const response = await client.createAndPostOrder(
  {
    tokenID: "71321045679...",
    price: 0.65,
    size: 100,
    side: Side.BUY,
  },
  {
    tickSize: "0.01",
    negRisk: true
  },
  OrderType.GTC
);

If the client gets this wrong, the order can be rejected or handled incorrectly.

Implementation takeaway

If your domain has hard rules, encode them as typed fields.

Do not leave critical invariants only in prose documentation.

Examples:

{
  "requiresKyc": true,
  "settlementMode": "on_chain",
  "isMutuallyExclusive": true,
  "minCollateralRatio": "1.50",
  "supportsPartialFill": true,
  "expiresAt": "2026-01-01T00:00:00Z"
}

Fields like these are valuable because clients can branch on them programmatically.

Documentation explains the rule. The API should expose the rule.

Pattern 7: Treat changing market parameters as state, not configuration

Many financial APIs treat tick size as static. Polymarket exposes tick size as dynamic market state.

When a market price approaches the extremes, above 0.96 or below 0.04, the minimum tick size narrows from 0.01 to 0.001.

Example WebSocket event:

{
  "event_type": "tick_size_change",
  "asset_id": "65818619657...",
  "old_tick_size": "0.01",
  "new_tick_size": "0.001",
  "timestamp": "100000000"
}

The reason is practical. Near extreme probabilities, a 1-cent tick is too coarse. Moving from 0.04 to 0.03 is a large relative move. A smaller tick allows prices like 0.973 instead of forcing 0.97.

Implementation takeaway

Do not assume market parameters are static.

For trading clients, tick size should be part of the current market state:

type MarketState = {
  assetId: string;
  bestBid: string;
  bestAsk: string;
  tickSize: string;
};

When a tick_size_change event arrives, update local state:

function handleTickSizeChange(event: {
  asset_id: string;
  new_tick_size: string;
}) {
  marketState[event.asset_id].tickSize = event.new_tick_size;
}

Then validate orders against the current tick size before submitting:

function isValidPrice(price: number, tickSize: number) {
  return Number.isInteger(price / tickSize);
}

If your client hard-codes tick size, it will eventually submit invalid orders.

The broader principle: changing domain state should be broadcast explicitly, not discovered only through failed requests.

Pattern 8: Use separate WebSocket layers for different real-time consumers

Polymarket runs two separate WebSocket systems.

Market Channel

The Market Channel is designed for trading consumers:

wss://ws-subscriptions-clob.polymarket.com/ws/market

It streams data such as:

Order book snapshots
Price changes
Trade executions
Tick size changes

Subscription example:

{
  "assets_ids": [
    "65818619657568813474341868652308942079804919287380422192892211131408793125422"
  ],
  "type": "market"
}

This channel is optimized around asset IDs and low-latency trading workflows.

Real-Time Data Socket

The Real-Time Data Socket serves a different use case:

wss://ws-live-data.polymarket.com

It streams broader platform activity, including comments, crypto prices, equity prices, and social interaction events.

Subscription example:

{
  "action": "subscribe",
  "subscriptions": [
    {
      "topic": "crypto_prices",
      "type": "update",
      "filters": "btcusdt,ethusd"
    }
  ]
}

These consumers have different needs.

A market maker needs low-latency order book updates. A UI showing platform activity needs comments, prices, and social events. Combining both into one WebSocket system would force one infrastructure layer to serve conflicting requirements.

Implementation takeaway

Separate real-time infrastructure when consumers differ by:

Latency requirements
Message volume
Failure tolerance
Data shape
Subscription model
Operational priority

A practical split might look like this:

/ws/trading        low latency, order books, fills
/ws/activity       comments, notifications, social events
/ws/analytics      aggregates, leaderboards, dashboards

Trying to make one WebSocket endpoint serve every use case usually creates unnecessary complexity and uneven performance.

What these patterns have in common

Polymarket’s API design makes the domain structure visible.

The main patterns are:

Separate APIs by operational domain.
Make public read access easy when data liquidity matters.
Use different authentication levels for different trust levels.
Represent high-stakes actions as signed payloads.
Encode the domain ontology in the API model.
Surface domain invariants as explicit fields.
Treat changing parameters as real-time state.
Split WebSocket infrastructure by consumer profile.

The broader design lesson: do not abstract away distinctions that matter.

If a concept affects client behavior, put it in the API. If a rule affects correctness, expose it as a field. If state changes over time, broadcast the change. If different consumers have different performance requirements, give them different interfaces.

Good API design is not only about clean routes and consistent naming. It is about making the system’s real constraints understandable and programmable.

Get Free Unlimited Gemini API

Hassann — Sat, 09 May 2026 06:58:45 +0000

Google’s Gemini family is a cost-effective frontier model line for high-volume workloads, but token costs can still grow quickly when a public app, side project, or hackathon demo gets real traffic. Puter.js changes the billing model: it exposes Gemini models such as 2.5 Pro, 2.5 Flash, 2.0 Flash, 3 Flash Preview, and Gemma models without requiring your Google API key. The end user signs in with Puter and covers usage from their account, while your app calls the model from the browser.

Try Apidog today

TL;DR

Puter.js gives browser apps access to Gemini and Gemma models without a Google API key, Google Cloud project, or backend.
Supported Gemini models include 2.5 Pro, 2.5 Flash, 2.5 Flash Lite, 2.0 Flash, 2.0 Flash Lite, 3 Flash Preview, plus dated previews.
Supported Gemma models include Gemma 2, 3, and 4 in multiple sizes.
Setup can be as small as one <script> tag and one puter.ai.chat() call.
Streaming, image input, and temperature control work from the browser.
Usage is charged to the signed-in Puter user, not your developer account.
Use Apidog to compare a Puter prototype with the official Gemini API before migrating.

How “free unlimited” works

Puter.js inverts the usual LLM billing flow.

Instead of your app holding a Google AI Studio key and paying for every token, the user signs in to Puter. Calls are made on behalf of that signed-in user and usage is charged against their Puter balance. New Puter accounts receive starter credit, and users can top up if they need more.

For developers, this means:

No Google Cloud project
No AI Studio API key
No server-side token proxy
No key rotation
No billing exposure from public traffic

The trade-off: Puter.js is browser-first. It assumes a user session, so it is not a clean fit for backend-only jobs such as cron tasks, batch processors, or webhooks.

Step 1: Install Puter.js

For a static page, add the CDN script:

<script src="https://js.puter.com/v2/"></script>

That is enough to call Gemini from the browser.

For a bundled app, install the package:

npm install @heyputer/puter.js

Then import it:

import { puter } from '@heyputer/puter.js';

Step 2: Pick a Gemini or Gemma model

Choose the smallest model that handles your task well.

Model ID	When to use
`google/gemini-2.5-pro`	Hard reasoning, complex analysis, long-context tasks
`google/gemini-2.5-flash`	Default choice for most app features
`google/gemini-2.5-flash-lite`	High-volume classification, tagging, simple Q&A
`google/gemini-2.0-flash`	Stable baseline with well-understood behavior
`google/gemini-3-flash-preview`	Latest preview model
`google/gemma-3-27b-it`	Open Gemma, instruction-tuned workflows
`google/gemma-4-31b-it`	Larger open Gemma option

For most apps, start with:

google/gemini-2.5-flash

Use google/gemini-2.5-pro only when the prompt needs stronger reasoning. Use Lite variants for high-volume, low-complexity tasks like classification or tagging.

Step 3: Make your first Gemini call

Create an HTML file:

<!DOCTYPE html>
<html>
<body>
  <script src="https://js.puter.com/v2/"></script>

  <script>
    puter.ai.chat(
      "Explain machine learning in three sentences",
      {
        model: "google/gemini-2.5-flash"
      }
    ).then(response => {
      puter.print(response);
    });
  </script>
</body>
</html>

Open the file in a browser.

On first use, Puter handles authentication. The user signs in or creates a free Puter account, then the response is printed to the page.

No API key. No .env file. No backend route.

Step 4: Stream the response

For chat UIs, stream tokens as they arrive:

const response = await puter.ai.chat(
  "Explain photosynthesis in detail",
  {
    model: "google/gemini-2.5-flash",
    stream: true,
  }
);

for await (const part of response) {
  if (part?.text) {
    outputDiv.innerHTML += part.text;
  }
}

A simple UI target could look like this:

<div id="output"></div>

<script>
  const outputDiv = document.getElementById("output");
</script>

Each part.text contains a response chunk. Append it to your UI so the user sees the answer appear progressively.

Step 5: Send image input to Gemini

Gemini supports multimodal prompts. With Puter.js, pass the prompt first, then the image URL:

puter.ai.chat(
  "What do you see in this image? Describe colors, objects, and mood.",
  "https://assets.puter.site/doge.jpeg",
  {
    model: "google/gemini-2.5-flash"
  }
).then(response => {
  puter.print(response);
});

Practical use cases include:

Alt-text generation
Visual question answering
Screenshot analysis
OCR-style extraction
Accessibility tooling
Product image tagging
Diagram explanation

Step 6: Tune temperature

Pass model parameters in the options object:

const response = await puter.ai.chat(
  "Write a creative short story about a robot chef",
  {
    model: "google/gemini-2.5-flash",
    temperature: 0.2,
  }
);

console.log(response);

Use lower values for deterministic output:

temperature: 0.0

Good for:

JSON generation
Classification
Extraction
Factual answers
Structured summaries

Use higher values for more variation:

temperature: 0.8

Good for:

Brainstorming
Creative writing
Marketing copy
Ideation

Step 7: Build multi-turn conversations

Pass an array of messages instead of a single string:

const messages = [
  {
    role: "user",
    content: "I am building a Next.js app with Postgres."
  },
  {
    role: "assistant",
    content: "Got it. What do you need help with?"
  },
  {
    role: "user",
    content: "How should I structure migrations?"
  },
];

const response = await puter.ai.chat(messages, {
  model: "google/gemini-2.5-pro",
});

console.log(response);

For an actual chat UI, keep updating the message array:

messages.push({
  role: "user",
  content: userInput,
});

const response = await puter.ai.chat(messages, {
  model: "google/gemini-2.5-flash",
});

messages.push({
  role: "assistant",
  content: response,
});

Gemini receives the full conversation history on each call.

Compare Gemini with other models

Puter exposes multiple model providers through one interface. You can benchmark the same prompt across models by changing only the model string:

const models = [
  "google/gemini-2.5-flash",
  "claude-sonnet-4-6",
  "gpt-5.5",
  "x-ai/grok-4.3",
];

const prompt = "Refactor this React component to use hooks: ...";

for (const model of models) {
  const start = performance.now();

  const response = await puter.ai.chat(prompt, { model });

  const elapsed = performance.now() - start;

  console.log(`${model}: ${elapsed.toFixed(0)}ms`);
  console.log(response);
  console.log("---");
}

Use this pattern to compare:

Latency
Output quality
Formatting consistency
Instruction following
Coding accuracy
Cost profile for the user

For many apps, Gemini Flash is a strong default when latency matters. For harder prompts, benchmark against other models before choosing a production default.

What Puter.js gives you

You get:

Gemini 2.5, 2.0, and 3 Flash variants
Gemini 2.5 Pro
Gemma 2, 3, and 4 models
Multi-turn conversations
Streaming responses
Image URL input
Temperature control
max_tokens
System prompts
Browser-based production usage

You may not get, depending on the current Puter version:

Native Gemini function calling
Code execution tools
Google Search grounding
Gemini’s full 2M-token context ceiling on every model
Server-side use without a browser session
Direct Google rate-limit visibility

For agentic workflows that require code execution, grounding, or strict server-side control, the official Google Gemini API is usually the better fit. For browser-based chat, Q&A, content generation, and vision tasks, Puter.js is often enough.

When to use Puter.js vs the official Gemini API

Use Puter.js when:

You are building a free public app.
You do not want token costs attached to your developer account.
You are prototyping quickly.
You do not want to configure Google Cloud.
You are building a static site, hackathon app, or browser extension.
Your users can sign in to Puter.

Use the official Gemini API when:

You need backend calls.
You need cron jobs, batch jobs, or webhooks.
You need code execution.
You need Search grounding.
You need the full Gemini Pro long-context ceiling.
You need a direct compliance or billing relationship with Google.
You need fine-tuning on your own dataset.
Your users will not accept a Puter sign-in step.

For a standalone Gemini 3 Flash walkthrough, see How to use the Gemini 3 Flash Preview API.

Test the integration with Apidog

Puter calls happen in the browser, so you cannot test them exactly like a backend REST API. A practical workflow is:

Build a small static Puter page.
Accept a prompt through a query parameter.
Use that page for browser-based prototype testing.
Use Apidog to validate the official Gemini API surface for a future migration.
Keep Puter and Gemini API configs as separate environments.

Example environment split:

Environment	Base URL
`puter-prototype`	Your localhost/static page URL
`gemini-prod`	`https://generativelanguage.googleapis.com/v1`

You can download Apidog, create both environments, and keep the same prompt payloads documented in one collection.

For more API testing patterns, see API testing tool for QA engineers.

Other free LLM paths through Puter

The same user-pays model works across other providers:

The implementation pattern is the same: keep the Puter script and switch the model value.

const response = await puter.ai.chat(
  "Summarize this issue for a developer changelog.",
  {
    model: "google/gemini-2.5-flash"
  }
);

FAQ

Is this truly unlimited?

Unlimited from the developer’s side, yes. Your app does not pay per token from your own Google account. The signed-in Puter user has whatever balance is available in their Puter account.

Do I need a Google account or Google Cloud project?

No. Puter handles the upstream relationship. Your app does not need a Google API key.

Can I use this in production?

Yes, for browser-based apps. The main product decision is whether your users are willing to sign in with Puter.

Does Gemini through Puter behave like the official API?

Puter calls Google’s API on the user’s behalf. Model behavior should be aligned with the underlying model. Latency may differ because Puter adds another layer between your browser app and the upstream model.

What about Gemini’s 2M-token context window?

Puter may not expose the full 2M-token ceiling for every model variant. If your app depends on extremely long context, use the official Google Gemini API.

Can I use Puter Gemini in a Discord bot or backend service?

Not cleanly. Puter.js is browser-first and assumes a logged-in user session. Backend services should use the official Gemini API directly.

What model should I default to?

Start with:

google/gemini-2.5-flash

Move to:

google/gemini-2.5-pro

for difficult reasoning tasks.

Use:

google/gemini-2.5-flash-lite

for high-volume classification or tagging.

Is Imagen image generation supported?

Puter exposes image generation through OpenAI image models such as gpt-image-2 and DALL-E variants today, not Imagen. See Get free unlimited GPT-5.5 API for that path.

Wrapping up

Puter.js is a practical way to add Gemini to browser-based apps without managing Google Cloud, API keys, or developer-side token billing.

The basic implementation is:

<script src="https://js.puter.com/v2/"></script>

const response = await puter.ai.chat(
  "Explain this code snippet.",
  {
    model: "google/gemini-2.5-flash"
  }
);

Use Puter.js for prototypes, hackathon builds, free public apps, static sites, and browser extensions. Use the official Gemini API when you need backend execution, fine-tuning, code tools, Search grounding, or maximum long-context support.

Build the request once in Apidog, compare Puter with the official API, and choose the path that matches your app.

Get Free Unlimited GPT-5.5 API and All OpenAI Models

Hassann — Sat, 09 May 2026 02:34:52 +0000

OpenAI’s GPT-5.5 API pricing ($5 per million input tokens, $30 per million output tokens) can block side projects, hackathon apps, and free public tools before they ship. Puter.js offers a browser-first workaround: it exposes OpenAI models such as GPT-5.5, GPT-5.5 Pro, GPT-5.x variants, GPT-Image-2, DALL-E, and OpenAI TTS without requiring your OpenAI API key. Instead of billing you, usage is charged to the signed-in Puter end user.

Try Apidog today

TL;DR

Use Puter.js when you want OpenAI access in the browser without managing an OpenAI account, API key, backend, or billing.
Text models include gpt-5.5, gpt-5.5-pro, gpt-5.4, gpt-5, gpt-5-mini, o1, o3, gpt-4.1, gpt-4o, and chat/codex variants.
Image models include gpt-image-2, gpt-image-1.5, dall-e-3.
TTS models include gpt-4o-mini-tts, tts-1, tts-1-hd.
Add one <script> tag, call puter.ai.chat(), and you can run GPT-5.5 from a browser page.
Streaming, function calling, vision input, image generation, and text-to-speech are available from the browser.
The end user covers usage through their Puter account; your app does not receive OpenAI invoices.
Use Apidog to compare Puter-based prototypes with the official OpenAI API before migration.

How Puter’s “free unlimited” model works

Puter.js changes who pays for LLM usage.

In a standard OpenAI integration:

You create an OpenAI account.
You store an API key.
Your app sends requests to OpenAI.
You pay for all user usage.

With Puter:

Your app loads Puter.js in the browser.
The user signs in to Puter.
Your app calls OpenAI-compatible models through Puter.
Usage is charged to the user’s Puter balance.

For developers, this means:

No OpenAI key in your repo
No token bill attached to your account
No server required for browser apps
No per-developer usage cap

The trade-off: Puter is browser-first. If you need cron jobs, webhook handlers, background workers, or backend-only automation, use the official OpenAI API.

Step 1: Install Puter.js

For a plain HTML page, add the CDN script:

<script src="https://js.puter.com/v2/"></script>

That is enough for static sites, prototypes, browser extensions, and hackathon demos.

For a bundled JavaScript app, install the package:

npm install @heyputer/puter.js

Then import it:

import { puter } from '@heyputer/puter.js';

Use the CDN when you want the fastest possible setup. Use the npm package when you want bundler support and TypeScript types.

Step 2: Choose a model

Puter exposes GPT-5.x models and older OpenAI models. Pick the smallest model that meets your quality requirements.

Model ID	Use case
`gpt-5.5-pro`	Hard reasoning, coding agents, complex analysis
`gpt-5.5`	Default model for general chat and reasoning
`gpt-5.4-nano`	Fast, low-cost classification or extraction
`gpt-5.4-mini`	Chat UIs and mid-complexity tasks
`gpt-5.3-codex`	Code-focused workflows
`o3`	Complex reasoning chains
`o1-pro`	Agentic multi-step planning
`gpt-4.1`, `gpt-4o`, `gpt-4o-mini`	Stable baseline models

For image generation:

gpt-image-2: latest image model
gpt-image-1.5, gpt-image-1, dall-e-3, dall-e-2: older stable options

For text-to-speech:

gpt-4o-mini-tts: newer TTS model
tts-1, tts-1-hd: classic TTS options

Step 3: Call GPT-5.5 from the browser

Create an index.html file:

<!DOCTYPE html>
<html>
<body>
  <script src="https://js.puter.com/v2/"></script>

  <script>
    puter.ai.chat(
      "Explain WebSockets in three sentences",
      { model: "gpt-5.5" }
    ).then(response => {
      puter.print(response);
    });
  </script>
</body>
</html>

Open the file in a browser.

Puter handles authentication and the model request. On first use, the user signs in or creates a Puter account. You do not need an OpenAI key, .env file, proxy server, or backend route.

Step 4: Stream responses for chat UIs

For long answers, stream tokens instead of waiting for the full response:

const response = await puter.ai.chat(
  "Explain the theory of relativity in detail",
  {
    model: "gpt-5.5",
    stream: true
  }
);

for await (const part of response) {
  puter.print(part?.text);
}

In a real UI, append each chunk to the current assistant message:

const output = document.querySelector("#output");

for await (const part of response) {
  output.textContent += part?.text ?? "";
}

Use streaming for:

Chatbots
Documentation assistants
Long-form explanations
Code generation
Any UX where users should see progress immediately

Step 5: Send image input to a vision model

Pass the prompt, image URL, and model options:

puter.ai.chat(
  "What do you see in this image? Describe colors, objects, and mood.",
  "https://assets.puter.site/doge.jpeg",
  { model: "gpt-5.5" }
).then(response => {
  puter.print(response);
});

Use vision input for:

Alt-text generation
Screenshot analysis
Visual QA
OCR-like workflows
Accessibility tooling
Product image inspection

This works with GPT-5.x models and GPT-4o variants.

Step 6: Generate images

Use puter.ai.txt2img():

puter.ai.txt2img(
  "A futuristic cityscape at night, cinematic, neon, rain",
  { model: "gpt-image-2" }
).then(imageElement => {
  document.body.appendChild(imageElement);
});

txt2img() returns an <img> element that you can insert directly into the DOM.

Example with a basic UI:

<input id="prompt" placeholder="Describe an image..." />
<button id="generate">Generate</button>
<div id="result"></div>

<script src="https://js.puter.com/v2/"></script>
<script>
  document.querySelector("#generate").addEventListener("click", async () => {
    const prompt = document.querySelector("#prompt").value;
    const result = document.querySelector("#result");

    result.textContent = "Generating...";

    const image = await puter.ai.txt2img(prompt, {
      model: "gpt-image-2"
    });

    result.innerHTML = "";
    result.appendChild(image);
  });
</script>

The user pays the image generation cost from their Puter account.

Step 7: Convert text to speech

Use puter.ai.txt2speech():

puter.ai.txt2speech(
  "Welcome back. Your account balance is $1,247.50.",
  {
    provider: "openai",
    model: "gpt-4o-mini-tts"
  }
).then(audio => {
  audio.setAttribute("controls", "");
  document.body.appendChild(audio);
});

The function returns an <audio> element.

Use it for:

Voice prompts
Accessibility narration
Product walkthroughs
App voiceovers
Podcast intros

Step 8: Add function calling

Puter supports the standard OpenAI-style tool definition shape.

Define your tools:

const tools = [
  {
    type: "function",
    function: {
      name: "get_weather",
      description: "Get the current weather for a city.",
      parameters: {
        type: "object",
        properties: {
          city: {
            type: "string"
          }
        },
        required: ["city"]
      }
    }
  }
];

Send the prompt with tools:

const response = await puter.ai.chat(
  "What's the weather in Tokyo right now?",
  {
    model: "gpt-5.5",
    tools
  }
);

Read the tool call:

const toolCalls = response.message.tool_calls;

if (toolCalls?.length) {
  const call = toolCalls[0];

  console.log("Function:", call.function.name);
  console.log("Arguments:", call.function.arguments);

  // Execute your function here.
}

The model emits the tool call. Your app is responsible for executing the function and sending the result back into the conversation.

For testing tool-driven flows in production-grade settings, see MCP server testing in Apidog.

Step 9: Tune `temperature` and `max_tokens`

Pass OpenAI-style parameters in the options object:

const response = await puter.ai.chat(
  "Tell me about Mars",
  {
    model: "gpt-5.5",
    temperature: 0.2,
    max_tokens: 200
  }
);

Recommended defaults:

const defaults = {
  model: "gpt-5.5",
  temperature: 0.2,
  max_tokens: 500
};

Use lower temperature for predictable output:

temperature: 0.0 // deterministic / factual
temperature: 0.2 // documentation, summaries, QA
temperature: 0.7 // creative writing
temperature: 1.0 // highly varied output

Use max_tokens to keep responses bounded and avoid unnecessary user-side cost.

What Puter gives you

Puter’s browser-first OpenAI access is useful when you want to ship quickly without handling billing.

You get:

GPT-5.x models, including GPT-5.5 and GPT-5.5 Pro
Older OpenAI models such as GPT-4.1, GPT-4o, o1, and o3
GPT-Image-2 and DALL-E image generation
OpenAI TTS models, including gpt-4o-mini-tts
Streaming
Vision input
Function calling
Temperature control
max_tokens

What Puter does not replace

Puter is not a full replacement for every official OpenAI API workflow.

You may not get:

Responses API support
Prompt caching cost controls
Files API support
Backend-only usage without a browser session
Direct OpenAI rate-limit headers
OpenAI structured output mode and JSON schema enforcement

Use the official OpenAI API when you need backend execution, compliance controls, structured outputs, prompt caching, or direct OpenAI account management.

When to use Puter vs the official OpenAI API

Use Puter when:

You are building a browser-based app.
You want to avoid OpenAI billing exposure.
You are prototyping and do not want to set up an OpenAI account.
You are building a static site, browser extension, or hackathon demo.
Your users are willing to sign in to Puter.

Use the official OpenAI API when:

You need server-side calls.
You need cron jobs, webhook handlers, queues, or batch processing.
You need prompt caching.
You need the Responses API, Files API, or structured outputs.
You need compliance terms such as BAAs, SOC 2, or residency guarantees.
Your users will not accept a Puter sign-in step.

Many projects can start with Puter, validate the product, then migrate to the official API when backend or compliance requirements appear.

For a paid production setup, see How to use the GPT-5.5 API.

Testing the integration in Apidog

Puter calls run in the browser, so you cannot test them like a normal backend API request. A practical setup is:

Create a static HTML page that loads Puter.js.
Accept the prompt from a query parameter.
Use the page as your puter-prototype test target.
Create a separate openai-prod environment for the official OpenAI API.
Keep both environments in the same Apidog collection for migration planning.

Example local Puter test page:

<!DOCTYPE html>
<html>
<body>
  <pre id="output">Loading...</pre>

  <script src="https://js.puter.com/v2/"></script>
  <script>
    const params = new URLSearchParams(window.location.search);
    const prompt = params.get("prompt") || "Say hello";

    const output = document.querySelector("#output");

    puter.ai.chat(prompt, {
      model: "gpt-5.5"
    }).then(response => {
      output.textContent = response;
    }).catch(error => {
      output.textContent = error.message;
    });
  </script>
</body>
</html>

Run it locally:

npx serve .

Then call it in the browser:

http://localhost:3000?prompt=Explain%20JWT%20in%20one%20paragraph

Use Apidog to model the official OpenAI request you may migrate to later.

Download Apidog and create two environments:

puter-prototype: your localhost page that runs Puter.js
openai-prod: https://api.openai.com/v1

For broader API testing patterns, see API testing tool for QA engineers.

FAQ

Is Puter unlimited for developers?

Yes. The developer does not pay for usage through their own OpenAI account. Usage is charged to the signed-in user’s Puter balance.

Do I need an OpenAI account?

No. Puter handles the OpenAI relationship. Your app does not need an OpenAI API key.

Can I use this in production?

Yes, for browser-based apps. The key product question is whether your users are willing to sign in to Puter.

Does GPT-5.5 through Puter behave the same as the official API?

The model output should come from the same OpenAI model because Puter calls OpenAI on the user’s behalf. Latency may differ because there is an extra layer between your app and OpenAI.

Does Puter support prompt caching?

Puter does not expose OpenAI prompt caching pricing controls today. If prompt caching is important for your workload, use the official OpenAI API.

Can I use Puter from a backend service?

Not cleanly. Puter is browser-first and assumes a user session. Backend services should use the official OpenAI API.

For free server-side options, see How to use the GPT-5.5 API for free.

What model should I start with?

Use:

gpt-5.5 for general chat and reasoning
gpt-5.4-nano for high-volume classification
gpt-5.5-pro for harder reasoning
o3 for long reasoning chains

Will users be charged a lot?

Most chat-style usage costs cents per session at OpenAI-style rates. Image generation is usually more expensive than text. Use max_tokens, avoid unnecessary regeneration, and make cost-producing actions explicit in the UI.

Can I generate images with Puter?

Yes. Use puter.ai.txt2img() with gpt-image-2 or DALL-E models. The user pays from their Puter balance.

For the official paid API guide, see How to use the GPT-Image-2 API.

Wrapping up

Puter.js is a practical way to add GPT-5.5, image generation, vision, function calling, streaming, and TTS to browser-based apps without managing an OpenAI key or paying for user traffic yourself.

Use Puter for prototypes, hackathon builds, static sites, browser extensions, and free public apps. Use the official OpenAI API for backend workloads, compliance requirements, prompt caching, the Responses API, Files API, or strict structured outputs.

Build and compare your requests in Apidog, test the migration path, and choose the integration model that fits your app.

Get Free Unlimited Claude Opus 4.7 API

Hassann — Sat, 09 May 2026 02:27:58 +0000

Anthropic’s Claude models are strong choices for coding, agentic workflows, and long-context reasoning, but the official API cost can block small projects fast. Puter.js changes the billing model: you call Claude from the browser without an Anthropic API key, and usage is billed to the signed-in Puter user instead of your developer account.

Try Apidog today

This guide shows how to wire Claude into a browser app with Puter.js, choose a model, stream responses, maintain chat state, and understand when you should switch to the official Anthropic API.

TL;DR

Puter.js lets browser apps call Claude without an Anthropic API key, backend, or developer-side billing.
The end user signs in to Puter and covers their own usage.
Supported models include Opus 4.7, Opus 4.6, Opus 4.6 Fast, Opus 4.5, Opus 4.1, Opus 4, Sonnet 4.6, Sonnet 4.5, Sonnet 4, and Haiku 4.5.
Add one <script> tag, then call puter.ai.chat().
Streaming, system prompts, and multi-turn conversations are supported.
Use Apidog to benchmark prompts against the official Anthropic API when you plan a migration.

How the Puter billing model works

With the official Anthropic API, you usually do this:

Create an Anthropic account.
Store an API key.
Proxy requests through your backend.
Pay for every user’s tokens.

With Puter.js, the flow changes:

Your frontend loads Puter.js.
The user signs in to Puter.
Your app calls puter.ai.chat().
Usage is charged to the user’s Puter account.

For you as the developer, that means:

No API key in your repo
No Anthropic billing account required
No backend required for basic browser apps
No shared usage cap across your whole user base

The main constraint: Puter.js is browser-first. If you need cron jobs, backend workers, Discord bots, batch processing, or server-side API routes, use the official Anthropic API instead.

Step 1: Add Puter.js

For a static page or quick prototype, use the CDN script:

<script src="https://js.puter.com/v2/"></script>

A minimal HTML file looks like this:

<!DOCTYPE html>
<html>
  <body>
    <script src="https://js.puter.com/v2/"></script>
  </body>
</html>

If you are building with Vite, Webpack, or another bundler, install the package instead:

npm install @heyputer/puter.js

Then import it:

import { puter } from '@heyputer/puter.js';

Use the CDN for the fastest setup. Use the npm package when you want bundling, TypeScript support, or a production frontend build.

Step 2: Choose a Claude model

Puter exposes Claude models using Anthropic-style model IDs.

Model ID	When to use
`claude-opus-4-7`	Latest flagship; deepest reasoning and complex agentic work
`claude-opus-4-6`	Prior flagship; strong coding and reasoning
`claude-opus-4.6-fast`	Lower-latency Opus variant
`claude-opus-4-5`	Stable choice for production agents
`claude-opus-4-1`	Legacy stable option
`claude-opus-4`	Original Opus 4 baseline
`claude-sonnet-4-6`	Default daily driver for most apps
`claude-sonnet-4-5`	Prior Sonnet version; still useful for general tasks
`claude-sonnet-4`	Sonnet 4 baseline
`claude-haiku-4-5`	Fast option for classification and high-volume simple tasks

Practical defaults:

Use claude-sonnet-4-6 for most app features.
Use claude-haiku-4-5 for fast classification, tagging, routing, or lightweight summaries.
Use claude-opus-4-7 for complex code review, multi-step planning, and long-form reasoning.

Step 3: Make your first Claude call

Here is the smallest working example:

<!DOCTYPE html>
<html>
  <body>
    <script src="https://js.puter.com/v2/"></script>

    <script>
      puter.ai.chat(
        "Explain quantum computing in simple terms",
        { model: "claude-sonnet-4-6" }
      ).then(response => {
        puter.print(response.message.content[0].text);
      });
    </script>
  </body>
</html>

Open the file in a browser. Puter handles the call and prompts the user to sign in or create a Puter account if needed.

The response shape mirrors Anthropic’s message format:

response.message.content[0].text

For simple text responses, read the first content block. For more complex responses, iterate over all blocks:

for (const block of response.message.content) {
  if (block.type === "text") {
    console.log(block.text);
  }
}

Step 4: Stream long responses

For essays, code generation, and chat UIs, stream the response instead of waiting for the full answer.

const response = await puter.ai.chat(
  "Write a detailed essay on the impact of artificial intelligence on society",
  {
    model: "claude-sonnet-4-6",
    stream: true
  }
);

for await (const part of response) {
  puter.print(part?.text);
}

In a real chat UI, append each streamed chunk to the current message:

const output = document.querySelector("#assistant-message");

const stream = await puter.ai.chat(
  "Generate a checklist for securing an Express.js API",
  {
    model: "claude-sonnet-4-6",
    stream: true
  }
);

for await (const part of stream) {
  if (part?.text) {
    output.textContent += part.text;
  }
}

Example HTML:

<div id="assistant-message"></div>

Step 5: Build a multi-turn conversation

For chat, pass an array of messages instead of a single string.

const messages = [
  {
    role: "user",
    content: "I am building a Next.js app with Postgres."
  },
  {
    role: "assistant",
    content: "Got it. What do you need help with?"
  },
  {
    role: "user",
    content: "How should I structure the migrations folder?"
  }
];

const response = await puter.ai.chat(messages, {
  model: "claude-opus-4-7"
});

console.log(response.message.content[0].text);

To keep the conversation going, store the transcript and append each new turn:

const messages = [];

async function sendMessage(userText) {
  messages.push({
    role: "user",
    content: userText
  });

  const response = await puter.ai.chat(messages, {
    model: "claude-sonnet-4-6"
  });

  const assistantText = response.message.content[0].text;

  messages.push({
    role: "assistant",
    content: assistantText
  });

  return assistantText;
}

Claude reads the full message array on each call, so keep the transcript trimmed if your app has very long conversations.

Step 6: Add a system prompt

Use a system message to define behavior, tone, constraints, and output format.

const messages = [
  {
    role: "system",
    content: "You are a senior backend engineer. Reply in numbered bullets, never more than five."
  },
  {
    role: "user",
    content: "How do I prevent SQL injection in a Node app?"
  }
];

const response = await puter.ai.chat(messages, {
  model: "claude-sonnet-4-6"
});

console.log(response.message.content[0].text);

Good system prompts are specific:

const systemPrompt = `
You are a TypeScript code reviewer.
Focus on correctness, security, and maintainability.
Return:
1. Critical issues
2. Suggested improvements
3. A corrected code snippet when useful
Keep the answer concise.
`;

Then pass it at the top of the message list:

const messages = [
  { role: "system", content: systemPrompt },
  { role: "user", content: "Review this function: ..." }
];

Step 7: Compare models with the same prompt

The fastest way to pick a model is to run the same prompt across multiple Claude variants.

const models = [
  "claude-haiku-4-5",
  "claude-sonnet-4-6",
  "claude-opus-4-7"
];

const prompt = "Refactor this React component to use hooks: ...";

for (const model of models) {
  const start = performance.now();

  const response = await puter.ai.chat(prompt, { model });

  const elapsed = performance.now() - start;

  console.log(`${model}: ${elapsed.toFixed(0)}ms`);
  console.log(response.message.content[0].text);
  console.log("---");
}

You will usually see this pattern:

Haiku: fastest; best for simple and high-volume tasks.
Sonnet: best default for most app features.
Opus: strongest for difficult prompts, deeper reasoning, and complex code tasks.

To benchmark Puter’s browser path against the official Anthropic API in Apidog, keep both providers in the same collection and switch environments.

What you get with Puter.js

Puter.js gives you:

Claude model access from the browser
Multi-turn conversations
System prompts
Streaming responses
No developer-side API key
No developer-side Anthropic billing
Browser-first production deployment path

Depending on the current Puter version, you may not get every official Anthropic API feature, such as:

Native tool use / function calling
Vision input
Anthropic prompt caching controls
Server-side execution without a browser user session
Direct Anthropic rate-limit headers

For deeper tool workflows, the official Anthropic API or MCP server testing in Apidog gives you more control.

When to use Puter vs the official Anthropic API

Use Puter when:

You are building a browser-based app.
You do not want to manage an Anthropic API key.
You are shipping a free public tool and want to avoid developer-side billing exposure.
You are prototyping before committing to official API usage.
Your users can sign in to Puter.

Use the official Anthropic API when:

You need backend calls.
You need cron jobs, workers, or batch processing.
You need prompt caching controls.
You need advanced tool use, vision input, or Files API support.
You need compliance, contracts, or regional requirements.
Your users will not accept a Puter sign-in flow.

A common path is:

Prototype in the browser with Puter.
Validate prompts and UX.
Benchmark model behavior.
Migrate to the official Anthropic API when you need backend control.

The message shape is similar, so the migration is manageable.

For the GPT equivalent, see How to use the GPT-5.5 API.

Testing the integration in Apidog

Puter calls run in the browser, so you usually do not test them like backend API requests. A practical workflow is:

Create a small static page that loads Puter.js.
Accept the prompt through a query parameter or form input.
Use that page for browser-based Puter testing.
Use Apidog to test the official Anthropic API surface.
Keep both paths documented in the same project so migration is easier later.

Example static test page:

<!DOCTYPE html>
<html>
  <body>
    <pre id="output"></pre>

    <script src="https://js.puter.com/v2/"></script>
    <script>
      const params = new URLSearchParams(window.location.search);
      const prompt = params.get("prompt") || "Say hello from Claude.";

      const output = document.querySelector("#output");

      const response = await puter.ai.chat(prompt, {
        model: "claude-sonnet-4-6"
      });

      output.textContent = response.message.content[0].text;
    </script>
  </body>
</html>

Then run it locally and test prompts like:

http://localhost:5173/?prompt=Explain%20JWT%20authentication

Download Apidog and create two environments:

puter-prototype: your local static page that uses Puter.js
anthropic-prod: https://api.anthropic.com/v1

This lets you keep prompt tests, request examples, and migration notes in one place.

FAQ

Is this truly unlimited?

Unlimited from the developer side, yes. The end user has whatever balance is available in their Puter account. New Puter accounts include starter credit, and users can top up if they need more.

Do I need an Anthropic account?

No. Puter handles the Anthropic relationship. Your app does not need an Anthropic API key.

Can I use this in production?

Yes, for browser-based apps. The key product decision is whether your users are willing to sign in to Puter.

Does Claude through Puter behave the same as the official API?

The model output is expected to be the same because Puter calls Anthropic on the user’s behalf. Latency may be slightly different because Puter adds an extra layer between your app and Anthropic.

What about prompt caching?

Puter does not expose Anthropic’s prompt caching pricing controls today. If you rely on prompt caching for large stable prompts, use the official Anthropic API.

Can I use Puter for a Discord bot or backend service?

Not cleanly. Puter is browser-first and assumes a logged-in user session. For backend services, use the official Anthropic API.

Which model should I default to?

Use claude-sonnet-4-6 by default. Move to claude-opus-4-7 for harder reasoning tasks and claude-haiku-4-5 for fast, high-volume classification.

Will users be charged a lot?

Most chat-style usage costs cents per session at Anthropic-style rates. Casual users can run many conversations on starter credit before they need to top up.

Wrapping up

Puter.js is a practical way to add Claude to a browser app without managing Anthropic keys, billing, or backend infrastructure. Add the script, choose a model, call puter.ai.chat(), and let the signed-in user cover their own usage.

Use Puter for prototypes, hackathon projects, static sites, browser extensions, and free public apps. Use the official Anthropic API when you need backend execution, prompt caching, compliance controls, or advanced API features.

Build and benchmark your requests in Apidog, compare Puter with the official API, and choose the path that matches your deployment model.

How to Use Grok 4.3 for Free: 4 Working Paths in 2026

Hassann — Sat, 09 May 2026 02:24:07 +0000

Grok 4.3 is xAI’s flagship model as of May 2026. It supports a 1M-token context window, native video input, and pricing of $1.25 / $2.50 per million tokens. If you are prototyping, learning, or building a side project, you can use Grok 4.3 without paying upfront through three practical routes: xAI Console promotional credits, Puter.js user-paid calls, and the free chat surfaces on grok.com and X.

Try Apidog today

This guide walks through each option with setup steps, code, and trade-offs. For the full paid API guide, see How to use the Grok 4.3 API. For the voice equivalent, see How to use Grok Voice for free.

TL;DR

Three free paths to Grok 4.3: xAI Console promotional credits, Puter.js, and the chat UIs at grok.com and X.
Best for developers shipping a public web app: Puter.js. Your users cover their own usage.
Best for backend/API prototyping: xAI Console credits.
Best for no-code use: grok.com or the X app.
Model IDs:
- xAI direct API: grok-4.3
- Puter.js: x-ai/grok-4.3
Use Apidog to test equivalent requests across providers and compare responses, latency, and token usage.

Path 1: Use xAI Console promotional credits

Use this path when you want to test the real production API surface without paying upfront.

Step 1: Create an xAI Console account

Go to:

console.x.ai

Step 2: Check your promotional credits

After signup, open the Billing tab.

xAI has run promotional windows that give new accounts free credits. These credits are usually enough for several days of integration testing, but the amount and eligibility window can change.

The key point: these credits are finite and do not auto-renew. Use them to validate your integration, then either move to paid usage or switch to another path.

Step 3: Call Grok 4.3 from the API

The xAI endpoint follows an OpenAI-compatible Chat Completions shape:

export XAI_API_KEY="xai-..."

curl https://api.x.ai/v1/chat/completions \
  -H "Authorization: Bearer $XAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "grok-4.3",
    "messages": [
      {
        "role": "user",
        "content": "Explain prompt caching in three sentences."
      }
    ],
    "reasoning_effort": "low"
  }'

For early testing, start with:

"reasoning_effort": "low"

Use medium or high only when you need stronger reasoning, because they consume credits faster.

Pros and cons

Pros	Cons
Real production API behavior	Credit pool is finite
Supports Grok 4.3 capabilities such as 1M context, video, and function calling	Promotional terms can change
No migration cost when moving to paid usage	Limited to what fits inside the credit bucket

Use this path if: you need backend access, want to test the real API, or plan to move to paid xAI usage later.

For the full request schema, see How to use the Grok 4.3 API.

Path 2: Use Puter.js

Puter.js is the cleanest free path for developers building public browser-based apps.

How Puter.js works

Puter.js exposes a JavaScript client for calling LLMs such as Grok, GPT, Claude, Gemini, and DeepSeek.

The important billing difference:

The end user pays from their Puter account, not the developer.

You add the script and call the model from the browser. When users run the app, Puter handles authentication and charges the user for the cloud and AI usage their session consumes.

Step 1: Add Puter.js to your page

<script src="https://js.puter.com/v2/"></script>

No API key is required in your app.

Step 2: Call Grok 4.3

Use puter.ai.chat() with the Puter model ID:

<script src="https://js.puter.com/v2/"></script>

<script>
  puter.ai.chat(
    "Summarize the trade-offs between SQLite and Postgres in three bullets.",
    { model: "x-ai/grok-4.3" }
  ).then((response) => {
    document.body.innerText = response.message.content;
  });
</script>

The first time a user runs this, Puter prompts them to sign in or create a Puter account. After that, requests use the user’s Puter balance.

Step 3: Stream responses

Puter.js also supports streaming:

const stream = await puter.ai.chat(
  "Walk me through migrating a React app to Next.js.",
  {
    model: "x-ai/grok-4.3",
    stream: true,
    reasoning_effort: "medium",
  }
);

for await (const chunk of stream) {
  process.stdout.write(chunk?.text || "");
}

Pros and cons

Pros	Cons
Developer pays $0	User must sign in to Puter
No API key in your repo	Less suitable for backend-only systems
Supports multiple major LLM providers	Requires a browser context
Good fit for public tools and side projects	May add slightly more latency than direct xAI calls

Use this path if: you are building a public web app, demo, side project, or free tool where users can cover their own usage.

Avoid this path if: the user is not the person triggering the query, such as in internal automation, backend jobs, or bots.

For similar free-access patterns, see How to use the DeepSeek V4 API for free and How to use the GPT-5.5 API for free.

Path 3: Use grok.com or the X app

Use this path when you only need to chat with Grok and do not need API access.

Options:

grok.com: web chat. Sign in with X.
X app: Grok is available inside the X mobile and web apps under the Grok tab.

Free users get a limited daily quota that resets every 24 hours.

This path is useful for:

One-off research.
Prompt exploration.
Checking whether Grok’s output style fits your use case.
Manual testing before implementing API calls.

You cannot script or automate requests from these chat UIs.

The free tier on grok.com currently defaults to a smaller Grok variant. Premium subscriptions on X unlock Grok 4.3 in the chat UI with higher quotas.

Path 4: Use OpenRouter for cheaper Grok-class testing

OpenRouter is not a free Grok 4.3 path, but it is useful for testing Grok-class models behind one API key.

Grok 4.3 on OpenRouter costs the same as direct xAI pricing:

$1.25 / $2.50 per 1M tokens

However, OpenRouter also carries free variants for some Grok models, such as:

grok-4-fast:free

Use this when you do not specifically need Grok 4.3 but want a free Grok-family model for experimentation.

Example request:

curl https://openrouter.ai/api/v1/chat/completions \
  -H "Authorization: Bearer $OPENROUTER_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "x-ai/grok-4-fast:free",
    "messages": [
      {
        "role": "user",
        "content": "Hello!"
      }
    ]
  }'

Use this path if: you want free Grok-class output and do not require Grok 4.3 specifically.

Compare the options

Path	Cost to developer	Cost to end user	Best for
xAI Console credits	$0 within credit limit	n/a	Backend prototyping and learning the production API
Puter.js	$0	User pays usage	Public web apps, side projects, free tools
grok.com / X	$0	$0 within quota	Manual use and prompt testing
OpenRouter free model	$0	n/a	Free Grok-class output, not Grok 4.3 specifically

Test provider requests in Apidog

When you are comparing providers, keep the prompt and request body stable. Change only the base URL, auth key, and model name.

A practical Apidog setup:

Create an Apidog environment with these variables:
- XAI_API_KEY
- OPENROUTER_API_KEY
- BASE_URL
Create one Chat Completions request.
Save provider-specific variants:
- xAI direct
- OpenRouter
Run both with the same prompt.
Compare:
- Response quality
- Token usage
- Latency
- Error behavior
When credits run out, switch environments instead of changing code.

Download Apidog and create a new collection.

Use these base URLs:

https://api.x.ai/v1
https://openrouter.ai/api/v1

Both use an OpenAI-compatible Chat Completions schema, so the request body can stay mostly identical except for the model value.

For more on cross-provider testing, see API testing tool for QA engineers.

What you give up by staying free

Free paths are useful for prototyping, but they come with trade-offs.

1. Tighter rate limits

Promotional credits do not remove rate limits. If you test at scale, expect 429 responses before your credit pool is exhausted.

Add basic throttling during tests:

const sleep = (ms) => new Promise((resolve) => setTimeout(resolve, ms));

for (const prompt of prompts) {
  await callGrok(prompt);
  await sleep(1000);
}

2. Less benefit from prompt caching

Prompt caching is most valuable when you send large repeated context, such as a stable 50k+ token system prompt.

For a small prototype with a few dozen calls, caching savings are less important.

3. Best-effort support

Free usage paths usually do not include production support. If you are debugging production traffic, move to a paid tier.

When to move to paid usage

Move off the free path when you see one of these signals:

You hit rate limits often.

If rate limits block testing or usage more than a few times per week, paid usage is easier to operate.
You have large reusable prompts.

Stable long prompts can benefit from prompt caching.
You need compliance or support.

Free tiers are not the right place for SOC 2 audit trails, BAAs, regional data residency, or production support requirements.

Migration is usually small:

For xAI Console, keep the same API surface and use paid billing.
For OpenRouter, change the model or base URL.
For Puter.js, keep the browser flow if user-paid usage still fits your product.

FAQ

Is Grok 4.3 truly free?

It depends on the path.

xAI Console: free while promotional credits last.
Puter.js: free for the developer because the user pays.
grok.com: free within a daily message quota.

Can I use Grok 4.3 from a backend without paying?

Yes, while xAI Console credits last. After that, you need paid usage or a browser-based user-pays flow such as Puter.js.

Does Puter.js work in Node.js?

Puter.js is browser-first. The user-pays model is built around browser authentication and user handoff. For backend usage, use the xAI Console path while credits last.

What model ID should I use on Puter.js?

Use:

x-ai/grok-4.3

What model ID should I use with xAI directly?

Use:

grok-4.3

Do free credits cover function calling and video input?

Yes. Console credits apply to Grok 4.3 usage, including function calling, long context, video input, and reasoning effort. Watch token usage closely because video and long context can consume credits quickly.

How does this compare to Grok Voice?

Grok Voice has its own free-access pattern. For that walkthrough, see How to use Grok Voice for free.

Is there a free Grok 4.3 mini?

Not currently. xAI has not released a separate mini SKU for the 4.3 line. The closest free substitute mentioned here is grok-4-fast:free on OpenRouter, which is a smaller, faster Grok 4 variant.

Wrapping up

Use the path that matches your implementation:

Use xAI Console credits if you need the real backend API.
Use Puter.js if you are shipping a browser-based public app and want users to cover usage.
Use grok.com or X if you only need manual chat.
Use OpenRouter free Grok variants if you want free Grok-class output but do not need Grok 4.3 specifically.

If none of the free paths fit, Grok 4.3’s paid pricing is still low enough for many side projects.

For the full paid API walkthrough, see How to use the Grok 4.3 API. For the head-to-head against OpenAI, see Grok Voice vs GPT-Realtime.

Build the request once in Apidog, swap the base URL between providers, and ship on the option that fits your usage curve.

How to Use the Grok 4.3 API ?

Hassann — Fri, 08 May 2026 07:40:02 +0000

xAI rolled out Grok 4.3 in stages: beta on April 17, 2026, API access on April 30, and full general availability on May 6. The release adds a 1,000,000-token context window, native video input, always-on reasoning, and roughly 40% lower pricing versus Grok 4.20. Eight legacy Grok models retire on May 15, so teams still using grok-3 or grok-4 models should migrate now.

Try Apidog today

This guide shows how to call Grok 4.3 from code: endpoint format, authentication, OpenAI-compatible SDK setup, reasoning_effort, video input, function calling, and a repeatable test workflow in Apidog.

For the voice side of the same release, see How to use Grok Voice for free. For the head-to-head against OpenAI’s flagship voice model, see Grok Voice vs GPT-Realtime.

TL;DR

Grok 4.3 went GA on May 6, 2026.
Eight legacy models retire on May 15, 2026.
Pricing:
- $1.25 per 1M input tokens
- $2.50 per 1M output tokens
- $0.20 per 1M cached input tokens
Context window: 1,000,000 tokens.
New input type: native video input.
Reasoning is always on.
reasoning_effort supports low, medium, and high.
Default reasoning effort is medium.
Endpoint: https://api.x.ai/v1/chat/completions.
The API is OpenAI-compatible for Chat Completions.
Standard-tier throughput is around 159 tokens/second.
Intelligence Index: 53, according to Artificial Analysis.
Use Apidog to save request variants, compare reasoning settings, and replay the same test across providers.

What changed in Grok 4.3

For most developer teams, the important changes are practical:

Lower token cost

Input pricing is down 37.5% versus Grok 4.20. Output pricing is down 58.3%. Cached input is now $0.20 per 1M tokens, which matters if you reuse long system prompts or large static context.

1M-token context window

Grok 4.3 increases the context window from 256k to 1M tokens. That makes it usable for large prompts such as codebases, transcripts, long contracts, and multi-document workflows.

Native video input

Grok 4.3 is the first Grok model with native video input. You can pass a video URL in the message content and ask the model to reason over the clip.

Always-on reasoning

Every request includes reasoning. The reasoning_effort parameter controls depth, but the model does not run below low.

Better agent workflows

xAI reports a +300 Elo gain on GDPval-AA versus Grok 4.20. In practice, this matters most for tool selection, multi-step workflows, and function-calling agents.

Artificial Analysis gives Grok 4.3 an Intelligence Index of 53, above the average of 35 for its price tier, and ranks it tenth out of 146 tracked models.

Prerequisites

Before sending your first request, prepare:

An xAI Console account at console.x.ai
A billable tier with an API key
A project-scoped API key for production use
The OpenAI SDK or the xAI SDK
An API client for saving and replaying requests

Export your API key:

export XAI_API_KEY="xai-..."

If you are testing locally, use an environment file or shell variable. For production, store the key in your secret manager.

Endpoint and authentication

Grok 4.3 uses the OpenAI-compatible Chat Completions API with xAI’s base URL.

POST https://api.x.ai/v1/chat/completions

Required headers:

Authorization: Bearer $XAI_API_KEY
Content-Type: application/json

Because the API is OpenAI-compatible, most existing OpenAI SDK code only needs two changes:

Change the API key.
Change the base_url.

Python example

import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ["XAI_API_KEY"],
    base_url="https://api.x.ai/v1",
)

response = client.chat.completions.create(
    model="grok-4.3",
    messages=[
        {
            "role": "user",
            "content": "Summarize the trade-offs of GraphQL vs REST in three bullets.",
        }
    ],
    reasoning_effort="medium",
)

print(response.choices[0].message.content)

If you use the xAI SDK instead, the request shape is similar. The main difference is the client import and initialization.

Request parameters

Use these parameters for most Grok 4.3 Chat Completions requests:

Parameter	Type	Values	Notes
`model`	string	`grok-4.3`	Required.
`messages`	array	OpenAI message shape	Required. Supports `system`, `user`, and `assistant` roles.
`reasoning_effort`	string	`low`, `medium`, `high`	Optional. Default: `medium`. Higher values can increase latency and output tokens.
`max_tokens`	int	`1–32768`	Caps output length.
`temperature`	float	`0.0–2.0`	Default: `1.0`.
`top_p`	float	`0.0–1.0`	Nucleus sampling.
`stream`	bool	`true`, `false`	Enables server-sent events when `true`.
`tools`	array	OpenAI tool shape	Used for function calling.
`tool_choice`	string / object	`auto`, `none`, or specific tool	Uses standard OpenAI semantics.
`response_format`	object	`{ "type": "json_object" }`	Enables structured JSON output.
`seed`	int	any integer	Useful for reproducibility with `temperature: 0`.

Minimal curl request

curl https://api.x.ai/v1/chat/completions \
  -H "Authorization: Bearer $XAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "grok-4.3",
    "messages": [
      {
        "role": "system",
        "content": "You are a senior backend engineer."
      },
      {
        "role": "user",
        "content": "Review this query plan and flag the bottleneck."
      }
    ],
    "reasoning_effort": "high"
  }'

The response uses the standard OpenAI-style shape:

{
  "choices": [
    {
      "message": {
        "role": "assistant",
        "content": "..."
      }
    }
  ],
  "usage": {
    "prompt_tokens": 123,
    "completion_tokens": 456,
    "reasoning_tokens": 78,
    "total_tokens": 657
  }
}

Read the final text from:

response.choices[0].message.content

Choosing a reasoning effort

Grok 4.3 supports three reasoning levels.

Use `low` for fast, simple tasks

Good fits:

Classification
Summarization
Rule extraction
Simple Q&A
Lightweight routing

Example:

response = client.chat.completions.create(
    model="grok-4.3",
    messages=[
        {
            "role": "user",
            "content": "Classify this ticket as billing, bug, feature request, or account access: ...",
        }
    ],
    reasoning_effort="low",
)

Use `medium` for default production traffic

Good fits:

Customer support
Single-step tool use
Data analysis
Normal code explanations
Function calling

response = client.chat.completions.create(
    model="grok-4.3",
    messages=[
        {
            "role": "user",
            "content": "Analyze this API error log and suggest the most likely root cause.",
        }
    ],
    reasoning_effort="medium",
)

Use `high` for complex workflows

Good fits:

Multi-step agents
Long code review
Complex math
Planning-heavy tasks
Debugging with many constraints

response = client.chat.completions.create(
    model="grok-4.3",
    messages=[
        {
            "role": "user",
            "content": "Review this migration plan, identify risks, and produce a safer rollout sequence.",
        }
    ],
    reasoning_effort="high",
)

Reasoning is always enabled. Setting reasoning_effort to low reduces depth, but it does not disable reasoning.

Function calling

Grok 4.3 supports the standard OpenAI function-calling shape.

The flow is:

Define tools.
Send the user message and tool schema.
Read tool_calls from the assistant message.
Execute the tool in your application.
Send the tool result back with role tool.
Ask the model to produce the final answer.

Define a tool

tools = [
    {
        "type": "function",
        "function": {
            "name": "lookup_user",
            "description": "Look up a user by ID.",
            "parameters": {
                "type": "object",
                "properties": {
                    "user_id": {
                        "type": "string"
                    }
                },
                "required": ["user_id"],
            },
        },
    }
]

Ask Grok 4.3 to call the tool

response = client.chat.completions.create(
    model="grok-4.3",
    messages=[
        {
            "role": "user",
            "content": "Find user u_42 and tell me their last login.",
        }
    ],
    tools=tools,
    reasoning_effort="medium",
)

message = response.choices[0].message
tool_calls = message.tool_calls

print(tool_calls)

Execute and return the tool result

messages = [
    {
        "role": "user",
        "content": "Find user u_42 and tell me their last login.",
    },
    message,
]

for tool_call in tool_calls:
    if tool_call.function.name == "lookup_user":
        # Replace this with your real database/API call.
        result = {
            "user_id": "u_42",
            "last_login": "2026-05-06T14:22:00Z",
        }

        messages.append(
            {
                "role": "tool",
                "tool_call_id": tool_call.id,
                "content": json.dumps(result),
            }
        )

final_response = client.chat.completions.create(
    model="grok-4.3",
    messages=messages,
    reasoning_effort="medium",
)

print(final_response.choices[0].message.content)

The GDPval-AA gain is especially relevant here: Grok 4.3 should be better at choosing tools, avoiding redundant calls, and recovering from tool errors.

If you are testing tool workflows, MCP server testing in Apidog covers a replay-based setup.

Video input

Grok 4.3 is the first Grok model with native video input. Pass a video URL inside the message content array.

response = client.chat.completions.create(
    model="grok-4.3",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "Describe what happens in this clip and flag any anomalies.",
                },
                {
                    "type": "video_url",
                    "video_url": {
                        "url": "https://example.com/clip.mp4"
                    },
                },
            ],
        }
    ],
)

Video tokens count against input usage. If cost or latency matters:

Trim the clip before sending.
Downsample when full resolution is unnecessary.
Avoid sending repeated static footage.
Cache surrounding text context when possible.

The model reasons over frames natively, so you do not need to manually extract keyframes first.

Using the 1M-token context window

The 1M-token context window is useful when retrieval or chunking would remove important context.

Common patterns:

Whole-codebase review

Send:

The diff
Touched files
Related interfaces
Test output
Lint output
Migration notes

Prompt example:

Review this change as a senior backend engineer.

Focus on:
1. Data loss risks
2. Transaction boundaries
3. Backward compatibility
4. Test gaps
5. Rollback strategy

Context:
...

Long-document QA

Use it for:

Legal contracts
Earnings calls
Compliance policies
Technical specifications
Incident timelines

Prompt example:

Answer only from the provided document.

Question:
Which clauses describe termination rights, and what notice period applies to each party?

Agent memory

For agent workflows, you can keep long conversation history in context instead of summarizing aggressively. This is useful when prior details affect personalization or task continuity.

Cached input pricing makes stable long context cheaper. For example, a 400k-token stable system prompt costs $0.08 per cached call at $0.20 per 1M cached tokens, instead of $0.50 at the fresh input rate.

Migrating from legacy Grok models

Eight legacy Grok models retire on May 15, 2026, 12:00 PM PT.

For most apps, migration is:

- model="grok-4.20"
+ model="grok-4.3"

or:

- model="grok-3"
+ model="grok-4.3"

Because the request shape is compatible, most Chat Completions calls should continue working.

Watch for two differences.

1. Reasoning behavior

Some legacy models did not accept reasoning_effort. Grok 4.3 always reasons.

If your previous workflow depended on a very fast non-reasoning path, start with:

{
  "reasoning_effort": "low"
}

Then measure latency and quality before moving to medium or high.

2. Output formatting

Grok 4.3 tends to produce more structured output than Grok 4.20. If your application uses regex-based parsing, retest before switching production traffic.

For broader model pricing context, see GPT-5.5 pricing. For reasoning-model usage patterns, see How to use the GPT-5.5 API.

Testing Grok 4.3 in Apidog

Use Apidog to create repeatable API tests before migrating production traffic.

Recommended setup:

Create an Apidog environment.
Add these variables:

XAI_API_KEY = xai-...
BASE_URL = https://api.x.ai/v1
MODEL = grok-4.3
REASONING_EFFORT = medium

Create a POST request:

{{BASE_URL}}/chat/completions

Add headers:

Authorization: Bearer {{XAI_API_KEY}}
Content-Type: application/json

Add the request body:

{
  "model": "{{MODEL}}",
  "messages": [
    {
      "role": "system",
      "content": "You are a senior backend engineer."
    },
    {
      "role": "user",
      "content": "Review this API design and identify the top three implementation risks."
    }
  ],
  "reasoning_effort": "{{REASONING_EFFORT}}"
}

Duplicate the request three times:
- Grok 4.3 - low
- Grok 4.3 - medium
- Grok 4.3 - high
Change only REASONING_EFFORT.

Compare:

Response quality
Latency
usage.prompt_tokens
usage.completion_tokens
usage.reasoning_tokens
Total cost

To compare with another provider, duplicate the environment and change BASE_URL, MODEL, and the API key. Keep the same prompt and request body.

Download Apidog to run the comparison. For broader API testing strategy, see API testing tool for QA engineers.

Rate limits

xAI Console tier limits range from a few thousand requests per minute on Tier 1 to multi-hundred-thousand request limits on enterprise tiers. Exact numbers can change, so check your console dashboard.

The advertised 159 tokens/second throughput is per-stream output speed, not total account throughput. Concurrent requests scale within your tier limits.

If you exceed your limit, the API returns HTTP 429 with a retry-after header.

Basic retry pattern:

import time
from openai import RateLimitError

for attempt in range(5):
    try:
        response = client.chat.completions.create(
            model="grok-4.3",
            messages=[
                {
                    "role": "user",
                    "content": "Summarize this incident report.",
                }
            ],
            reasoning_effort="medium",
        )
        break
    except RateLimitError as error:
        wait_seconds = min(2 ** attempt, 30)
        time.sleep(wait_seconds)
else:
    raise RuntimeError("Request failed after retries")

In production, also add jitter and respect the retry-after header when present.

FAQ

Is Grok 4.3 OpenAI-compatible end to end?

For Chat Completions, yes. You can use the OpenAI SDK, change base_url, change model, and keep the same request shape. Function calling, structured output, and streaming use the same semantics.

Does Grok 4.3 support the Responses API?

The xAI surface is Chat Completions today. The Responses API is OpenAI-only.

What is the actual context limit?

The context limit is 1,000,000 tokens. Long inputs still cost money, so use cached input when your prompt is stable.

How does always-on reasoning affect latency?

First-token latency is higher than non-reasoning models, but Grok 4.3 streams output at around 159 tokens/second. Use low for simple paths and reserve high for planning-heavy work.

Can I use Grok 4.3 with Grok Voice?

Yes. The voice agent, grok-voice-think-fast-1.0, calls Grok 4.3 under the hood when it reasons. You can also call Grok 4.3 directly from a custom voice loop built with TTS and STT components.

What happens to old Grok 3 or Grok 4 calls after May 15?

They fail with HTTP 410 because the model is retired. Migrate before the cutoff.

Does Grok 4.3 support image input?

Yes. It supports image input alongside video input. Pass an image URL in a content block using the OpenAI-style message format.

Wrapping up

Grok 4.3 is a practical migration target if you need lower token costs, larger context, always-on reasoning, native video input, and OpenAI-compatible Chat Completions. For existing OpenAI SDK users, the migration is mostly a base URL and model-name change.

The fastest validation path is to create three request variants in Apidog, test low, medium, and high reasoning on your real prompts, then compare latency, quality, and token usage before moving production traffic.

Grok Voice vs GPT-Realtime: Which Is the Best Voice Model in 2026?

Hassann — Fri, 08 May 2026 07:33:41 +0000

xAI shipped Grok Voice the same week OpenAI rolled out GPT-Realtime-2. If you are choosing a voice model in 2026, both are credible flagship options: speech-to-speech, reasoning-capable, WebSocket-based, tool-capable, and natural-sounding. The practical decision comes down to five implementation trade-offs: latency, price, voice catalog, reasoning depth, and whether you need SIP, image input, or voice cloning.

Try Apidog today

This guide compares the models from a developer perspective: API surface, integration shape, cost model, and which model to pick for common voice-agent architectures.

For standalone implementation guides, see How to use GPT-Realtime-2 and How to use Grok Voice for free. To stress-test either model under load, Apidog supports WebSocket sessions natively.

TL;DR

Use Grok Voice (grok-voice-think-fast-1.0) when latency, low cost, voice variety, multilingual TTS, or voice cloning are the main requirements.
Use GPT-Realtime-2 when you need deeper reasoning, image input, native SIP, MCP tool execution, or a more mature production voice-agent stack.
Grok Voice reports under 1 second time-to-first-audio and ships 80+ preset voices across 28 TTS languages.
GPT-Realtime-2 provides GPT-5-class reasoning, five reasoning levels, 128k context, image input, SIP, and native MCP support.
Paid GPT-Realtime-2 voice usage is metered at $32 / 1M audio input tokens and $64 / 1M audio output tokens.
Grok Voice has no per-minute audio charge on the xAI Console; you pay for Grok 4.3 reasoning at $1.25 / 1M input tokens and $2.50 / 1M output tokens.
Build a small test harness first, measure latency and cost with your own audio, then choose.

Capability comparison

Capability	Grok Voice (`grok-voice-think-fast-1.0`)	GPT-Realtime-2
Time to first audio	< 1 second; xAI claims ~5x faster than nearest competitor	Sub-second on `low` reasoning; slower on `high` / `xhigh`
Reasoning levels	`low` / `medium` / `high`	`minimal` / `low` / `medium` / `high` / `xhigh`
Underlying intelligence	Grok 4.3, Intelligence Index 53	GPT-5-class
Context window	1,000,000 tokens via Grok 4.3	128,000 tokens
Preset voices	80+; five named voice-agent personas: Eve, Ara, Rex, Sal, Leo	10; Cedar, Marin, plus eight retuned legacy voices
Languages, TTS	28	Not officially counted
Languages, STT	25	Inherited from GPT-Realtime
Voice cloning	Yes; Custom Voices, ~1-minute sample, <2-minute training	No
Image input	No; text + audio only	Yes; photo and screenshot input
Remote MCP servers	Tool use supported; native MCP not advertised	Yes; MCP tools executed by API
Native SIP / phone calling	Bring your own SIP provider	Yes; `?call_id={call_id}` endpoint
Audio formats	PCM16, MP3, μ-law	PCM16, G.711 μ-law, A-law
Pricing model	Free on console for voice; pay Grok 4.3 reasoning only	$32 / 1M audio input tokens, $64 / 1M audio output tokens, $4 / $24 per 1M text tokens
Compliance	SOC 2 Type II, HIPAA-eligible with BAA, GDPR	SOC 2, GDPR through OpenAI Enterprise

Latency: Grok Voice is the default for real-time UX

xAI claims grok-voice-think-fast-1.0 is “nearly 5 times faster than the closest competitor.” Treat vendor multipliers carefully, but the practical direction is consistent: Grok Voice usually reaches time-to-first-audio comfortably under one second, while GPT-Realtime-2 often sits around the 800ms–1500ms range depending on reasoning level.

For a voice agent, this matters more than most benchmark numbers. In a phone call or mobile assistant, 600ms can feel responsive; 1200ms can feel like the user is waiting on a bot.

Implementation rule:

If the user is speaking live and latency is the top UX metric,
start with Grok Voice.

Use GPT-Realtime-2 when the extra latency buys you reasoning, image understanding, SIP, or MCP.

Pricing: compare the billing shape, not just the headline rate

The two products price different parts of the pipeline.

GPT-Realtime-2 pricing shape

GPT-Realtime-2 meters audio as tokens:

Audio input: $32 / 1M tokens
Audio output: $64 / 1M tokens
Text input/output: $4 / $24 per 1M tokens

One second of audio is roughly 50 tokens. A 5-minute conversation with balanced turn-taking can use around 30,000 audio tokens, or roughly $1.50 in audio I/O. Cached input can reduce stable prompt costs significantly.

Grok Voice pricing shape

Grok Voice has no per-minute or per-token charge on the xAI Console for:

TTS
STT
Voice agent usage
Custom Voices

You pay for Grok 4.3 reasoning:

Input: $1.25 / 1M tokens
Output: $2.50 / 1M tokens

Because reasoning tokens are usually far fewer than audio tokens for the same call, a similar 5-minute interaction can come in under $0.10, depending on usage.

Implementation rule:

If you expect thousands of voice minutes per day,
benchmark Grok Voice first.

For high-stakes but lower-volume flows, such as regulated support or sales calls, the price gap may matter less than reasoning quality and integrations.

For more pricing context, see How to use the Grok 4.3 API and GPT-5.5 pricing.

Reasoning depth: GPT-Realtime-2 is stronger for complex agents

GPT-Realtime-2 is described by OpenAI as a GPT-5-class speech-to-speech model. It exposes five reasoning levels:

minimal
low
medium
high
xhigh

That gives you a useful production control: reduce latency for simple turns, increase reasoning for complex turns.

Example routing logic:

function selectReasoningLevel(turn) {
  if (turn.requiresToolChain || turn.hasAmbiguousIntent) {
    return "high";
  }

  if (turn.requiresLongAnswer) {
    return "medium";
  }

  return "low";
}

Grok Voice runs Grok 4.3 underneath. Grok 4.3 is strong, especially on agentic tasks, but based on the published benchmark framing, GPT-Realtime-2 is the safer choice for complex mid-conversation reasoning.

Use GPT-Realtime-2 when the agent must:

Disambiguate unclear user intent
Select between many tools
Reason over long state
Recover from interruptions
Handle multi-step workflows
Explain decisions out loud

Use Grok Voice when the workflow is mostly scripted:

FAQ support
Order status
Appointment booking
Simple sales qualification
Consumer chat companions
Low-latency mobile voice UX

Voice catalog: Grok has more voices; OpenAI has tighter consistency

Grok ships 80+ preset voices across 28 TTS languages. The voice-agent layer exposes five curated personas:

The broader TTS surface gives you more variety, especially if you need a particular tone, accent, or brand fit.

GPT-Realtime-2 ships 10 voices:

Cedar
Marin
alloy
ash
ballad
coral
echo
sage
shimmer
verse

The OpenAI catalog is smaller, but voice behavior is more consistent across the available options.

Implementation rule:

Need a specific voice or custom brand voice? Use Grok.
Need one reliable production voice? GPT-Realtime-2 is enough.

Voice cloning: only Grok Voice supports it

Grok’s Custom Voices can clone a voice from about one minute of clean speech and return a voice_id in under two minutes. The same voice_id can be used across TTS and the voice-agent surface.

OpenAI does not currently expose voice cloning through the Realtime API.

If your product requires a cloned brand voice, character voice, or consented custom voice, this category is not close: choose Grok Voice.

Image input: only GPT-Realtime-2 supports it

GPT-Realtime-2 accepts:

Text
Audio
Images

That means a user can send a screenshot or photo, then continue speaking with the agent about what is visible.

This matters for:

Field support
Accessibility narration
QA workflows
Visual troubleshooting
Voice-driven app support
“Look at my screen and help me” workflows

Grok Voice does not currently match this. If the agent needs to see what the user sees, use GPT-Realtime-2.

For more on OpenAI’s image stack, see How to use the GPT-Image-2 API.

SIP and phone integration: GPT-Realtime-2 is simpler

OpenAI’s Realtime API has native SIP support. A SIP trunk can connect directly to OpenAI’s gateway, and inbound calls open a WebSocket session:

wss://api.openai.com/v1/realtime?call_id={call_id}

That removes a bridge layer from your architecture.

Grok Voice supports μ-law output for telephony, but you need to bring your own SIP provider, such as Twilio, Telnyx, or Plivo, and run the bridge yourself.

A typical Grok telephony architecture looks like this:

Caller
  -> SIP provider
  -> Your media bridge
  -> Grok Voice WebSocket
  -> Your media bridge
  -> SIP provider
  -> Caller

A typical GPT-Realtime-2 SIP architecture can be simpler:

Caller
  -> SIP trunk
  -> OpenAI Realtime SIP endpoint
  -> GPT-Realtime-2 session

Implementation rule:

If you are building call-center infrastructure and want fewer moving parts,
start with GPT-Realtime-2.

MCP and tool use

Both models support tool/function calling, but the integration level differs.

GPT-Realtime-2

GPT-Realtime-2 supports remote MCP servers natively. You configure:

MCP server URL
Allowed tools
Tool execution policy

Then the Realtime API can execute MCP tools directly.

That matters when your voice agent has a large tool catalog and you do not want every tool call to round-trip through your own function-call event loop.

Grok Voice

Grok Voice supports function calling and includes a built-in web_search tool. Native MCP is not advertised as a first-class primitive yet.

For small tool sets, this is fine.

const tools = [
  {
    name: "lookup_order",
    description: "Look up an order by ID",
    parameters: {
      type: "object",
      properties: {
        order_id: { type: "string" }
      },
      required: ["order_id"]
    }
  },
  {
    name: "create_support_ticket",
    description: "Create a support ticket",
    parameters: {
      type: "object",
      properties: {
        customer_id: { type: "string" },
        issue: { type: "string" }
      },
      required: ["customer_id", "issue"]
    }
  }
];

Implementation rule:

5 or fewer tools: either model is fine.
50+ tools or MCP-first architecture: GPT-Realtime-2 is cleaner.

If you are testing MCP servers separately, see MCP server testing in Apidog.

Model selection by use case

Use case	Recommended model
Consumer voice app, high volume, latency-critical	Grok Voice
Voice cloning required	Grok Voice
Custom brand voice	Grok Voice
Character voices	Grok Voice
Multilingual TTS at scale, especially >10 languages	Grok Voice
Lowest-cost production voice agent	Grok Voice on console
Voice agent that needs screenshots or photos	GPT-Realtime-2
Call-center deployment with SIP	GPT-Realtime-2
Multi-step reasoning agent	GPT-Realtime-2
Agent with 50+ tools	GPT-Realtime-2 with MCP
Benchmark-heavy reasoning	GPT-Realtime-2 with `xhigh` reasoning
Long-context text reasoning	Depends: GPT-Realtime-2 has 128k context; Grok 4.3 has 1M context

How to test both before committing

Do not choose from a spec sheet only. Build a small benchmark harness and measure both models using your own prompts, tools, audio, and target languages.

1. Create a fixture conversation

Use a 10-turn script that represents your real product.

Include:

One simple answer
One interruption
One tool call
One disambiguation
One long-form answer
One edge case
Real user audio, not only synthetic text

Example fixture:

[
  {
    "role": "user",
    "type": "audio",
    "case": "initial_request"
  },
  {
    "role": "assistant",
    "expected": "asks_clarifying_question"
  },
  {
    "role": "user",
    "type": "audio",
    "case": "clarification"
  },
  {
    "role": "assistant",
    "expected": "calls_lookup_tool"
  }
]

2. Configure both API keys

Use environment variables:

export XAI_API_KEY="..."
export OPENAI_API_KEY="..."

In Apidog, define both as environment variables so the same WebSocket test can run against either provider.

3. Use one WebSocket test shape

Test Grok Voice with:

wss://api.x.ai/v1/realtime?model=grok-voice-think-fast-1.0

Test GPT-Realtime-2 with:

wss://api.openai.com/v1/realtime?model=gpt-realtime-2

Keep your test script as similar as possible across both runs.

4. Measure the metrics that affect production

Capture:

Time to first audio
Total response latency
Tool-call latency
Number of failed or malformed tool calls
Total input tokens
Total output tokens
Estimated cost per call
User-perceived interruption handling
Language-specific voice quality

A simple result table is enough:

Metric	Grok Voice	GPT-Realtime-2
Time to first audio
Total response latency
Tool-call success rate
Cost per 5-minute call
Subjective voice score

5. Pick based on your measured bottleneck

Use this decision logic:

function chooseVoiceModel(result) {
  if (result.requiresImageInput) return "GPT-Realtime-2";
  if (result.requiresNativeSIP) return "GPT-Realtime-2";
  if (result.requiresMCPAtScale) return "GPT-Realtime-2";
  if (result.requiresVoiceCloning) return "Grok Voice";

  if (result.latencyIsPrimaryMetric) return "Grok Voice";
  if (result.costIsPrimaryMetric) return "Grok Voice";
  if (result.reasoningFailuresAreCostly) return "GPT-Realtime-2";

  return result.realWorldBenchmarkWinner;
}

Download Apidog to run the side-by-side tests. The collection format is portable, so you can keep the benchmark artifact in version control.

FAQ

Can I use both models in the same app and route at runtime?

Yes. Both use similar conversation shapes. You can route by intent, latency requirement, language, or workflow complexity.

Example:

function routeTurn(turn) {
  if (turn.includesImage) return "gpt-realtime-2";
  if (turn.requiresComplexToolUse) return "gpt-realtime-2";
  if (turn.requiresVoiceClone) return "grok-voice-think-fast-1.0";
  if (turn.isCasualOrHighVolume) return "grok-voice-think-fast-1.0";

  return "gpt-realtime-2";
}

Which model has better non-English voice quality?

Grok wins on language coverage: 80+ voices and 28 TTS languages. For languages both models support well, quality is close enough that you should test your exact language, accent, and domain vocabulary.

Is GPT-Realtime-2 worth the higher price?

For simple FAQ-style support, usually no. For agents that need to read from a CRM, call multiple tools, resolve ambiguity, handle interruptions, and reason through edge cases, the reasoning and integration advantages can justify the cost.

Does either model support cloning public figures?

No. Both vendors restrict voice cloning to consented samples. Cloning a public figure without permission violates platform terms.

How hard is migration later?

The event names and session configuration differ, but the conversation architecture is similar:

connect
  -> configure session
  -> stream user audio
  -> receive assistant audio/events
  -> handle tool calls
  -> close session

Plan for a small port, mostly in:

Session update payloads
Event names
Tool-call handlers
Audio format handling
Provider-specific authentication

If you build and test with Apidog, the request collection ports cleanly.

Wrapping up

There is no universal winner between Grok Voice and GPT-Realtime-2. There is a correct choice per architecture.

Use Grok Voice when your priorities are:

Lowest latency
Lower cost at scale
Larger voice catalog
Multilingual TTS
Voice cloning
Consumer voice UX

Use GPT-Realtime-2 when your priorities are:

Deeper reasoning
Image input
Native SIP
MCP tools
Complex agent workflows
Production call-center integration

For everything else, build one benchmark harness in Apidog, run both models for a week, and choose based on your own latency, cost, and task-success data.

How to Use Grok Voice for Free: Console Setup, Voice Cloning, and Real-Time Voice Agents

Hassann — Fri, 08 May 2026 07:29:18 +0000

xAI shipped Grok Voice with the Grok 4.3 release. For developers, the key point is simple: Grok Voice is free on the xAI Console. There is no per-minute charge and no per-token charge for the voice agent model, text-to-speech, speech-to-text, or Custom Voices clone tool. The only billable resource is the underlying Grok 4.3 token usage when the agent reasons, and that usage has its own free console allowance for testing.

Try Apidog today

This guide shows how to run Grok Voice at zero voice-feature cost: create a console key, clone a voice, open a WebSocket session, stream audio, add tool calls, and test the flow with Apidog before wiring it into a product.

If you also want the broader Grok 4.3 API guide, or a head-to-head against OpenAI’s stack in Grok Voice vs GPT-Realtime, those companion posts cover the rest of the surface.

TL;DR

Grok Voice is free for users on the xAI Console (console.x.ai): no per-minute or per-token charge for TTS, STT, voice agent, or Custom Voices.
Flagship model: grok-voice-think-fast-1.0.
Time-to-first-audio is under 1 second; xAI claims it is roughly 5x faster than the closest competitor.
80+ preset voices across 28 languages.
5 built-in voice agent personas: Eve, Ara, Rex, Sal, Leo.
Custom voice cloning works from about 1 minute of speech.
Production-ready voice generation completes in under 2 minutes.
WebSocket endpoint:

wss://api.x.ai/v1/realtime?model=grok-voice-think-fast-1.0

REST endpoints are available for TTS, STT, and Custom Voices.
Use Apidog to script WebSocket sessions and replay them without rerecording audio.

What Grok Voice gives you for free

The xAI Console is the path to free access. Sign in at console.x.ai, generate an API key, and you can call four voice surfaces with no charge tied to the voice features themselves.

You get access to:

Voice Agent: real-time speech-to-speech with tool use, server-side voice activity detection, and turn-taking.
Text-to-Speech: 80+ preset voices across 28 languages, with MP3 or μ-law output.
Speech-to-Text: streaming and batch transcription across 25 input languages, with word-level timestamps and speaker diarization.
Custom Voices: clone your voice from a short sample and use the resulting voice_id across TTS and voice agent APIs.

The only meter that ticks is Grok 4.3 token usage when the agent reasons over a request. The console also gives you free credit for testing that surface, which is enough to validate end-to-end flows before billing starts.

Step 1: Get a console key

Go to console.x.ai and sign in with your X account.

From the API Keys page:

Create a new API key.
Enable the voice and chat scopes.
Export the key once.
Store it in your local environment.

export XAI_API_KEY="xai-..."

For client-side apps, do not ship the parent API key to the browser. Instead, mint an ephemeral token from the console settings or via the /v1/realtime/sessions endpoint.

Ephemeral tokens carry the same scope but expire in minutes, so they are suitable for browser-based WebSocket sessions.

Step 2: Pick a voice

You can start with a preset voice or create a custom clone.

Option A: Use a preset voice

The voice agent includes five named personas:

Voice	Description	Good fit
`eve`	Female, energetic	Upbeat support flows
`ara`	Female, warm	General assistance
`rex`	Male, confident	Sales scripts
`sal`	Neutral, smooth	Narration and longer reads
`leo`	Male, authoritative	Compliance and formal flows

For the broader TTS API, the preset library is larger: more than 80 voices across 28 languages. You select them with the voice parameter on the TTS endpoint.

Option B: Clone a custom voice

Upload a WAV file with about one minute of clean speech from a single speaker.

curl https://api.x.ai/v1/custom-voices \
  -H "Authorization: Bearer $XAI_API_KEY" \
  -F "name=narrator-jane" \
  -F "language=en" \
  -F "audio=@sample.wav"

The API returns a voice_id in under two minutes. You can reuse that ID across both the TTS endpoint and the voice agent.

Keep the reference clip clean:

Use a quiet room.
Record one speaker only.
Avoid music, effects, or background noise.
Prefer a consistent single take.
Do not assume longer is better; the maximum reference clip length is 120 seconds, but clean audio matters more than duration.

Step 3: Make Grok talk over WebSocket

The voice agent runs over a single WebSocket session:

Open the WebSocket.
Send a session.update event.
Stream user audio into the socket.
Receive audio deltas back from the model.

Endpoint:

wss://api.x.ai/v1/realtime?model=grok-voice-think-fast-1.0

A minimal Node.js client:

import WebSocket from "ws";

const ws = new WebSocket(
  "wss://api.x.ai/v1/realtime?model=grok-voice-think-fast-1.0",
  {
    headers: {
      Authorization: `Bearer ${process.env.XAI_API_KEY}`,
    },
  }
);

ws.on("open", () => {
  ws.send(JSON.stringify({
    type: "session.update",
    session: {
      voice: "ara",
      instructions: "You are a friendly support agent. Keep replies under two sentences.",
      input_audio_format: "pcm16",
      output_audio_format: "pcm16",
      turn_detection: {
        type: "server_vad",
      },
    },
  }));
});

ws.on("message", (raw) => {
  const event = JSON.parse(raw.toString());

  if (event.type === "response.audio.delta") {
    process.stdout.write(Buffer.from(event.delta, "base64"));
  }

  if (event.type === "response.audio.done") {
    console.error("Turn complete");
  }
});

User audio is sent with input_audio_buffer.append events as base64-encoded PCM16 frames.

The server responds with:

response.audio.delta: streamed audio chunks
response.audio.done: end of the current response turn

PCM16 at 24 kHz is the safe default for browser and desktop apps. Use μ-law when bridging to phone systems.

Step 4: Add tool use

The voice agent supports function calling, so the model can call your APIs during a conversation.

Declare tools in the session config:

ws.send(JSON.stringify({
  type: "session.update",
  session: {
    tools: [
      {
        type: "function",
        name: "lookup_order",
        description: "Look up the status of a customer order by order number.",
        parameters: {
          type: "object",
          properties: {
            order_id: {
              type: "string",
            },
          },
          required: ["order_id"],
        },
      },
    ],
  },
}));

When the model wants to call your function, it emits:

response.function_call_arguments.done

Your app should then:

Parse the function name and arguments.
Run the function on your side.
Send the result back with a conversation.item.create event of type function_call_output.
Let the model continue and narrate the result.

A built-in web_search tool is also available, which is useful when you need fresh data without building a retrieval layer yourself.

Step 5: Use TTS without the voice agent

If you only need text-to-speech for audio prompts, voiceovers, podcast intros, or static app audio, skip the WebSocket and call the REST endpoint.

curl https://api.x.ai/v1/tts \
  -H "Authorization: Bearer $XAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "grok-tts-1",
    "voice": "ara",
    "input": "Welcome back to your account. Your last login was Tuesday at 3pm.",
    "format": "mp3"
  }' \
  --output greeting.mp3

Supported output formats:

mp3: high-fidelity output
mulaw: 8 kHz telephony output

The TTS endpoint is synchronous. You send text and receive audio bytes back; no streaming session is required.

Step 6: Test the whole flow in Apidog

WebSocket APIs are harder to debug from the terminal because the conversation is stateful. A repeatable test setup helps you isolate changes in voice, instructions, tool calls, and audio frames.

A practical workflow:

Create a new WebSocket request in Apidog.
Save the WebSocket URL:

wss://api.x.ai/v1/realtime?model=grok-voice-think-fast-1.0

Store your bearer token in an Apidog environment variable.
Stage a script of JSON messages:
- session.update
- input_audio_buffer.append
- response.create
Replay the script against one connection.
Capture every server event into a tree.
Diff two runs side by side when you change the voice or instructions.

This is useful for catching drift in turn-taking behavior before you ship.

Download Apidog, create a WebSocket request, and paste your XAI_API_KEY under environment variables.

The same collection can also hold your TTS and STT REST requests, so you can keep all Grok Voice surfaces in one project. For more on stateful API testing patterns, see API testing tool for QA engineers.

Free tier limits

The console gives you full access without a per-minute or per-token charge for the voice features themselves. The main limits are operational:

Rate limits: the console enforces request-per-minute caps on each endpoint to prevent abuse. They are suitable for development and demos, not production traffic.
Custom voice quota: a single account can hold a finite number of custom voice clones at once. Delete unused clones to free slots.
Reasoning tokens: when the voice agent uses Grok 4.3 reasoning under the hood, it bills against your console credit. Free credit is enough for prototyping; production requires a paid plan.

If you hit rate-limit errors, batch your requests or move to a paid tier. The API behavior stays the same; only the cap changes.

Compare voices before shipping

Run the same script through every candidate voice before going live. Voices handle tone differently, and short tests catch poor pairings quickly.

Use a small test set:

A two-sentence greeting.
A confirmation phrase: “Got it, that’s all set.”
A longer sentence with a number, a date, and a comma.

Also test the same prompt at different tones:

Calm
Normal
Urgent

Grok’s preset voices handle tone shifts better than many TTS engines we have benchmarked, but you should still audit the actual output for your use case.

FAQ

Is the API actually free, or is there a hidden cap?

The voice features — TTS, STT, voice agent, and Custom Voices — carry no per-minute or per-token charge on the console.

The reasoning model under the hood bills against console credit. The console allowance is enough for prototyping.

Do I need an X account?

Yes. Console sign-in uses an X account.

Can I use Grok Voice from a browser?

Yes, but use an ephemeral token.

Mint the token server-side via /v1/realtime/sessions, hand the short-lived token to the browser, and connect to the WebSocket directly. The parent API key should never leave your server.

What audio quality can I expect?

TTS output is available as high-fidelity MP3 or 8 kHz μ-law. The voice agent runs PCM16 at 24 kHz internally.

Quality is on par with major commercial TTS engines; latency is the differentiator.

Does it work with telephony?

Yes. μ-law output is the standard format for SIP and PSTN bridges.

You still need a SIP provider. xAI does not ship its own SIP gateway today.

How does cloning quality compare to other tools?

Cloning quality depends more on reference audio quality than length.

A clean 60-second sample in a quiet room beats a noisy 120-second sample. The resulting voice_id works across both the TTS endpoint and the voice agent without recloning.

Can I use Grok Voice for AI characters in a game?

Yes. The TTS endpoint is fast enough for runtime generation, and Custom Voices lets each character use its own clone.

Watch latency on long lines. Chunked TTS is the recommended pattern.

Wrapping up

Grok Voice is a direct path to building a real-time voice agent with no per-minute charge on the xAI Console. Start with a console key, pick a preset voice, test a WebSocket session, and only then add custom voice cloning or tool calls.

The fastest validation loop is:

Script a session in Apidog.
Run it against three preset voices.
Compare latency, tone, and turn-taking.
Add tool calls once the base conversation works.

When you are ready to plug it into Grok 4.3 reasoning, see the Grok 4.3 API guide. For a side-by-side against OpenAI’s stack, see Grok Voice vs GPT-Realtime.

What Is GPT-Realtime-2 and How to Use the GPT-Realtime-2 API

Hassann — Fri, 08 May 2026 07:23:10 +0000

OpenAI shipped GPT-Realtime-2 on November 6, 2026. It is a speech-to-speech model with GPT-5-class reasoning, a 128,000-token context window, and configurable reasoning effort so you can trade latency for answer quality. If you already use gpt-realtime, migration mostly means changing the model string and adding a few optional session/tool fields.

Try Apidog today

This guide shows what changed, how pricing works, and how to call GPT-Realtime-2 over WebSocket and SIP. It also includes a practical setup in Apidog so you can replay Realtime sessions without re-recording audio for every test.

For context on OpenAI’s broader 2026 model line, see What is GPT-5.5. For the multimodal sibling, see How to use the GPT-Image-2 API.

TL;DR

Model ID: gpt-realtime-2
Context window: 128k tokens
Max output: 32k tokens
Input modalities: text, audio, image
Output modalities: text, audio
Audio pricing: $32 / 1M input tokens, $64 / 1M output tokens
Cached audio input: $0.40 / 1M tokens
New Realtime-only voices: Cedar and Marin
Reasoning levels: minimal, low, medium, high, xhigh
Default reasoning level: low
WebSocket endpoint:

wss://api.openai.com/v1/realtime?model=gpt-realtime-2

SIP sessions use:

wss://api.openai.com/v1/realtime?call_id={call_id}

Companion models:
- GPT-Realtime-Translate: live translation, 70 input languages, $0.034/min
- GPT-Realtime-Whisper: streaming speech-to-text, $0.017/min
Use Apidog to script WebSocket sessions, capture frames, and compare event output between runs.

What is GPT-Realtime-2?

GPT-Realtime-2 is a single speech-to-speech model. You stream audio in, receive audio out, and the model handles transcription, reasoning, tool selection, and voice generation in one pass.

That means you do not need to build a separate STT → LLM → TTS pipeline. The model runs on the existing Realtime API surface and improves the previous gpt-realtime flow with stronger reasoning and larger context.

The model accepts text, audio, and images as input, then emits text and audio as output. Image input is new for this model. You can add a screenshot or photo to a live conversation, ask a question by voice, and get a spoken answer.

That enables agents such as:

Voice support copilots that can inspect user screenshots
Field-support agents that reason over photos
Accessibility assistants that describe what is on screen

Specs:

Attribute	Value
Model ID	`gpt-realtime-2`
Context window	128,000 tokens
Max output	32,000 tokens
Modalities in	text, audio, image
Modalities out	text, audio
Knowledge cutoff	2024-09-30
Reasoning levels	`minimal`, `low`, `medium`, `high`, `xhigh`
Function calling	yes
Remote MCP servers	yes
Image input	yes
SIP phone calling	yes

What changed from `gpt-realtime`

Compared with gpt-realtime-1.5, GPT-Realtime-2 improves benchmark performance:

Big Bench Audio: 81.4% → 96.6%
Audio MultiChallenge: 34.7% → 48.5%

Those scores used high and xhigh reasoning. In production, the default is low to reduce latency, so you should benchmark your own workload before increasing reasoning effort.

Key behavior changes:

Preambles: The model can say short filler phrases like “let me check that” while it reasons.
Parallel tool calls with narration: The model can call multiple tools and describe progress instead of going silent.
Better recovery: Ambiguous or partially failed turns are handled more gracefully.
Domain tone control: The model can keep specialized terminology consistent and adapt delivery style during a session.

The context window also increased from 32k to 128k tokens. That matters for long-running voice sessions such as support calls, banking workflows, and tutoring sessions.

Pricing

GPT-Realtime-2 is billed per token, with separate rates for text, audio, and image input.

Token type	Input	Cached input	Output
Text	$4.00 / 1M	$0.40 / 1M	$24.00 / 1M
Audio	$32.00 / 1M	$0.40 / 1M	$64.00 / 1M
Image	$5.00 / 1M	$0.50 / 1M	n/a

Cached input reduces repeated-context cost significantly. If your agent uses a stable system prompt, policy document, or repeated instructions, keep that context cacheable.

For comparison with the rest of the OpenAI line, see GPT-5.5 pricing.

Companion model pricing:

GPT-Realtime-Translate: $0.034/min. Supports 70 input languages and 13 output languages, with 12.5% lower Word Error Rate than any other model tested in Hindi, Tamil, and Telugu.
GPT-Realtime-Whisper: $0.017/min. Streaming speech-to-text for live captions and continuous transcription.

Use:

GPT-Realtime-2 when you need reasoning and voice generation together.
GPT-Realtime-Translate for live multilingual interpretation.
GPT-Realtime-Whisper when you only need a transcript.

Endpoints and authentication

Available endpoints:

POST https://api.openai.com/v1/chat/completions
POST https://api.openai.com/v1/responses
WSS  wss://api.openai.com/v1/realtime?model=gpt-realtime-2
WSS  wss://api.openai.com/v1/realtime?call_id={call_id}
POST https://api.openai.com/v1/realtime/translations
POST https://api.openai.com/v1/realtime/transcription_sessions

For voice agents, use the WebSocket endpoint:

wss://api.openai.com/v1/realtime?model=gpt-realtime-2

Required headers:

Authorization: Bearer $OPENAI_API_KEY
OpenAI-Beta: realtime=v1

Set your API key:

export OPENAI_API_KEY="sk-proj-..."

Connect over WebSocket

Install the WebSocket client:

npm install ws

Create a minimal Node.js client:

import WebSocket from "ws";

const ws = new WebSocket(
  "wss://api.openai.com/v1/realtime?model=gpt-realtime-2",
  {
    headers: {
      Authorization: `Bearer ${process.env.OPENAI_API_KEY}`,
      "OpenAI-Beta": "realtime=v1",
    },
  }
);

ws.on("open", () => {
  ws.send(JSON.stringify({
    type: "session.update",
    session: {
      voice: "cedar",
      instructions: "You are a friendly support agent for a fintech app.",
      input_audio_format: "pcm16",
      output_audio_format: "pcm16",
      turn_detection: { type: "server_vad" },
      reasoning: { effort: "low" },
    },
  }));
});

ws.on("message", (raw) => {
  const event = JSON.parse(raw.toString());

  if (event.type === "response.audio.delta") {
    // base64 PCM16 audio chunk
    // Pipe this to a speaker, browser AudioWorklet, or media stream.
    process.stdout.write(Buffer.from(event.delta, "base64"));
  }
});

The session is event-driven:

Send a session.update event to configure the voice, audio format, VAD, tools, and reasoning effort.
Send input_audio_buffer.append events while the user speaks.
Receive response.audio.delta events as the model speaks.
Handle tool-call events if the model requests external data.

PCM16 at 24 kHz is a safe default. G.711 mu-law and A-law are also supported, which is useful for phone-system integrations.

For Python, the openai SDK >= 2.1.0 exposes a realtime client with the same event names. To compare the Realtime API with the Responses API, see How to use the GPT-5.5 API.

Voices

GPT-Realtime-2 adds two Realtime-only voices:

Cedar: warm, mid-range male voice. Suitable as a default general-agent voice.
Marin: bright, clear female voice. Useful for translation and announcements.

The previous eight voices are still available:

alloy
ash
ballad
coral
echo
sage
shimmer
verse

They were also retuned for the new audio stack.

To switch voices mid-session, send another session.update:

ws.send(JSON.stringify({
  type: "session.update",
  session: {
    voice: "marin",
  },
}));

Add image input to a voice turn

You can attach an image to a user turn and then ask a question about it:

ws.send(JSON.stringify({
  type: "conversation.item.create",
  item: {
    type: "message",
    role: "user",
    content: [
      {
        type: "input_image",
        image_url: "https://example.com/screenshot.png",
      },
      {
        type: "input_text",
        text: "What does this error mean?",
      },
    ],
  },
}));

ws.send(JSON.stringify({ type: "response.create" }));

Useful implementation patterns:

Voice-driven QA: A tester points a camera at a broken UI and the agent dictates a bug report.
Field support: A technician shares a wiring-panel photo and the agent walks through diagnostics.
Accessibility: The agent describes a user’s current screen during a support call.

For more on OpenAI’s image stack, see How to use the GPT-Image-2 API.

Function calling and MCP

GPT-Realtime-2 supports standard function tools and remote MCP servers in the same session.

Standard function calling

The flow is similar to Chat Completions:

Declare tools in the session config.
The model emits response.function_call_arguments.delta.
Your app executes the function.
Your app sends a conversation.item.create event with function_call_output.

The important change is parallel calling. The model can trigger multiple calls at once and narrate progress while waiting for results.

Example session update:

ws.send(JSON.stringify({
  type: "session.update",
  session: {
    tools: [
      {
        type: "function",
        name: "lookup_account",
        description: "Look up a customer account by ID.",
        parameters: {
          type: "object",
          properties: {
            account_id: { type: "string" },
          },
          required: ["account_id"],
        },
      },
      {
        type: "function",
        name: "list_transactions",
        description: "List recent transactions for an account.",
        parameters: {
          type: "object",
          properties: {
            account_id: { type: "string" },
            limit: { type: "number" },
          },
          required: ["account_id"],
        },
      },
    ],
  },
}));

Remote MCP servers

Remote MCP support lets the Realtime API call tools from an MCP server directly. Configure the MCP URL and allowed tools in the session:

ws.send(JSON.stringify({
  type: "session.update",
  session: {
    tools: [
      {
        type: "mcp",
        server_url: "https://mcp.example.com/sse",
        allowed_tools: [
          "lookup_account",
          "list_transactions",
        ],
      },
    ],
  },
}));

This is useful when your voice agent needs access to a larger tool catalog without manually routing every function call through your WebSocket loop.

If you are testing MCP servers before wiring them into a voice agent, see MCP server testing in Apidog.

SIP phone calling

GPT-Realtime-2 can handle real phone calls through SIP.

At a high level:

Point your SIP trunk at OpenAI’s SIP gateway.
An inbound call opens a Realtime WebSocket session.
Your app connects using the call ID:

wss://api.openai.com/v1/realtime?call_id={call_id}

The model accepts G.711 mu-law and A-law directly, so your bridge does not need to transcode audio before sending it to the Realtime API.

This makes GPT-Realtime-2 suitable for call-center-style agents where most turns involve listening, calling tools, and responding by voice.

Configure reasoning effort

Reasoning effort controls the latency/quality tradeoff.

Level	Use case	Approx. latency cost
`minimal`	Single-turn yes/no answers	none
`low`	Default; everyday support and chat	small
`medium`	Disambiguation, complex tool dispatch	moderate
`high`	Multi-step reasoning, code review by voice	high
`xhigh`	Benchmarks, hard analytical questions	highest

Default to low:

ws.send(JSON.stringify({
  type: "session.update",
  session: {
    reasoning: {
      effort: "low",
    },
  },
}));

Move to medium, high, or xhigh only when you can measure a quality gap. The latency cost is noticeable in live calls.

Test the Realtime API in Apidog

WebSocket APIs are difficult to debug from the terminal because every connection has state. Apidog gives you a repeatable way to test the same Realtime session.

A practical test workflow:

Create a new WebSocket request.
Use this URL:

wss://api.openai.com/v1/realtime?model=gpt-realtime-2

Add headers:

Authorization: Bearer {{OPENAI_API_KEY}}
OpenAI-Beta: realtime=v1

Save a session.update message.
Add scripted messages such as:
- input_audio_buffer.append
- input_audio_buffer.commit
- response.create
Replay the script against one connection.
Capture all server events.
Diff runs when changing voice, reasoning effort, or tool configuration.

Download Apidog, create a WebSocket request, and store your bearer token under Auth or an environment variable.

For comparison with another fast multimodal model, see How to use the Gemini 3 Flash Preview API.

FAQ

What model ID should I use?

Use:

gpt-realtime-2

The earlier model is still available as:

gpt-realtime

The lite version is:

gpt-realtime-2-mini

Can I stream input audio while output audio is still playing?

Yes. The Realtime API uses server-side voice activity detection by default, so the model can stop speaking when the user starts. You can also disable VAD and manage turn boundaries from the client.

Does the 128k context include audio tokens?

Yes. Audio is tokenized. One second of audio is roughly 50 tokens depending on format. Long calls can consume context faster than long text chats, so inspect usage before assuming the full 128k window is enough.

Is fine-tuning supported?

Not yet. Per the model card, GPT-Realtime-2 does not yet support fine-tuning, predicted outputs, or text streaming on Chat Completions. The Realtime endpoint streams audio inherently.

How does GPT-Realtime-2 compare to GPT-5.5 plus TTS?

GPT-Realtime-2 performs end-to-end speech reasoning. A voice-aware model can respond to tone, hesitation, and emphasis. A text model with TTS cannot use those audio cues in the same way.

For pure text reasoning, see How to use the GPT-5.5 API.

What rate limits apply?

Tier 1 starts at 40,000 tokens per minute and scales to 15M TPM at Tier 5. Rate limits are per model, so existing GPT-5 quota does not carry over.

Wrapping up

GPT-Realtime-2 gives you a single API surface for voice input, reasoning, tool use, image input, and spoken output. The main implementation path is:

Start with the WebSocket endpoint.
Configure session.update.
Use low reasoning by default.
Add tools only after the basic audio loop works.
Test repeated sessions in Apidog.
Increase reasoning effort only when measured quality requires it.

The combination of 128k context, GPT-5-class reasoning, image input, MCP, and SIP support makes it practical to build voice agents that can answer calls, inspect screenshots, dispatch tools, and recover from failed turns without leaving the Realtime session.

Best Local LLMs of 2026

Hassann — Fri, 08 May 2026 06:38:04 +0000

This guide helps you choose a local LLM for 2026 based on VRAM, latency, and workload, then serve and test it through an OpenAI-compatible API using Ollama, vLLM, LM Studio, and Apidog.

Try Apidog today

TL;DR

The “best” local LLM in 2026 depends on your VRAM budget, latency target, and use case: coding, reasoning, multilingual, or vision.
For 24 GB GPUs, Qwen 3.6 32B and DeepSeek V4 Flash are the strongest all-rounders.
For 8 GB and below, Gemma 4 9B and Llama 5.1 8B are the practical picks.
For reasoning or coding-heavy workloads, use DeepSeek V4 Pro quantized or GLM 5.
Use Ollama or LM Studio to expose an OpenAI-compatible HTTP endpoint.
Test local models with Apidog the same way you test hosted models.
Use Apidog to mock, replay, and benchmark local model traffic without spending hosted LLM tokens.

If you are already focused on DeepSeek, see the DeepSeek V4 local install guide and the DeepSeek V4 overview.

Why local LLMs matter again in 2026

A few years ago, running a local LLM usually meant accepting lower quality. That is less true now.

Open-weight models have narrowed the quality gap with hosted GPT-4-class systems, especially for:

Extraction
Classification
Tool calling
Coding assistance
Reasoning workflows
Structured output generation

The bigger change is hardware. A 24 GB consumer GPU can run a 32B-parameter model at production-quality 4-bit quantization. A Mac Studio with 64 GB unified memory can run DeepSeek V4 Flash at usable speeds.

Local models now make sense when you care about:

Data residency
Vendor lock-in
Predictable inference cost
Offline or private workloads
Internal tools and CI workflows

The hard part is no longer only “is the model good enough?” It is also:

Can your app call the local model the same way it calls a hosted API?

That is why OpenAI-compatible serving and API testing tools matter.

Selection criteria

This shortlist is not just a leaderboard scrape. The criteria:

Open weights with a permissive license such as MIT, Apache 2.0, or a production-friendly community license
Active maintenance in 2026
OpenAI-compatible serving through Ollama, vLLM, or LM Studio
Strong real-world performance in at least one area:
- General reasoning
- Code
- Multilingual output
- Vision
- Long context
- Tool calling
Reasonable hardware requirements

The models were tested with the same prompt set on a 4090 and a Mac Studio M3 Ultra, then cross-checked against LMSYS Chatbot Arena and the Hugging Face Open LLM Leaderboard where applicable.

Local LLM picks for 2026

Model	Best for	Practical hardware target
DeepSeek V4 Pro	Reasoning-heavy agents	192 GB unified memory or 2x 80 GB GPUs
DeepSeek V4 Flash	General local agent, coding, RAG	24 GB VRAM at Q4
Qwen 3.6 32B	Multilingual, structured output, tool calling	24 GB VRAM at Q4
GLM 5.1	Tool-calling agents, extraction, JSON workflows	Local serving through Ollama or vLLM
Llama 5.1 8B	Smaller local setups	8 GB-class hardware
Gemma 4 9B	Lightweight local assistants	8 GB-class hardware

1. DeepSeek V4 Pro

DeepSeek V4 Pro is the flagship model in the DeepSeek V4 release. It is available as 4-bit GGUF and AWQ on Hugging Face.

The full model has:

1.6T total parameters
49B active parameters

That puts it in datacenter-class territory. Quantized to Q4, it fits on:

A pair of 80 GB H100s
A Mac Studio M3 Ultra with 192 GB unified memory

For most developers, V4 Pro is not the first model to run locally. It is more useful as a reference point for high-end reasoning quality.

If you would rather use the same family through a hosted API, see how to use the DeepSeek V4 API.

Best for: reasoning-heavy agents and high-end local inference.

Hardware: 192 GB unified memory or 2x 80 GB GPUs.

Where to get it: DeepSeek V4 Pro GGUF on Hugging Face.

2. DeepSeek V4 Flash

DeepSeek V4 Flash is the smaller V4 variant:

284B total parameters
13B active parameters
Fits in 24 GB VRAM at 4-bit quantization
Leaves room for a 64K context window

On a 4090, throughput averages about 28 tokens per second on long-form generation.

This is the DeepSeek model most teams are likely to run locally. In testing, reasoning quality stayed close to V4 Pro, while coding was slightly behind.

For an end-to-end setup, use the DeepSeek V4 local install guide.

Best for: general-purpose local agents, coding assistants, and RAG generation.

Hardware: 24 GB VRAM at Q4, or 16 GB at Q3 with quality loss.

Where to get it:

ollama pull deepseek-v4-flash

Or use the Hugging Face GGUF.

3. Qwen 3.6 32B

Alibaba’s Qwen models have been one of the most consistent open-weight model families.

Qwen 3.6 32B at Q4 fits in 24 GB VRAM and performs well on:

General reasoning
Tool calling
Structured outputs
Multilingual tasks

Its multilingual support is the main reason to choose it over many Western open models. It handles Chinese, Japanese, Korean, and Arabic at a high level.

If your product needs one local model for reasoning plus multilingual output, Qwen 3.6 32B is the most practical pick.

Best for: multilingual products, structured output, tool calling, and balanced cost.

Hardware: 24 GB VRAM at Q4.

Where to get it:

ollama pull qwen3.6:32b

Or use Qwen 3.6 on Hugging Face.

4. GLM 5.1

Zhipu AI’s GLM line has become a strong option for tool-calling and structured workflows.

GLM 5.1 scores near the top among open models on tool-calling benchmarks. Its strongest areas are:

Reasoning
Classification
Structured extraction
JSON-mode workflows
Instruction following

Coding is weaker than its reasoning and extraction performance.

Choose GLM 5.1 when your workload is mostly tool calls, agentic workflows, or JSON schema extraction.

Best for: tool-calling agents, structured extraction, and JSON-mode pipelines.

Serve a local LLM like a hosted API

Once the model is running, your application still needs an HTTP endpoint.

Three serving paths matter.

Option 1: Ollama

Ollama is the easiest path for local development.

Start the server:

ollama serve

Pull a model:

ollama pull qwen3.6:32b

Ollama exposes an OpenAI-compatible endpoint at:

http://localhost:11434/v1

That means most OpenAI SDK-based apps only need two changes:

base_url
model

Option 2: vLLM

vLLM is the production-oriented option.

Use it when you need:

Better throughput
Lower latency
Continuous batching
Higher concurrency

It exposes an OpenAI-compatible API at:

http://localhost:8000/v1

Option 3: LM Studio

LM Studio is useful for individual developers who want a GUI.

Enable the local server in settings, then point your app or API client at the exposed local endpoint.

Minimal Python client example

The OpenAI Python client can call Ollama, vLLM, or LM Studio if the server exposes an OpenAI-compatible API.

from openai import OpenAI

client = OpenAI(
    api_key="ollama",  # any string; Ollama ignores it
    base_url="http://localhost:11434/v1",
)

resp = client.chat.completions.create(
    model="qwen3.6:32b",
    messages=[
        {
            "role": "user",
            "content": "Summarize the differences between MoE and dense models in three bullets."
        }
    ],
    temperature=0.3,
)

print(resp.choices[0].message.content)

To switch models, change only the model name:

model="deepseek-v4-flash"

or:

model="llama5.1:8b"

The request shape stays the same.

For a related hosted/local workflow, see how to use DeepSeek V4 for free.

Test local models with Apidog

Local inference gives you control, but it also gives you more things to debug.

When a hosted provider breaks, you check the status page. When your local model breaks, you own the issue.

You need to inspect:

Raw requests
Headers
Streaming responses
Tool-call payloads
Token latency
Time to first token
Output differences between model versions

Apidog treats your Ollama or vLLM endpoint like any other API.

1. Save canonical requests

Create one request collection per model.

Include realistic values for:

Prompt
System message
Temperature
max_tokens
Tool definitions
JSON schema requirements

Replay the same request whenever you change models or quantization levels.

2. Diff outputs across models

Run the same prompt against:

Qwen 3.6
DeepSeek V4 Flash
GLM 5.1
Llama 5.1
Gemma 4

Then compare responses to spot regressions before shipping.

3. Mock the endpoint for CI

CI should not need a 24 GB GPU to pass.

Use Apidog mocks to return realistic JSON or streaming responses during tests. That keeps unit and integration tests deterministic even when the local model is offline.

4. Benchmark throughput

Track:

Latency
Time to first token
Tokens per second
Failure rate
Response size

Use those numbers to compare Q4 vs Q5 quantization or Ollama vs vLLM.

5. Document the local API

Apidog projects can export OpenAPI 3.1, so teammates get a clear contract for calling your internal local model endpoint.

For a broader API workflow, see Apidog as a Postman alternative.

Common mistakes when running local LLMs

Picking the biggest model that fits

A 32B model at Q3 can be worse than a 14B model at Q5.

Once you go below 4-bit quantization, quality can drop quickly. Do not compare parameter count without comparing quantization quality.

Forgetting that context length uses VRAM

A long context window is not free.

A 32K-token context on a 32B model needs several GB of KV cache. Reserve memory for context before choosing the model.

Trusting random fine-tunes

Avoid random Hugging Face uploads for production workloads.

Prefer:

Original model cards
Known fine-tune authors
Reproducible evaluation results
Clear licenses

A poisoned or poorly trained fine-tune can create security and reliability issues.

Skipping the mock layer

Local models go down.

Common causes:

Driver crashes
OOM kills
GPU throttling
Process restarts
Broken model downloads

If CI calls the real local model directly, your tests become flaky. Mock the endpoint in Apidog instead.

Ignoring tool-call format differences

Different models can support tool calls but emit slightly different JSON shapes.

Test each model before swapping it into production.

Pay attention to:

Function name fields
Argument serialization
Streaming chunks
Invalid JSON recovery
Empty tool-call responses

Real-world usage patterns

A startup running a customer-support agent moved from GPT-5.5 to Qwen 3.6 32B on a single 4090. Latency stayed under 800 ms, monthly inference cost dropped, and the team uses Apidog mocks to keep CI deterministic.

A solo developer building a voice assistant runs Gemma 4 9B on an M2 Pro with 16 GB unified memory. Multi-token prediction drafters provide enough throughput for a native-feeling assistant.

A fintech research team runs DeepSeek V4 Flash on two 4090s for nightly batch summarization of regulatory filings. Their cost per summary is mostly electricity and maintenance time.

Implementation checklist

Use this flow to get from model choice to testable local API.

1. Pick the model

For 24 GB VRAM:

Qwen 3.6 32B
DeepSeek V4 Flash

For smaller machines:

Llama 5.1 8B
Gemma 4 9B

For tool-heavy workflows:

GLM 5.1
Qwen 3.6 32B

For high-end reasoning:

DeepSeek V4 Pro
DeepSeek V4 Flash

2. Pull the model

Example with Ollama:

ollama pull qwen3.6:32b

3. Start the local server

ollama serve

4. Test the endpoint

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3.6:32b",
    "messages": [
      {
        "role": "user",
        "content": "Return three API testing best practices as JSON."
      }
    ],
    "temperature": 0.2
  }'

5. Add the endpoint to Apidog

Use:

http://localhost:11434/v1

Then create saved requests for:

Normal chat
Streaming chat
Tool calling
JSON output
Long-context prompts
Failure cases

6. Replay before every model swap

Before changing from one model to another, replay the same collection and compare:

Output structure
Latency
Tool-call behavior
JSON validity
Error handling

Conclusion

The best local LLM in 2026 is the one that fits your VRAM, latency budget, and quality bar.

Most teams should start with:

Qwen 3.6 32B or DeepSeek V4 Flash for 24 GB GPUs
Llama 5.1 8B or Gemma 4 9B for smaller hardware
GLM 5.1 when tool calling and structured extraction are the main workload
DeepSeek V4 Pro only when you have high-end hardware and need maximum reasoning quality

Five practical takeaways:

Local model quality is close enough for many production tasks.
Ollama plus an OpenAI-compatible client is the fastest setup path.
Quantization quality matters more than raw parameter count.
Treat the local model as a production API.
Use Apidog to save requests, mock CI, benchmark runs, and document the endpoint.

Next step: pick a model, run ollama pull <name>, and point Apidog at:

http://localhost:11434/v1

You can start replaying and benchmarking requests within an hour.

FAQ

What is the best local LLM for a 24 GB GPU in 2026?

For most workloads, use Qwen 3.6 32B at Q4 or DeepSeek V4 Flash at Q4.

Pick Qwen for multilingual or tool-heavy tasks. Pick DeepSeek V4 Flash for reasoning and coding. See the DeepSeek V4 local guide for setup details.

Can I run a local LLM on a Mac?

Yes. Apple silicon with 16 GB or more unified memory can run Llama 5.1 8B and Gemma 4 9B comfortably.

An M3 Ultra with 192 GB unified memory can run DeepSeek V4 Pro at Q4. Use Ollama or LM Studio.

How do I test a local LLM the same way I test OpenAI?

Point your OpenAI-compatible client and your Apidog project at the local serving URL.

Ollama:

http://localhost:11434/v1

vLLM:

http://localhost:8000/v1

The request shape stays the same. Only the base URL and model name change.

Is local LLM quality really at parity with hosted?

For reasoning, coding, classification, extraction, and tool calling, top open models are often within single-digit percentage points of hosted models.

Hosted models still tend to lead on vision, long-context document QA, and creative writing.

What about cost?

A 4090 can run DeepSeek V4 Flash for the price of electricity and hardware maintenance.

At high volume, hosted inference can cost hundreds or thousands per month. The break-even point depends on utilization, but it is often around millions of tokens per month.

How do I switch a production app between hosted and local?

Keep the OpenAI-compatible client.

Change:

base_url
model

Then replay saved API requests before shipping the swap. See API testing without Postman for the same testing pattern.

Where can I track current model rankings?

Use both:

Cross-reference them because they measure different things.

Computer Use vs Structured APIs: When Each Wins (2026)

Hassann — Fri, 08 May 2026 02:36:54 +0000

Driving a browser with an LLM through computer-use models can cost roughly 45x more than calling the same vendor through a structured API.

Try Apidog today

This guide explains where that 45x gap comes from, when computer use is still worth it, and how to design cheaper agent workflows with Apidog. The same framework applies to OpenAI Operator, Anthropic computer use, browser-use, Skyvern, and any agent runtime built around a screenshot loop.

If you write APIs for AI agents, also read the companion guide on how to write agents.md files. Those conventions make the structured-API path easier for agents to discover and call.

TL;DR

Computer use means an LLM reads screenshots and emits clicks, keystrokes, and scrolls.
Structured APIs mean the LLM emits JSON tool calls that your backend executes.
For the same task, computer use often burns 30x to 50x more tokens because every step sends another screenshot.
Use computer use only when no API exists, the API is blocked, or the workflow lives behind an interface you cannot automate cleanly.
Use structured APIs for payments, search, CRM updates, internal tools, queue jobs, and anything you can document with OpenAPI.
In production, hybrid is usually the right architecture: structured APIs handle the common path, computer use handles the legacy long tail.
Use Apidog to design JSON tool schemas, mock endpoints while iterating, and replay requests without burning agent credits.

Why the cost gap is so big

The 45x number is not magic. It comes from token usage.

A structured API call usually looks like this:

Send the user request.
Send a tool schema.
Receive a JSON object.
Execute one backend request.

That round trip may use a few hundred input tokens and a small JSON response.

A computer-use loop looks like this:

Send the user request.
Send a screenshot.
Receive a click coordinate or keyboard action.
Execute the action.
Take another screenshot.
Repeat until the task finishes.

A typical browser task can take 12 to 30 rounds. Each screenshot can cost around 1,500 tokens at common resolutions. Add retries, cookie banners, login screens, scroll mistakes, and misclicks, and the cost multiplies quickly.

Anthropic documents screenshot token usage in its computer use documentation. The Hacker News discussion Computer Use is 45x more expensive than structured APIs puts the common penalty around 30x to 50x, which matches the practical pattern you see when replaying the same workflow through both paths in Apidog.

When structured APIs win

Default to structured APIs when any of these are true.

1. The vendor exposes a schema

Use the API if the vendor provides:

an OpenAPI spec
a GraphQL schema
REST docs
a stable JSON endpoint

If a JSON shape exists, the model can usually fill it through a tool call.

Example tool shape:

{
  "name": "update_deal_stage",
  "description": "Update a CRM deal to a new pipeline stage",
  "parameters": {
    "type": "object",
    "properties": {
      "deal_id": {
        "type": "string"
      },
      "stage": {
        "type": "string",
        "enum": ["qualified", "proposal", "closed_won", "closed_lost"]
      }
    },
    "required": ["deal_id", "stage"]
  }
}

That is cheaper and easier to validate than asking an agent to open a CRM dashboard and click through a pipeline UI.

2. The task fits one or two endpoints

These should be API calls, not browser tasks:

Create a Stripe customer.
Update a HubSpot deal stage.
Post a Slack message.
Trigger a CI rerun.
Search internal records.
Generate an invoice.
Add a user to a workspace.

Routing these through a browser adds cost, latency, and failure modes without adding value.

3. The workflow runs unattended

Cron jobs, webhooks, queue workers, and background agents need deterministic network calls.

A screenshot loop can get stuck on:

a changed button label
an unexpected modal
an expired session
a slow-loading table
a scroll position issue

Structured API calls are easier to retry, monitor, and alert on.

4. Latency matters

A structured API call may return in hundreds of milliseconds.

A computer-use loop with 15 browser rounds may take 30 to 90 seconds. If a user is waiting, that usually breaks the experience.

5. You need test coverage

Mocking JSON endpoints is straightforward in Apidog. Mocking a browser screenshot loop is much harder because every run depends on UI state.

When computer use is still useful

Computer use is not useless. It is just expensive. Use it for workflows where a structured path is unavailable or not worth building.

Legacy vendor portals

Some procurement, freight, benefits, and compliance portals have no public API. They may live behind ASP.NET sessions, old forms, or vendor-specific auth flows.

If the alternative is maintaining brittle Selenium scripts that break every quarter, paying more per run can be acceptable.

Internal tools you cannot change

Examples:

a legacy ERP
a client-owned CRM
a SharePoint dashboard
an admin portal maintained by another team

If you cannot add endpoints and the workflow volume is low, computer use may be practical.

One-off operator tasks

A founder asking an agent to “research these 50 competitors and put the highlights in Notion” may not need a formal API contract.

For one-off or rare work, computer use can be cheaper than building an integration.

Workflows blocked by terms of service

Be careful here. Many “use a browser agent to scrape this website” requests violate vendor terms. The token bill may be the least important risk.

Decision framework

Run every agent workflow through these checks before choosing computer use.

Check	If yes	If no
Does a documented API exist?	Use the API.	Continue.
Can you ship a thin server-side adapter around a private endpoint?	Build the adapter and expose JSON.	Continue.
Is the task one-off or low-volume, for example fewer than 100 runs/day?	Computer use can be acceptable.	Continue.
Are you comfortable paying 30x to 50x more token cost on every run?	Use computer use.	Stop and negotiate or build API access.

Most workflows should fail into the API path at check one or two. Computer use should survive only when both structured options are unavailable.

What structured APIs look like in an agent

Here is a simplified version of a “fetch yesterday’s failed payments” workflow using a structured tool.

import json
from openai import OpenAI

client = OpenAI()

tools = [
    {
        "type": "function",
        "function": {
            "name": "list_failed_payments",
            "description": "List failed payments in a date range",
            "parameters": {
                "type": "object",
                "properties": {
                    "start": {
                        "type": "string",
                        "format": "date"
                    },
                    "end": {
                        "type": "string",
                        "format": "date"
                    }
                },
                "required": ["start", "end"]
            }
        }
    }
]

resp = client.chat.completions.create(
    model="gpt-5.5",
    messages=[
        {
            "role": "user",
            "content": "Show yesterday's failed payments."
        }
    ],
    tools=tools,
    tool_choice="auto"
)

call = resp.choices[0].message.tool_calls[0]
args = json.loads(call.function.arguments)

payments = stripe.PaymentIntent.list(
    created={
        "gte": args["start"],
        "lte": args["end"]
    },
    limit=100
)

The agent never opens the Stripe dashboard. It produces structured arguments, your runtime validates them, and your backend makes the request.

The computer-use version would need to:

Open a browser.
Log into Stripe.
Screenshot the dashboard.
Click the date picker.
Screenshot again.
Select the date range.
Screenshot again.
Find the failed status filter.
Screenshot again.
Extract table data from the UI.

That is slower, more expensive, and more fragile.

Designing the structured path with Apidog

Teams often reach for computer use because nobody has designed a clean tool surface for the agent.

Apidog gives you a practical workflow for turning agent actions into documented API contracts.

Step 1: Model the operations as endpoints

Start with the operations the agent actually needs.

For example:

POST /invoices/search
POST /deals/update-stage
POST /messages/send
POST /reports/failed-payments

Each endpoint should have:

a clear operation name
a narrow request body
explicit required fields
predictable JSON responses
validation rules

A small set of focused endpoints can replace most browser-agent demos.

Step 2: Export the OpenAPI document

Apidog can generate an OpenAPI 3.1 document from the design view.

That document becomes the contract between:

the model
your agent runtime
your backend
your tests
your docs

Step 3: Feed the schema into your agent framework

Common agent runtimes can consume structured tool schemas.

Examples include:

OpenAI tools
Anthropic tool use
LangChain OpenAPI loaders
DeepSeek tool-calling endpoints

Once the model has the schema, it can call typed functions instead of navigating a UI.

Step 4: Turn on the mock server

Use Apidog’s mock server before connecting the agent to production.

The mock server lets you:

test tool selection
validate request bodies
simulate success and error responses
run the agent end-to-end
avoid spending credits on live workflows

This is the same pattern covered in Apidog’s contract-first development guide.

Step 5: Replay and debug traffic

When the agent runs, inspect the requests and responses.

Look for:

missing required fields
invalid enum values
wrong endpoint selection
malformed dates
unexpected retries
fallback to browser use

Replay a passing run next to a failing run to find where the tool call drifted.

Step 6: Ship the API contract

The same Apidog project can support:

public API docs
internal tool docs
mocks
QA
request replay
agent debugging

That turns the agent tool surface into a maintainable product surface.

Hybrid architecture: use both paths intentionally

Most production agents end up hybrid.

A reasonable default:

90% of operations use structured API tools.
10% fall back to computer use for legacy portals.
A router decides which path to use.

A minimal router rule can be as simple as:

If the requested operation exists in known_tools, call the structured tool.
If no matching tool exists, hand off to the browser agent.

In code, that logic might look like:

KNOWN_TOOLS = {
    "list_failed_payments",
    "update_deal_stage",
    "send_slack_message",
    "create_invoice"
}

def route_operation(operation_name: str) -> str:
    if operation_name in KNOWN_TOOLS:
        return "structured_api"

    return "computer_use"

Anthropic Claude 4.5, OpenAI GPT-5.5, and DeepSeek V4 can follow this routing pattern. For DeepSeek request examples, see how to use DeepSeek V4 API.

Track both paths separately:

request volume
latency
token cost
failure rate
retry count
fallback frequency

If the browser path starts handling common operations, add the missing endpoint to your tool surface.

Common mistakes to avoid

Skipping the schema

Do not rely on prose-only system prompts for tool calls.

Use JSON Schema with:

required fields
enums
formats
descriptions
examples where useful

Strict schemas improve tool accuracy and make validation failures easy to catch.

Letting the agent design schemas at runtime

A schema is product surface. Do not let the model invent it dynamically.

Author the schema in Apidog, version it, review it, and treat breaking changes like API changes.

Logging tokens but not actual cost

Computer-use tokens often hide in image inputs. Many dashboards display text tokens clearly but price image tokens differently.

Use your provider’s billing console to validate real cost.

Confusing computer use with RPA

RPA tools run scripted clicks against known selectors or DOM elements.

Computer-use agents re-decide what to click from screenshots on every step.

RPA is cheaper and more repeatable when the UI is stable. Computer use is more flexible but more expensive.

Ignoring latency

A 45x token bill is only part of the problem.

A 60-second browser loop can kick users out of flow. If a user is waiting, use an API whenever possible.

Alternatives before full computer use

If a vendor has no public API, try these options before handing the workflow to a screenshot loop.

Headless browser scripts

Playwright and Puppeteer cost nothing per run after development.

Tradeoff: they break when the UI changes.

Use them when:

the workflow is high-volume
the UI is stable
selectors are reliable
maintenance cost is acceptable

Vendor-published iPaaS connectors

Zapier, Make, and similar platforms may already support the vendor.

Use them when:

speed matters
the connector covers your workflow
the seat cost is lower than custom integration work

Private JSON endpoints

Many dashboards call internal JSON APIs from the browser.

You can inspect the network tab in DevTools, identify the private endpoint, and wrap it with your own server-side adapter.

Document that adapter in Apidog and treat it as semi-stable. This pattern also appears in API testing without Postman.

Computer use should be the last resort, not the default.

Real-world use cases

A fintech compliance team replaced a six-step computer-use Stripe report with three structured calls. Token cost dropped 92%, and runtime went from 41 seconds to 2 seconds.

A B2B SaaS support agent kept computer use for one workflow: a vendor procurement portal with no API. Everything else routed through OpenAPI tool calls designed in Apidog. Monthly token spend dropped from $4,200 to $310.

A solo founder used computer use once per week to refresh a Notion dashboard from a legacy ERP. The 45x cost on a weekly run was only a few cents. Building a full integration would have taken weeks. That is a good fit for computer use.

Conclusion

The 45x cost gap is real enough to change your default architecture.

Use structured APIs designed in Apidog for workflows with stable endpoints. Use computer use only when no API exists and the workflow runs rarely enough that the extra token cost is acceptable.

Five practical takeaways:

Computer use often costs 30x to 50x more tokens than an equivalent structured API call.
A documented endpoint plus JSON Schema beats a screenshot loop on cost, latency, and reliability.
Hybrid stacks are normal: design the common path in Apidog and fall back to computer use for legacy workflows.
Mock the structured tool surface before connecting it to production.
Track structured calls and browser-agent calls separately so cost drift is visible.

Next step: open Apidog, create a project for your agent’s tool surface, and turn on the mock server. Within an hour, you should know whether your browser workflow can become two structured calls instead.

FAQ

Is computer use ever cheaper than a structured API?

Not per run. Screenshot tokens dominate.

Computer use can be cheaper in total only when integration cost would exceed years of run cost. That usually means a very low-volume workflow against a system with no API.

How do I mock a JSON tool surface for an agent?

Design the endpoints in Apidog, turn on the built-in mock server, and point your agent at the mock URL.

Every request returns realistic JSON without hitting production. For a related workflow, see API testing tools for QA engineers.

Can I use OpenAPI for tool calls in any model?

Yes. OpenAI tools, Anthropic tool_use, and DeepSeek V4 tool-calling endpoints can consume OpenAPI 3.1-style schemas.

Apidog exports the schema cleanly. See how to use DeepSeek V4 API for the DeepSeek request shape.

Does GPT-5.5 still support computer use?

OpenAI ships computer use through Operator and the Responses API. The cost profile is similar to Anthropic’s screenshot-based approach. The recommendation here applies regardless of vendor.

What about Skyvern, browser-use, and other open-source agents?

The same math applies.

Open-source browser agents may reduce per-call price by using cheaper models, but they still require multiple rounds and screenshots. Structured APIs still win where APIs exist.

How do I know when an endpoint is missing for an agent task?

Watch for repeated fallback to browser use.

If the agent keeps trying to use a browser for the same operation, that is a missing endpoint in your tool surface. Add it in Apidog, regenerate the schema, and route the agent back through structured calls.

TradingAgents:Open-Source LLM Trading Framework

Hassann — Thu, 07 May 2026 03:59:19 +0000

Most multi-agent LLM frameworks promise more than they deliver. TradingAgents is one of the rare exceptions: open-sourced by Tauric Research alongside an arXiv paper, now at version 0.2.4, and built around a clean role decomposition that mirrors a real research desk.

Try Apidog today

This guide focuses on what TradingAgents does, what changed in v0.2.4, how its agent architecture works, and how to test the LLM and market-data layers underneath with Apidog. If you are already thinking about agent contracts, pair this with the agents.md guide for API teams.

TL;DR

TradingAgents is a multi-agent LLM trading framework from Tauric Research, described in arXiv 2412.20138.
It decomposes trading into specialist agents: Fundamentals Analyst, Sentiment Analyst, News Analyst, Technical Analyst, Bull/Bear Researchers, Trader, and Risk Management agents.
v0.2.4 adds structured-output agents, LangGraph checkpoint resume, persistent decision logs, Docker support, and more LLM providers.
It can run against OpenAI-compatible endpoints, which makes hosted, local, and self-hosted models easier to swap.
Use Apidog to mock market-data APIs, replay LLM traffic, assert structured output, and compare provider behavior.
Download Apidog if you want to wire these checks into CI before trusting agent output.

What TradingAgents is

TradingAgents is a Python package and CLI for running a multi-agent trading research workflow.

Instead of asking one model to “analyze this stock,” the framework splits the workflow into roles:

Fundamentals Analyst
Sentiment Analyst
News Analyst
Technical Analyst
Bull Researcher
Bear Researcher
Research Manager
Trader
Risk Management agents
Portfolio Manager

Each agent has:

A specific role prompt.
A focused toolset.
A place in the workflow graph.
A defined output consumed by the next stage.

The project README frames it as research code, not investment advice. That distinction matters. The useful engineering lesson is not “let an LLM trade for you.” It is how to design a multi-agent system with specialist roles, debate, structured decisions, and an audit trail.

What v0.2.4 shipped

The v0.2.4 release is important because it improves reliability around long-running agent workflows.

Structured-output agents

The Research Manager, Trader, and Portfolio Manager now emit structured output through either:

OpenAI Responses API
Anthropic tool-use channel

That replaces brittle free-text parsing with typed JSON-style outputs, which makes downstream automation safer.

LangGraph checkpoint resume

TradingAgents uses LangGraph for orchestration. v0.2.4 adds checkpoint resume support, so a run can recover from interruptions such as:

LLM provider 429 responses
market-data API throttling
local process failures
network issues

Instead of restarting the full workflow, you can resume from a saved checkpoint.

Persistent decision log

Trader decisions are written to a SQLite log with reasoning, inputs, and timestamps.

That gives you an audit trail you can inspect later or use for evaluation.

More LLM providers

v0.2.4 added support for:

DeepSeek
Qwen
GLM
Azure OpenAI

Those join the existing provider matrix that includes OpenAI, Anthropic, Gemini, and Grok.

If you want to compare cost and reasoning behavior, you can test DeepSeek through its OpenAI-compatible endpoint. The request pattern is covered in the DeepSeek V4 API guide.

Docker and Windows fixes

The release also includes:

Dockerfile support
a Windows UTF-8/path encoding fix from v0.2.3

Not exciting, but useful if you want repeatable local or CI runs.

TradingAgents architecture

A complete TradingAgents run follows this flow:

The CLI accepts a ticker and date.
The Analyst Team fans out.
Each analyst fetches data and writes a report.
The Bull Researcher writes a bullish thesis.
The Bear Researcher writes a bearish thesis.
The researchers debate.
The Research Manager synthesizes the debate into a recommendation.
The Trader reads the recommendation and decision history.
The Trader produces a trade plan.
Risk Management agents review the plan from aggressive, conservative, and neutral perspectives.
The Portfolio Manager approves or sends the plan back.
The final decision is written to SQLite.

The highest LLM cost usually appears in the debate and risk-review stages because multiple agents reason over the same context.

That is also where smaller models tend to fail. A weak local model may loop, repeat arguments, or produce shallow Bull/Bear debates. Stronger reasoning models generally produce more useful tradeoffs and cleaner structured conclusions.

How it compares to LangGraph and CrewAI

TradingAgents is not a general-purpose agent framework in the same way LangGraph or CrewAI is.

Think of the layers like this:

LangGraph: low-level graph orchestration for agent workflows.
CrewAI: general-purpose role-based multi-agent framework.
TradingAgents: domain-specific implementation for trading research.

If you want maximum flexibility, start with LangGraph.

If you want a general multi-agent abstraction, evaluate CrewAI.

If you want to study a concrete, opinionated multi-agent workflow with debate, decision, risk review, and logging, read TradingAgents.

Why you need to test the API layers

TradingAgents depends on two unstable surfaces:

Market-data APIs
LLM provider APIs

Both can break runs in ways that are hard to debug.

Market-data APIs fail through drift

Common issues include:

inconsistent free-tier rate limits
renamed fields
missing fields
different trading-day boundaries
different historical-data formats between vendors

A run can work one day and fail the next because a vendor changed a field such as regularMarketTime.

LLM provider APIs fail through shape and cost

Common issues include:

changed response formats
tool-call parsing differences
reasoning-mode cost spikes
provider-specific structured-output behavior
token usage that varies by role

The fix is to keep saved, replayable request collections with assertions. That is where Apidog fits. The same pattern is useful for protocol-level testing, as described in the MCP server testing playbook.

Mock market-data APIs with Apidog

Use this workflow to make TradingAgents test runs deterministic.

Step 1: define upstream endpoints

Create an Apidog project and add the market-data endpoints TradingAgents calls, such as:

Yahoo Finance
FinnHub
Polygon
OpenBB

For each endpoint, save:

method
path
query parameters
headers
example response body

Use real vendor responses as fixtures.

Step 2: enable the mock server

Turn on Apidog’s mock server and point TradingAgents’ tool configuration at the mock URL.

The Fundamentals Analyst, Technical Analyst, and other data-consuming agents now receive deterministic data instead of live vendor responses.

Step 3: detect vendor drift

On a schedule, replay the live vendor endpoints and compare their response shapes against your saved fixtures.

Look for:

renamed fields
removed fields
newly required fields
type changes
empty values where data previously existed

This is the same contract-first workflow described in contract-first API development.

Test the LLM provider layer

Before scaling TradingAgents runs, test three things.

1. Cost per role

Run a single ticker and capture token usage per agent.

At minimum, track:

Fundamentals Analyst tokens
Sentiment Analyst tokens
News Analyst tokens
Technical Analyst tokens
Bull/Bear debate tokens
Risk Management tokens
final decision tokens

The Bull/Bear debate should usually be more expensive than a single analyst pass. If it is not, the model may be short-circuiting the debate.

Use Apidog request logs to capture provider traffic and compare token usage across runs.

2. Structured output shape

For v0.2.4 structured-output agents, add assertions that verify required fields exist.

For example, assert that the Trader output contains fields like:

{
  "action": "buy | sell | hold",
  "confidence": 0.72,
  "reasoning": "...",
  "risk_notes": "..."
}

Then add JSONPath checks such as:

$.action
$.confidence
$.reasoning

A structured-output regression is dangerous because downstream code may fail only after the model response is already accepted.

3. Provider parity

When swapping providers, do not compare one run against one run.

Instead:

Select a fixed ticker basket.
Run the same dates through provider A.
Run the same dates through provider B.
Compare the SQLite decision logs.
Measure how often conclusions diverge.

For example:

OpenAI vs DeepSeek
30 tickers
2 debate rounds
same market-data fixtures
same date range
compare final action + confidence + reasoning summary

Use the DeepSeek V4 API guide and the GPT-5.5 API guide for provider request patterns.

Minimal TradingAgents run

A basic run looks like this:

git clone https://github.com/TauricResearch/TradingAgents
cd TradingAgents
pip install -r requirements.txt

export OPENAI_API_KEY="sk-..."
export FINNHUB_API_KEY="..."

python -m tradingagents.cli \
  --ticker AAPL \
  --date 2026-04-30 \
  --models gpt-5.5 \
  --rounds 2

Two debate rounds are a practical minimum for testing the Bull/Bear workflow.

The output is written under:

tradingagents/results/

Expect JSON artifacts plus a Markdown decision summary.

Swap to DeepSeek

To test a different reasoning provider, configure the provider and model:

export DEEPSEEK_API_KEY="sk-..."

python -m tradingagents.cli \
  --ticker AAPL \
  --date 2026-04-30 \
  --models deepseek-v4-pro \
  --provider deepseek \
  --rounds 2

The same pattern applies to Qwen, GLM, or local OpenAI-compatible servers such as Ollama or vLLM.

For local model options, see the best local LLMs of 2026 post.

Common pitfalls

Running with a model that is too small

Small local models can produce repetitive Bull/Bear debates that never converge.

For serious evaluation, use at least a mid-tier reasoning model. The original article identifies DeepSeek V4 Flash, Qwen 3.6 32B, GPT-5.5, and Claude 4.5 as realistic options.

Skipping market-data caching

Each analyst can call the data layer separately. Without caching, one ticker run can fan out into multiple vendor requests.

Enable caching before running batches.

Treating research code as a trading bot

TradingAgents is research code. Backtest results are sensitive to:

model choice
prompt seed
debate length
data quality
provider behavior

Treat outputs as hypotheses, not executable trading strategies.

Not logging token spend

A single ticker run can cost anywhere from cents to several dollars depending on model and debate rounds.

Track per-run cost in Apidog’s replay history so debate loops do not silently burn budget.

Hardcoding one provider

The framework supports multiple providers. Use that to your advantage.

Before committing to one provider:

Run the same ticker set through several models.
Compare decision logs.
Compare token cost.
Review failure modes.
Pick based on both cost and behavior.

Where Apidog fits in the development loop

Design the API surface

Before wiring TradingAgents to live vendors, model each market-data endpoint in Apidog.

That forces you to identify which response fields the agents actually need.

Run local CI against mocks

Use Apidog’s mock server for unit and integration tests.

That keeps tests independent of:

vendor uptime
market hours
rate limits
network failures

The same workflow is covered in API testing without Postman.

Diff live responses against fixtures

Schedule a weekly replay of live vendor endpoints.

Compare the live response shape against saved fixtures and alert on schema drift. This gives you an early warning when the data layer changes underneath the agents.

Why this pattern matters beyond trading

TradingAgents is useful even if you never build trading software.

The architecture transfers to other multi-step agent workflows:

customer support triage
code review
compliance review
research summarization
incident analysis
security review

The reusable pattern is:

specialist agents -> debate/review -> synthesis -> decision -> audit log

That structure is easier to test than a single large prompt because each stage has a defined responsibility and output.

Real-world examples

A quant research student can run the same 30-ticker basket through DeepSeek V4, GPT-5.5, and Claude 4.5, then use Apidog logs to compare request/response behavior.

A fintech engineer can reuse the multi-agent pattern for code reviews: security agent, performance agent, style agent, then a synthesizer that writes the final PR comment.

A solo developer can run TradingAgents nightly on a 10-ticker watchlist and log every decision into a database while using Apidog mocks for weekend test runs.

Conclusion

TradingAgents is a practical reference implementation for multi-agent LLM workflows. It uses specialist roles, debate, risk review, structured decisions, and persistent logs instead of a single monolithic prompt.

v0.2.4 makes the project more useful for production-style experimentation with structured outputs, checkpoint resume, SQLite decision logs, Docker support, and broader provider coverage.

The key implementation lesson: test the layers underneath the agents.

Mock market-data vendors in Apidog.
Assert structured LLM outputs.
Log token cost by role.
Compare providers with repeatable fixtures.
Treat final decisions as research artifacts, not trading instructions.

Next step: clone the repo, run one ticker, and route the upstream calls through an Apidog mock server. You should know within an hour whether the architecture fits your workflow.

FAQ

Is TradingAgents safe to use with real money?

The repo describes TradingAgents as research code, not financial advice. Treat its output as a hypothesis. Running it against a live brokerage is your own risk.

Which LLM provider gives the best cost-quality tradeoff?

The original article identifies DeepSeek V4 Flash with thinking mode as a strong cost-quality option for early 2026 workloads. See the DeepSeek V4 API guide for request details.

Can I run TradingAgents on local models?

Yes. Multi-provider support allows OpenAI-compatible local endpoints from tools such as Ollama, vLLM, and LM Studio. See the best local LLMs of 2026 post.

How do I mock market-data APIs?

Define each vendor endpoint in Apidog, enable the mock server, and point TradingAgents’ tool config at the mock URL. The same pattern is covered in API testing tools for QA engineers.

What hardware do I need?

If you call hosted LLMs such as OpenAI, Anthropic, or DeepSeek, any laptop with Python 3.10+ should be enough.

If you serve local models, hardware depends on model size. Larger reasoning models need substantially more GPU memory than small local models.

Does it support after-hours and weekend simulation?

TradingAgents can run against historical data for a selected date. Live trading is a separate problem that the framework does not claim to solve.

How does it compare to other multi-agent frameworks?

TradingAgents is domain-specific. CrewAI, AutoGen, and LangGraph are general-purpose. Use TradingAgents to study a concrete multi-agent implementation; use LangGraph or another general framework when you need to build your own agent graph from scratch.

Forem: Hassann

API Design Patterns from the World's Largest Prediction Market: Lessons from Polymarket

Pattern 1: Separate APIs by domain, not by database entity

Implementation takeaway

Pattern 2: Make read access public when data liquidity matters

Implementation takeaway

Pattern 3: Use different authentication levels for different trust levels

L1 authentication: prove wallet ownership

L2 authentication: sign each API request

Implementation takeaway

Pattern 4: Treat high-stakes actions as signed payloads, not just API calls

Conventional API semantics

Signed-message semantics

Implementation takeaway

Pattern 5: Encode the domain ontology in the API model

Implementation takeaway

Pattern 6: Represent domain invariants as API fields

Implementation takeaway

Pattern 7: Treat changing market parameters as state, not configuration

Implementation takeaway

Pattern 8: Use separate WebSocket layers for different real-time consumers

Market Channel

Real-Time Data Socket

Implementation takeaway

What these patterns have in common

Get Free Unlimited Gemini API

TL;DR

How “free unlimited” works

Step 1: Install Puter.js

Step 2: Pick a Gemini or Gemma model

Step 3: Make your first Gemini call

Step 4: Stream the response

Step 5: Send image input to Gemini

Step 6: Tune temperature

Step 7: Build multi-turn conversations

Compare Gemini with other models

What Puter.js gives you

When to use Puter.js vs the official Gemini API

Test the integration with Apidog

Other free LLM paths through Puter

FAQ

Is this truly unlimited?

Do I need a Google account or Google Cloud project?

Can I use this in production?

Does Gemini through Puter behave like the official API?

What about Gemini’s 2M-token context window?

Can I use Puter Gemini in a Discord bot or backend service?

What model should I default to?

Is Imagen image generation supported?

Wrapping up

Get Free Unlimited GPT-5.5 API and All OpenAI Models

TL;DR

How Puter’s “free unlimited” model works

Step 1: Install Puter.js

Step 2: Choose a model

Step 3: Call GPT-5.5 from the browser

Step 4: Stream responses for chat UIs

Step 5: Send image input to a vision model

Step 6: Generate images

Step 7: Convert text to speech

Step 8: Add function calling

Step 9: Tune temperature and max_tokens

What Puter gives you

What Puter does not replace

When to use Puter vs the official OpenAI API

Testing the integration in Apidog

FAQ

Wrapping up

Get Free Unlimited Claude Opus 4.7 API

TL;DR

How the Puter billing model works

Step 1: Add Puter.js

Step 2: Choose a Claude model

Step 3: Make your first Claude call

Step 4: Stream long responses

Step 5: Build a multi-turn conversation

Step 6: Add a system prompt

Step 7: Compare models with the same prompt

What you get with Puter.js

When to use Puter vs the official Anthropic API

Step 9: Tune `temperature` and `max_tokens`

Use `low` for fast, simple tasks

Use `medium` for default production traffic

Use `high` for complex workflows