Forem: Manas Mudbari

Can AI Remember the Market? Teaching LLMs to Detect When the Rules Change

Manas Mudbari — Sun, 08 Mar 2026 17:06:02 +0000

TL;DR

We built a memory system for LLMs to track Bitcoin market regimes. The LLM can't predict tomorrow's price any better than a coin flip (nobody can, honestly). But it can detect major market regime changes with zero false alarms, and unlike every statistical method, it tells you why the regime changed in plain English. That explainability is the real contribution.

The Problem: AI Models Have Amnesia

Imagine you trained an AI model to predict Bitcoin prices during the 2020-2021 bull run, a period when institutional investors were piling in, central banks were printing money, and everything was going up. The model learns the rules of that world pretty well.

Then 2022 arrives. The Fed starts aggressively hiking interest rates. Crypto exchange FTX collapses. Luna implodes. The entire market enters a prolonged bear phase.

Your model, still operating on the old rules, has no idea what hit it.

This is called concept drift: when the underlying patterns that a model learned no longer reflect reality. It's one of the most underappreciated problems in applied machine learning, especially in financial markets where the "rules" can change overnight.

Traditional fixes are crude: either retrain the model on new data (expensive and reactive), or use statistical alarms that fire when something looks statistically unusual (they tell you that something changed, but never why).

Our Idea: Give the AI a Memory

Large language models (LLMs) like GPT-4 have been trained on enormous amounts of text, including financial news, market commentary, earnings reports, and macroeconomic analysis. They already "know" things like "when the Fed raises rates, risk assets tend to fall" or "Bitcoin historically rallies in the months before a halving."

What if we could structure that knowledge into a formal memory system that the LLM consults before making predictions? Instead of treating every 24-hour window as if it exists in isolation, the model would have context: what regime is the market in right now, what has happened before in similar conditions, and what did the model itself predict recently?

That's the core idea of this paper. We built four types of adaptive memory and tested them on seven years of Bitcoin data.

The Four Memory Types

Think of each memory type as a different "cheat sheet" the AI gets to read before making its prediction:

1. Regime Memory
The AI is told what "mode" the market is currently in (e.g., "Macro Bear Market") and what characteristics define that mode. Like giving a student a study guide that says: "Right now we're in a period defined by Fed tightening, exchange failures, and risk-off sentiment."

2. News Memory
Recent headlines are ranked by importance and fed to the model, along with any major events that happened during the same calendar window in previous years. Think of it as saying: "Here are the most important things happening right now, and here's what happened at this time of year historically."

3. Similarity Memory
The current market conditions (price momentum, volatility, volume) are compared against every similar-looking period in the past seven years. The top five most similar historical windows are retrieved, along with what actually happened next. Essentially: "The last five times the market looked like this, here's what followed."

4. Relative Memory
The AI is shown a log of its own recent predictions: how accurate it's been, whether it's been systematically biased toward UP or DOWN, and what its last seven predictions were. This lets it self-correct: "I've been wrong six times in a row predicting UP, maybe I should reconsider."

Two Tests

We ran the system on two tasks:

Task 1: Predict Tomorrow's Price Direction

Given the last 7 days of Bitcoin price data, predict whether the price will be higher or lower 24 hours from now.

Task 2: Detect Regime Changes

Given the current market conditions and recent news, determine whether Bitcoin has transitioned into a fundamentally new market regime.

We tested against 6 real historical regime transitions that occurred between 2017 and 2024.

What We Found

On Price Prediction: Nobody Wins

Method	Accuracy
LSTM (traditional neural net)	50.8%
LLM with no memory	50.1%
LLM + Similarity Memory	51.3%
LLM + News Memory	48.6%
LLM + Regime Memory	47.1%
LLM + Relative Memory	49.0%

Every single method lands within a coin-flip range of 50%. This is actually an important and honest result, confirming that short-term Bitcoin price prediction is genuinely hard regardless of how sophisticated your model is. Nobody has cracked this, and we didn't pretend to either.

The statistical analysis confirmed that none of the differences between methods are statistically significant. In plain terms: the margin of error swallows all the differences.

On Regime Detection: The LLM Has a Unique Edge

Method	Detected	False Alarm Rate	Can Explain Why?
CUSUM (statistical)	5/6 (83%)	N/A	No
LLM	3/6 (50%)	0%	Yes
BinSeg (statistical)	2/6 (33%)	N/A	No
Bollinger Bands	1/6 (17%)	N/A	No

The LLM doesn't win on raw detection rate; CUSUM beats it handily. But two things stand out:

Zero false alarms. The LLM never incorrectly flagged a regime change when there wasn't one. It only raised its hand when it was genuinely confident.

It can explain itself. When CUSUM fires, it just says "something changed." When the LLM fires, it says things like: "Fed tightening beginning in Q1 2022, combined with the collapse of the Terra/Luna ecosystem in May, has fundamentally altered risk appetite. The current regime shows classic bear market characteristics: declining volume, high correlation with equities, and consistent negative news flow from exchange insolvencies."

That explanation has real practical value. A risk manager doesn't just want to know the alarm went off; they want to know why, so they can decide what to do.

Where the LLM Struggled

The most interesting failure was the Institutional Accumulation regime (April 2019 to February 2020). This was a quiet period of slow, steady accumulation by institutional players like Grayscale, with no dramatic headlines, no price explosions, and no obvious trigger.

The LLM scored 0% on detecting this transition. It relies heavily on news hooks and dramatic price movements. Slow, structural, low-noise regime changes are essentially invisible to it.

This reveals a genuine limitation: LLMs reason from narrative, and quiet regimes have no narrative.

The Bigger Picture

The paper makes a case that LLMs and traditional statistical methods are complementary, not competing:

Use CUSUM as a cheap, fast first-stage detector (it's great at catching that something changed)
Use the LLM as a second stage to interpret what changed and why

Neither alone is the full answer. Together, they cover each other's weaknesses.

What We Released

Everything is open source:

The full Bitcoin OHLCV dataset (2017-2024) with labeled regimes
50 annotated news events
All model code, prompts, and raw LLM responses
A reproducibility checklist so anyone can replicate every number in the paper

The total API cost to run every LLM experiment in the paper was about $4.40. The entire pipeline is accessible to any individual researcher without institutional compute budgets.

The full paper is available on engrXiv. Code and data are on GitHub.

Building a Multi-Provider LLM Benchmark with Automated GitHub Actions

Manas Mudbari — Sun, 25 Jan 2026 17:10:29 +0000

A core problem we tackled when building realtime LLM based signal analysis is LLM token efficiency: when you're feeding time-series data (stock prices, IoT sensors, blockchain events) into LLMs, the serialization format matters. A lot.

We needed hard numbers to prove it. So we built an automated benchmark system that runs every two weeks, tests four data formats across four major LLM providers, and publishes live results on our website.

Here's how we built it.

Proving Token Efficiency at Scale

Time-series data is structurally simple but verbose. JSON, the industry default, repeats keys on every row. CSV is better, but still repeats full timestamps and values. For LLMs, this repetition directly translates to tokens—and tokens cost money.

We developed TSLN (Time-Series Lean Notation), a format that exploits temporal regularity and delta encoding to reduce token count by up to 87%. But claiming efficiency isn't enough. We needed:

Reproducible benchmarks across multiple LLM providers
Automated execution so results stay current
Public transparency so developers can verify our claims

The result: an open-source benchmark suite that runs automatically via GitHub Actions and displays live results on turboline.ai.

Architecture Overview

┌─────────────────────────────────────────────────────┐
│  GitHub Actions (Bi-weekly cron + manual trigger)   │
│  ┌───────────────────────────────────────────────┐  │
│  │  1. Checkout repo                             │  │
│  │  2. Install Python deps (openai, anthropic...) │  │
│  │  3. Run benchmark script                      │  │
│  │  4. Commit results to public/data/*.json      │  │
│  │  5. Push to main branch                       │  │
│  └───────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────┘
                        │
                        │ Git push triggers Railway
                        ▼
┌─────────────────────────────────────────────────────┐
│  Railway (Automated CI/CD)                          │
│  ┌───────────────────────────────────────────────┐  │
│  │  1. Detect commit to main                     │  │
│  │  2. Build Next.js site                        │  │
│  │  3. Deploy to production                      │  │
│  └───────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────┘
                        │
                        ▼
          Website auto-loads /data/benchmark-results.json

Key components:

Python benchmark runner (benchmark/run_full_benchmark.py)
GitHub Actions workflow (.github/workflows/run-benchmark.yml)
Next.js frontend (React component with Recharts visualization)
Railway deployment (automatic on git push)

The Benchmark Script

The core script tests four serialization formats:

Data Formats Tested

JSON - Baseline format with full object notation
CSV - Header row with comma-separated values
TSLN - Time-Series Lean Notation (our format)
TOON - Token-Oriented Object Notation (pipe-delimited)

Sample Data Generation

def generate_sample_data(format_name: str) -> str:
    """Generate 100 data points in different formats."""

    if format_name == "json":
        return json.dumps({
            "data": [
                {"timestamp": f"2024-01-01T09:{str(i).zfill(2)}:00Z", 
                 "value": 150.0 + i}
                for i in range(100)
            ]
        })

    elif format_name == "csv":
        rows = ["timestamp,value"]
        rows.extend([
            f"2024-01-01T09:{str(i).zfill(2)}:00Z,{150.0 + i}"
            for i in range(100)
        ])
        return "\n".join(rows)

    elif format_name == "tsln":
        # Delta-encoded compact format
        values = [str(150.0 + i) for i in range(100)]
        return "t:2024-01-01T09:00:00Z|i:60|v:" + ",".join(values)

    elif format_name == "toon":
        # Pipe-delimited format
        return "timestamp|value\n" + "\n".join([
            f"2024-01-01T09:{str(i).zfill(2)}:00Z|{150.0 + i}"
            for i in range(100)
        ])

Token Counting & Cost Calculation

Each benchmark:

Generates sample data (100 stock price data points)
Counts tokens using a simple heuristic (~4 chars/token)
Calculates costs using provider-specific pricing:
- OpenAI GPT-4o-mini: $0.15/1M tokens
- Anthropic Claude Haiku: $0.80/1M tokens
- Google Gemini 1.5 Flash: $0.075/1M tokens
- Deepseek: $0.14/1M tokens

def run_single_benchmark(provider: str, model: str, 
                        format_name: str, data: str):
    prompt = "Analyze this time-series data and summarize trends."
    full_prompt = f"{prompt}\n\n{data}"
    input_tokens = estimate_tokens(full_prompt)

    # Provider-specific cost rates
    cost_per_1m_tokens = {
        "openai": {"gpt-4o-mini": 0.15},
        "anthropic": {"claude-haiku-4-5-20251001": 0.8},
        "google": {"gemini-1.5-flash": 0.075},
        "deepseek": {"deepseek-chat": 0.14}
    }

    rate = cost_per_1m_tokens[provider][model]
    cost_usd = (input_tokens / 1_000_000) * rate
    cost_per_100k_datapoints = cost_usd * 100

    return {
        "provider": provider,
        "model": model,
        "format": format_name.upper(),
        "input_tokens": input_tokens,
        "cost_per_100k_datapoints": cost_per_100k_datapoints,
        # ... more metadata
    }

Summary Statistics

After running all combinations (4 formats × 4 providers = 16 tests), we aggregate results:

def calculate_summary(results):
    """Calculate per-format averages and savings vs JSON."""
    format_groups = {}
    for r in results:
        fmt = r["format"]
        if fmt not in format_groups:
            format_groups[fmt] = []
        format_groups[fmt].append(r)

    format_stats = {}
    for fmt, items in format_groups.items():
        avg_tokens = sum(r["input_tokens"] for r in items) / len(items)
        avg_cost = sum(r["cost_per_100k_datapoints"] for r in items) / len(items)

        format_stats[fmt.lower()] = {
            "avg_input_tokens": round(avg_tokens),
            "avg_cost_per_100k": round(avg_cost, 4),
            "sample_count": len(items),
            "savings_vs_json_percent": 0.0
        }

    # Calculate savings relative to JSON baseline
    if "json" in format_stats:
        json_cost = format_stats["json"]["avg_cost_per_100k"]
        for fmt, stats in format_stats.items():
            if fmt != "json":
                savings = ((json_cost - stats["avg_cost_per_100k"]) / json_cost) * 100
                stats["savings_vs_json_percent"] = round(savings, 1)

    return format_stats

GitHub Actions Automation

The workflow runs on a bi-weekly schedule but can also be triggered manually:

name: Run Benchmark Bi-weekly

on:
  schedule:
    # Every 2 weeks on Sunday at 00:00 UTC
    - cron: '0 0 */14 * 0'
  workflow_dispatch:  # Manual trigger via GitHub UI

permissions: 
  contents: write  # Required to commit results

jobs:
  run-benchmark:
    runs-on: ubuntu-latest

    steps:
      - name: Checkout repository
        uses: actions/checkout@v4
        with: 
          token: ${{ secrets.GITHUB_TOKEN }}
          persist-credentials: true

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.11'

      - name: Install Python dependencies
        run: |
          pip install openai anthropic google-generativeai

      - name: Run benchmark script
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
          GOOGLE_API_KEY: ${{ secrets.GOOGLE_API_KEY }}
          DEEPSEEK_API_KEY: ${{ secrets.DEEPSEEK_API_KEY }}
        run: |
          python benchmark/run_full_benchmark.py

      - name: Commit and push results
        run: |
          git config --local user.email "github-actions[bot]@users.noreply.github.com"
          git config --local user.name "GitHub Actions"
          git add public/data/benchmark-results.json
          git diff --staged --quiet || git commit -m "Update benchmark results [automated]"
          git push

Key Features

API Secrets Management: API keys are stored as GitHub repository secrets and injected as environment variables during workflow execution.

Conditional Commits: The git diff --staged --quiet || pattern ensures we only commit when results actually change.

Automated Deployment: After pushing to main, Railway automatically detects the change and redeploys the Next.js site within ~2 minutes.

Real Results from Latest Run

Here's what our latest benchmark (January 20, 2026) shows for 100 stock price data points:

Format	Avg Tokens	Cost/100k Points	Savings vs JSON
JSON	1,397	$0.0404	— (baseline)
CSV	698	$0.0202	50.0%
TOON	698	$0.0202	50.0%
TSLN	177	$0.0052	87.3% ✨

Key findings:

TSLN uses 87.3% fewer tokens than JSON across all providers
CSV and TOON are equivalent at ~50% savings (both avoid JSON's key repetition)
Savings are consistent across OpenAI, Anthropic, Google, and Deepseek
For 100k data points, JSON costs ~$4 while TSLN costs ~$0.52 (average across providers)

Frontend Visualization

The benchmark results are visualized on our homepage using React + Recharts:

Features

Provider Tabs: Switch between aggregated view and provider-specific breakdowns (OpenAI, Anthropic, Google, Deepseek).

Interactive Table: Shows format comparison with highlighting for best performance.

Cost Comparison Chart: Bar chart using Recharts with color-coded formats:

🔴 JSON (red) - baseline
🟠 CSV (orange)
🔵 TOON (blue)
🟢 TSLN (green) - most efficient

Stats Cards: Display best format, max savings %, and test success rate.

Implementation Snippet

export default function LLMBenchmark() {
  const [data, setData] = useState<BenchmarkData | null>(null)
  const [activeTab, setActiveTab] = useState('average')

  useEffect(() => {
    // Load from static JSON generated by GitHub Actions
    fetch('/data/benchmark-results.json')
      .then(res => res.json())
      .then(setData)
  }, [])

  // Calculate provider-specific or averaged stats
  const getProviderStats = (providerId: string) => {
    if (providerId === 'average') {
      return data?.summary.format_stats
    }

    // Filter and aggregate by provider
    const providerResults = data.results.filter(
      r => r.provider === providerId
    )
    // ... aggregate by format
  }

  return (
    <section>
      {/* Provider tabs */}
      <div className="flex gap-2">
        {PROVIDERS.map(provider => (
          <button onClick={() => setActiveTab(provider.id)}>
            {provider.logo && <Image src={provider.logo} />}
            {provider.name}
          </button>
        ))}
      </div>

      {/* Results table and chart */}
      <ResponsiveContainer>
        <BarChart data={chartData}>
          <Bar dataKey="cost">
            {chartData.map((entry, i) => (
              <Cell fill={formatColors[entry.format]} />
            ))}
          </Bar>
        </BarChart>
      </ResponsiveContainer>
    </section>
  )
}

CI/CD Pipeline: GitHub Actions → Railway

Our deployment pipeline is fully automated:

1. GitHub Actions Runs Benchmark

Trigger: Cron schedule (bi-weekly) or manual dispatch
Actions: Run Python benchmark, commit JSON results
Output: public/data/benchmark-results.json committed to main

2. Railway Detects Git Push

Connected to GitHub: Railway monitors our main branch
Auto-build: Detects commit, runs npm run build
Auto-deploy: Ships new build to production

3. Next.js Loads Static JSON

Static file: Results JSON is in /public, served directly
Client-side fetch: React component loads on mount
Fast & cacheable: No backend needed for benchmark data

This architecture is serverless-friendly: the benchmark results are just static JSON, so we avoid database costs and API rate limits.

TypeScript Type Safety

We maintain type definitions shared between Python output and TypeScript frontend:

// lib/benchmark-types.ts
export interface BenchmarkResult {
  provider: string
  model: string
  format: string
  input_tokens: number
  output_tokens: number
  cost_usd: number
  cost_per_100k_datapoints: number
  timestamp: string
  success: boolean
}

export interface FormatSummaryStats {
  avg_input_tokens: number
  avg_cost_per_100k: number
  sample_count: number
  savings_vs_json_percent?: number
}

export interface BenchmarkData {
  benchmark_date: string
  job_id: string
  config: { formats: string[]; providers: string[]; /* ... */ }
  results: BenchmarkResult[]
  summary: {
    format_stats: Record<string, FormatSummaryStats>
    best_format: string
    max_savings_percent: number
  }
}

This ensures the Python output schema matches what the React component expects.

Live Visualization Deep Dive

The frontend uses Framer Motion for animations and Recharts for data visualization:

Provider Switching

Users can toggle between:

Average - Aggregated stats across all providers
OpenAI - GPT-4o-mini specific results
Anthropic - Claude Haiku specific results
Google - Gemini Flash specific results
Deepseek - Deepseek Chat specific results

When switching tabs, we filter data.results by provider and recalculate format averages dynamically:

const getProviderStats = (providerId: string) => {
  if (providerId === 'average') {
    return data?.summary.format_stats
  }

  const providerResults = data.results.filter(
    r => r.provider === providerId
  )

  // Group by format and calculate averages
  const formatGroups = {}
  providerResults.forEach(result => {
    const fmt = result.format.toLowerCase()
    if (!formatGroups[fmt]) formatGroups[fmt] = []
    formatGroups[fmt].push(result)
  })

  return Object.entries(formatGroups).reduce((stats, [fmt, results]) => {
    stats[fmt] = {
      avg_input_tokens: avg(results, 'input_tokens'),
      avg_cost_per_100k: avg(results, 'cost_per_100k_datapoints'),
      savings_vs_json_percent: /* calculated vs JSON */
    }
    return stats
  }, {})
}

Color Coding

Each format has a semantic color:

🔴 JSON (red) - Most expensive baseline
🟠 CSV (orange) - Moderate efficiency
🔵 TOON (blue) - Moderate efficiency
🟢 TSLN (green) - Highest efficiency

Responsive Design

The chart uses ResponsiveContainer from Recharts to adapt to mobile/tablet/desktop. Table columns stack on smaller screens.

Lessons Learned

1. Static JSON > Database for Benchmark Results

We initially considered storing results in a database, but realized:

Results only update bi-weekly
No user-specific data
Static files are faster and free

2. GitHub Actions Auto-Commit is Powerful

The pattern of running a script, committing output, and pushing back to the repo unlocks:

Automated data pipelines
Version-controlled results
GitOps-style transparency

3. Railway's GitHub Integration is Seamless

We didn't write a single line of deploy config. Railway just watches our main branch and redeploys on every push. Perfect for small teams.

4. Token Efficiency Compounds Quickly

At 100 data points, TSLN saves ~$0.035 per run. At 10,000 data points, that becomes $3.50 per run. For production systems ingesting millions of time-series events, the savings are material.

A Token-Efficient Way to Send Time-Series Data into LLMs

Manas Mudbari — Wed, 31 Dec 2025 21:14:08 +0000

If you’ve ever pushed time-series data (metrics, logs, network streams, sensor readings) into an LLM, you’ve probably noticed that even small datasets can get expensive and slow very quickly.

Not because the data is huge, but because of how it gets tokenized.

This post is about a representation experiment we’ve been running to reduce that overhead, what didn’t work, and what seems to help.

The Problem We Ran Into

Time-series data is repetitive by nature:

timestamps move forward predictably
values often change slowly
schema repeats on every row

Humans immediately see the pattern but LLMs don’t.

Most LLM tokenizers are optimized for natural language, not numerical streams. Two numbers that look almost identical to us can tokenize very differently. Repeating structure (timestamps, keys, braces) quietly eats context and cost.

At small scale, it’s annoying, but at scale, it becomes an infrastructure problem.

What We Tried (and What Didn’t Help Much)

Before building anything new, we tried existing formats:

JSON (baseline)
CSV (more compact, but still verbose)
TOON (interesting idea, but still text that gets re-tokenized)

In practice, TOON didn’t materially reduce token usage once everything was still passed as plain text into an LLM. The structure was different, but the tokenizer behavior didn’t improve much.

That was the key realization: compression alone isn’t the problem — tokenization is.

Math was our intuition

If you’ve taken calculus or physics, this will feel familiar.

Think about motion:

Position → where something is
Velocity → how position changes
Acceleration → how velocity changes

Now map that to time-series data:

raw values = position
differences between values = velocity (delta)
differences between deltas = acceleration (delta-of-delta)

Most real-world time-series data has low acceleration. Values drift; timestamps tick forward regularly.

So instead of repeating full values and timestamps, we started experimenting with representing changes.

TSLN: A Token-Aware Representation

That experiment turned into TSLN (Time-Series Lean Notation).

At a high level, it’s a text-based serialization that:

stores deltas instead of repeating full values
stores delta-of-delta for regular timestamps
declares schema once instead of repeating it
stays human-readable and streamable

The key difference from “just compression” is that it’s designed to be tokenization-aware. Smaller, bounded numbers and less repeated syntax lead to far fewer tokens once the model sees the input.

In early benchmarks, the same datasets used up to ~80% fewer tokens compared to JSON. That directly translated into lower cost and better effective context windows when calling LLMs.

Code and Next Steps

We’ve open-sourced Go and Node.js implementations under the MIT license so it’s easy to experiment or drop into existing pipelines.

I’m currently expanding the benchmarks across more datasets, tokenizers, and workloads, and plan to publish a more formal preprint once that’s done.

If you work with:

time-series data
LLM pipelines
serialization or streaming systems

I’d genuinely love feedback — especially edge cases, comparisons we should run, or prior art I may have missed.

I Benchmarked LLM APIs on Live BGP Streams. Here’s What Actually Matters.

Manas Mudbari — Sun, 28 Dec 2025 17:01:36 +0000

Most LLM benchmarks are polite.

They run clean prompts on static text, measure token speed, and declare a winner. That’s fine if you’re building a chatbot. It’s almost useless if you’re building a real-time system.

I wanted to see what happens when LLMs are exposed to something messier: live, high-velocity network telemetry.

So I wired multiple LLM APIs directly into a live BGP stream and measured how they behaved when the data never stopped.

This post is about what broke, what worked, and why “smartest model” is often the wrong question.

The Setup (Simple, No Tricks)

The data source was a live BGP feed from RIPE RIS:

WebSocket endpoint:

wss://ris-live.ripe.net/v1/ws/?client=turbomart-test

Subscription message:

{
  "type": "ris_subscribe",
  "data": { "host": "rrc21" }
}

That gives you a continuous firehose of routing updates. No batching. No backpressure help.

Each update was sent to five LLM APIs:

OpenAI
Anthropic
Azure OpenAI
Gemini
Grok

Same prompts. Same parameters. No model-specific tuning.

System prompt:

You are an expert network engineer who analyzes BGP feeds for a living…

User prompt:

Summarize the following BGP update in under 140 characters for a real-time network alert. Include ASN owner, prefix, and region if known.

If a model failed, truncated output, or rambled, that was considered part of the result.

What I Measured (The Stuff That Actually Hurts in Production)

I didn’t care about abstract “intelligence.” I cared about things that break pipelines:

Time to First Token (TTFT)
Total completion latency
Tokens in vs tokens out
Compression ratio (output tokens divided by input tokens)

These metrics determine:

how stale your alerts are
whether your buffers explode
whether you burn money on filler text

The Averages (Across All Samples)

Here’s what the numbers looked like when averaged per provider:

OpenAI
TTFT ~830 ms
Total latency ~1.8 s
Tokens out ~45
Compression ~0.01

Anthropic
TTFT ~2.1 s
Total latency ~6.3 s
Tokens out ~137
Compression ~0.03

Azure OpenAI
TTFT ~2.8 s
Total latency ~2.8 s
Tokens out ~9,400
Compression ~1.85

Gemini
TTFT ~3.0 s
Total latency ~3.4 s
Tokens out ~9,600
Compression ~1.84

Grok
TTFT ~19 s
Total latency ~19.7 s
Tokens out ~33
Compression ~0.01

Even without context, some things should already look alarming.

Model-by-Model: What Actually Happened

OpenAI

OpenAI behaved exactly how you’d want in a streaming system:

fast first token
short, clean summaries
almost no wasted output

It followed the prompt closely and didn’t try to be clever. That’s a feature, not a bug.

If you’re building dashboards, alerts, or anything user-facing in real time, OpenAI was the most predictable option.

Anthropic

Anthropic did something different.

It didn’t just summarize updates. It tried to interpret them. Sometimes it flagged anomalies. Sometimes it suggested what might be happening.

That extra reasoning came at a cost:

slower responses
significantly more tokens
longer completions

This is not an alerting engine. It’s closer to a junior analyst reading the feed.

Great for offline analysis. Dangerous for live alerts.

Azure OpenAI

Azure OpenAI struggled in this setup.

It often behaved as if it only partially understood the incoming data. Output was verbose, repetitive, and sometimes ignored the summarization constraint entirely.

The compression ratio tells the story: output was often larger than input.

That’s a red flag in any streaming system.

I suspect this can be fixed with tighter controls, but out of the box it wasn’t stream-safe.

Gemini

Gemini responses were usually fast enough, but often incomplete.

Some outputs were truncated. Others were short but low-value. Many wasted tokens without adding useful signal.

It felt optimized for short Q&A, not for interpreting structured telemetry.

If you’re processing logs or metrics streams, Gemini isn’t there yet.

Grok

Grok was the strangest.

Responses were extremely slow to start, but very short once they arrived. Often it just signaled that something changed, without explaining what or why.

Think of it as a “delta detector,” not a summarizer.

If your use case is “ping me when anything changes,” maybe.
If you need explanation, no.

The Big Lesson

LLM APIs are not interchangeable components.

They encode assumptions about:

how fast answers should arrive
how verbose responses should be
how much reasoning is appropriate
how strictly prompts should be followed

In real-time systems:

latency beats intelligence
consistency beats creativity
token efficiency beats verbosity

An answer that arrives late is indistinguishable from noise.

If You’re Building a Streaming System

Based on this experiment:

Use OpenAI for real-time alerts and dashboards
Use Anthropic for offline analysis or investigations
Be cautious with Azure OpenAI unless you tightly constrain it
Avoid Gemini for structured stream summarization
Use Grok only if you care about “something changed,” not details

Building a Terminal UI Broke My Brain

Manas Mudbari — Mon, 15 Dec 2025 18:22:19 +0000

I’ve spent most of my career building things for the browser.

If something looks wrong, you open DevTools.
If spacing is off, you inspect the DOM.
If the layout is cursed, you tweak CSS until it stops yelling at you.

So naturally, I thought building a Terminal UI (TUI) would be… similar.

It was not!

This post is part of me building in public while working on a project called TurboStream — a developer tool that lets you connect high-velocity WebSocket streams (blockchains, BGP, finance feeds, etc.) to LLMs without draining tokens or crashing your system.

Think:

WebSocket → Cache → Triggers → LLM → Short human-readable alerts.

The backend was the easy part.
The Terminal UI nearly ended me.

Coming From the Browser World

In the web world:

Figma → HTML/CSS is mostly mechanical
Layout is visual
Debugging is interactive

In Bubble Tea land:

Layout is math
Padding is vibes
“Why is this box 2 columns wider?” is a philosophical question

There’s no DOM inspector.
There’s no “hover to see bounding box.”
You change one lipgloss.Style() and suddenly everything shifts.

The Hardest Screen: AI Analysis

The screen that caused the most pain was the AI Analysis panel.

Conceptually, it’s simple:

show LLM context size
show token usage
show generation timing
stream output text

In reality:

content height changes constantly
widths need to stay aligned with sibling panels
scrolling text + borders + padding fight each other

I kept ending up with:

truncated text
panels overflowing by 1 character
borders misaligned depending on terminal width

It looked fine at one size, then completely broke when resized.

If you’ve used Bubble Tea, you know the feeling:
“This should work… why doesn’t it?”

What I Learned (So Far)

A few lessons that finally started to click:

1. Stop thinking in pixels

Terminal layout is about constraints, not visuals.
Everything is rows × columns. Nothing is free.

2. Measure everything explicitly

If you don’t calculate width/height yourself, Bubble Tea will happily surprise you.

3. Borders lie

Borders and padding count.
That “one extra column” is always your fault.

4. Debugging TUIs requires instrumentation

I started adding:

temporary background colors
width/height labels inside boxes
fake content to stress layouts

It felt ugly — but it worked.

What I’m Doing Next

To make this sane long-term, my next steps are:

Create a layout debug mode: Toggleable overlays that show component boundaries and sizes.
Write small layout test harnesses: Instead of debugging inside the full app, isolate one view at a time.
Standardize layout contracts: Every panel declares what it needs instead of “figuring it out.”
Accept that TUI UX ≠ Web UX: Different medium, different rules