Forem: Frank Fiegel

The Hackers Who Tracked My Sleep Cycle

Frank Fiegel — Thu, 26 Mar 2026 16:17:47 +0000

This is a short story about how I caught hackers timing their attacks around my daily routine.

A couple of weeks ago, in the middle of the night, I got alerts on my phone that various metrics were going haywire.

The only thing I noticed was a huge spike in sign-ups. This coincided with one of my articles being on the front page of Hacker News, so I didn't think much of it.

It had spiked for a few hours and flattened out just before I started troubleshooting, so I went back to sleep.

When I woke up the next morning, a few thousand more accounts had appeared.

I thought this was a bit suspicious, but ... even if someone was being naughty, they seemed to have given up.

Nothing else was out of the ordinary that day. I even decided to stay up later that night to see if the pattern would repeat. Nothing.

As a small precaution, I activated CAPTCHA to slow down the sign-ups and went back to sleep.

The next morning, you guessed it... same pattern.

This time, I decided to do a deep dive into the data.

What I found was that the hackers were...

creating thousands of accounts
adding a valid payment method to each account
running a single very expensive LLM call (2-3 USD)

This would let the first request go through, then trigger a charge to their payment method. The payment method gets rejected, but the request has already been processed.

The reason the first request goes through is that we deposit a little bit of money into the account when it's created. A nominal amount that's enough to play with the API. However, if a payment method is added, we allow people to go into overdraft, which is why their expensive LLM call goes through.

Anyway, using this method, they would get away with about a thousand dollars' worth of credits every night, which kept them interested in the service.

But what caught my attention wasn't the money – it was the timing. The attacks coincided with my sleep cycle.

At this point, I still thought of it as unlucky timing, but even then something just didn't sit right with me.

Coincidentally, that day I decided to take a break and disconnect from my computer early. And lo and behold, just 30 minutes after I shut down my computer, I got the first notification.

I logged in to check, and it stopped.

Went to play some games and ... 30 minutes later, I got the second notification.

That's when it clicked – the timing of the attacks wasn't random. They were checking my Discord status to see if I was online.

Sure enough, I confirmed this by setting myself as offline on Discord, and the attacks popped right back up.

Over the next few days, I used this insight to mess with the hackers and to use them as my personal pen testers.

I didn't want to remove free credits for everyone, so I began experimenting with different ways to deter future attackers.

I would make a change, then go 'offline'. I'd watch them troubleshoot their automations until they figured out a workaround. Then I'd go back 'online'. The attacks would resume, and the cycle would repeat.

I mostly forgot about the entire incident until they came back to try their luck again the following week. But despite a few nights of alerts about tripwires getting triggered, they never managed to get more than the few cents we deposited into new accounts – not enough of an incentive to keep trying.

In the end, it was the cat-and-mouse game that made the whole experience worth it. I got free pen testing; they got a few dollars.

Card testing vulnerability

I wasn't surprised about the overdraft feature being abused. This was something we were aware of and treated as a conscious trade-off between convenience and risk of abuse.

The bigger issue was that this made me realize that a malicious actor could abuse our system for card testing. That's a widespread problem and one that will get your Stripe account flagged. When researching this problem, I didn't find many effective solutions, so I wanted to dedicate part of this blog post to sharing what I learned.

Here's what I tried and how it held up:

Method	Effectiveness
Device fingerprinting	Ineffective. Fingerprints are great for detecting legitimate returning users (e.g. to bypass CAPTCHA), but because they are easy to fake, they are not effective at detecting malicious actors.
IP address blocking	Ineffective. Residential proxies are cheap and easy to get.
CAPTCHA	Mild deterrent. Ineffective. Many existing solutions to bypass CAPTCHA.
OTP	Mild deterrent. Ineffective. Many existing solutions to bypass OTP.
JA4	Somewhat effective. JA4 is a TLS fingerprinting method that identifies clients based on how they negotiate TLS connections. Of all data points that we collect, JA4 is the most stable identifier.
ALTCHA	Somewhat effective. ALTCHA is a proof-of-work challenge that requires the client to solve a computational puzzle before submitting a request. When combined with prior methods, can slow down the attacks enough to deter the attacker.
Rate limiting	Somewhat effective. Slows down the attacks, but may hurt legitimate users.

At the end of the day, each method is individually bypassable – the game is making the combination expensive enough that the attacker moves on.

Oh, and set your Discord status to offline.

MCP Inspector is Now Stable: A Browser-Based Tool for Testing MCP Servers

Frank Fiegel — Sat, 17 Jan 2026 22:34:20 +0000

For the MCP ecosystem to grow, we need better developer tools. Today, we're publicly launching MCP Inspector.

The Model Context Protocol (MCP) is rapidly becoming the standard for connecting AI assistants to external tools and data sources. But as developers build MCP servers, they face a common challenge: how do you test and debug them effectively?

Until now, testing meant setting up local environments, managing dependencies, or logging into platforms that collect your data. We built MCP Inspector to change that.

What is MCP Inspector?

MCP Inspector is a free, browser-based tool that lets you connect to any MCP server URL and interact with its full capabilities—tools, resources, prompts, and tasks—directly from your browser.

Try it now: glama.ai/mcp/inspector

Why We Built This

When we started building MCP integrations, we found ourselves constantly switching between terminals, debugging configurations, and writing throwaway scripts just to test a single tool call. The official MCP Inspector is great for local development, but we needed something we could use anywhere—to test remote servers, share debugging sessions with teammates, or quickly verify a deployment.

So we built what we needed: a zero-friction inspector that runs entirely in your browser.

Key Features

No Login Required

Open the URL, paste your server address, and start inspecting. No account creation, no signup flow, no barriers.

Privacy-First Design

All requests go directly from your browser to the MCP server. We don't proxy, log, or store any of your requests or responses. Your API keys and server interactions stay between you and your server.

Full Protocol Support

We didn't cut corners. MCP Inspector supports the complete MCP specification:

Tools — List available tools, configure parameters with a dynamic form, and execute them
Resources & Templates — Browse and read server resources
Prompts — Test prompt templates with arguments
Tasks — Create, monitor, cancel, and retrieve results from long-running tasks
Progress Notifications — See real-time progress updates during execution
Elicitations — Respond to server-initiated form requests
OAuth 2.0 — Full OAuth flow support with dynamic client registration
Bearer Tokens & Custom Headers — Flexible authentication options

Built-in Demo Server

Not sure how it works? We've included a test server (mcp-test.glama.ai/mcp) that demonstrates every feature—tasks, elicitations, progress notifications, audio responses, and images. It's the fastest way to understand what MCP can do.

Shareable Sessions

Your entire configuration—servers, selected tools, and arguments—is stored in the URL. Bookmark it to save your setup, or share the link with a colleague to give them the exact same view.

How to Use It

Go to glama.ai/mcp/inspector
Click "Add Server" and enter your MCP server URL
Select your authentication method (None, OAuth, Bearer Token, or Custom Headers)
Connect and start exploring

The interface shows your server's capabilities in tabs: Tools, Resources, Resource Templates, Prompts, and Tasks. Select any item to see its details, configure parameters, and execute requests. All requests and responses are logged in the panel below, and additional debugging info is available in your browser's developer console.

For Local Development

MCP Inspector works great for testing remote servers. For local development with stdio transports or other local-only features, we recommend the official MCP Inspector which is designed specifically for that workflow.

Try It Today

We've been using MCP Inspector internally for months, and it's become an essential part of our development workflow. We're excited to share it with the community.

Open MCP Inspector →

Have feedback or feature requests? Join our Discord and let us know what you'd like to see.

The MCP Inspector is part of Glama's suite of tools for the Model Context Protocol ecosystem. Explore more at glama.ai/mcp.

MCP vs API

Frank Fiegel — Sun, 08 Jun 2025 15:39:11 +0000

Every week a new thread emerges on Reddit asking about the difference between MCP and API. I've tried summarizing everything that's been said about MCP vs API in a single post (and a single table).

Aspect	Traditional APIs (REST/GraphQL)	Model Context Protocol (MCP)
What it is	Interface styles (REST, GraphQL) with optional spec formats (OpenAPI, GraphQL SDL)	Standardized protocol with enforced message structure
Designed for	Human developers writing code	AI agents making decisions
Data location	REST: Path, headers, query params, body (multiple formats)	Single JSON input/output per tool
Discovery	Static docs, regenerate SDKs for changes¹ ²	Runtime introspection (`tools/list`)
Execution	LLM generates HTTP requests (error-prone)	LLM picks tool, deterministic code runs
Direction	Typically client-initiated; server-push exists but not standardized	Bidirectional as first-class feature
Local access	Requires port, auth, CORS setup	Native stdio support for desktop tools
Training target	Impractical at scale due to heterogeneity	Single protocol enables model fine-tuning

The HTTP API Problem

HTTP APIs suffer from combinatorial chaos. To send data to an endpoint, you might encode it in:

URL path (/users/123)
Request headers (X-User-Id: 123)
Query parameters (?userId=123)
Request body (JSON, XML, form-encoded, CSV)

OpenAPI/Swagger documents these variations, but as a specification format, it describes existing patterns rather than enforcing consistency. Building automated tools to reliably use arbitrary APIs remains hard because HTTP wasn't designed for this—it was the only cross-platform, firewall-friendly transport universally available from browsers.

MCP: A Wire Protocol, Not Documentation

Model Context Protocol (MCP) isn't another API standard—it's a wire protocol that enforces consistency. While OpenAPI documents existing interfaces with their variations, MCP mandates specific patterns: JSON-RPC 2.0 transport, single input schema per tool, deterministic execution.

Key architecture:

Transport: stdio (local) or streamable HTTP
Discovery: tools/list, resources/list expose capabilities at runtime
Primitives: Tools (actions), Resources (read-only data), Prompts (templates)

Why Not Just Use OpenAPI?

The most common question: "Why not extend OpenAPI with AI-specific features?"

Three reasons:

OpenAPI describes; MCP prescribes. You can't fix inconsistency by documenting it better—you need enforcement at the protocol level.
Retrofitting fails at scale. OpenAPI would need to standardize transport, mandate single-location inputs, require specific schemas, add bidirectional primitives—essentially becoming a different protocol.
The ecosystem problem. Even if OpenAPI added these features tomorrow, millions of existing APIs wouldn't adopt them. MCP starts fresh with AI-first principles.

Five Fundamental Differences

1. Runtime Discovery vs Static Specs

API: Ship new client code when endpoints change

MCP: Agents query capabilities dynamically and adapt automatically

// MCP discovery - works with any server
client.request('tools/list')
// Returns all available tools with schemas

2. Deterministic Execution vs LLM-Generated Calls

API: LLM writes the HTTP request → hallucinated paths, wrong parameters

MCP: LLM picks which tool → wrapped code executes deterministically

This distinction is critical for production safety. With MCP, you can test, sanitize inputs, and handle errors in actual code, not hope the LLM formats requests correctly.

3. Bidirectional Communication

API: Server-push exists (WebSockets, SSE, GraphQL subscriptions) but lacks standardization

MCP: Bidirectional communication as first-class feature:

Request LLM completions from server
Ask users for input (elicitation)
Push progress notifications

4. Single-Request Human Tasks

REST APIs fragment human tasks across endpoints. Creating a calendar event might require:

POST /events (create)
GET /conflicts (check)
POST /invitations (notify)

MCP tools map to complete workflows. One tool, one human task.

5. Local-First by Design

API: Requires HTTP server (port binding, CORS, auth headers)

MCP: Can run as local process via stdio—no network layer needed

Why this matters: When MCP servers run locally via stdio, they inherit the host process's permissions.

This enables:

Direct filesystem access (read/write files)
Terminal command execution
System-level operations

[!NOTE]
A local HTTP server could provide the same capabilities. However, I think the fact that MCP led with stdio transport planted the idea that MCP servers are meant to be as local services, which is not how we typically think of APIs.

The Training Advantage

MCP's standardization creates a future opportunity: models could be trained on a single, consistent protocol rather than thousands of API variations. While models today use MCP through existing function-calling capabilities, the protocol's uniformity offers immediate practical benefits:

Consistent patterns across all servers:

Discovery: tools/list, resources/list, prompts/list
Execution: tools/call with single JSON argument object
Errors: Standard JSON-RPC format with numeric codes

Reduced cognitive load for models:

// Every MCP tool follows the same pattern:
{
  "method": "tools/call",
  "params": {
    "name": "github.search_prs",
    "arguments": {"query": "security", "state": "open"}
  }
}

// Versus REST APIs with endless variations:
// GET /api/v2/search?q=security&type=pr
// POST /graphql {"query": "{ search(query: \"security\") { ... } }"}
// GET /repos/owner/repo/pulls?state=open&search=security

This standardization means models need to learn one calling convention instead of inferring patterns from documentation. As MCP adoption grows, future models could be specifically optimized for the protocol, similar to how models today are trained on function-calling formats.

They're Layers, Not Competitors

Most MCP servers wrap existing APIs:

[AI Agent] ⟷ MCP Client ⟷ MCP Server ⟷ REST API ⟷ Service

The mcp-github server translates repository/list into GitHub REST calls. You keep battle-tested infrastructure while adding AI-friendly ergonomics.

Real-World Example

Consider a task: "Find all pull requests mentioning security issues and create a summary report."

With OpenAPI/REST:

LLM reads API docs, generates: GET /repos/{owner}/{repo}/pulls?state=all
Hopes it formatted the request correctly
Parses response, generates: GET /repos/{owner}/{repo}/pulls/{number}
Repeats for each PR (rate limiting issues)
Generates search queries for comments
Assembles report

With MCP:

LLM calls: github.search_issues_and_prs({query: "security", type: "pr"})
Deterministic code handles pagination, rate limits, error retry
Returns structured data
LLM focuses on analysis, not API mechanics

The Bottom Line

HTTP APIs evolved to serve human developers and browser-based applications, not AI agents. MCP addresses AI-specific requirements from the ground up: runtime discovery, deterministic execution, and bidirectional communication.

For AI-first applications, MCP provides structural advantages—local execution, server-initiated flows, and guaranteed tool reliability—that would require significant workarounds in traditional API architectures. The practical path forward involves using both: maintaining APIs for human developers while adding MCP for AI agent integration.

GraphQL offers schema introspection, but it lacks task-level descriptions or JSON-schema-style validation, so SDKs still regenerate for new fields. ↩
OpenAPI 3.1+ supports runtime discovery through the OpenAPI document endpoint. The key difference is that MCP mandates runtime discovery while OpenAPI makes it optional. ↩

NLWeb: Microsoft's Protocol for AI-Powered Website Search

Frank Fiegel — Wed, 04 Jun 2025 17:57:44 +0000

Microsoft recently open-sourced NLWeb, a protocol for adding conversational interfaces to websites.¹ It leverages Schema.org structured data that many sites already have and includes built-in support for MCP (Model Context Protocol), enabling both human conversations and agent-to-agent communication.

The key idea: NLWeb creates a standard protocol that turns any website into a conversational interface that both humans and AI agents can query naturally.

What Problem Does NLWeb Solve?

Currently, websites have structured data (Schema.org) but no standard way for AI agents or conversational interfaces to access it. Every implementation is bespoke. Traditional search interfaces struggle with context-aware, multi-turn queries.

NLWeb creates a standard protocol for conversational access to web content. Like RSS did for syndication, NLWeb does for AI interactions - one implementation serves both human chat interfaces and programmatic agent access.

The key insight: Instead of building custom NLP for every site, NLWeb leverages LLMs' existing understanding of Schema.org to create instant conversational interfaces.

The real power comes from multi-turn conversations that preserve context:

"Find recipes for dinner parties"
"Only vegetarian options"
"That can be prepared in under an hour"

Each query builds on the previous context - something traditional search interfaces struggle with.

How NLWeb Works

Two-Component System

Protocol Layer: REST API (/ask endpoint) and MCP server (/mcp endpoint) that accept natural language queries and return Schema.org JSON responses
Implementation Layer: Reference implementation that orchestrates multiple LLM calls for query processing

Query Processing Pipeline

User Query → Parallel Pre-processing → Vector Retrieval → LLM Ranking → Response
             ├─ Relevancy Check
             ├─ Decontextualization  
             ├─ Memory Detection
             └─ Fast Track Path

In this flow, a single query may trigger 50+ targeted LLM calls for:

Query decontextualization based on conversation history
Relevancy scoring against site content
Result ranking with custom prompts per content type
Optional post-processing (summarization/generation)

The "fast track" optimization launches a parallel path to retrieval (step 3) while pre-processing occurs, but results are blocked until relevancy checks complete².

Why 50+ LLM Calls?

Instead of using one large prompt to handle everything, NLWeb breaks each query into dozens of small, specific questions:

"Is this query about recipes?"
"Does it reference something mentioned earlier?"
"Is the user asking to remember dietary preferences?"
"How relevant is this specific result?"

This approach has two major benefits:

No hallucination - Results only come from your actual database
Better accuracy - Each LLM call has one clear job it can do well

Think of it like having a team of specialists instead of one generalist.

Even if you don't use NLWeb, this pattern—using many focused LLM calls instead of one complex prompt—is worth borrowing.

Quick Start

The best way to wrap your head around NLWeb is to try it out.

Microsoft provides a quick start guide for setting up an example NLWeb server with Behind The Tech RSS feed.

# Setup
git clone https://github.com/microsoft/NLWeb
cd NLWeb
python -m venv myenv
source myenv/bin/activate
cd code
pip install -r requirements.txt

# Configure (copy .env.template → .env, update API keys)

# Load data
python -m tools.db_load https://feeds.libsyn.com/121695/rss Behind-the-Tech

# Run
python app-file.py

Go to localhost:8000 and you should have a working NLWeb server.

I have also noticed that the repository contains a CLI to simplify configuration, testing, and execution of the application. However, I struggled to get it working.

Once you have the server running, you can ask it questions like:

curl -X POST http://localhost:8000/ask \
  -H "Content-Type: application/json" \
  -d '{
    "query": "tell me more about the first one",
    "prev": "find podcasts about AI,what topics do they cover"
  }'

which will return a JSON response like:

{
  "query_id": "abc123",
  "results": [{
    "url": "https://...",
    "name": "AI Safety with Stuart Russell",
    "score": 85,
    "description": "Discussion on alignment challenges...",
    "schema_object": { "@type": "PodcastEpisode", ... }
  }]
}

Glama NLWeb Server

As part of writing this post, I've built a simple NLWeb server using Node.js. You can use it to query our MCP server directory:

curl -X POST https://glama.ai/nlweb/ask \
  -H "Content-Type: application/json" \
  -d '{"query": "MCP servers for working with GitHub"}'

As far as I can tell, this is the first ever public NLWeb endpoint!

Due to the volume of LLM calls, it takes a few seconds to respond.

or, if you want to continue the conversation:

curl -X POST https://glama.ai/nlweb/ask \
  -H "Content-Type: application/json" \
  -d '{
    "query": "servers that can create PRs",
    "prev": "MCP servers for working with GitHub"
  }'

or, if you want to summarize the results:

curl -X POST https://glama.ai/nlweb/ask \
  -H "Content-Type: application/json" \
  -d '{
    "query": "MCP servers for working with GitHub",
    "mode": "summarize"
  }'

Useful when you want an overview rather than just a list of results.

or, if you want to generate a response:

curl -X POST https://glama.ai/nlweb/ask \
  -H "Content-Type: application/json" \
  -d '{
    "query": "MCP servers for working with GitHub",
    "mode": "generate"
  }'

This mode attempts to answer the question using the retrieved results (like traditional RAG)

Things that made it easy to implement:

We have existing embeddings for every MCP server and a vector store
We already have a way to make LLM calls

A few questions came to mind as I was implementing this:

It seems that NLWeb doesn't dictate where the /ask endpoint needs to be hosted—does it have to be https://glama.ai/ask or can it be https://glama.ai/nlweb/ask?
It wasn't super clear to me which Schema.org data is best suited to describe MCP servers.

Not surprisingly, the slowest part of the pipeline is the LLM calls.

REST API

Currently, NLWeb supports two APIs at the endpoints /ask and /mcp. The arguments are the same for both, as is most of the functionality. The /mcp endpoint returns the answers in format that MCP clients can use. The /mcp endpoint also supports the core MCP methods (list_tools, list_prompts, call_tool and get_prompt).

The /ask endpoint supports the following parameters:

Parameter	Type	Description
`query`	`string`	Natural language question
`site`	`string`	Scope to specific data subset
`prev`	`string`	Comma-separated previous queries
`decontextualized_query`	`string`	Skip decontextualization if provided
`streaming`	`bool`	Enable SSE streaming
`query_id`	`string`	Track conversation
`mode`	`string`	`list`, `summarize`, or `generate`

Integrating with MCP

Since NLWeb includes an MCP server by default, you can configure Claude for Desktop to talk to NLWeb.

If you already have the NLWeb server running, this should be as simple as adding the following to your ~/Library/Application Support/Claude/claude_desktop_config.json configuration:

{
  "mcpServers": {
    "ask_nlw": {
      "command": "/Users/yourname/NLWeb/myenv/bin/python",
      "args": [
        "/Users/yourname/NLWeb/code/chatbot_interface.py",
        "--server",
        "http://localhost:8000",
        "--endpoint",
        "/mcp"
      ],
      "cwd": "/Users/yourname/NLWeb/code"
    }
  }
}

Implementation Reality

The documentation suggests you can get a basic prototype running quickly if you have existing Schema.org markup or RSS feeds.

What's actually straightforward:

Loading RSS feeds or Schema.org data
Basic search functionality with provided prompts
Local development with Qdrant

What requires more effort:

Production deployment at scale
Optimizing 50+ LLM calls per query (mentioned in docs)
Custom prompt engineering for your domain
Maintaining data freshness between vector store and live data

I already had a lot of these components in place, so I was able to get a basic prototype running in an hour. However, to make this production-ready, I'd need to think a lot more time about the cost of the LLM calls.

Should You Care?

Yes if:

You have structured data (Schema.org, RSS) already
You want to enable conversational search beyond keywords
You need programmatic AI agent access via MCP
You can experiment with early-stage tech

No if:

You need battle-tested production code
You can't handle significant LLM API costs
Your content isn't well-structured
You expect plug-and-play simplicity

Bottom Line

NLWeb is more interesting as a strategic direction than as current technology. NLWeb was conceived and developed by R.V. Guha (creator of Schema.org, RSS, and RDF), now a CVP and Technical Fellow at Microsoft³. That's serious pedigree.

The O'Reilly prototype proves it's viable for content-heavy sites. The quick start shows it's approachable for developers. But "prototype in days" doesn't mean "production in weeks."

Think of it as an investment in making your content natively conversational. The technical foundation is solid—REST API, standard formats, proven vector stores. The vision is compelling. The code needs work.

Want to experiment? Clone the repo and try the quick start above.

Claude Sonnet and Opus 4 (Executive Summary)

Frank Fiegel — Thu, 22 May 2025 18:19:44 +0000

Anthropic released Claude Opus 4 and Sonnet 4 today, claiming the #1 spot for coding performance. There are going to be a lot of articles floating around with exaggerations and marketing talk, but here is an executive summary of everything you need to know.

Performance Numbers

Claude Opus 4:

SWE-bench: 72.5% (world's best)
Terminal-bench: 43.2%
Sustained performance for hours on complex tasks
$15/$75 per million tokens

Claude Sonnet 4:

SWE-bench: 72.7% (matches Opus 4)
3x faster than Opus 4 for most tasks
$3/$15 per million tokens

Two key slides from the announcement:

Key Technical Features

Hybrid Architecture: Instant responses + extended thinking mode (up to 64K tokens)
Extended Thinking with Tools: Can use web search, code execution during reasoning
Parallel Tool Execution: Multiple tools simultaneously
Memory Files: Creates persistent memory when given file access
65% Reduction: Less shortcut/loopholes behavior vs Sonnet 3.7

Industry Adoption

GitHub: Integrating Sonnet 4 into GitHub Copilot
Cursor: "State-of-the-art for coding"
Rakuten: Validated 7-hour autonomous refactor
Sourcegraph: "Substantial leap in software development"

New API Capabilities

4 new capabilities:

Code execution tool
MCP connector
Files API
Prompt caching (1 hour)

Claude Code Generally Available

VS Code and JetBrains extensions (beta)
GitHub Actions integration (demo)
Claude Code SDK for custom agents
GitHub PR integration via /install-github-app

Access

Already available via Anthropic API.

If you want to skip the new model restrictions, you can try it via Glama Gateway and OpenRouter.

So, is it hype?

Claude 4 models lead coding benchmarks and offer sustained performance for complex agent workflows. Opus 4 for maximum capability, Sonnet 4 for speed/cost balance. Both already available to test.

Source: Official Announcement

Will update this article to add interesting insights and facts as the day progresses.

search	count
supabase	2106
github	1570
playwright	1398
docker	1186
browser	1176
obsidian	980
notion	944
postgres	920
search	720
filesystem	704
sequential	650
browser tools	604
google	538
perplexity	494
memory	442
youtube	440
reddit	436
blender	432
cursor	418
firebase	400
wordpress	390
firecrawl	370
sequential thinking	366
weather	356
confluence	334
linear	322
puppeteer	318
salesforce	300
postgresql	300
linkedin	296
gitlab	294
thinking	286
spotify	284
kubernetes	264
shopify	260
flutter	258
google drive	254
calendar	250
atlassian	246
sqlite	228
whatsapp	224
twitter	222
Supabase	218
google calendar	212
documentation	206
browsertools	202
terminal	198
python	198
Sequential Thinking	194
chrome	184
openapi	184
telegram	174
tavily	174
file system	172
cloudflare	170
discord	164
browser use	158
instagram	156
Playwright	154
datadog	152
crypto	152
fetch	150
laravel	150
mongodb	148
outlook	146
airtable	140
snowflake	138
postgre	138
database	136
markdown	136
web search	136
browser tool	134
stripe	128
sql server	126
hubspot	126
powerpoint	126
ollama	124
bitbucket	124
microsoft	122
clickhouse	122
clickup	120
browser tools	118
claude	118
deepseek	116
Notion	116
Obsidian	114
docker mcp	112
mcp server fetch	112
postman	112
terraform	110
vercel	110
replicate	110
research	108
brave search	108
Documentation	108
trello	108
oracle	106
todoist	106
android	106
screenshot	104

GPT-4.5 Announced: How to Access the Latest OpenAI Model Without Rate Limits

Frank Fiegel — Thu, 27 Feb 2025 21:49:12 +0000

OpenAI's latest research preview, GPT-4.5, marks a substantial leap forward in unsupervised learning, combining broader world knowledge, greater emotional intelligence ("EQ"), and vastly improved intuition to support natural and nuanced conversation. Designed with innovative scalable training techniques, GPT-4.5 excels in creative tasks, coding workflows, and meaningful interactions.

Available Servers to Access GPT-4.5

Anthropic Official: https://openai.com/api/ — Official platform but includes rate limitations.
Glama AI: GPT-4.5 Preview — Offers GPT-4.5 without rate limits, easy sign-up within 30 seconds.
OpenRouter: GPT-4.5 — Another reliable platform without rate restrictions providing easy access.

Key Highlights from Official OpenAI Announcement

OpenAI describes GPT-4.5 as a major step forward in scaling unsupervised learning. Surpassing previous generations in intuitive world knowledge, GPT-4.5 has significant reductions in hallucination rates and far superior accuracy in general knowledge queries.

Improved Natural Interactions

Early feedback indicates that GPT-4.5 interactions feel notably smoother and more organic, with stronger emotional intelligence (“EQ”) enabling more nuanced conversational exchanges.

Enhanced Capabilities at a Glance:

Higher factual accuracy across diverse topics (62.5% SimpleQA accuracy vs GPT-4o’s 38.2%).
Reduced hallucination rate (37.1% vs GPT-4o’s 61.8%).
Better collaborative intelligence, intuitive interactions, and creativity.

Key Technical Advances:

Marked improvement in unsupervised learning, enhancing intuition and broadening world-model accuracy.
New scalable training techniques enable GPT-4.5 to better understand and anticipate human intent, resulting in unprecedented conversational fluidity.

Immediate Use Cases

Creative and Writing Assistance: Exceptional at creative writing, content editing, and design tasks.
Programming and Automation Workflows: Superb capabilities in executing complex coding and multi-step automation.
Interpersonal Communication: Stronger emotional intelligence makes GPT-4.5 ideal for empathetic engagements, coaching scenarios, and even mental wellness interactions. How to Access GPT-4.5 Now

Available Providers

Anthropic Official: https://openai.com/api/ — Official platform but includes rate limitations.
Glama AI: GPT-4.5 Preview — Offers GPT-4.5 without rate limits, easy sign-up within 30 seconds.
OpenRouter: GPT-4.5 — Another reliable platform without rate restrictions providing easy access.

How to access DeepSeek r1?

Frank Fiegel — Wed, 29 Jan 2025 18:25:59 +0000

Many people are struggling to get access to DeepSeek r1 at the moment because of the rate limits and restricted sign ups. However, there are alternative providers that you can use to access DeepSeek r1 and its distill models.

`deepseek-r1-distill-qwen-32b`

Let’s start with deepseek-r1-distill-qwen-32b because it is the easiest model to get access to and probably the best balance of cost, performance and speed.

deepseek-r1-distill-qwen-32b is a distilled version of r1. This model is made by transferring the knowledge from the larger model to the smaller model through a process known as knowledge distillation. The 32b qwen model in particular beats other models in several benchmarks, esp. coding.

There is only one provider that currently makes this model available for anyone to use: Glama Gateway. Alternatively, you can self-host this model, but expect that you will need approx. 80gb of VRAM.

Providers:

https://glama.ai/models/deepseek-r1-distill-qwen-32b

The great thing about 32b is price and response times. It is currently cheaper than the official DeepSeek r1 and responds slightly faster than r1.

deepseek-r1-distill-llama-70b

The 70b llama version is also a distilled version of DeepSeek r1. It is based on llama, meaning that there are more providers available.

Groq is one of the noteworthy providers that have access to this model.

https://console.groq.com/docs/models

The benefit of using Groq is that it is extremely fast (upwards of 300 tokens per second for this model).

The downside is that the model is severely rate limited. Depending on what you are planning to do, the current rate limits (30k tokens per minute) might be not enough.

You can also access this model through Glama — deepseek-r1-distill-llama-70b. As a gateway provider, Glama has slightly elevated rate limits and can offer up to 60k tokens per minute.

Other providers to evaluate:

I will update this article as I discover other providers.

If you were planning to host this model yourself, bare in mind that it requires a lot of vRAM (GB 140). While it is possible to host it with lower spec machine, the performance will be subpar.

DeepSeek r1

Finally, if you are trying to get deepseek-r1, your best bet remains waiting for deepseek.com to clear out the backlog of demand. Allegedly, they are currently experiencing DDoS. Therefore, new user sign ups are currently restricted.

A few other providers that claim to offer r1:

I explicitly mention “claim to offer” because many of them are oversubscribed at the moment and not able to meet the demand. Even if you sign up, you might get rate limits.

Unfortunately, hosting r1 yourself is not a viable option for most of us. The model is 671b parameter model. Meaning that you would need at least 1,342 vRAM to host it, which is beyond reach for any home user.

If you become aware of other providers, please leave a comment and I will add them to the list.

Other Distill Models

There are many other distilled versions available. If your goal is to run the model locally, then you should evaluate them based on the benchmarks in the latter GitHub repository. Some small models (like the 1.5B and 7B) can be reasonably run on your local machine and perform decently well.

Building ask-PDF service using LLMs

Frank Fiegel — Sun, 17 Nov 2024 17:58:32 +0000

I wanted to add a feature to Glama that allows users to upload documents and ask questions about them.

I've built similar features before, but they were always domain specific. For example, looking up recipes, searching for products, etc. A generalized solution had a few unexpected challenges, e.g. converting documents to markdown, splitting documents, indexing documents, and retrieval of documents all turned out to be quite complex.

In this post, I'll walk through the strategy of splitting documents into smaller chunks, since this took me a while to figure out.

The Problem

When you have a domain-specific rag, it is typically easy to just create a dedicated record for every entity in the domain. For example, if you are building a recipe rag, you might have a record for each recipe, ingredient, and step. You don't have to worry about splitting the document into chunks, since you already know the semantic structure of the document.

However, when you have a generalized rag, your input is just a document. Any document. Even when you convert the document to a markdown format (which has some structure), you still have to figure out how to split the document into context aware chunks.

Suppose user uploads a document like this:

# Recipe Book

## Recipe 1

Name: Chocolate Chip Cookies

### Ingredients

* 2 cups all-purpose flour
* 1 cup granulated sugar
* 1 cup unsalted butter, at room temperature
* 1 cup light brown sugar, packed
* 2 large eggs
* 2 teaspoons vanilla extract
* 2 cups semi-sweet chocolate chips

### Instructions

1. Preheat oven to 350°F (180°C). Line a baking sheet with parchment paper.
2. In a medium bowl, whisk together flour, sugar, and butter.
3. In a large bowl, beat the egg yolks and the egg whites together.
4. Stir in the vanilla.
5. Gradually stir in the flour mixture until a dough forms.
6. Fold in the chocolate chips.
7. Drop the dough by rounded tablespoons onto the prepared baking sheet.
8. Bake for 8-10 minutes, or until the edges are golden brown.
9. Let cool for a few minutes before transferring to a wire rack to cool completely.

## Recipe 2 ...

If we knew it is a recipe book, we could just split the document into chunks based on the ## Recipe 1 and ## Recipe 2 headers. However, since we don't know the structure of the document, we can't just split it based on headers.

If we split too-high (h2), we might end up with too large chunks
If we split too-low (h3), we might end up with too many small chunks that do not have the necessary context to answer the question

So we need to split the document such that:

Each chunk would have useful embeddings
Each chunk could retrieve sufficient context to answer the question

Sounds like an impossible task, right? Well, it is. But I found a solution that works pretty well.

The Solution

The solution is a combination of several techniques.

Splitting

Parsing the document into a tree structure
Splitting each node in the tree into semantically meaningful chunks

Example:

Using our example document, the tree structure would look like this:

{
  "children": [
    {
      "children": [
        {
          "children": [],
          "content": "### Ingredients\n\n* 2 cups all-purpose flour\n* 1 cup granulated sugar\n* 1 cup unsalted butter, at room temperature\n* 1 cup light brown sugar, packed\n* 2 large eggs\n* 2 teaspoons vanilla extract\n* 2 cups semi-sweet chocolate chips\n",
          "heading": {
            "depth": 3,
            "title": "Ingredients"
          }
        },
        {
          "children": [],
          "content": "### Instructions\n\n1. Preheat oven to 350°F (180°C). Line a baking sheet with parchment paper.\n2. In a medium bowl, whisk together flour, sugar, and butter.\n3. In a large bowl, beat the egg yolks and the egg whites together.\n4. Stir in the vanilla.\n5. Gradually stir in the flour mixture until a dough forms.\n6. Fold in the chocolate chips.\n7. Drop the dough by rounded tablespoons onto the prepared baking sheet.\n8. Bake for 8-10 minutes, or until the edges are golden brown.\n9. Let cool for a few minutes before transferring to a wire rack to cool completely.\n",
          "heading": {
            "depth": 3,
            "title": "Instructions"
          }
        }
      ],
      "content": "## Recipe 1\n\nName: Chocolate Chip Cookies\n",
      "heading": {
        "depth": 2,
        "title": "Recipe 1"
      }
    },
    {
      "children": [],
      "content": "## Recipe 2 ...\n",
      "heading": {
        "depth": 2,
        "title": "Recipe 2"
      }
    }
  ],
  "content": null,
  "heading": {
    "depth": 1,
    "title": "Recipe Book"
  }
}

The benefit of this structure is that we can now store these sections in a database while retaining their hierarchical structure. Here is the database schema:

                              Table "public.document_section"
           Column           |  Type   | Collation | Nullable |           Default
----------------------------+---------+-----------+----------+------------------------------
 id                         | integer |           | not null | generated always as identity
 uploaded_document_id       | integer |           | not null |
 parent_document_section_id | integer |           |          |
 heading_title              | text    |           | not null |
 heading_depth              | integer |           | not null |
 content                    | text    |           |          |
 sequence_number            | integer |           | not null |
 path                       | ltree   |           | not null |

[!NOTE]
The path column is a PostgreSQL ltree column that allows us to store the hierarchical structure of the document. This is useful for querying later on.

However, this alone is not enough. Since each section can be infinitely long, we need to split sections into smaller chunks. This also allows us to create more granular embeddings for each chunk.

I ended up using mdast to split each section into chunks between 1000 and 2000 characters. I made exceptions for tables, code blocks, blockquotes, and lists.

Here is the resulting database schema:

                          Table "public.document_section_chunk"
       Column        |     Type     | Collation | Nullable |           Default
---------------------+--------------+-----------+----------+------------------------------
 id                  | integer      |           | not null | generated always as identity
 document_section_id | integer      |           | not null |
 chunk_index         | integer      |           | not null |
 content             | text         |           | not null |
 embedding           | vector(1024) |           | not null |

The embedding column is a PostgreSQL vector type that stores the embedding of the chunk. I used jina-embeddings-v3 to create the embeddings. I picked something that scores relatively well on the MTEB leaderboard, but also relatively low in terms of memory usage.

Okay, so now we have a database that stores the document sections and their embeddings. The next step is to create a Rag that can retrieve the relevant sections/chunks for a given question.

Retrieval

Retrieval is the process of finding the relevant chunks for a given question.

My process was to:

Use LLMs to generate several questions based on user's input, e.g. If user asks "What is the recipe for chocolate chip cookies?", my LLMs would generate queries that break down the question into smaller parts, e.g. "chocolate chip cookies ingredients", "chocolate chip cookies instructions", etc.
Query the database to find the top N chunks that match the generated queries.
Use the document_section_chunk and document_section relationship to identify which sections chunks belong to, and which sections are referenced by the chunks the most frequently.

At this point, we know:

which chunks are the most relevant to the question
which sections are the most relevant to the question

We determine the most relevant sections based on their ordering using cosine distance.

However, we don't know which sections/chunks can be used to answer the question, i.e. just because a chunk has a low cosine distance to the question, it does not mean that the chunk answers the question. For this step, I ended up using another LLM prompt. The prompt is given the question and the candidate chunks, and it asks the LLM to rank the chunks based on how well they answer the question.

[!NOTE]
I later learned that Jina has a Reranker API that does essentially the same thing. I compared the two approaches and found that both solutions perform equally well. However, if you prefer a higher level of abstraction, Reranker is a good choice.

Finally, I have a handful of sections/chunks that answer the question. The last step is to determine which sections/chunk to include in the final answer. I do this by assigning a finite budget to each question (e.g. 1000 tokens), and then prioritize adding the most relevant sections/chunks to the answer. The reason they are separated is that because sometimes a single section can answer the whole question and it might fit in the budget, but sometimes we need to include the more granular chunks to the answer.

Further Improvements

As I started typing this post, I realized that there are too many subtle details that if I mentioned them, it would make the post too long.

A few things I want to mention that helped me improve the solution:

I use a simple LLM to generate a brief description of each section. I then create embeddings for those descriptions and use them as part of the logic used to determine which sections to include in the answer.
I include meta information about each section in the generated answer. For example, the section title, depth, and the surrounding section names.
I provide multiple tools to the LLMs to help answer the question, e.g. a tool to lookup all mentions of a term in the document, a tool to lookup the next section in the document, etc.

Overall, I think the biggest innovation of this approach is splitting markdown documents into a hierarchical structure, and then splitting each section into smaller chunks. This allows to create a generalized rag that can answer questions about any markdown document.

Implementing Tool Functionality in Conversational AI

Frank Fiegel — Thu, 17 Oct 2024 23:32:49 +0000

As part of building Glama, I am trying to build a deeper understanding of the concepts behind existing services, such as OpenAI's assistant tools. So I decided to write a small PoC that attempts to replicate the functionality.

What are assistant tools?

This blog post captures well the concepts behind the tools. In short, the tools are a way to define a set of functions that can be called by the model in response to user queries. Furthermore, the model can call multiple functions in sequence to answer complex queries. It can deduce the correct order of function calls to complete a task, eliminating the need for complex routing logic.

Practical examples of tools include:

Fetching information from external sources (e.g. fetching the current weather in a given location)
Calculating complex mathematical expressions (e.g. calculating the total cost of a shopping cart)
Performing actions on the user's behalf (e.g. sending an email)

However, not all models support tools. I wanted to write my own routing implementation so that I could enable access to the tools for all models.

Implementing tools

I started with writing a simple test case that describes the happy path.

import { routeMessage } from './routeMessage';
import { expect, it } from 'vitest';
import { z } from 'zod';

it('uses tools if relevant tools are available', async () => {
  const plan = await routeMessage({
    messages: [
      {
        content: 'What is 2+2?',
        role: 'user' as const,
      },
    ],
    tools: [
      {
        description: 'Adds two numbers.',
        execute: async ({ a, b }) => {
          return {
            sum: a + b,
          };
        },
        name: 'addNumbers',
        parameters: z.object({
          a: z.number(),
          b: z.number(),
        }),
        response: z.object({
          sum: z.number(),
        }),
      },
    ],
  });

  expect(plan).toEqual({
    actions: [
      {
        name: 'addNumbers',
        parameters: {
          a: 2,
          b: 2,
        },
      },
    ],
  });
});

The expectation is that the routeMessage function will understand that the user is asking for the sum of two numbers, and will instruct to call the addNumbers tool to get the result.

In order to do this, we need to:

Describe the tools available to the model
Provide model with the conversation history

Describing tools

Tool descriptions need to be expressed in a way that the model can understand. I simply defaulted to using JSON.

The only complexity here is that I've used zod to describe the expected parameters and response schema of the tools. So the first thing we need to do is convert the zod schema to JSON schema.

import { zodToJsonSchema } from 'zod-to-json-schema';
import { type AnyZodObject, z } from 'zod';

type Tool<P extends AnyZodObject, R extends AnyZodObject> = {
  description: string;
  execute: (parameters?: z.infer<P>) => Promise<z.infer<R>>;
  name: string;
  parameters?: P;
  response: R;
};

const describeTool = (tool: Tool<AnyZodObject, AnyZodObject>) => {
  return {
    description: tool.description,
    name: tool.name,
    parameters: tool.parameters ? zodToJsonSchema(tool.parameters) : null,
  };
};

describeTool is a helper function that we will use to serializing tools to JSON.

Now that we have a way to describe tools, we need to describe the conversation history.

Describing conversation history

I am using ai library message format because most readers are familiar with it. However, this implementation does not depend on any specific framework.

import { type CoreMessage } from 'ai';

export const routeMessage = async ({
  messages,
  tools,
}: {
  messages: CoreMessage[];
  tools: ReadonlyArray<Tool<AnyZodObject, AnyZodObject>>;
}) => {}

I like the ai library message format because it captures every important aspect of the conversation, including the role of each participant, the content of the message, and the tools used. We need the conversation history to include the tool invocations, so that the model would have the context about what tools were already used in the conversation.

Writing the prompt

At the end, the entire routing logic is expressed as a prompt.

I've experimented with different prompts, and this is what I landed on.

You are an assistant with access to the following tools:

[tools]

Tools are described in the following format:

* "description" describes what the tool does.
* "name" is the name of the tool.
* "parameters" is the JSON schema of the tool parameters (or null if the tool does not have parameters).

You are also given the following conversation history:

[messages]

The conversation history is a list of messages exchanged between the user and the assistant. It may also describe previous actions taken by the assistant.

Based on the conversation history, and the tools you have access to, propose a plan for how to answer the user's question.

The response should be a JSON object with "actions" property, which is an array of tools to use. Each tool is represented as an object with the following properties:

* "name": the name of the tool to use.
* "parameters": the parameters to pass to the tool (or null if the tool does not have parameters).

The same tool can be used multiple times in the plan.

If the conversation does not necessitate the use of tools, respond with an empty action plan, e.g.,

{
  actions: [],
}

Implementing the routing logic

Now that we have all the pieces in place, we can put them together.

[!NOTE]
quickPrompt is a simple utility function that I use to execute prompts with expected response schema.

import { quickPrompt } from './quickPrompt';
import { type CoreMessage } from 'ai';
import multiline from 'multiline-ts';
import { type AnyZodObject, z } from 'zod';
import { zodToJsonSchema } from 'zod-to-json-schema';

const SerializableZodSchema = z.record(
  z.union([z.string(), z.number(), z.boolean()]),
);

type Tool<P extends AnyZodObject, R extends AnyZodObject> = {
  description: string;
  execute: (parameters?: z.infer<P>) => Promise<z.infer<R>>;
  name: string;
  parameters?: P;
  response: R;
};

const describeTool = (tool: Tool<AnyZodObject, AnyZodObject>) => {
  return {
    description: tool.description,
    name: tool.name,
    parameters: tool.parameters ? zodToJsonSchema(tool.parameters) : null,
  };
};

export const routeMessage = async ({
  messages,
  tools,
}: {
  messages: CoreMessage[];
  tools: ReadonlyArray<Tool<AnyZodObject, AnyZodObject>>;
}) => {
  return await quickPrompt({
    model: 'openai@gpt-4o-mini',
    name: 'routeMessage',
    prompt: multiline`
      You are an assistant with access to the following tools:

      ${JSON.stringify(
        tools.map((tool) => {
          return describeTool(tool);
        })
      )}

      Tools are described in the following format:

      * "description" describes what the tool does.
      * "name" is the name of the tool.
      * "parameters" is the JSON schema of the tool parameters (or null if the tool does not have parameters).

      You are also given the following conversation history:

      ${JSON.stringify(messages)}

      The conversation history is a list of messages exchanged between the user and the assistant. It may also describe previous actions taken by the assistant.

      Based on the conversation history, and the tools you have access to, propose a plan for how to answer the user's question.

      The response should be a JSON object with "actions" property, which is an array of tools to use. Each tool is represented as an object with the following properties:

      * "name": the name of the tool to use.
      * "parameters": the parameters to pass to the tool (or null if the tool does not have parameters).

      The same tool can be used multiple times in the plan.

      If the conversation does not necessitate the use of tools, respond with an empty action plan, e.g.,

      {
        actions: [],
      }
    `,
    zodSchema: z.object({
      actions: z.array(
        z.object({
          name: z.string(),
          parameters: SerializableZodSchema.nullable(),
        }),
      ),
    }),
  });
};

This function captures the essence of the routing logic. It takes the conversation history and the tools available to the model, and returns a plan for how to answer the user's question.

Evaluating the routing logic

The idea is that every time the user asks a question, we should use routeMessage to determine if the question requires the use of tools, and if so, which tools to use.

Inside Glama, I am using routeMessage in the following way:

// We may have multiple cycles of invocations.
// See the explanation after this code example.
while (toolCycle < MAX_TOOL_CYCLES) {
  const plan = await routeMessage({
    messages,
    tools,
  });

  // If `routeMessage` returns an empty plan, it means that the conversation does not require the use of tools.
  if (plan.actions.length === 0) {
    break;
  }

  // For each tool we use, we need to record the invocation and the result in the conversation history.
  for (const action of plan.actions) {
    const toolCallId = randomUUID();

    messages.push({
      content: [
        {
          args: action.parameters,
          toolCallId,
          toolName: action.name,
          type: 'tool-call' as const,
        },
      ],
      role: 'assistant',
    });

    const result = await invokeUserTool(action.name, action.parameters);

    messages.push({
      content: [
        {
          result,
          toolCallId,
          toolName: action.name,
          type: 'tool-result' as const,
        },
      ],
      role: 'tool',
    });
  }
}

// Now pass the messages history to the LLM which will use the recorded tool calls to generate a response.
await streamAssistantResponse({
  messages,
  signal: abortController.signal,
  visitor: context.visitor,
});

The code above is mostly self-explanatory. The only tricky part is that we need to invoke routeMessage in a loop because the model may need to use multiple tools to answer the user's question. For example, if the user asks 'What's the weather in New York?', the model may first use a tool to reverse geocode the location and then use another tool to fetch the current weather at the given coordinates.

This appears to be the entirety of the routing logic.

Benefits of self-implemented routing logic

At the end, I prefer to implement the routing logic myself because:

It allows me to use tools with models that do not natively support tools.
I can expand the logic for how the tools are resolved. For example, I want to load a subset of tools based on the user's prompt, or I want to prioritize tools based on the frequency of use.
There is no ambiguity about the cost of using tools. To this day, I have no clue what is OpenAI's pricing is for utilising tools.

Replacing GitHub Copilot with Local LLMs

Frank Fiegel — Fri, 11 Oct 2024 19:11:59 +0000

As part of developing Glama, I try to stay at the cutting edge of everything AI, especially when it comes to LLM-enabled development. I've tried GitHub Copilot, Supermaven, and many other AI code completion tools. However, earlier this week I gave a try to locally hosted LLMs and I am not coming back.

Setup

These instructions assume that you are a macOS user.

The setup takes no more than a few minutes.

Download and install Ollama.

What about LM Studio? I saw a few posts debate one over the other. LM Studio has intuitive UI; Ollama does not. However, my research led me to belief that Ollama is faster than LM Studio.

Install the model that you want to use.

ollama pull starcoder2:3b

I've evaluated a few and landed on starcoder2:3b. It provides a good balance of usefuless and interference speed.

For context, the following table shows the speed of each model.

Model	tokens/second
`starcoder2:3b`	99
`llama3.1:8b`	54
`codestral:22b`	21

Finally, install a continue.dev – a VSCode extension that enables tab completion (and chat) using local LLMs.

Then update continue.dev settings to use the desired model.

{
  "models": [
    {
      "title": "Starcoder2",
      "provider": "ollama",
      "model": "starcoder2:3b"
    }
  ],
  "tabAutocompleteModel": {
    "title": "Starcoder2",
    "provider": "ollama",
    "model": "starcoder2:3b"
  }
}

Restart VSCode and you should be good to go.

Ensure that you've disabled GitHub Copilot and other overlapping VSCode extensions.

Pros and Cons

Pros

Offline Availability: Work anywhere without relying on an internet connection.
Privacy: Your code and prompts never leave your machine, ensuring maximum data privacy.
Customization: Ability to fine-tune models to your specific needs or codebase.
No Subscription Costs: Once set up, there are no ongoing fees unlike many cloud-based services.
Consistent Performance: No latency issues due to poor internet connection or server load.
Open Source: Many local LLMs are open-source, allowing for community improvements and transparency.

Cons

Initial Setup Time: Requires some time and technical knowledge to set up properly.
Hardware Requirements: Local LLMs can be resource-intensive, requiring a reasonably powerful machine.
Limited Model Size: Typically, local models are smaller than their cloud-based counterparts, which might affect performance for some tasks.
Manual Updates: You need to manually update models and tools to get the latest improvements.

Closing Thoughts

I was hesitant to adopt local LLMs because services like GitHub Copilot "just work." However, as I've been traveling the world, I found myself often regretting having to depend on Internet connection for my auto completions. In that sense, switching to a local model has been a huge win for me. If Internet connectivity was not issue, I think services like Supermaven are still very appealing and worth the cost.

If you are not familiar with Supermaven and if you are Okay with depending on Internet connection, then it's worth checking out. Compared to GitHub Copilot, I found Supermaven's auto completion to be much more reliable and much faster.

However, if you are like me and want your code completion to work with or without an Internet connection, then this is definitely worth a try.

No Single LLM Can Be Trusted in Isolation

Frank Fiegel — Thu, 10 Oct 2024 22:40:27 +0000

I started building Glama after a simple observation: no single LLM can be trusted in isolation.

From Awe to Skepticism

Like many others, my first exposure to LLMs was through OpenAI's GPT-2 model. At first, I would compose a prompt, share it with the model, and usually accept its response as "most likely correct." However, as many others at the time, I still viewed the technology as a promise of what will be possible in the future, rather than a trustworthy peer to consult with.

Later, in June 2020, as GPT-3 came out, wowed by many incredible demos, I began exploring what it is like to rely on LLMs for helping with everyday tasks in my domain of expertise. This is where my trust in LLMs began to diminish...

Trust, but Verify

There is a phenomenon known as the Gell-Mann Amnesia effect. The effect describes how an expert can spot numerous errors in an article about their field but then accept information on other subjects as accurate, forgetting the flaws they just identified. Being aware of the phenomenon, and observing the frequency of errors in the information I was receiving, I stopped trusting LLMs without validating their responses.

Over time, more models started to appear, each one making more grandiose statements than the others. I started to experiment with all of them. No matter what the prompt was, I developed a habit of copy-pasting my prompts across multiple models like OpenAI, Claude, and Gemini. This change in behavior led me to a further insight:

A single LLM might be unreliable, but when multiple models independently reach the same conclusion, it boosts confidence in the accuracy of the information.

As a result, my trust in LLMs became proportional to the level of consensus achieved by consulting multiple models.

Limitations of LLMs

We've established that relying on any single LLM is dangerous. Based on my understanding of the technology, I believe this limitation to be inherent to LLMs (rather than a question of model quality). It's because of the following reasons:

Dataset Bias: Each LLM is trained on a specific dataset, inheriting its biases and limitations.
Knowledge Cutoff: LLMs have a fixed knowledge cutoff date, lacking information on recent events.
Hallucination: LLMs can generate plausible-sounding but incorrect information.
Domain Specificity: Models excel in certain areas but underperform in others.
Ethical Inconsistency: Alignment techniques vary, leading to inconsistent handling of ethical queries.
Overconfidence: LLMs may present incorrect information with high confidence.

By leveraging multiple LLMs, we can mitigate these limitations. Different models can complement each other's strengths, allow user to cross-verify information, and provide a more balanced perspective. This approach, while not perfect, significantly improves the trustworthiness of LLMs.

[!NOTE]
In addition to what's being discussed in this article, I also want to draw attention to the emergence of 'AI services' (it's no longer accurate to call them just LLM models) that are capable of reasoning. These services combine techniques such as Dynamic Chain-of-Thought (CoT), Reflection, and Verbal Reinforcement Learning to provide responses that aim to offer a higher degree of trust. There is a great article that goes into detail about what these techniques are and how they work. We are actively working on bringing these capabilities to Glama.

Glama: Streamlining Multi-Model Interactions

Recognizing the limitations of single-model reliance, I developed Glama as a solution to streamline the process of gaining perspectives from multiple LLMs. Glama provides a unified platform where users can interact with various AI models simultaneously, effectively creating a panel of AI advisors.

Key features of Glama include:

Multi-Model Querying: Simultaneously consult multiple LLMs, including the latest from Google, OpenAI, and Anthropic.
Enterprise-Grade Security:
- Your data remains under your control, never used for model training.
- End-to-end encryption (AES 256, TLS 1.2+) for data in transit and at rest.
- SOC 2 compliance, meeting stringent security standards.
Seamless Integration:
- Admin console for easy team management, including SSO and domain verification.
- Collaborative features like shared chat templates for streamlined workflows.
Comparative Analysis: Easily compare responses side-by-side to identify consistencies and discrepancies across models.
Customizable Model Selection: Choose which LLMs to consult based on your specific needs and security requirements.

By facilitating secure, efficient access to diverse AI perspectives, Glama empowers users to make more informed decisions, leveraging the strengths of multiple models while mitigating individual weaknesses – all within a robust, enterprise-ready environment.

Conclusion

In today's AI landscape, relying on a single LLM is akin to seeking advice from just one expert – potentially valuable, but inherently limited. Glama embodies the principle that diversity in AI perspectives leads to more robust and reliable outcomes. By streamlining access to multiple LLMs, Glama not only saves time but also enhances the quality of AI-assisted decision-making.

As we continue to navigate the evolving world of AI, tools like Glama will play a crucial role in helping users harness the collective intelligence of multiple models...

There's no one AI to rule them all – but with Glama, you can leverage the power of many.