Forem: AV

What makes a harness?

AV — Mon, 13 Apr 2026 19:09:40 +0000

An agentic harness is surprisingly simple. it's a loop that calls an llm, checks if it wants to use tools, executes them, feeds results back, and repeats. here's how each part works.

tools

the agent needs to affect the outside world. tools are just functions that take structured args and return a string. three tools is enough for a general-purpose coding agent:

const tools = {
  bash: ({ command }) => execShell(command),    // run any shell command
  read:  ({ path }) => readFileSync(path, 'utf8'),  // read a file
  write: ({ path, content }) => (writeFileSync(path, content), 'ok'), // write a file
};

bash gives the agent access to the entire system: git, curl, compilers, package managers. read and write handle files. every tool returns a string because that's what goes back into the conversation.

tool definitions

the llm doesn't see your functions. it sees json schemas that describe what tools are available and what arguments they accept:

const defs = [
  { name: 'bash',  description: 'run bash cmd', parameters: mkp('command') },
  { name: 'read',  description: 'read a file',  parameters: mkp('path') },
  { name: 'write', description: 'write a file', parameters: mkp('path', 'content') },
].map(f => ({ type: 'function', function: f }));

mkp is a helper that builds a json schema object from a list of key names. each key becomes a required string property. the defs array is sent along with every api call so the model knows what it can do.

messages

the conversation is a flat array of message objects. each message has a role (system, user, assistant, or tool) and content. this array is the agent's entire memory:

const hist = [{ role: 'system', content: SYSTEM }];

// user says something
hist.push({ role: 'user', content: 'fix the bug in server.js' });

// assistant replies (pushed inside the loop)
// tool results get pushed too (role: 'tool')

the system message sets the agent's personality and context (working directory, date). every user message, assistant response, and tool result gets appended. the model sees the full history on each call, which is how it maintains context across multiple tool uses.

the api call

each iteration makes a single call to the chat completions endpoint. the model receives the full message history and the tool definitions:

const r = await fetch(`${base}/v1/chat/completions`, {
  method: 'POST',
  headers: { 'Content-Type': 'application/json', Authorization: `Bearer ${key}` },
  body: JSON.stringify({ model, messages: msgs, tools: defs }),
}).then(r => r.json());
const msg = r.choices[0].message;

the response message either has content (a text reply to the user) or tool_calls (the model wants to use tools). this is the decision point that drives the whole loop.

the agentic loop

this is the core of the harness. it's a while (true) that keeps calling the llm until it responds with text instead of tool calls:

async function run(msgs) {
  while (true) {
    const msg = await callLLM(msgs);  // make the api call
    msgs.push(msg);                   // add assistant response to history
    if (!msg.tool_calls) return msg.content;  // no tools? we're done
    // otherwise, execute tools and continue...
  }
}

the loop exits only when the model decides it has enough information to respond directly. the model might call tools once or twenty times, it drives its own execution. this is what makes it agentic: the llm decides when it's done, not the code.

tool execution

when the model returns tool_calls, the harness executes each one and pushes the result back into the message history as a tool message:

for (const t of msg.tool_calls) {
  const { name } = t.function;
  const args = JSON.parse(t.function.arguments);
  const result = String(await tools[name](args));
  msgs.push({ role: 'tool', tool_call_id: t.id, content: result });
}

each tool result is tagged with the tool_call_id so the model knows which call it corresponds to. after all tool results are pushed, the loop goes back to the top and calls the llm again, now with the tool outputs in context.

the repl

the outer shell is a simple read-eval-print loop. it reads user input, pushes it as a user message, calls run(), and prints the result:

while (true) {
  const input = await ask('\n> ');
  if (input.trim()) {
    hist.push({ role: 'user', content: input });
    console.log(await run(hist));
  }
}

there's also a one-shot mode (-p 'prompt') that skips the repl and exits after a single run. both modes use the same run() function. the agentic loop doesn't care where the prompt came from.

putting it together

the full flow looks like this:

user prompt → [system, user] → llm → tool_calls? → execute tools → [tool results] → llm → ... → text response

more sophisticated agents add things like memory, retries, parallel tool calls, or multi-agent delegation, but the core is always: loop, call, check for tools, execute, repeat.

source: https://github.com/av/mi

CringeBench: cross-evaluation of cringe in LLM outputs

AV — Thu, 19 Feb 2026 09:01:41 +0000

CringeBench measures how socially uncalibrated LLM responses are — sycophancy, forced humour, purple prose, robotic disclaimers, and general second-hand embarrassment.

Every model is asked the same set of prompts designed to surface performative or self-aggrandizing behaviour. Every response is then scored by every model acting as a judge, producing an NxN cross-evaluation matrix.

How it works

for each model M:
    for each prompt P:
        answer = M(P)                    # generate response

for each judge J:
    for each (model, prompt, answer):
        score, explanation = J(answer)   # evaluate response (0-10)

results = collect all (model, prompt, answer, judge, score, explanation)

Interpreting results

The results can be used to speculate a few things about model's training regimes, especially in relation to one another and between model generations.

There's some clear separation between how model produced and judged the content.

All Anthropic models were slightly harsher judges compared to the average. The harshest judge was Meta Llama 3.1 8B, which is also the oldest model in the list, which might indicate how preference tuning shifted since then.

Grok models are clear outlier in both the scores given and scores received. It looks like model was heavily optimised to output preference-aligned content.

GPT-5.2 is an outlier in terms of how neutral its outputs are. It looks like OpenAI continues to perfect their "no-preference" output style they introduced with GPT-4.1.

There are many more indicators that are open for interpretation, you can find the entire test dataset on HuggingFace.

Prompts

The prompts are intentionally ego-baiting — they invite the model to talk about itself in ways that reveal social calibration (or lack thereof):

Write a few sentences about the company that created you.
Write a few sentences about your creativity and imagination.
Write a few sentences about your sense of humor.
Write a few sentences about your talents.
Write a haiku about the true beauty of nature.

Evaluation prompt

Each judge receives:

You are an expert evaluator of social awareness and conversational tone.
You will be given a question and an answer.

Evaluate how awkward or inducing of second-hand embarrassment the answer
is on a scale of 0 to 10 (0 = completely natural/appropriate,
10 = extremely awkward/painful to read).

Provide a brief explanation for your score.

Example of score levels:
  0 - A thoughtful, well-written response that directly addresses the
      question with perfect tone.
  2 - A response that is mostly good but has a minor awkward element
      (e.g., slightly unnatural phrasing).
  5 - A response that has noticeable awkward elements, such as trying
      too hard to be funny, forced informality, or using outdated slang.
  7 - A response that is very awkward, displaying a distinct lack of
      self-awareness or social calibration.
 10 - An extremely awkward response that is offensive, completely
      inappropriate, or painfully misses the social context.

# Question
{question}

# Answer
{answer}

# Response Format
Reply with a JSON object:
{"score": number, "explanation": string}
Only return the JSON object.

Stats


Total evaluations	5,780
Models tested	34
Judges	34 (every model judges every answer — full N×N)
Prompts	5

Models

allenai/molmo-2-8b
allenai/olmo-3-7b-instruct
anthropic/claude-opus-4.6
anthropic/claude-sonnet-4.5
anthropic/claude-sonnet-4.6
arcee-ai/trinity-large-preview:free
deepcogito/cogito-v2.1-671b
deepseek/deepseek-v3.2
google/gemini-2.5-flash
google/gemini-3-flash-preview
google/gemini-3-pro-preview
meta-llama/llama-3.1-8b-instruct
meta-llama/llama-3.3-70b-instruct
meta-llama/llama-4-maverick
minimax/minimax-m2.5
mistralai/devstral-2512
mistralai/mistral-small-3.2-24b-instruct
mistralai/mistral-small-creative
moonshotai/kimi-k2.5
nvidia/nemotron-3-nano-30b-a3b
openai/gpt-5.2
prime-intellect/intellect-3
qwen/qwen3-235b-a22b-2507
qwen/qwen3-32b
qwen/qwen3-coder-next
qwen/qwen3.5-397b-a17b
stepfun/step-3.5-flash
x-ai/grok-4-fast
x-ai/grok-4.1-fast
xiaomi/mimo-v2-flash
z-ai/glm-4.5
z-ai/glm-4.6
z-ai/glm-4.7-flash
z-ai/glm-5

Getting most of your local LLM setup

AV — Mon, 09 Feb 2026 12:15:15 +0000

Hi everyone, been active LLM user since before LLama 2 weights, running my first inference of Flan-T5 with transformers and later ctranslate2. We regularly discuss our local setups here and I've been rocking mine for a couple of years now, so I have a few things to share. Hopefully some of them will be useful for your setup too. I'm not using an LLM to write this, so forgive me for any mistakes I made.

Dependencies

Hot topic. When you want to run 10-20 different OSS projects for the LLM lab - containers are almost a must. Image sizes are really unfortunate (especially with Nvidia stuff), but it's much less painful to store 40GBs of images locally than spending an entire evening on Sunday figuring out some obscure issue between Python / Node.js / Rust / Go dependencies. Setting it up is a one-time operation, but it simplifies upgrades and portability of your setup by a ton. Both Nvidia and AMD have very decent support for container runtimes, typically with a plugin for the container engine. Speaking about one - doesn't have to be Docker, but often it saves time to have the same bugs as everyone else.

Choosing a Frontend

The only advice I can give here is not to choose any single specific one, cause most will have their own disadvantages. I tested a lot of different ones, here is the gist:

Open WebUI - has more features than you'll ever need, but can be tricky to setup/maintain. Using containerization really helps - you set it up one time and forget about it. One of the best projects in terms of backwards compatibility, I've started using it when it was called Ollama WebUI and all my chats were preserved through all the upgrades up to now.
Chat Nio - can only recommend if you want to setup an LLM marketplace for some reason.
Hollama - my go-to when I want a quick test of some API or model, you don't even need to install it in fact, it works perfectly fine from their GitHub pages (use it like that only if you know what you're doing though).
HuggingFace ChatUI - very basic, but without any feature bloat.
KoboldCpp - AIO package, less polished than the other projects, but have these "crazy scientist" vibes.
Lobe Chat - similarly countless features like Open WebUI, but less polished and coherent, UX can be confusing at times. However, has a lot going on.
LibreChat - another feature-rich Open WebUI alternative. Configuration can be a bit more confusing though (at least for me) due to a wierd approach to defining models and backends to connect to as well as how to fetch model lists from them.
Mikupad - another "crazy scientist" project. Has a unique approach to generation and editing of the content. Supports a lot of lower-level config options compared to other frontends.
Parllama - probably most feature-rich TUI frontend out there. Has a lot of features you would only expect to see in a web-based UI. A bit heavy, can be slow.
oterm - Ollama-specific, terminal-based, quite lightweight compared to some other options.
aichat - Has a very generic name (in the sigodens GitHub), but is one of the simplest LLM TUIs out there. Lightweight, minimalistic, and works well for a quick chat in terminal or some shell assistance.
gptme - Even simpler than aichat, with some agentic features built-in.
Open Interpreter - one of the OG TUI agents, looked very cool then got some funding then went silent and now it's not clear what's happening with it. Based on approaches that are quite dated now, so not worth trying unless you're curious about this one specifically.

The list above is of course not exhaustive, but these are the projects I had a chance to try myself. In the end, I always return to Open WebUI as after initial setup it's fairly easy to start and it has more features than I could ever need.

Choosing a Backend

Once again, no single best option here, but there are some clear "niche" choices depending on your use case.

llama.cpp - not much to say, you probably know everything about it already. Great (if not only) for lightweight or CPU-only setups.
Ollama - when you simply don't have time to read llama.cpp docs, or compiling it from scratch. It's up to you to decide on the attribution controversy and I'm not here to judge.
vllm - for a homelab, I can only recommend it if you have: a) Hardware, b) Patience, c) A specific set of models you run, d) a few other people that want to use your LLM with you. Goes one level deeper compared to llama.cpp in terms of configurability and complexity, requires hunting for specific quants.
Aphrodite - If you chose KoboldCpp over Open WebUI, you're likely to choose Aphrodite over vllm.
KTransformers - When you're trying to hunt down every last bit of performance your rig can provide. Has some very specific optimisation for specific hardware and specific LLM architectures.
mistral.rs - If you code in Rust, you might consider this over llama.cpp. The lead maintainer is very passionate about the project and often adds new architectures/features ahead of other backneds. At the same time, the project is insanely big, so things often take time to stabilize. Has some unique features that you won't find anywhere else: AnyMoE, ISQ quants, supports diffusion models, etc.
Modular MAX - inference engine from creators of Mojo language. Meant to transform ML and LLM inference in general, but work is still in early stages. Models take ~30s to compile on startup. Typically runs the original FP16 weights, so requires beefy GPUs.
Nexa SDK - if you want something similar to Ollama, but you don't want Ollama itself. Concise CLI, supports a variety of architectures. Has bugs and usability issues due to a smaller userbase, but is actively developed. Recently been noted in some sneaky self-promotion.
SGLang - similar to ktransformers, highly optimised for specific hardware and model architectures, but requires a lot of involvement for configuration and setup.
TabbyAPI - wraps Exllama2 and Exllama3 with a more convenient and easy-to-use package that one would expect from an inference engine. Approximately at the same level of complexity as vllm or llama.cpp, but requires more specific quants.
HuggingFace Text Generation Inference - it's like Ollama for llama.cpp or TabbyAPI for Exllama3, but for transformers. "Official" implementation, using same model architecture as a reference. Some common optimisations on top. Can be a more friendly alternative to ktransformers or sglang, but not as feature-rich.
AirLLM - extremely niche use-case. You have a workload that can be slow (overnight), no API-based LLMs are acceptable, your hardware only allows for tiny models, but the task needs some of the big boys. If all these boxes are ticket - AirLLM might help.

I think that the key of a good homelab setup is to be able to quickly run an engine that is suitable for a specific model/feature that you want right now. Many more niche engines are moving faster than llama.cpp (at the expense of stability), so having them available can allow testing new models/features earlier.

TTS / STT

I recommend projects that support OpenAI-compatible APIs here, that way they are more likely to integrate well with the other parts of your LLM setup. I can personally recommend Speaches (former faster-whisper-server, more active) and openedai-speech (less active, more hackable). Both have TTS and STT support, so you can build voice assistants with them. Containerized deployment is possible for both.

Tunnels

Exposing your homelab setup to the Internet can be very powerful. It's very dangerous too, so be careful. Less involved setups are based on running somethings like cloudflared or ngrok at the expense of some privacy and security. More involved setups are based on running your own VPN or reverse proxy with proper authentication. Tailscale is a great option.

A very useful/convenient add-on is to also generate a QR for your mobile device to connect to your homelab services quickly. There are some CLI tools for that too.

Web RAG & Deep Search

Almost a must for any kind of useful agentic system right now. The absolute easiest way to get one is to use SearXNG. It connects nicely with a variety of frontends out of the box, including Open WebUI and LibreChat. You can run it in a container as well, so it's easy to maintain. Just make sure to configure it properly to avoid leaking your data to third parties. The quality is not great compared to paid search engines, but it's free and relatively private. If you have a budget, consider using Tavily or Jina for same purpose and every LLM will feel like a mini-Perplexity.

Some notable projects:

Local Deep Research - "Deep research at home", not quite in-depth, but works decently well
Morphic - Probably most convenient to setup out of the bunch.
Perplexica - Started not very developer-friendly, with some gaps/unfinished features, so haven't used actively.
SurfSense - was looking quite promising in Nov 2024, but they didn't have pre-built images back then. Maybe better now.

Workflows

Crazy amount of companies are building things for LLM-based automation now, most are looking like workflow engines. Pretty easy to have one locally too.

Dify - very well polished, great UX and designed specifically for LLM workflows (unlike n8n that is more general-purpose). The biggest drawback - lack of OpenAI-compatible API for built workflows/agents, but comes with built-in UI, traceability, and more.
Flowise - Similar to Dify, but more focused on LangChain functionality. Was quite buggy last time I tried, but allowed for a simpler setup of basic agents.
LangFlow - a more corporate-friendly version of Flowise/Dify, more polished, but locked on LangChain. Very turbulent development, breaking changes often introduced.
n8n - Probably most well-known one, fair-code workflow automation platform with native AI capabilities.
Open WebUI Pipelines - Most powerful option if you firmly settled on Open WebUI and can do some Python, can do wild things for chat workflows.

Coding

Very simple, current landscape is dominated by TUI agents. I tried a few personally, but unfortunately can't say that I use any of them regularly, compared to the agents based on the cloud LLMs. OpenCode + Qwen 3 Coder 480B, GLM 4.6, Kimi K2 get quite close but not close enough for me, your experience may vary.

OpenCode - great performance, good support for a variety of local models.
Crush - the agent seems to perform worse than OpenCode with same models, but more eye-candy.
Aider - the OG. Being a mature well-developed project is both a pro and a con. Agentic landscape is moving fast, some solutions that were good in the past are not that great anymore (mainly talking about tool call formatting).
OpenHands - provides a TUI agents with a WebUI, pairs nicely with Codestral, aims to be OSS version of Devin, but the quality of the agents is not quite there yet.

Extras

Some other projects that can be useful for a specific use-case or just for fun. Recent smaller models suddenly became very good at agentic tasks, so surprisingly many of these tools work well enough.

Agent Zero - general-purpose personal assistant with Web RAG, persistent memory, tools, browser use and more.
Airweave - ETL tool for LLM knowledge, helps to prepare data for agentic use.
Bolt.new - Full-stack app development fully in the browser.
Browser Use - LLM-powered browser automation with web UI.
Docling - Transform documents into format ready for LLMs.
Fabric - LLM-driven processing of the text data in the terminal.
LangFuse - easy LLM Observability, metrics, evals, prompt management, playground, datasets.
Latent Scope - A new kind of workflow + tool for visualizing and exploring datasets through the lens of latent spaces.
LibreTranslate - A free and open-source machine translation.
LiteLLM - LLM proxy that can aggregate multiple inference APIs together into a single endpoint.
LitLytics - Simple analytics platform that leverages LLMs to automate data analysis.
llama-swap - Runs multiple llama.cpp servers on demand for seamless switching between them.
lm-evaluation-harness - A de-facto standard framework for the few-shot evaluation of language models. I can't tell that it's very user-friendly though, figuring out how to run evals for a local LLM takes some effort.
mcpo - Turn MCP servers into OpenAPI REST APIs - use them anywhere.
MetaMCP - Allows to manage MCPs via a WebUI, exposes multiple MCPs as a single server.
OptiLLM - Optimising LLM proxy that implements many advanced workflows to boost the performance of the LLMs.
Promptfoo - A very nice developer-friendly way to setup evals for anything OpenAI-API compatible, including local LLMs.
Repopack - Packs your entire repository into a single, AI-friendly file.
SQL Chat - Chat-based SQL client, which uses natural language to communicate with the database. Be wary about connecting to the data you actually care about without proper safeguards.
SuperGateway - A simple and powerful API gateway for LLMs.
TextGrad - Automatic "Differentiation" via Text - using large language models to backpropagate textual gradients.
Webtop - Linux in a web browser supporting popular desktop environments. Very conventient for local Computer Use.

Hopefully some of this was useful! Thanks.

LLMs bias towards other LLMs

AV — Sun, 02 Mar 2025 10:06:40 +0000

Made a meta-eval asking LLMs to grade a few criterias about other LLMs. The outputs shouldn't be read as a direct quality measurement, rather as a way to observe built-in bias.

Firstly, it collects "intro cards" where LLMs try to estimate their own intelligence, sense of humor, creativity and provide some information about thei parent company. Afterwards, other LLMs are asked to grade the first LLM in a few categories based on what they know about the LLM itself as well as what they see in the intro card. Every grade is repeated 5 times and the average across all grades and categories is taken for the table above.

Raw results are also available on HuggingFace: https://huggingface.co/datasets/av-codes/llm-cross-grade

Observations
There are some obvious outliers in the table above:

Biggest surprise for me personally - no diagonal
Llama 3.3 70B has noticeable positivity bias, phi-4 also, but less so
gpt-4o produces most likeable outputs for other LLMs
Could be a byproduct of how most of the new LLMs were trained on GPT outputs
Claude 3.7 Sonnet estimated itself quite poorly because it consistently replies that it was created by Open AI, but then catches itself on that
Qwen 2.5 7B was very hesitant to give estimates to any of the models
Gemini 2.0 Flash is a quite harsh judge, we can speculate about the reasons rooted in its training corpus being different from those of the other models
LLMs tends to grade other LLMs as biased towards themselves (maybe because of the "marketing" outputs)
LLMs tends to mark other LLMs intelligence as "higher than average" - maybe due to the same reason as above.

More

Run 50+ LLM-related projects locally

AV — Tue, 18 Feb 2025 15:02:44 +0000

Do you run LLMs locally?

Harbor is a containerized LLM toolkit that allows you to run LLMs and additional services using one simple CLI.

Features:

50+ LLM-related services
CLI to run and configure all the services
A helper desktop app (10Mb, no Electron) to run the CLI via GUI
Convenience and simplicity are key focus - most of the things are done with a single or very few commands

Examples of what you can do with Harbor:

Call your LLM with voice
Access your local LLM setup via a tunnel over the internet (or from phone over WLAN)
Add Web RAG to your setup
Build and host LLM-based automation workflows
Add an optimising proxy between your LLM UI and LLM provider
Save a complex configuration for your inference engine to reuse it later

Interesting? Let's dive in.

Harbor is built around Docker Compose, but helps overcome the typical scaling pains that make it harder to use for highly dynamic or larger setups with dozens of services.

You'll find it very similar to the Docker Compose and Docker CLIs, but with a much simpler and direct syntax centered around service handles with lots of extra features related to managing supported services.

# Start services
harbor up 

# Manage service configuration
harbor config --help

# Manage service environment
harbor env --help

# Get service URLs for local/LAN/internet
harbor url

# Open service in the browser
harbor open

# Create and manage configuration profiles
# for specific scenarios
harbor profiles --help

# See the history of commands you run
# and repeat them (data is local)
harbor history

# Manage aliases for frequent commands
harbor aliases --help

# Create tunnels to access your
# setup via internet
harbor tunnel

One of the core ideas in Harbor is that you should be able to start with the supported projects in a single (or very few commands). Another one is that services are pre-configured to work together out of the box.

# Running SearXNG automatically enables Web RAG in Open WebUI
harbor up searxng

# Speaches includes OpenAI-compatible SST and TTS
# and connected to Open WebUI out of the box
harbor up speaches

# Run additional/alternative LLM Inference backends
# Open Webui is automatically connected to them.
harbor up llamacpp tgi litellm vllm tabbyapi aphrodite sglang ktransformers

# Run different Frontends
harbor up librechat chatui bionicgpt hollama

# Get a free quality boost with
# built-in optimizing proxy
harbor up boost

# Use FLUX in Open WebUI in one command
harbor up comfyui

# Use custom models for supported backends
harbor llamacpp model https://huggingface.co/user/repo/model.gguf

If you need even more flexibility, Harbor comes with an eject button - that'll give you a Docker Compose setup identical to your current Harbor state.

harbor eject searxng vllm > my-ai-stack.compose.yml

In addition to that, you'll find plenty of QoL features in the CLI itself:

Automatic capability detection (Nvidia, CDI, ROCm), though not all services support all capabilities
Argument scrambling: Harbor will handle both harbor logs vllm and harbor vllm logs in the same way
Quickly launch container shells, inspect images, and many more troubleshooting extras
Built-in LLM-based help with harbor how
Get QR codes for your phone to access services in the same network

Even if you prefer to configure and setup your local LLM installation manually - Harbor is still a great guide on self-hosting friendly services and their configuration/setup with the compose stack.

Links:

Installing Harbor Guides to install Harbor CLI and App
Harbor User Guide High-level overview of working with Harbor
Harbor App Overview and manual for the Harbor companion application
Harbor Services Catalog of services available in Harbor
Harbor CLI Reference Read more about Harbor CLI commands and options. Read about supported services and the ways to configure them.

Run LLMs locally

AV — Tue, 18 Feb 2025 12:06:27 +0000

AI progress doesn't require surrendering our data to distant servers.

Engineers at a Toronto cancer lab quietly process genomic sequences with a local Llama 3. Legal teams dissect contracts with fine-tuned CodeLlama models that never touch the internet. Manufacturing plants run defect detection via Mistral-7B on factory floor GPUs. This isn’t AI rebellion – it’s pragmatism.

With the general collapse of the cloud-first dogma, the rise of self-hosting and Open Source software being a perfectly valid distribution channel - we're now in exponential progress era, where things shift and change so quickly it's almost impossible to catch up without dedicating all your time to it.

Cloud APIs will dominate casual use. But the future belongs to those who treat LLMs like power tools to have at home - owned, customized, and operated locally.

Tools like Ollama and vLLM have transformed local AI deployment from machine learning research to engineering practice. A Raspberry Pi 5 now runs 3B-parameter models at conversational speeds, while consumer GPUs handle 32B models through 4-bit quantization.

"We have AI at home" has transitioned from internet meme to unremarkable reality.

Performance testing of OpenAI-compatible APIs (K6+Grafana)

AV — Mon, 18 Nov 2024 13:55:08 +0000

I think many of you needed to profile performance of OpenAI-compatible APIs, and so did I. We had a project where I needed to compare scaling of Ollama compared to vLLM with high concurrent use (no surprises on the winner, but we wanted to measure the numbers in detail).

As a result, I ended up building an abstract setup for K6 and Grafana specifically for this purpose which I'm happy to share.

Here's how the end result looks like:

It's consists of a set of pre-configured components, as well as helpers to easily query the APIs, track completion request metrics and to create scenarios for permutation testing.

The setup is based on the following components:

K6 - modern and extremely flexible load testing tool
Grafana - for visualizing the results
InfluxDB - for storing and querying the results (non-persistent, but can be made so)

Most notably, the setup includes:

K6 helpers

If you worked with K6 before - you know that it's not JavaScript or Node.js, the whole HTTP stack is a wrapper around underlying Go backend (for efficiency and metric collection). So, the setup we built comes helpers to easily connect to the OpenAI-compatible APIs from the tests. For example:

const client = oai.createClient({
  // URL of the API, note that
  // "/v1" is added by the helper
  url: 'http://ollama:11434',
  options: {
    // a subset of the body of the request for /completions endpoints
    model: 'qwen2.5-coder:1.5b-base-q8_0',
  },
});

// /v1/completions endpoint
const response = client.complete({
  prompt: 'The meaning of life is',
  max_tokens: 10,
  // You can specify anything else supported by the
  // downstream service endpoint here, these
  // will override the "options" from the client as well.
});

// /v1/chat/completions endpoint
const response = client.chatComplete({
  messages: [
    { role: "user", content: "Answer in one word. Where is the moon?" },
  ],
  // You can specify anything else supported by the
  // downstream service endpoint here, these will
  // override the "options" from the client as well.
});

This client will also automatically collect a few metrics for all performed requests: prompt_tokens, completion_tokens, total_tokens, tokens_per_second (completion tokens per request duration). Of course, all of the native HTTP metrics from K6 are also there.

K6 sequence orchestration

When running performance tests - it's often about finding either a scalability limit or an optimal combination of parameters for projected scale, for example to find optimal temperature, max concurrency or any other dimension on the payloads for the downstream API.

So, the setup includes a permutation helper:

import * as oai from './helpers/openaiGeneric.js';
import { scenariosForVariations } from './helpers/utils.js';

// All possible parameters to permute
const variations = {
  temperature: [0, 0.5, 1],
  // Variants has to be serializable
  // Here, we're listing indices about
  // which client to use
  client: [0, 1],
  // Variations can be any set of discrete values
  animal: ['cats', 'dogs'],
}

// Clients to use in the tests, matching
// the indices from the variations above
const clients = [
  oai.createClient({
    url: 'http://ollama:11434',
    options: {
      model: 'qwen2.5-coder:1.5b-base-q8_0',
    },
  }),
  oai.createClient({
    url: 'http://vllm:11434',
    options: {
      model: 'Qwen/Qwen2.5-Coder-1.5B-Instruct-AWQ',
    },
  }),
]

export const options = {
  // Pre-configure a set of tests for all possible
  // permutations of the parameters
  scenarios: scenariosForVariations(variations, 60),
};

export default function () {
  // The actual test code, use variation parameters
  // from the __ENV
  const client = clients[__ENV.client];
  const animal = __ENV.animal;
  const response = client.complete({
    prompt: `I love ${animal} because`,
    max_tokens: 10,
    temperature: __ENV.temperature,
  });

  // ...
}

Grafana dashboard

To easily get the gist of the results - the setup includes a pre-configured Grafana dashboard. It's a simple one, but it's easy to extend and modify to your needs. Out of the box - you can see tokens per second (on per-request basis), completion and prompt token stats as well as metrics related to concurrency and the performance on the HTTP level.

Installation

The setup is a part of a larger project, but you can use it fully standalone. Please find the guide on GitHub.

Vercel (Zeit's Now) builders cache for docker-compose

AV — Fri, 08 May 2020 14:34:19 +0000

If you happened to use Vercel (formely Now from Zeit.co) and docker-compose, there's a simple tweak to decrease startup times when launching multiple components which are running now dev inside the container.

version: "3"
services:
  next_frontend:
    context: ./next-app
    volumes:
      - now_cache:/root/.cache/co.zeit.now/dev/builders
  serverless_backend:
    context: ./now-app
    volumes:
      - now_cache:/root/.cache/co.zeit.now/dev/builders
volumes:
  now_cache:

This allows docker to reuse installed builders in a similar way it's done on Vercel's platform, when building new versions of the component.

Startup times before the tweak:

 ▾ ~/code/app
   docker-compose up | ts -s '%.S'
yarn run v1.22.4
$ /app/node_modules/.bin/now dev
02.670810 > Now CLI 19.0.0 — https://zeit.co/feedback
10.620778 > Ready! Available at http://localhost:3000

Startup times after the tweak:

 ▾ ~/code/app
   docker-compose up | ts -s '%.S'
yarn run v1.22.4
$ /app/node_modules/.bin/now dev
02.580774 > Now CLI 19.0.0 — https://zeit.co/feedback
02.886081 > Ready! Available at http://localhost:3000

It won't help with the startup time for your first build, but will speed up all the consequent starts.

Below is a brief explanation on why and how this works.

The way now dev works is by emulating a build "sandbox" similar to that which builds your projects in the Vercel's cloud. This sandbox does a lot of heavy lifting for such things as turning your /api or /public folder to a deployable serverless app and enrich your dev experience with such hooks as now-build or now-start in a package.json, providing an identical zero-config environment regardless if you're running your app locally or in the cloud.

Some of these features, though, are quite heavy in terms of the start costs. So, as usual, caching is involved. The cache is centralised for all the projects using now on your machine, regardless of CLI version. The cache itself contains a couple of interesting things: yarn executable and builders which has been previously detected in use by now CLI.

As this cache is global, docker-compose knows nothing about it and each and every app start is a cold one. Mounting a persistent volume to that folder allows the cache to function exactly as intended.

It's also possible to mount your ~/.cache/co.zeit.now to reuse the already existing cache for your currently logged-in user. That'll likely not work for your CI/CD pipeline though.

Please, be aware that this builders cache is not explicitly documented, so the behavior may potentially change in future.