Forem: Wayne

Switching to Secondary Is Faster

Wayne — Sat, 02 May 2026 06:57:07 +0000

Remember, switching to your pistol is always faster than reloading.

The same idea applies to LLM workflows.

Most of the time, you don't need a flagship model to scaffold a project. Boilerplate, spec drafts, and initial plans are all tasks where a smaller model can do the heavy lifting. Then you pass the result to a larger model for review.

Why this works

Prefill is usually a single forward pass (not counting advanced stuff like chunking and sequence parallelism). Fundamentally, the next token is just model.forward(). How does this help?

Say your initial prompt is 16k tokens (a rough ballpark for a Claude Code session) and you need to generate another 16k tokens of output (tool calls, reads, edits included). If your large model generates at 50t/s, a small model can easily hit 200t/s. That's 80 seconds versus 320 seconds for the same 16k tokens.

The concept is the same as speculative decoding. Modern decoders use a small draft model to propose multiple tokens at once, then the large model verifies them in parallel. Using a secondary model for the first pass is just speculative decoding scaled to 16k tokens.

The workflow

Here's what I've been doing:

Plan — either with a small model for speed, or directly with a large model for precision. The large model is more accurate but spends more tokens on the planning stage.
Review — pass the plan to a large model, fix what's wrong.
Generate code — small model implements from the refined spec.
Review again — catch what the small model missed.

For the small model, I've been using Qwen 3.6 35b MoE. It's fast enough to run locally and produces reasonable boilerplate. The large model then acts as a reviewer rather than a first-pass generator.

This hasn't been tested on novel codebases. For truly new problems, I write the code myself and use the small model for repetitive tasks like generating tests and boilerplate.

You can find this post and more on my blog.

Ansible at Home

Wayne — Mon, 27 Apr 2026 05:17:16 +0000

Introduction

While some concepts in Engineering should not be brought back to home, I find that Ansible was one of the few tools that is actually useful in a home environment.

Why Ansible?

Ansible is a powerful automation tool that can help manage and configure systems efficiently. Another key important aspect commonly missed out is documentation. In the past, I would usually SSH into my home server and make changes directly. If I remember to document it, I would save code snippets into a README.md or obsidian note. However, this approach is prone to human error and can lead to inconsistencies over time. Most forms of IaC (Infrastructure as Code) tools are self documenting, as the code itself serves as documentation.

Setup and Configuration

Before diving into use cases, it's important to set up Ansible properly for a home environment. The configuration is straightforward and makes running playbooks much more convenient.

I keep two files in the project directory: an ansible.cfg pointing to my inventory file and enabling become_ask_pass so it prompts for sudo passwords rather than storing credentials (security first, even at home), and an inventory.ini with at least localhost ansible_connection=local so playbooks run locally without SSH overhead. With those in place, ansible-playbook playbook.yml just works.

Use Cases

System configuration

I am using a consumer intel CPU with a stock cooler for my homelab. As I don't expect it to run heavy workloads, I don't need it to run at full power. I set the PL1 and PL2 power limits through Ansible rather than using the intel-undervolt tool. This way, if I ever need to re-install the OS or set up a new server, I can easily apply the same configuration without having to remember the exact commands or settings.

The playbook validates that PL1 and PL2 values fall within acceptable ranges (hard lower/upper limits) and ensures PL2 >= PL1 before applying them. It then writes to the sysfs powercap interfaces to set sustained and burst power limits, and creates a systemd service for persistence across reboots.

It's easy to make mistakes when setting raw power values -- adding one extra zero can be disastrous. With Ansible, I can specify values like 65W and 90W instead of 65000000 and 90000000, and the validation layer catches out-of-range inputs before they get applied.

Restic setup

Restic is a great backup tool that can be used to back up data to various locations. The backup scripts are manually written, but the cron jobs and log rotation are managed by Ansible. If the below process were to be done manually, it would be prone to human error and inconsistencies, as multiple files are involved, such as cron jobs, log rotation configuration, and the backup scripts themselves.

The playbook ensures restic is installed, makes the backup scripts executable, sets up two daily cron jobs (one for immich data at 2 AM, one for documents at 3 AM), and configures logrotate to keep 21 days of compressed logs with daily rotation.

Conclusion

As you add more services and configurations to your home environment, the benefits of using Ansible become even more apparent. It helps maintain consistency, reduces the risk of human error, and serves as documentation for your setup. Whether you're managing a single server or multiple devices, Ansible can streamline your home automation tasks effectively.

The full version with the complete playbook examples is on my blog.

Making Compression a Habit with zstd

Wayne — Sun, 26 Apr 2026 13:47:27 +0000

With zstd being added to Python 3.14, I've been using compressed files more often in my workflow. Here's what I've learned about making compression a habit.

Python Data Processing with Compression

Python 3.14 adds native zstd.open() support, which is a big step forward. Here's the comparison:

Before 3.14 (with zstandard package):

import zstandard as zstd
import json
import io

# Writing compressed JSONL with Zstandard
data = [
    {"id": 1, "name": "Alice", "score": 95},
    {"id": 2, "name": "Bob", "score": 87},
    {"id": 3, "name": "Charlie", "score": 92}
]

# Write
with open('data.jsonl.zst', 'wb') as f:
    cctx = zstd.ZstdCompressor(level=3)
    with cctx.stream_writer(f) as writer:
        for record in data:
            line = (json.dumps(record) + '\n').encode('utf-8')
            writer.write(line)

# Read
with open('data.jsonl.zst', 'rb') as f:
    dctx = zstd.ZstdDecompressor()
    with dctx.stream_reader(f) as reader:
        text_stream = io.TextIOWrapper(reader, encoding='utf-8')
        for line in text_stream:
            record = json.loads(line)
            print(record)

Python 3.14+ is much simpler:

from compression import zstd
import json

# Read and print first record
with zstd.open('data.jsonl.zst', "rt") as f:
    for line in f:
        data = json.loads(line)
        print(data)
        break  # Remove break to read all lines

The API mirrors regular open() -- just use zstd.open() instead.

Key points:

Use 'wt' mode for writing text, 'rt' for reading
Typical compression ratio: 6-7x size reduction at zstd-3

Benchmarking Your Workload

You should benchmark compression according to your workload to determine your trade-offs.

For archival of logs or long-term storage, you can use higher compression levels of zstd. Archives like Pushshift Reddit typically use level 22. For most use cases, zstd-3 is a good default.

Working with Compressed Files

Zstd includes tools for viewing, searching, and processing compressed files without manual decompression.

Quick commands:

zstdcat data.json.zst -- view the contents
zstdless data.json.zst -- page through like less
zstdgrep "error" events.json.zst -- search inside compressed files
zstdgrep -c "timeout" events.json.zst -- count occurrences

You can also pipe to other tools: zstdcat events.json.zst | grep ERROR | jq '.timestamp'

Transferring files

Rsync

The -z flag compresses data during transfer. On highly compressible files, rsync may report speedup > 1.0x. Here's a test with about 66GB of JSONL files:

sent 50,322 bytes  received 12,451,737,167 bytes  19,290,143.28 bytes/sec
total size is 66,857,841,487  speedup is 5.37

That speedup means the data sent was much less than the original file size. This is highly beneficial if you're network bound or concerned about egress costs.

S3 and Cloud Storage

AWS charges for outbound data transfer (egress). Compressing data before storage can significantly reduce these costs. With a 7.0x compression ratio, a $14,000 egress bill drops to roughly $2,000.

Here's an upload comparison on a gigabit connection with a 4GB JSONL file:

# Compressed upload (zstd -k -c ... | s5cmd pipe)
real    0m8.139s
# Result in S3: 363.7MB

# Uncompressed upload (s5cmd cp)
real    0m57.547s
# Result in S3: 4.0GB

The same principle works for downloading: s5cmd cat s3://bucket/data.zst | zstd -d > data.jsonl. Compression takes longer than decompression, but the speedup is usually worth it.

Conclusion

I use zstdcat to read files and rarely need to edit them in an IDE. This habit cut my text storage by up to 80%. There's a balance between convenience, speed, and storage - and this works for me. More optimized formats like protobuf or arrow exist, but most text processing still uses JSON.

The full version with code examples and benchmarks is on my blog.

Using hf tokenizers in Rust

Wayne — Sun, 26 Apr 2026 13:43:42 +0000

The tokenizers library from Hugging Face provides an efficient way to work with text tokenization in Rust. This guide shows you how to get started with pretrained tokenizers.

Setup

First, add the tokenizer library to your project:

cargo add tokenizers --features http,hf-hub

Basic Usage

Here's a complete example that loads a pretrained tokenizer and processes text:

use tokenizers::Tokenizer;

fn main() -> Result<(), Box<dyn std::error::Error + Send + Sync>> {
    // Load a pretrained tokenizer
    let tokenizer = Tokenizer::from_pretrained("hf-internal-testing/llama-tokenizer", None)?;

    let text = "This is a sample string to tokenize";

    // Encode the text (false = no special tokens)
    let encoding = tokenizer.encode(text, false)?;

    // Get token IDs
    let token_ids = encoding.get_ids();
    println!("Token IDs: {:?}", token_ids);

    // Get token text
    let tokens = encoding.get_tokens();
    println!("Tokens: {:?}", tokens);

    println!("Original: {}", text);
    println!("Number of tokens: {}", token_ids.len());

    let decoded = tokenizer.decode(token_ids, true)?;
    println!("Original: {}", text);
    println!("Decoded: {}", decoded);

    Ok(())
}

Working with Different Models

You can use various pretrained models:

// GPT-2 tokenizer
let gpt_tokenizer = Tokenizer::from_pretrained("gpt2", None)?;

// BERT tokenizer
let bert_tokenizer = Tokenizer::from_pretrained("bert-base-uncased", None)?;

// Llama tokenizer
let llama_tokenizer = Tokenizer::from_pretrained("hf-internal-testing/llama-tokenizer", None)?;

Configuration

To change the cache directory for downloaded models, set the HF_HOME environment variable:

export HF_HOME=/path/to/your/cache

Setting environment variables programmatically is not recommended as it requires an unsafe block.

Private Repositories

If you encounter this error:

Error: RequestError(Status(401, Response[status: 401, status_text: Unauthorized, url: https://huggingface.co/google/gemma-3-12b-it/resolve/main/tokenizer.json]))

It means you are not authenticated and may require a token. There are two ways to achieve this:

Write your token to $HF_HOME/token, usually $HOME/.cache/huggingface
Within Rust code:

use tokenizers::{Tokenizer, FromPretrainedParameters};

let params = FromPretrainedParameters {
    token: Some("<your very secret token>".to_string()),
    ..Default::default()
};
let tokenizer = Tokenizer::from_pretrained("google/gemma-3-4b-it", Some(params))?;

Note that you may still need to get permission to access the repos.

Branches

You can specify a specific branch or revision:

use tokenizers::{Tokenizer, FromPretrainedParameters};

let params = FromPretrainedParameters {
    revision: "main".to_string(),  // or specific commit hash
    ..Default::default()
};
let tokenizer = Tokenizer::from_pretrained("google/gemma-3-4b-it", Some(params))?;

User-Agent

Params have another variable called user_agent for customizing the HTTP client user agent string.

use tokenizers::{Tokenizer, FromPretrainedParameters};

let params = FromPretrainedParameters {
    user_agent: Some("my-rust-app/1.0".to_string()),
    ..Default::default()
};
let tokenizer = Tokenizer::from_pretrained("gpt2", Some(params))?;

Summary

The Hugging Face tokenizers library provides a robust, production-ready solution for text processing in Rust applications. With support for pretrained models, authentication for private repositories, and flexible configuration options, it's an excellent choice for NLP workflows in Rust.

You can find this post and more on my blog.

Setting Up Docker CI for Rust with cargo-dist

Wayne — Sun, 26 Apr 2026 13:29:54 +0000

Rust CI

The Core Idea

Building Rust inside Docker is slow. A typical multi-stage Dockerfile compiles the binary in one stage and copies it into a minimal image in another. That works fine for local builds, but in CI it takes a long time, especially when you're emulating arm64 through QEMU.

The better approach: let cargo-dist handle the compilation as part of the release workflow. By the time the Docker job runs, the binaries are already built and available as GitHub Actions artifacts. Docker just copies them in. QEMU is still needed for the final multi-arch manifest, but it's only moving files around rather than running a compiler through emulation, so arm64 builds don't take nearly as long.

The Setup

The starting point is the cargo-dist quickstart guide. Once that's in place, you need a few configuration pieces to trigger the Docker build after the release.

In release.yml, add a custom-docker-publish job that calls your docker-publish workflow and passes the plan output and binary name as inputs.

In dist-workspace.toml, set post-announce-jobs to point at your docker workflow:

post-announce-jobs = ["./docker-publish"]
github-custom-job-permissions = { "docker-publish" = { packages = "write", contents = "read" } }
allow-dirty = ["ci"]

The permissions block was needed because my docker workflow didn't have enough access by default.

The Docker Workflow

The workflow runs as a workflow_call and takes the dist plan JSON, binary name, and target triple suffix as inputs. Here's the overall structure:

on:
  workflow_call:
    inputs:
      plan:
        required: true
        type: string
      binary_name:
        required: true
        type: string
      target_triple_suffix:
        required: false
        type: string
        default: "unknown-linux-musl"

The job itself:

Set up QEMU and Docker Buildx
Log in to GHCR
Extract the version from the dist plan's announcement_tag
Generate Docker metadata (semver tags, major.minor, major, and latest for non-prereleases)
Download the amd64 and arm64 artifacts produced by cargo-dist
Extract and normalize the artifacts, moving binaries into the right folders
Build and push with docker/build-push-action@v6 targeting both linux/amd64 and linux/arm64

The version tags are pulled from the dist plan, so they stay in sync with cargo-dist's release process. The latest tag is skipped for prereleases.

- name: Build and push
  uses: docker/build-push-action@v6
  with:
    context: .
    platforms: linux/amd64,linux/arm64
    push: true
    tags: ${{ steps.meta.outputs.tags }}
    build-args: BINARY_NAME=${{ inputs.binary_name }}

The Dockerfile

The Dockerfile depends on what your binary needs. I used distroless images and determined the right base image by running ldd on the compiled binary:

linux-vdso.so.1 (0x00007ffdfb764000)
libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6
/lib64/ld-linux-x86-64.so.2

Since this binary needed libc and libm, I went with gcr.io/distroless/cc-debian13:nonroot.

FROM gcr.io/distroless/cc-debian13:nonroot

ARG TARGETARCH
ARG BINARY_NAME

COPY --chmod=755 artifacts/${TARGETARCH}/${BINARY_NAME} /usr/local/bin/app

EXPOSE 8000

USER nonroot:nonroot

ENTRYPOINT ["/usr/local/bin/app"]

The full version with the complete workflow YAML and more context is on my blog.

Learnings of the Poor

Wayne — Sun, 26 Apr 2026 06:44:18 +0000

Necessity is the mother of invention

I was already GPU poor, but a recent job change combined with rising component prices have also made me RAM and NVMe poor.

While I am nowhere close to the experts of optimisations in the early 2000s or 90s, I took this time to brush up on some fundamentals and key concepts in Python. As the saying goes:

"Premature optimisation is the root of all evil"

We are not looking for very deep level optimisations, these changes aim to follow the Pareto Principle where 80% of the outcome comes from 20% of the effort. The changes below may or may not be 20% effort but I would consider them low-effort.

As such, there won't be any discussion on performance profiling, where we are determining hot loops, cache misses, memory reallocations etc.

Iterators

Frankly I think this is an important concept that has a great carryover regardless of languages. Understanding iterators also helps if you need to think of channels, which is very important in Go.

The typical approach collects results at every stage into lists:

data = read_file("data.jsonl")
data = first_filter(data)
data = second_processing(data)
write_processed_data(data, "output.jsonl")

The issue: if data.jsonl is bigger than your RAM, you run OOM very fast. Using yield instead keeps memory usage low:

from collections.abc import Iterator

def read_file(file: str) -> Iterator[dict]:
    with open(file, "r") as f:
        for line in f:
            yield json.loads(line)

def first_filter(input_data: Iterator[dict]) -> Iterator[dict]:
    for record in input_data:
        if is_good(record):
            yield record

Each function in the pipeline takes an Iterator[dict] and yields records one at a time. Memory usage drops significantly.

Caveats:

Files are held open throughout the pipeline, so unintentional edits or moves will break it.
json.dumps does not add a trailing newline, so f.write(json.dumps(record) + '\n') is intentional when writing JSONL.

Learning points

I find that iterators are a step before understanding pipelines, channels, or pub/sub patterns. When you understand iterators, you understand the bottlenecks of your code. They are fundamentally all iterators that consume and yield.

If process_data is slow (1 line per second) while reading and filtering is fast (4 lines per second), the pipeline is bounded by 1 line per second. The solution is more processing workers bridged through queues or channels:

Read-worker-1 -> Filter-worker-1 -> Process-worker-{1..4} -> Write-worker-1

Compression

In my Compression post, I mentioned that benchmarks should be done to know whether your use case supports compressions. For write once, read many scenarios, higher compression values may help.

Here is a measurement for an IO-constrained scenario (reading a JSONL file from NAS):

ZST: 100000it [00:05, 17220.01it/s]  (9.47 MB/s)
Raw: 100000it [00:40, 2492.39it/s]  (11.15 MB/s)

Because data is compressed, you can read more data per buffer. More lines are stored per MB of compressed JSONL compared to its raw form.

Less is more

Less work means more efficient processing. It's about eliminating wasted work, not always adding a cache everywhere.

If filtering takes 1s per line and processing takes 5s per line:

Process then filter on 10000 lines: 10000 * 6s = 60000s
Filter then process on 10000 lines (50% bad): 10000 * 1s + 5000 * 5s = 35000s

No complex code, no need for compiled languages. Algorithmic complexity matters too. Choosing the right data structure — a set for membership checks instead of a list, a deque instead of a list for queue operations — can eliminate entire classes of wasted work regardless of language.

The full version with code examples and benchmarks is on my blog.

How to Benchmark LLM Inference Performance: TTFT, ITL, and Throughput Metrics

Wayne — Sun, 26 Apr 2026 05:05:46 +0000

When deploying large language models to production, measuring performance accurately is critical. Whether you're using vLLM, SGLang, TensorRT-LLM, or a custom inference stack, you need to understand:

Throughput: How many requests per second can your system handle?
Latency metrics: Time to First Token (TTFT), Inter-Token Latency (ITL), and end-to-end latency
Token generation speed: Tokens per second under different concurrency levels
Tail latency: P95 and P99 values that affect user experience

In this post, I'll walk through the key metrics for benchmarking language models and share why I built llmperf-rs, a Rust-based benchmarking tool that takes a different approach to measuring these metrics.

The Problem with Existing Tools

While working with ray-project/llmperf (now archived), I noticed that Inter-Token Latency (ITL) was calculated by averaging per-request first, then aggregating those averages. This approach works well for many use cases, but I needed to preserve individual latency spikes during testing.

There's also genai-perf, which is very comprehensive. My only issue was running it on Ubuntu 22.04 without Docker. As of this update, they've sunsetted genai-perf in favor of aiperf.

vllm-bench is solid too, but requires installing vllm.

The goal was to build a simple binary that runs almost anywhere with minimal dependencies. It was also a learning project.

Metrics

This is a summary of the full metrics documentation.

Time To First Token (TTFT)

TTFT measures how quickly the model begins responding after receiving your request. For interactive applications, this is the perceived latency before the user sees any output. It's also important for RAG-based applications where a large chunk of processing happens at the prefill stage.

TTFT = first_token_timestamp - request_start_timestamp

Lower is better.

Inter-Token Latency (ITL)

ITL is the time between consecutive tokens during generation. Spikes can reveal multiple issues, most commonly network problems. ITL is usually consistent due to how KV caches and the computation works.

When testing against vLLM, I noticed that high ITL spikes happen when you benchmark close to the context limit. I suspect this is due to vLLM's eviction of requests if they exceed the KV cache size.

For example, if 3 requests come in with 0.8x context length and 0.2x for generation, but the GPU has space for only 2.8x context length, one of the requests will be preempted.

Aggregation: concatenate ALL ITL values across all responses, then compute statistics. Each response produces (N-1) ITL values (where N is the token count). By aggregating raw values instead of per-request averages, you preserve the true distribution including outliers.

Throughput Metrics

Prefill TPS — tokens processed per second during the prefill phase:

Prefill TPS = input_tokens / TTFT

However, prefill TPS doesn't accurately reflect system performance because TTFT includes queue wait time, not just actual processing time. When a server is under load, your request might sit in a queue waiting for resources. The lower prefill TPS in that case reflects queue contention, not the system's processing capability.

Decode TPS — tokens generated per second during the decode phase:

Decode TPS = output_tokens / (final_time - decode_start_time)

This is the generation speed: how fast the model produces output.

What Matters Most

For production serving, focus on TTFT, ITL stats, and maybe RPM.

TTFT measures how quickly users see their first token — this is the perceived responsiveness of your system.

ITL statistics reveal decode-phase issues that throughput metrics hide. The 99th percentile and max ITL values expose preemption events from KV cache limits and network issues between components.

ITL matters less for batch jobs or non-streaming APIs where users don't watch tokens arrive in real-time.

Token Counting

Accurate metrics require accurate token counts. llmperf-rs handles this in two ways:

API response — Most OpenAI-compatible endpoints return token counts in the usage field. By default, llmperf-rs uses this as priority.
Tokenizer — For exact input counts, pass a HuggingFace tokenizer. Note that chat templates may cause <10 token variance.

The original llmperf uses a single tokenizer for all models. Different models use different tokenizers, so llmperf-rs lets you specify the correct one or rely on API-reported counts.

For example, Llama-2 has a vocab size of 32000, while Qwen3-4B has 151936. In my own testing, setting input tokens to 8192 against a Qwen endpoint while using the default llama tokenizer returned values around 7363-7376 tokens.

Validating Your Results

All benchmark runs should end with finish_reason = length (meaning the model hit the max_tokens limit). If you see finish_reason = stop, the model stopped early. This affects metrics like RPM and E2E latency. Higher rejection rates can produce higher RPMs and lower latency due to shorter responses.

When to Use llmperf-rs

Use llmperf-rs when: running benchmarks with minimal dependencies, testing OpenAI-compatible endpoints, wanting low overhead (Rust, no Ray/ZMQ), or needing a quick way to test endpoints.

Consider alternatives when: you need GPU-level metrics (use trtllm-bench or aiperf), testing vLLM-specific features, requiring extensive reporting dashboards, or needing distributed testing.

Why ITL Matters Even When Throughput Looks Good

High throughput with bad ITL means tokens arrive in bursts, and chat users notice the choppy streaming. ITL spikes (p99 >100ms) often indicate preemption, network issues, or other problems. For non-user-facing use cases like agentic coding, throughput may matter more than ITL specifics.

The full version with code examples, benchmarks, and installation instructions is on my blog.