Forem: Yahya Saleh

Using an LLM to automate a task that used to take hours by hand

Yahya Saleh — Sat, 23 May 2026 15:15:00 +0000

I want to share a concrete example of using an LLM to automate a manual process in my workflow. Not chatbot stuff. An actual pipeline step that used to require a human sitting with two audio tracks for hours.

The problem

I build live speech-to-speech translation. To measure latency, I need to know which phrase in the source audio corresponds to which phrase in the translated audio, so I can measure the time gap between them. This alignment used to be done by hand. A person listens to both tracks, matches up the phrases, and logs timestamps. For a 6-minute session that's easily an afternoon of work.

The hard part isn't the math. It's the alignment. Languages reorder things. German puts verbs at the end. Arabic restructures sentences. A Spanish phrase at position 3 might map to an English phrase at position 7.

Where the LLM fits in

This is exactly the kind of thing LLMs are good at. They understand semantic equivalence across languages and handle reordering naturally. So I replaced the manual step with an LLM call:

Force-align both audio tracks to get per-word timestamps (automated, no LLM needed)
Number every word in both transcripts and send them to the LLM
LLM returns matched phrase pairs with word indices
Compute the time gap for each pair using the timestamps from step 1

What used to take hours now takes a couple of minutes. No human in the loop.

The general pattern

The reason I'm sharing this is that the pattern generalizes. If you have a workflow step where a human reads two things and figures out how they correspond, an LLM can probably do it. The key is that I'm not asking it for a judgment call or creative output. I'm asking it to do structured alignment, a well-constrained task where it's reliable.

The LLM only handles the one step that actually needs language understanding. Everything else (force alignment, timestamp extraction, aggregation) is regular code.

Full methodology: Automating ear-voice span

Code: VoiceFrom/live-s2st-eval

How I use an LLM as a translation judge

Yahya Saleh — Fri, 22 May 2026 14:53:00 +0000

I use GEMBA-MQM v2 to evaluate translation quality in my live speech-to-speech translation pipeline. MQM (Multidimensional Quality Metrics) is an open industry standard for grading translations. Instead of a single score, it classifies every error by type (mistranslation, omission, hallucination, grammar, etc.) and severity (critical, major, minor). It's what professional linguists use when they review translations manually.

GEMBA makes an LLM do this same annotation process. It prompts the model to read the source and translation side by side, find the errors, and tag each one with an MQM type and severity. So you get the same structured error breakdown you'd get from a human reviewer, but automated. It ranked #1 on WMT24 by correlation with human MQM annotations.

The catch: LLM judges are noisy. On one English-to-German clip, 10 passes gave me scores from -29 to -109. Same translation, same model.

The fix is straightforward. Run 10 passes per segment, drop outliers beyond 2 standard deviations, aggregate with rank-reciprocal weighted averaging so the harshest outlier doesn't dominate. That same clip settles at -41.9 across 10 passes.

If you're using LLM-as-judge for anything, try running multiple passes. The variance will surprise you.

Full methodology: LLMs as translation judges: Inside GEMBA-MQM v2

Code: VoiceFrom/live-s2st-eval

I benchmarked OpenAI's new GPT-Realtime-Translate against four other live translation systems

Yahya Saleh — Wed, 20 May 2026 14:44:00 +0000

OpenAI shipped GPT-Realtime-Translate on May 8. It's their first model purpose-built for live speech translation, and it supports 70+ input languages.

I've been building a live translation pipeline at VoiceFrom, so I ran it through the same eval harness I use on our own system and three other competitors: Google Meet, LiveVoice, and Palabra. Same source audio, same scoring, eight language pairs.

How I scored it:

Accuracy: GEMBA-MQM v2, an LLM judge that annotates specific translation errors (type + severity) rather than giving a single score. 10 scoring passes per segment, outlier removal, rank-reciprocal weighted aggregation. Ranked #1 on WMT24.
Latency: Automated Ear-Voice Span, the time between when a source phrase is spoken and when the translation starts playing.

What I found:

VoiceFrom Pro was more accurate than OpenAI in 6 out of 8 language pairs
OpenAI had the fastest median latency (5.4s vs 7.3s for VoiceFrom)
Google Meet was fastest overall but had by far the worst accuracy
The accuracy gaps were much bigger than the latency gaps

The interesting tradeoff: OpenAI is fast but makes more errors. Google is fastest but the translations are often wrong. The platforms that take a bit longer tend to get the meaning right.

Full benchmark with charts and audio samples: Five platforms, one harness: a head-to-head live translation benchmark

The eval harness is open source if you want to run it on your own system: VoiceFrom/live-s2st-eval

Benchmarking five live translation systems with an open-source eval harness (including OpenAI's GPT-Realtime-Translate)

Yahya Saleh — Tue, 19 May 2026 16:42:00 +0000

We built an open-source evaluation harness for live speech-to-speech translation and used it to benchmark five platforms head-to-head. This post walks through the methodology (GEMBA-MQM v2 for accuracy, Ear-Voice Span for latency) and the results.

Eval harness: https://github.com/VoiceFrom/live-s2st-eval