I built an AI English speaking coach — what was technically hard

Elispeak — Fri, 24 Apr 2026 14:24:57 +0000

I spent the past year building Elispeak, an AI English speaking coach. The user-facing pitch is simple — talk to an AI tutor, get instant pronunciation and fluency feedback, practice TOEFL / IELTS / CELPIP speaking tasks on demand. Under the hood, a few things turned out to be much harder than I expected.

This is a note to myself, and to anyone else building voice-first language tools.

1. Real-time ASR + scoring latency is the whole product

The promise of "instant feedback" falls apart at 4 seconds of latency. At 1.5 seconds it feels like a person listening. At 3.5 it feels like a slow API. The user's confidence between "I spoke well" and "I messed up" is destroyed by the gap.

Getting from end-of-utterance to a scored result — not just transcription, but pronunciation and fluency features — meant:

streaming ASR instead of batch, with interim hypotheses used to start downstream work before the final transcript arrives
precomputing a phoneme-alignment path so pronunciation scoring can start as soon as the audio chunk lands, not after the full sentence
scoring features (pace, filler-word density, stress timing) computed on the audio stream, not derived post-hoc from the transcript

The second-order effect: every piece of UI has to be reactive too. If feedback lands in 1.2s but the UI repaints every 500ms, the user perceives 1.7s. Shaving animation blocking time ended up mattering almost as much as the model pipeline.

2. Exam rubrics are not a prompt — they are a protocol

TOEFL Independent Speaking, IELTS Part 2, CELPIP Task 4 each have published rubrics. It is tempting to drop the rubric into a system prompt and call it done. It is not done.

Timing windows matter more than content. TOEFL gives 15 seconds to prepare and 45 to speak. A "perfect answer" that runs 38 seconds is actually worse at the exam than a B+ answer at 44 seconds. The coach has to grade with that tension in mind, not just on transcript quality.
Exam-safe framing is a legal surface. You cannot say "this is your TOEFL score." You can say "a tutor applying the public band descriptors might score this around 23-25 of 30." That framing has to be in every response, not just onboarding.
Sample answers drift. A stable system prompt with drifting base-model behavior produces drifting feedback. I had to pin model versions per exam mode and run weekly evals on held-out recordings to catch regression.

Treat each exam mode as its own small product with its own eval set, not one mode with a different prompt.

3. TTS voices that don't sound robotic are half the battle

Students do not want to roleplay with a voice that sounds like a call-center IVR. The moment the voice feels synthetic, the emotional bar for opening their mouth goes up — and you just lost the session.

What actually helped:

neural voices tuned for conversational English, not neutral narration
varying pacing and pause patterns per scenario (airport interview is clipped and fast; therapy-style friend chat has longer pauses and more um / yeah / okay fillers)
supporting accent diversity so the student practices comprehension, not just production
lip-sync style micro-delays — the tutor reacts a beat late, like a human would, not instantly like a bot

On the engineering side, this meant a voice persona config per AI tutor character (we ship multiple tutors) and keeping the latency budget from Section 1 intact while adding TTS synthesis.

Stack and trade-offs

ASR: streaming provider with word-level timestamps and phoneme probabilities. Interim hypotheses + confidence scores shaped more of the architecture than raw accuracy.
Scoring: pronunciation on phoneme-level features and edit distance vs. expected; fluency on stream-level features (pace, filler rate, pause distribution); content via an LLM pass scoped to rubric criteria.
LLM: one pinned model per exam mode, with eval regression suite before upgrading.
TTS: neural conversational voices, persona config per tutor.
Frontend: WebRTC for capture, progressive UI updates keyed to pipeline stages so partial results feel immediate.

Trade-offs that bit me:

optimizing for end-to-end latency means giving up some scoring quality per step; I keep having to rebalance the two
picking "one best voice" per tutor is false economy — students attach to specific voices and churn when you change them
rubrics are a moving target; budget time to rerun evals after any provider upgrade

What I would build differently

Invest in the eval loop before the product surface. Most debugging pain in months 4-8 traced back to missing eval coverage, not missing features.
Do not ship more than two exam modes until the first two are clean. More modes means more eval sets means more drift surface.
Pay for a proper observability stack earlier. Custom logging runs out of road faster than you expect on a voice pipeline.

Try it

If you want to kick the tires, there is a free tier at elispeak.com. Paid plans are 50% off with code ELISPEAK50 — no minimum, works on any plan.

👉 Start practicing — 50% off any plan

Happy to answer questions in the comments — especially on ASR pipeline design and rubric evals.