Forem: Ahmed Mahmoud

Speech Recognition for Language Learners: A Technical Guide

Ahmed Mahmoud — Wed, 15 Apr 2026 10:03:00 +0000

Speech Recognition for Language Learners: A Technical Guide

Building pronunciation feedback into a language learning app requires more than dropping in a speech-to-text API. The general-purpose APIs (Whisper, Google STT, Azure Speech) are optimised for transcription accuracy on fluent speech — they're not designed for the specific problems of language learner speech, which has different phonological patterns, different error distributions, and different evaluation needs.

Here's a technical guide to implementing speech recognition that actually serves language learners.

The Core Problem: Learner Speech vs. Native Speech

Language learner speech has distinct characteristics that trip up models trained on native speaker corpora:

Non-native phonemes: A Spanish speaker learning English will produce /v/ as /b/ or /β/. A Chinese speaker may not distinguish /r/ and /l/. These aren't random errors — they're systematic substitutions from the L1 phonological system.
Prosodic errors: Wrong stress placement ("reCORD" vs "REcord"), flattened intonation, incorrect syllable timing.
Hesitation phenomena: More frequent false starts, filled pauses (uh, um), and reformulations.
Accent interference: Learner speech at B1–B2 level is recognisable but may use intonation patterns from the native language.

A standard transcription API that returns 95% word accuracy on native speech may drop to 75–80% on learner speech — which is bad for a transcription use case and catastrophic for pronunciation feedback, where you need to detect the subtle differences.

Architecture Options

Option 1: Standard STT → Text Comparison

The simplest approach: transcribe the learner's speech, compare the transcript to the target phrase.

import openai

def check_pronunciation(audio_bytes: bytes, target_text: str) -> dict:
    transcript = openai.audio.transcriptions.create(
        model="whisper-1",
        file=("audio.wav", audio_bytes, "audio/wav"),
        language="en",  # force language to prevent misdetection
    ).text

    # Simple word-level comparison
    target_words = target_text.lower().split()
    spoken_words = transcript.lower().split()

    matches = sum(1 for t, s in zip(target_words, spoken_words) if t == s)
    accuracy = matches / len(target_words) if target_words else 0

    return {
        "transcript": transcript,
        "accuracy": accuracy,
        "matched": matches,
        "total": len(target_words),
    }

This approach works for basic go/no-go feedback ("you said it correctly / incorrectly") but can't identify which phoneme was mispronounced, which is what a learner actually needs.

Option 2: Forced Alignment → Phoneme-Level Analysis

For granular pronunciation feedback, you need forced alignment: aligning the audio signal to the expected phoneme sequence to identify where divergences occur.

Montreal Forced Aligner (MFA) is the standard open-source tool for this. It takes audio + transcript and returns time-aligned phoneme segments:

# Install MFA
pip install montreal-forced-aligner

# Align a single utterance
mfa align audio_dir/ english_mfa english_mfa aligned_dir/

The output is a TextGrid file with word and phoneme tiers, each with start/end timestamps. You can then compare the actual phoneme segments to the expected phonemes from a G2P (grapheme-to-phoneme) model.

For a Python integration:

from montreal_forced_aligner.alignment import PretrainedAligner

aligner = PretrainedAligner(
    acoustic_model_path="english_mfa",
    dictionary_path="english_mfa",
)

def get_phoneme_alignment(audio_path: str, transcript: str) -> list[dict]:
    # Returns list of {phoneme, start, end, confidence} dicts.
    aligner.align_utterance(audio_path, transcript)
    # Parse TextGrid output
    return parse_textgrid(audio_path.replace(".wav", ".TextGrid"))

This lets you identify exactly which phoneme was mispronounced, at what timestamp, and by how much — enabling precise feedback ("you pronounced the /θ/ in 'think' as /s/").

Option 3: Whisper + Custom Pronunciation Scoring

A practical middle ground: use Whisper for transcription and a custom model for pronunciation scoring.

Whisper's word_timestamps feature (available in the Python library) returns per-word timestamps that you can use as an approximation of forced alignment:

import whisper

model = whisper.load_model("base.en")

def transcribe_with_timestamps(audio_path: str) -> dict:
    result = model.transcribe(
        audio_path,
        word_timestamps=True,
        language="en",
    )

    words = []
    for segment in result["segments"]:
        for word in segment.get("words", []):
            words.append({
                "word": word["word"].strip(),
                "start": word["start"],
                "end": word["end"],
                "probability": word.get("probability", 1.0),
            })

    return {
        "text": result["text"],
        "words": words,
        "language": result["language"],
    }

Low probability scores at the word level indicate the model was uncertain, which correlates with mispronunciation. This isn't as precise as forced alignment but is much simpler to implement and runs on consumer hardware.

Whisper for Multilingual Learner Speech

Whisper is trained on 680,000 hours of multilingual speech and handles learner speech better than most commercial APIs because its training data includes accented and non-native speech. Key parameters for learner use cases:

result = model.transcribe(
    audio_path,
    language="es",              # force target language — prevents misidentifying
                                #   accented speech as a different language
    task="transcribe",          # "transcribe" (same language) vs "translate" (→English)
    temperature=0.0,            # deterministic output for consistent scoring
    no_speech_threshold=0.6,    # threshold for detecting silence/no speech
    logprob_threshold=-1.0,     # lower = accept lower-confidence transcriptions
    compression_ratio_threshold=2.4,
)

Setting language explicitly is critical — without it, a Spanish learner speaking heavily accented English may have their audio classified as Spanish, producing a transcription in the wrong language.

Pronunciation Scoring Approaches

Goodness of Pronunciation (GOP)

GOP is the standard quantitative measure of pronunciation quality at the phoneme level. It measures how well the learner's acoustic signal matches the expected phoneme model:

# Simplified GOP calculation using acoustic model likelihoods
def calculate_gop(phoneme_posterior: list[float], expected_phoneme_idx: int) -> float:
    # phoneme_posterior: posterior probability distribution over all phonemes
    # expected_phoneme_idx: index of the expected phoneme
    # Returns: GOP score in range [-1, 0], higher is better
    expected_prob = phoneme_posterior[expected_phoneme_idx]
    max_prob = max(phoneme_posterior)
    gop = math.log(expected_prob / max_prob + 1e-10)
    return gop

A GOP score near 0 means the expected phoneme had the highest posterior probability (correct pronunciation). A very negative score means a different phoneme dominated (mispronunciation).

Extended GOP (eGOP) for Language Learners

Standard GOP has high false positive rates for learner speech because it's calibrated on native speech distributions. Extended GOP variants (using acoustic models trained on learner speech corpora) significantly improve precision.

The ISLE corpus (Italian and German learners of English) and the L2-ARCTIC corpus (Indian, Korean, Mandarin, Spanish speakers of English) are the main training resources for learner-specific acoustic models.

React Native Integration

For a mobile language app, on-device inference is preferable to reduce latency:

// Using @xenova/transformers (Transformers.js) for on-device Whisper
import { pipeline } from '@xenova/transformers';

const asr = await pipeline(
  'automatic-speech-recognition',
  'Xenova/whisper-base.en',  // ~150MB, runs on-device
  { device: 'auto' }
);

async function transcribeAudio(audioUri: string): Promise<string> {
  const result = await asr(audioUri, {
    language: 'english',
    task: 'transcribe',
    return_timestamps: 'word',
  });
  return result.text;
}

For devices where on-device inference is too slow (older iPhones, low-end Android), a hybrid approach — stream audio to a server endpoint — adds ~200–400ms of network latency but maintains quality.

Feedback Design: What to Tell the Learner

Technical accuracy in pronunciation detection is only useful if the feedback is pedagogically sound. The design principles from SLA research:

Prioritise actionable errors: Tell the learner which phoneme to work on, not just that pronunciation was incorrect.
Recast before explicit correction: Play back the correct pronunciation immediately after the learner's attempt. Hearing the correct form in context is more effective than a score.
Avoid overcorrection: Flag at most 1–2 errors per utterance. More than that is demoralising and cognitively overloading.
Use visual feedback for prosody: A pitch track (waveform with pitch overlay) helps learners see sentence-level intonation patterns they can't easily hear in their own speech.
Track improvement over time: Learners need to see progress. A per-phoneme accuracy trend chart (this week vs. last week) is more motivating than an absolute score.

The speech recognition layer provides the data; the UX layer determines whether it produces learning. Getting both right is the product challenge.

I'm building Pocket Linguist, an AI-powered language tutor for iOS. It uses spaced repetition, camera translation, and conversational AI to help you reach conversational fluency faster. Try it free.

Building a React Native App for 20+ Languages: Lessons in i18n

Ahmed Mahmoud — Wed, 08 Apr 2026 10:00:21 +0000

Building a React Native App for 20+ Languages: Lessons in i18n

Supporting 20+ languages in a mobile app is not a checklist item. It's a continuous engineering commitment that touches every layer of the stack: UI layout, typography, data storage, API design, and release workflows.

Here's what I learned building a language learning app with extensive multilingual support.

The i18n Library Decision

For React Native, the main options are:

i18next + react-i18next: Most full-featured. Supports namespaces, pluralisation, interpolation, language detection. ~20KB gzipped.
react-native-localize + custom solution: Lower-level, more control. Works well if your needs are simple.
expo-localization: Good for Expo-managed workflow apps, limited for bare React Native.

I chose i18next with react-i18next. The namespace support is critical when your translation file grows beyond 200 keys — splitting by feature (onboarding, settings, lesson, error) keeps files manageable and allows lazy loading.

// i18n.ts
import i18n from 'i18next';
import { initReactI18next } from 'react-i18next';
import { getLocales } from 'expo-localization';

i18n.use(initReactI18next).init({
  resources: {
    en: { translation: require('./locales/en.json') },
    es: { translation: require('./locales/es.json') },
    // ...
  },
  lng: getLocales()[0].languageCode ?? 'en',
  fallbackLng: 'en',
  interpolation: { escapeValue: false },
});

Text Expansion: The Layout Killer

English is one of the most compact written languages. When you translate UI strings to German, Finnish, or Portuguese, prepare for your buttons to overflow.

Expansion factors by target language (relative to English):

Language	Avg text expansion
German	+25–35%
Finnish	+25–30%
Portuguese	+20–30%
Spanish	+15–25%
French	+15–20%
Japanese	-15–25% (usually shorter)
Chinese	-30–40%

The failure modes are predictable: buttons that wrap to two lines, truncated navigation labels, overflow in table cells, clipped input placeholder text.

The fix requires designing with worst-case text from the start. I enforce a rule: every string that appears in UI must be tested with the German translation before the component is considered complete. German reliably produces the longest strings in Latin-script languages.

Practical solutions:

// Bad: fixed width button
<TouchableOpacity style={{ width: 120 }}>
  <Text>{t('save_button')}</Text>
</TouchableOpacity>

// Good: minimum width with flexible growth
<TouchableOpacity style={{ minWidth: 120, paddingHorizontal: 16 }}>
  <Text numberOfLines={1} adjustsFontSizeToFit minimumFontScale={0.8}>
    {t('save_button')}
  </Text>
</TouchableOpacity>

adjustsFontSizeToFit with a reasonable minimumFontScale handles most overflow cases without layout breakage. For buttons, prefer paddingHorizontal over fixed widths.

Right-to-Left Layout: Not Optional for Arabic and Hebrew

Arabic (standard in North Africa, the Middle East) and Hebrew are right-to-left scripts. If you're supporting them, you need RTL layout support — this is a global I18nManager.forceRTL(true) call that flips the entire layout, not individual components.

import { I18nManager } from 'react-native';
import * as Updates from 'expo-updates';

async function activateRTL() {
  if (!I18nManager.isRTL) {
    I18nManager.forceRTL(true);
    await Updates.reloadAsync(); // app restart required
  }
}

Caveats:

The RTL switch requires an app reload. Don't try to do it in-session.
Not all third-party components respect the RTL flag. Custom icons (chevrons, back arrows) need manual mirroring.
Numbers in Arabic text are still left-to-right. Mixed directionality in a single text run requires explicit bidi control characters.
Test on a physical device. RTL rendering in the simulator has historically had edge cases.

Font Support: The CJK Problem

React Native's default font stack handles Latin, Cyrillic, and Greek well. For CJK (Chinese, Japanese, Korean), you're relying on the system font — which is fine on iOS (PingFang SC/TC, Hiragino Sans) but inconsistent on Android where the system CJK font depends on the device manufacturer and Android version.

For a language learning app where rendering quality directly impacts the user's ability to read characters correctly, inconsistent CJK rendering is a real problem. The solution:

Bundle a guaranteed CJK font (Noto Sans CJK, Source Han Sans). Accept the 2–5MB APK size increase.
Use @expo-google-fonts/noto-sans-sc (and equivalents) for managed Expo apps.
Apply the font to text components via a custom Text wrapper that automatically selects the correct font family based on the current language.

// components/LText.tsx — language-aware Text component
const FONT_MAP: Record<string, string> = {
  zh: 'NotoSansSC',
  ja: 'NotoSansJP',
  ko: 'NotoSansKR',
  ar: 'NotoSansArabic',
  default: 'System',
};

export function LText({ style, ...props }: TextProps) {
  const { language } = useLanguage();
  const fontFamily = FONT_MAP[language] ?? FONT_MAP.default;
  return <Text style={[{ fontFamily }, style]} {...props} />;
}

Pluralisation: Not Just "0, 1, many"

English has two plural forms: one (singular) and everything else (plural). Many languages have more. Russian has four forms (one, a few, many, other). Arabic has six. Polish has four with different rules than Russian.

i18next handles pluralisation through the _one, _other convention for simple cases, and _zero, _one, _two, _few, _many, _other for languages that require them. The CLDR (Common Locale Data Repository) defines the exact rules per language.

// locales/ru.json
{
  "items_count": "{{count}} предмет",
  "items_count_few": "{{count}} предмета",
  "items_count_many": "{{count}} предметов",
  "items_count_other": "{{count}} предмета"
}

The traps:

JavaScript's Intl.PluralRules is your friend for runtime pluralisation outside i18next.
Don't embed numbers in translated strings if you can avoid it. Let the UI compose the number and the pluralised noun separately.
Date and number formats are locale-specific. Use Intl.DateTimeFormat and Intl.NumberFormat — never hardcode separators.

Translation Workflow: The Human Problem

Technical i18n is the easy part. Managing translations for 20+ languages is an operational challenge.

The workflow that works at small scale:

Source strings only in English. Never translate from a translation. The telephone-game error accumulates.
Automated key extraction. i18next-parser scans your codebase and generates a keys-only JSON for translators.
Translation memory. Tools like Weblate, Crowdin, or even a shared Google Sheet with a translation memory script save significant cost and improve consistency.
Machine translation first-pass + human review. DeepL for European languages, Google Cloud Translation for Asian languages. Human review catches idiom errors and context mismatches that MT misses.
Screenshot context for translators. A string like "back" is ambiguous without seeing the UI. Tools like Crowdin's in-context editor or automated screenshot generation remove ambiguity.

The worst outcome is inconsistent terminology — using three different words for the same concept across screens because three different translators worked on three different screens without a glossary. Build a glossary early and enforce it.

Performance: Lazy-Loading Locales

Bundling 20+ locale files adds up. At even 50KB per language, 20 languages is 1MB of translation JSON loaded at startup — most of which the user never needs.

Lazy-loading solution:

i18n.use(initReactI18next).init({
  partialBundledLanguages: true,
  resources: {
    en: { translation: require('./locales/en.json') }, // bundle default
  },
  backend: {
    loadPath: `${FileSystem.documentDirectory}locales/{{lng}}/{{ns}}.json`,
  },
});

// On language change:
async function switchLanguage(lng: string) {
  await downloadLocaleIfNeeded(lng); // fetch from CDN, write to FileSystem
  await i18n.changeLanguage(lng);
}

The tradeoff: first-launch latency on a language change. Accept a loading state the first time a non-bundled language is selected; subsequent loads are instant from the filesystem cache.

The Testing Problem

Automated testing for i18n is under-invested in most projects. A minimum viable approach:

Snapshot tests with each locale to catch layout regressions.
String length tests — assert no translated string exceeds a maximum length for UI-critical strings.
RTL smoke test — a single E2E test that switches to Arabic and verifies the primary navigation flows don't break.
Missing translation linting — a CI step that fails if any key present in the English locale is absent from other locales.

# CI step using i18next-parser output
for lang in es fr de ja zh ar ko; do
  node scripts/check-missing-keys.js --base en --target $lang
done

i18n debt compounds faster than most technical debt. Catching missing translations in CI rather than in production is worth the setup cost.

I also run Agnes AI — AI-powered services for businesses including security scans, content packs, translations, and custom AI solutions.

The Untranslatable: 20 Words That Don't Exist in English

Ahmed Mahmoud — Wed, 08 Apr 2026 10:00:02 +0000

The Untranslatable: 20 Words That Don't Exist in English

Every language draws borders around experiences, giving them a name and therefore a kind of existence. Some experiences that other cultures have named, English has never found the need to. These gaps reveal as much about English speakers as the words themselves reveal about other cultures.

Here are 20 words worth knowing — and worth stealing.

Japanese

木漏れ日 (Komorebi)
The interplay of sunlight and shadow created by leaves filtering light. English has "dappled light" but it's clinical, a description rather than an experience. Komorebi names the specific quality of that light and the emotional register it carries.

木枯らし (Kogarashi)
The first cold wind of winter — specifically, the one that makes you realise summer is finally over. A word for a meteorological threshold that also marks a psychological one.

侘び寂び (Wabi-sabi)
The beauty found in imperfection, transience, and incompleteness. The appreciation of a cracked tea bowl, a faded garden, an old face. English has "bittersweet" but wabi-sabi is not about two opposing qualities coexisting — it's about the incompleteness itself being the beautiful thing.

木の芽時 (Konome-doki)
The unsettled, slightly melancholy feeling that comes with spring. The emotional vertigo of seasonal change.

Portuguese

Saudade
A deep, melancholic longing for something or someone you love that is absent — and may never return. Crucially, saudade can be felt for something you've never had. It's not nostalgia (which requires a past) and it's not simply sadness. It's a specific flavour of longing that Portuguese speakers feel is fundamental to their character.

Desenrascanço (Desenrascanço)
The act of improvising a solution to a problem using whatever is at hand, usually moments before disaster. MacGyvering, but as a cultural virtue and a reliable strategy, not an emergency measure.

German

Verschlimmbessern
To make something worse while attempting to improve it. Every developer who has ever "quickly refactored" something before a deadline knows this word.

Weltschmerz
World-weariness. The anguish of the gap between how the world is and how you feel it should be. Cynicism is too shallow; depression is too clinical. Weltschmerz is the specific intellectual and emotional pain of caring about the world's failures.

Torschlusspanik
"Gate-closing panic" — the anxiety that time is running out to achieve something. Specifically the feeling that a door (a relationship, a career path, a life stage) is closing and you haven't walked through it yet.

Fingerspitzengefühl
Literally "fingertip feeling." The delicate intuitive sense for handling a difficult situation — knowing exactly how much pressure to apply, what to say, when to be silent. The word itself demonstrates the quality it names.

Danish / Norwegian

Hygge (Danish)
The feeling of coziness, warmth, and contentment that comes from being surrounded by good company, candlelight, and comfort. Not the objects themselves but the emotional state they produce. It became a lifestyle trend in English-speaking countries, but borrowing the word hasn't quite captured the cultural specificity of what it names.

Pålegg (Norwegian)
Any topping for bread — cheese, jam, meat, vegetables. The word's existence (English requires six different words to cover the same concept) suggests that the Norwegians think about open-faced sandwiches more seriously than most cultures.

Georgian

Shemomedjamo (შემომეჭამა)
"I ate the whole thing by accident." A word for the experience of eating more than you intended because the food was so good that stopping required conscious effort you weren't willing to apply.

Swedish

Lagom
Not too much, not too little. Precisely the right amount. The word encodes a cultural value: the right measure, the appropriate amount, the comfortable middle. Swedish culture treats lagom as a virtue; other cultures might call it mediocrity. The difference is instructive.

Mångata
The road-like reflection of moonlight on water. A word for a visual phenomenon specific enough to have warranted naming.

Indonesian

Jayus
A joke so bad, so unfunny, that you can't help laughing at it. The English approximation requires a phrase ("so bad it's funny"); Indonesian has a single word.

Czech

Prozvonit
To call a phone and let it ring once so the other person calls back — avoiding the charge for the call. The word exists because the practice is common enough to need naming. A linguistic artifact of a specific economic reality.

Filipino

Gigil
The overwhelming urge to squeeze or pinch something overwhelmingly cute — a puppy, a baby, a kitten. The physical, almost aggressive impulse that extreme cuteness produces. English has "I want to squeeze you" but gigil names the internal state before any action.

What these words share is that they name experiences that are universal enough to exist across cultures, but only specific enough to be named in some. The experiences English doesn't name aren't absent from English speakers' lives — they simply have to be described in phrases, which is less efficient and also less emotionally available.

When you learn a word for an experience you recognise, something shifts. The experience becomes more legible to you, more sharable, more available for reflection. This is one of the most underappreciated benefits of language learning: it doesn't just give you access to a culture, it gives you new categories for your own experience.

The Science of Language Learning: What Research Actually Says

Ahmed Mahmoud — Wed, 01 Apr 2026 10:00:14 +0000

The Science of Language Learning: What Research Actually Says

Language learning advice is everywhere. Most of it is based on anecdote, marketing, or the experience of unusually gifted polyglots. The scientific literature tells a more nuanced — and more actionable — story.

Here's what decades of second language acquisition (SLA) research actually establishes, with the practical implications for how you should build your study routine.

1. Input Hypothesis: Comprehensible Input Is the Core Driver

Stephen Krashen's Input Hypothesis (1982) remains the most influential and most contested theory in SLA. The central claim: language acquisition happens when you encounter input that is slightly above your current level of competence (i+1 in his notation — "comprehensible input").

What the evidence actually supports:

High-quality input (reading, listening to native material) is necessary for acquisition
Grammar instruction alone without input exposure produces test-takers, not speakers
Output (speaking, writing) accelerates acquisition beyond input-only exposure — this is where Krashen's original theory is underdeveloped

Practical implication: The majority of your study time should involve encountering natural language in context — books, podcasts, TV shows — not drilling grammar rules. But speaking practice matters too, especially for activating passive vocabulary.

2. Spaced Repetition: The Most Evidence-Backed Learning Technique

The spacing effect — first documented by Hermann Ebbinghaus in 1885 — is one of the most replicated findings in cognitive psychology. Distributing practice over time produces dramatically better long-term retention than massing the same amount of practice in a single session (cramming).

Spaced repetition systems (SRS) formalise this by scheduling reviews at expanding intervals based on your performance:

Initial learning: review after 1 day
Correct recall: push to 3 days, then 7 days, then 21 days, etc.
Incorrect recall: reset the interval

Effect sizes from meta-analyses are striking: spaced practice produces 1.5–2x better long-term retention versus massed practice for the same total study time.

Practical implication: Use an SRS (Anki, the algorithm built into language learning apps) for vocabulary. The discipline of daily short sessions beats weekend marathons.

3. Output Hypothesis: Speaking Accelerates Acquisition

Merrill Swain's Output Hypothesis (1985) challenged Krashen's input-only model. Swain observed that French immersion students in Canada had excellent comprehension but poor speaking accuracy after years of input-rich schooling. Her argument: speaking forces you to process language at a level of precision that listening doesn't require.

When you produce output, you notice gaps in your competence (you reach for a word and discover you don't know it), you test hypotheses about grammar, and you receive corrective feedback. These noticing events appear to drive acquisition.

Modern SLA research broadly supports a dual role: input for acquiring new forms, output for consolidating them.

Practical implication: If you study for 30 minutes a day, at least 10 of those minutes should involve speaking or writing — not just passive exposure.

4. Critical Period Hypothesis: Adults Can Learn But It's Harder

The Critical Period Hypothesis (Lenneberg, 1967) proposes that language acquisition is biologically constrained — there's a developmental window (roughly through puberty) during which native-like acquisition is achievable. After the critical period closes, adult learners face a steeper path.

What the evidence actually shows is more nuanced:

Phonology: Accent acquisition is genuinely harder after puberty. Neural plasticity in the auditory-motor integration system decreases. Most adults who start after their teens retain a detectable foreign accent — not all, but most.
Morphosyntax: Adults are slower to acquire complex grammatical features but are better at learning vocabulary and at deploying explicit rule knowledge.
Ultimate attainment: Adults can and do achieve very high proficiency. The claim that adults "can't become fluent" is false. The claim that it's harder and takes longer is true.

A 2018 study (Hartshorne et al.) analysed 670,000 online grammar test takers and found the optimal period for achieving native-like grammar ends at around age 17–18, with a softer decline continuing into the mid-twenties. This is a population-level trend, not a ceiling on any individual.

Practical implication: If you're an adult learner, don't accept the defeatist framing. Do invest extra time in pronunciation practice early — it becomes progressively harder to change phonological habits.

5. Interactional Feedback: Error Correction That Works

Not all error correction is equal. Research distinguishes several types:

Recasts: Repeating the learner's utterance with the error corrected (e.g., learner says "He go to school," teacher responds "Yes, he goes to school every day."). Natural, low-threat, but learners often don't notice the correction.
Explicit correction: Directly flagging the error ("You should say 'goes,' not 'go'"). More noticing, more disruptive to fluency.
Clarification requests: Pretending you didn't understand ("Sorry?"). Forces the learner to self-repair.
Metalinguistic feedback: Describing the rule without providing the form ("Remember the third-person singular present tense rule").

Meta-analyses (e.g., Li, 2010) find that recasts work best for phonological errors, explicit correction works best for morphosyntactic errors, and clarification requests are most effective for pragmatic errors. The context matters.

Practical implication: When using AI conversation tools, ask for recast-style correction in free conversation mode and explicit correction when drilling specific grammar points.

6. Motivation: Intrinsic Beats Extrinsic

Self-Determination Theory (Deci and Ryan) distinguishes:

Intrinsic motivation: Engaging with the language because it's genuinely interesting or enjoyable
Extrinsic motivation: Studying to pass a test, get a job, or keep a streak

Both drive behavior in the short term. Only intrinsic motivation sustains behavior long term. Learners who study primarily to maintain a streak or earn badges show dramatically higher dropout rates once external rewards are removed.

The most successful long-term language learners tend to share one characteristic: they find genuine enjoyment in the content of the target language — its music, films, literature, or the relationships it opens. The language becomes a vehicle for something they already care about.

Practical implication: Find content in your target language that you'd want to consume even if you already spoke it fluently. Pair your SRS sessions with content you actually enjoy.

7. The Comprehensible Input Threshold: ~95% Rule

Vocabulary research (Nation, 2001) established that readers need to know approximately 95% of the words in a text to read it with adequate comprehension and without heavy dictionary use. For audio, the threshold is slightly lower (~90%) because prosody and context fill in more gaps.

This has a direct implication for content selection: material that's at 80% comprehension is frustrating, not productive. The sweet spot is challenging but accessible.

For listening, this corresponds to roughly the i+1 level Krashen described — you catch most of what's said and use context to infer the rest. Netflix series aimed at teenagers or young adults, podcasts from language teaching networks (Dreaming Spanish, Coffee Break Languages), and graded readers are engineered to sit near this threshold.

Building a Study System From the Science

Synthesising the above:

Time allocation	Activity	Why
30–40%	Comprehensible input (reading/listening at 95% comprehension)	Core acquisition mechanism
20–25%	Spaced repetition vocabulary review	Highest ROI per minute for retention
20–25%	Speaking/writing output	Consolidates forms, reveals gaps
10–15%	Pronunciation practice (especially early)	Most time-sensitive skill
5–10%	Grammar study (targeted, not exhaustive)	Fills specific gaps, not a primary driver

Consistency matters more than any single session. Thirty minutes daily beats four hours on weekends, independent of method. This isn't motivational — it's what the spaced repetition and consolidation research directly predicts.

I also run Agnes AI — AI-powered services for businesses including security scans, content packs, translations, and custom AI solutions.

How I Used AI Agents to Automate My Marketing (With Code)

Ahmed Mahmoud — Wed, 25 Mar 2026 10:12:41 +0000

How I Used AI Agents to Automate My Marketing (With Code)

Running a solo app means wearing every hat: product, engineering, support, and marketing. Marketing is the one that most developers neglect, because it doesn't feel like "real work" and the results are invisible until they aren't.

I spent about three weeks building a multi-platform marketing automation system using AI agents, Python daemons, and the APIs for every major social platform. Here's the technical architecture and what actually worked.

The Problem: Marketing Is a Repeating Task Queue

Social media marketing for an indie app is roughly this loop:

Produce content (posts, threads, articles)
Post content on a schedule
Engage with responses (replies, DMs, comments)
Analyse what worked
Repeat

Every part of this is automatable at some level. The creative/strategy layer still needs a human, but execution is a perfect target for automation.

Architecture: Daemons + Cron + AI Generation

The system I built has four layers:

┌────────────────────────────────────────────┐
│  Content Generation (AI, CrewAI pipeline)  │
└────────────────────────────────────────────┘
                        │
┌────────────────────────────────────────────┐
│  Content Queue (JSON state files + lock)   │
└────────────────────────────────────────────┘
                        │
┌────────────────────────────────────────────┐
│  Platform Posters (Python daemons, one per │
│  platform: Twitter, Threads, Facebook,      │
│  Dev.to)                                   │
└────────────────────────────────────────────┘
                        │
┌────────────────────────────────────────────┐
│  Scheduler (macOS LaunchAgents / cron)     │
└────────────────────────────────────────────┘

Each platform poster is an independent Python script with its own state file. They share no runtime state — if one fails, the others continue. The state files are JSON, stored in ~/.susan-*.state.json per platform.

The State File Pattern

Every poster follows the same state file schema:

# Default state structure
DEFAULT_STATE = {
    "posted_indices": [],     # indices of already-posted items
    "last_post_ts": 0,        # Unix timestamp of last post
    "post_count": 0,          # total posts made
}

def load_state(state_file: str) -> dict:
    if not os.path.exists(state_file):
        return dict(DEFAULT_STATE)
    try:
        with open(state_file) as f:
            data = json.load(f)
        for key, val in DEFAULT_STATE.items():
            data.setdefault(key, val)
        return data
    except (json.JSONDecodeError, OSError):
        return dict(DEFAULT_STATE)

def save_state(state: dict, state_file: str) -> None:
    """Atomic write via temp file + rename."""
    tmp = state_file + ".tmp"
    with open(tmp, "w") as f:
        json.dump(state, f, indent=2)
    os.replace(tmp, state_file)  # atomic on POSIX

The atomic write pattern (write tmp → rename) prevents state corruption if the process is interrupted mid-write.

Twitter Thread Poster

Twitter threads (8–12 tweets chained together) consistently outperform single tweets for educational content. I built a rotation system for 8 pre-written threads, each covering a different topic. A cooldown of 30 days prevents repeating a thread too soon.

THREADS = [
    {
        "id": "spaced_repetition",
        "title": "How spaced repetition works",
        "cooldown_days": 30,
        "tweets": [
            "A thread on why spaced repetition is the most evidence-backed study technique...",
            "The core idea: your brain forgets in a predictable curve (Ebbinghaus, 1885)...",
            # ...
        ]
    },
    # ...
]

def cmd_next():
    state = load_state(STATE_FILE)
    now = time.time()

    available = [
        t for t in THREADS
        if now - state.get("last_posted", {}).get(t["id"], 0)
           > t["cooldown_days"] * 86400
    ]

    if not available:
        return {"success": False, "error": "No threads available (all in cooldown)"}

    thread = available[0]
    result = post_thread(thread["tweets"])

    if result["success"]:
        state.setdefault("last_posted", {})[thread["id"]] = now
        save_state(state, STATE_FILE)

    return result

CrewAI for Weekly Content Generation

The manual content is a fixed rotation (fine for Twitter threads and tips, not ideal for generating fresh ideas). For weekly content generation, I built a CrewAI pipeline with four agents:

from crewai import Agent, Task, Crew

strategist = Agent(
    role="Content Strategist",
    goal="Identify the highest-value content topics for this week",
    backstory="Expert in language learning content marketing...",
    llm="anthropic/claude-sonnet-4-5-20250929",
)

writer = Agent(
    role="Technical Writer",
    goal="Write genuinely valuable technical content",
    backstory="Developer and language learning enthusiast...",
    llm="anthropic/claude-sonnet-4-5-20250929",
)

editor = Agent(
    role="Editor",
    goal="Polish content for platform-specific best practices",
    backstory="Senior editor with deep knowledge of dev community...",
    llm="anthropic/claude-sonnet-4-5-20250929",
)

distribution = Agent(
    role="Distribution Manager",
    goal="Format content for each target platform",
    backstory="Social media specialist who knows platform nuances...",
    llm="anthropic/claude-sonnet-4-5-20250929",
)

crew = Crew(
    agents=[strategist, writer, editor, distribution],
    tasks=[strategy_task, writing_task, editing_task, distribution_task],
    verbose=True,
)

result = crew.kickoff(inputs={
    "week": datetime.now().strftime("%Y-W%W"),
    "app_focus": "Pocket Linguist language learning app",
    "platforms": ["twitter", "devto", "linkedin"],
})

The pipeline runs weekly (Monday 8 AM via LaunchAgent) and produces a content bundle that the platform-specific posters can pull from.

LaunchAgent Scheduling

On macOS, LaunchAgents are the cron replacement for user-level scheduled tasks:

<!-- ~/Library/LaunchAgents/com.pocketlinguist.twitter.threads.plist -->
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN"
  "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
  <key>Label</key>
  <string>com.pocketlinguist.twitter.threads</string>
  <key>ProgramArguments</key>
  <array>
    <string>/usr/bin/python3</string>
    <string>/path/to/twitter_thread_poster.py</string>
    <string>next</string>
  </array>
  <key>StartCalendarInterval</key>
  <array>
    <dict>
      <key>Weekday</key><integer>2</integer>  <!-- Tuesday -->
      <key>Hour</key><integer>9</integer>
      <key>Minute</key><integer>0</integer>
    </dict>
  </array>
  <key>EnvironmentVariables</key>
  <dict>
    <key>TWITTER_ACCESS_TOKEN</key>
    <string>YOUR_TOKEN</string>
  </dict>
  <key>StandardOutPath</key>
  <string>/Users/you/logs/twitter-threads.log</string>
  <key>StandardErrorPath</key>
  <string>/Users/you/logs/twitter-threads-error.log</string>
</dict>
</plist>

Load it with launchctl load ~/Library/LaunchAgents/com.pocketlinguist.twitter.threads.plist. Unload to pause.

Engagement Bot: Automated Replies

The engagement bot is the part that requires the most care. Automated replies that look spammy get accounts flagged. The rules I follow:

Only reply to posts that match a keyword from a curated list (language learning topics)
Use 36 different reply templates, selected semi-randomly to avoid pattern detection
Rate limit to 20 replies per hour, with random delays between actions (jitter)
Never DM unsolicited
Log every action for review

REPLY_TEMPLATES = [
    "That's a great point about {keyword}. In my experience...",
    "This is exactly what motivated me to build {app}...",
    # 34 more variants
]

def engage(tweet_id: str, keyword: str) -> dict:
    template = random.choice(REPLY_TEMPLATES)
    text = template.format(keyword=keyword, app="Pocket Linguist")

    # Jitter: random delay 30–120 seconds
    delay = random.uniform(30, 120)
    time.sleep(delay)

    return post_reply(tweet_id, text)

What Actually Moved the Needle

Honest assessment after three months:

Twitter threads: Measurable engagement increase. Educational threads on language learning consistently reached 2–5x the impressions of single posts.
Dev.to articles: Slow build but compounding. Articles rank in search and bring organic traffic weeks after publication.
Threads (Meta): Highest organic reach of any platform, but the Threads API has reliability issues.
Engagement bot: Modest follower growth. The quality of followers from bot engagement is lower than organic.
CrewAI content generation: Saves 2–3 hours per week. Quality requires human review but the first draft is usually solid.

The biggest lesson: automation is best at distribution (consistent posting, scheduling), not at community building. The posts that drove real downloads were ones where I personally engaged with a large account's audience. Automation can prepare the content; the human moments convert.

Text-to-Speech in 2026: Comparing 5 TTS APIs for Language Apps

Ahmed Mahmoud — Wed, 18 Mar 2026 10:11:07 +0000

Text-to-Speech in 2026: Comparing 5 TTS APIs for Language Apps

For a language learning app, text-to-speech isn't a nice-to-have — it's how learners hear correct pronunciation. The quality gap between TTS systems is enormous, and the right choice depends on your target language set, budget, and latency requirements.

Here's a direct comparison of five TTS systems evaluated on criteria that matter specifically for language education.

Evaluation Criteria

For a language app, I care about:

Naturalness — Does it sound like a real person? Unnatural rhythm or intonation actively teaches bad pronunciation habits.
Prosodic accuracy — Does the stress pattern match native speaker norms? This is different from naturalness — a voice can sound smooth but stress the wrong syllables.
Language coverage — How many languages are supported at a usable quality level?
Phonetic control — Can you force specific pronunciations via SSML or IPA input?
Latency — First byte of audio to streaming playback start.
Cost — Per-character or per-second pricing at scale.
Offline capability — Can it run on-device without a network call?

The Five Systems

1. ElevenLabs

ElevenLabs produces the most natural-sounding voices of any current commercial API. The prosodic accuracy is exceptional — sentence-level intonation, emotional emphasis, and rhythm match native speaker norms better than any competitor.

Strengths:

Best overall naturalness for supported languages
Voice cloning for custom voices
Good SSML support

Weaknesses:

Language support gaps — excellent for English, Spanish, French, German; mediocre for CJK; limited for Arabic, Turkish, Polish
Highest latency (~400–800ms to first audio byte)
Most expensive at scale ($0.30/1000 characters on the standard plan)

Verdict for language apps: Excellent for European languages, not viable for apps targeting CJK or less-common languages.

2. Google Cloud Text-to-Speech (WaveNet / Studio)

Google's Neural2 and Studio voices cover the broadest language range of any commercial API: 40+ languages with multiple voice options per language. Quality is consistently good, if not exceptional.

Strengths:

Best language coverage by far
WaveNet voices are natural-sounding for most use cases
Reliable SSML support including <phoneme> tags for IPA-based pronunciation forcing
Predictable latency (~150–300ms)

Weaknesses:

Studio voices are significantly better than WaveNet but more expensive
Prosodic accuracy is lower than ElevenLabs for languages both cover
Neural2 voices (the mid-tier) are a clear step down in quality

Verdict for language apps: The default choice for apps covering many languages, especially Asian and less-common languages.

3. OpenAI TTS (tts-1, tts-1-hd)

OpenAI's TTS models (tts-1 for speed, tts-1-hd for quality) are optimised for English with secondary capability in common European languages. They're simple to use (no SSML needed for basic use cases) and the tts-1 model has excellent latency.

Strengths:

Fastest first-byte latency of commercial APIs (~80–150ms for tts-1)
Competitive quality for English
Simple API — single endpoint, no voice configuration required for defaults
Solid streaming support

Weaknesses:

Limited language support outside English and common European languages
No SSML support — you can't force specific pronunciations
No phoneme-level control
Voice variety is limited (6 built-in voices as of early 2026)

Verdict for language apps: Best for English-only or English-primary apps where latency matters. Not viable for broad language coverage.

4. Microsoft Azure Cognitive Services TTS

Azure's Neural TTS system has improved substantially since the Neural Voice v3 update. It covers 140+ languages and locales — the broadest official coverage of any provider. Quality is solid and consistent.

Strengths:

Widest official language + locale coverage (140+)
Strong SSML support including <phoneme> with IPA and X-SAMPA
Viseme output (mouth shape data for lip-sync animations)
Competitive pricing ($16/1M characters for neural voices)
On-device SDK available (limited voice set)

Weaknesses:

Quality varies significantly across languages — flagship English and Mandarin voices are excellent, but less-common language voices are noticeably robotic
API complexity is higher than Google or OpenAI
Latency is slightly higher than Google (~200–400ms)

Verdict for language apps: Best choice for apps that need obscure language support (e.g., Welsh, Swahili, Catalan). Also excellent if you need lip-sync data.

5. Kokoro (Open Source / Self-Hosted)

Kokoro is a lightweight open-source TTS model that ranks competitively with commercial APIs for English. It's model-weight-only (Apache 2.0 license), runs on CPU, and can be self-hosted or deployed to serverless infrastructure.

Strengths:

Free at any scale (host it yourself)
High quality for English — competitive with tts-1-hd at no cost
Fast on modern hardware (~100ms on M2 chip)
Voice control via style embeddings
OpenAI-compatible API format — drop-in replacement for many integrations

Weaknesses:

English-primary: Spanish and French work reasonably, most other languages don't
Self-hosting adds operational overhead
No official support or SLA
Language coverage grows with community contributions, but slowly

Verdict for language apps: Outstanding for English-heavy apps willing to self-host. Best cost profile by far for high-volume English TTS.

Head-to-Head Comparison

Criteria	ElevenLabs	Google TTS	OpenAI TTS	Azure TTS	Kokoro
English quality	Excellent	Very Good	Very Good	Very Good	Excellent
CJK quality	Poor	Very Good	Poor	Good	Poor
Language count	~30	40+	~30	140+	3–5
First-byte latency	400–800ms	150–300ms	80–150ms	200–400ms	50–150ms
SSML/Phoneme control	Limited	Full	None	Full	None
Price per 1M chars	$300	$16–160	$15–30	$16	Free
Offline/On-device	No	No	No	Limited	Yes

Architecture Recommendation for Language Apps

For a language learning app supporting 20+ languages:

Primary: Google Cloud TTS (Neural2)
  - Use for: all language coverage
  - SSML for pronunciation drilling

Secondary: Kokoro (self-hosted)
  - Use for: English content at high volume
  - Reduces Google TTS cost significantly

Fallback: Azure TTS
  - Use for: obscure languages not covered well by Google
  - Use for: lip-sync features if needed

This hybrid approach uses Kokoro for English (where it's competitive and free), Google for broad language coverage, and Azure as a fallback for edge cases. At 10 million characters/month, this reduces TTS API costs by approximately 70% compared to using Google for everything.

SSML for Pronunciation Drilling

For a language app specifically, phoneme-level control is critical for drilling correct pronunciation. Both Google and Azure support the <phoneme> SSML tag with IPA input:

<speak>
  In Spanish, 'll' is pronounced like 'y':
  <phoneme alphabet="ipa" ph="kaˈβaʎo">caballo</phoneme>
  means horse.
</speak>

This lets you demonstrate exactly how a word is pronounced, overriding the model's default interpretation for cases where it differs from standard pronunciation. OpenAI TTS has no equivalent — you're entirely at the mercy of the model's training data.

For a language learning app where pronunciation accuracy is the product, SSML phoneme support is a non-negotiable feature.

How Camera Translation Actually Works (And Why It's Hard)

Ahmed Mahmoud — Wed, 18 Mar 2026 10:00:23 +0000

How Camera Translation Actually Works (And Why It's Hard)

Point your phone at a sign in a foreign language, and text floats back in your native tongue. It looks like magic. It's actually a five-stage engineering pipeline with a failure mode at every step.

This is a technical walkthrough of how camera translation works and where real-world implementations break down.

The Pipeline: Five Stages

Camera frame
    │
    ▼
1. Text Detection (find where text exists in the image)
    │
    ▼
2. Text Recognition / OCR (read the characters)
    │
    ▼
3. Language Detection (what language is this?)
    │
    ▼
4. Translation (convert to target language)
    │
    ▼
5. Augmented Reality Overlay (render translated text back on image)

Each stage has distinct technical challenges. Let's go through them.

Stage 1: Text Detection

Before you can read text, you have to find it. Text detection is a segmentation problem: given an image, produce bounding boxes (or polygons) around regions that contain text.

Modern approaches use deep learning — specifically, variants of the CRAFT (Character Region Awareness for Text Detection) architecture, or the newer DBNet (Differentiable Binarization Network). These produce probability maps over the image that highlight character regions, then apply post-processing to extract polygons.

The hard cases:

Curved text (logos, signs with stylised lettering): Rectangular bounding boxes fail here. You need polygon output.
Text on complex backgrounds: A menu with watermark patterns, or graffiti on a textured wall.
Very small text: Sub-20px text is essentially lost to downsampling.
Overlapping text: Subtitles on videos, ads with layered typography.
Handwriting: A completely different detection regime — the character spacing and stroke characteristics differ enough that handwriting-trained models often fail on printed text and vice versa.

For a mobile app, you also face a hard constraint: the model must run at 10–15 frames per second on a CPU-only inference stack (battery and thermal limits make continuous GPU inference on mobile impractical). CRAFT at full resolution is too slow. The production solution is a two-pass system: run a fast, lightweight detector at 15fps to track text regions, and a higher-accuracy detector only when the user taps or holds steady.

Stage 2: OCR — Reading the Characters

Once you have a text region, you need to convert it to a string. This is Optical Character Recognition.

The dominant architecture for scene text OCR is the CRNN (Convolutional Recurrent Neural Network): a CNN backbone extracts visual features, a BiLSTM captures sequence context, and a CTC (Connectionist Temporal Classification) decoder produces the character sequence.

More recently, transformer-based approaches like TrOCR (Microsoft) show better accuracy on degraded or unusual fonts but are significantly larger and slower.

Language-specific challenges:

Latin scripts: Relatively well-solved. CRNN achieves >98% character accuracy on clean printed text.
CJK (Chinese/Japanese/Korean): 5,000–50,000 possible output classes instead of ~100. Model size and latency scale accordingly. Stroke-based methods help.
Arabic/Hebrew: Right-to-left scripts with connected characters. Sequence models handle directionality poorly without explicit RTL encoding.
Devanagari (Hindi): Ligatures and matras (vowel diacritics) require character grouping before decoding.

A common mobile architecture uses on-device ML (Core ML for iOS, ML Kit for Android) to run OCR. Google's ML Kit Text Recognition API handles Latin, Chinese, Japanese, Korean, and Devanagari on-device with reasonable accuracy. For less common scripts, you typically fall back to a server-side API.

Stage 3: Language Detection

You have a string of characters. Now you need to know what language it is so you can route it to the right translation model.

For alphabetic scripts, the character set alone gives you a strong prior:

Arabic characters → Arabic, Urdu, Persian, Pashto
Cyrillic → Russian, Ukrainian, Bulgarian, Serbian, Mongolian
Hangul → Korean exclusively
Kana (ひ, カ) → Japanese

But within a script family, language detection is a genuine classification problem. Spanish, French, Italian, and Portuguese all use the same Latin character set. Distinguishing them requires word-level or n-gram models.

FastText's language identification model (176 languages, 917KB compressed) is the production standard for most apps. It achieves >99% accuracy on clean text of 10+ words. The failure modes are:

Very short strings (1–3 words): Classification confidence collapses
Code-switching: A sign that mixes English brand names with Japanese script
Transliterated text: Romanized Japanese (romaji) looks like garbage Latin to a language detector

For camera translation, the combination of character set detection + FastText with a minimum confidence threshold (typically 0.6–0.7) handles most cases. Below the threshold, you show the user a language selector.

Stage 4: Translation

This is the stage most people think of first, and it's the most computationally expensive.

Neural machine translation (NMT) based on the Transformer architecture is the current standard. The major options for mobile apps:

Approach	Latency	Accuracy	Cost
Cloud API (Google Translate, DeepL)	200-600ms	Excellent	Per-character billing
On-device model (OPUS-MT, M2M-100)	50-200ms	Good	Free after download
Hybrid (on-device first, cloud fallback)	Variable	Excellent	Low

For a language learning app, translation quality matters more than for a pure utility tool — you're teaching the user, so mistranslations have pedagogical consequences. DeepL consistently outperforms Google Translate on European language pairs. For Asian languages, Google has the better coverage.

On-device translation using OPUS-MT (Helsinki-NLP) is compelling for offline support and privacy, but the models are 70–300MB each and accuracy lags cloud models by a noticeable margin on complex sentences.

The hybrid approach — attempt on-device, fall back to cloud for low-confidence outputs — balances cost and quality well.

Stage 5: AR Overlay

Rendering translated text back over the original image sounds like a solved problem. It isn't.

Challenges:

Font matching: The translated text needs to match the visual style of the original. A neon sign in a Gothic font shouldn't be replaced by Arial. Apps typically use a heuristic: detect font weight (bold/regular) and approximate size from the bounding box, then use a matching system font.

Text expansion/contraction: German words are often 30–50% longer than their English equivalents. Japanese translations of English signs are often shorter. The overlay must reflow or scale text to fit the original bounding box without overflowing into other elements.

Background reconstruction: To overlay translated text, you need to erase the original text first. This requires inpainting — filling the erased region with a plausible background. State-of-the-art inpainting (LaMa, SDXL inpainting) works well on simple backgrounds but struggles with complex textures. Most production apps use a simpler approach: render translated text on a semi-transparent box that occludes the original.

Frame consistency: In live camera mode (as opposed to single-image mode), you need detections and translations to be stable across frames. Bounding boxes that jitter per-frame are extremely distracting. A Kalman filter or simple exponential smoothing on bounding box coordinates reduces jitter significantly.

Putting It Together: The Real Performance Budget

On an iPhone 14 with on-device OCR (ML Kit) and cloud translation (Google):

Stage	Latency
Text detection	40–80ms
OCR	30–60ms
Language detection	<5ms
Translation (cloud)	200–500ms
AR overlay render	10–20ms
Total	280–665ms

The translation API call dominates. Caching translations (same text → same result, keyed by source text + language pair) with a 24-hour TTL eliminates the round trip for repeated text — useful for signs you pass daily.

Where Current Systems Still Struggle

Even the best camera translation apps fail reliably on:

Highly stylised fonts — decorative logos, calligraphy, graffiti
Very long documents — a full page of A4 text captured with a camera
Low-contrast text — light grey text on white background
Idiomatic expressions — machine translation handles idioms poorly
Context-dependent ambiguity — "銀行" in Japanese means "bank" (financial institution), but the translation model doesn't know if you're at a riverbank or a savings bank

The pipeline I've described reflects roughly where production systems stood as of late 2024. Vision-language models (GPT-4o, Gemini 1.5 Pro, Claude) can now handle end-to-end image-to-translation in a single call with impressive accuracy on the failure cases above — but at higher latency and cost. The pipeline approach still wins on speed; the single-model approach wins on robustness. Most production apps will converge on hybrid architectures that use vision-language models as a high-accuracy fallback.

I also run Agnes AI — AI-powered services for businesses including security scans, content packs, translations, and custom AI solutions.

I Built an AI Language Tutor — Here's What I Learned About NLP

Ahmed Mahmoud — Wed, 11 Mar 2026 10:00:11 +0000

I Built an AI Language Tutor — Here's What I Learned About NLP

Building a conversational language tutor sounds straightforward until you actually do it. You imagine a sleek interface, a model that listens and responds, and users happily chatting their way to fluency. What you get instead is a humbling education in the gap between demo and production NLP.

Here's an honest technical breakdown of what I built, what broke, and what I'd do differently.

The Core Problem: Language Learning Needs More Than a Chatbot

A raw large language model is fluent. That's the problem. You're trying to teach someone Italian, and your AI responds with flawless, complex sentences that immediately overwhelm an A2 learner. The first engineering challenge isn't getting the model to speak — it's getting it to speak badly on purpose.

Vocabulary Grading

CEFR (Common European Framework of Reference) defines language proficiency in six levels: A1, A2, B1, B2, C1, C2. Each level has a corresponding vocabulary band. A1 covers roughly 500–700 words; C2 expands to 16,000+.

To grade output, I built a vocabulary filter that:

Tokenises the model's response using a language-specific tokenizer (spaCy for European languages, MeCab for Japanese).
Lemmatises each token to its base form.
Checks each lemma against a CEFR word list (freely available from EVP — English Vocabulary Profile for English, ELP for other languages).
Flags any word above the user's target CEFR band.
Rewrites the prompt to the model, instructing it to replace flagged vocabulary with simpler alternatives.

This two-pass approach — generate then simplify — adds latency (roughly 300–600 ms on GPT-4o-mini) but produces dramatically more appropriate output.

def grade_vocabulary(text: str, target_level: str, lang: str) -> dict:
    doc = nlp_models[lang](text)
    lemmas = [token.lemma_.lower() for token in doc if token.is_alpha]
    above_level = [
        w for w in lemmas
        if cefr_level(w, lang) > target_level
    ]
    return {
        "flagged": above_level,
        "needs_rewrite": len(above_level) > 0,
    }

Intent Classification

A language tutor needs to handle multiple conversation modes:

Free conversation — user just chats, AI responds naturally
Correction mode — AI corrects grammar errors
Vocabulary drill — spaced repetition flashcard loop
Pronunciation practice — AI evaluates user speech (more on this below)
Translation check — user submits a translation, AI grades it

I initially tried to detect intent from the user's message alone. This worked about 70% of the time and failed spectacularly the other 30%. A user saying "how do you say 'dog'?" looks like a translation question, but in context might be a free conversation turn where they forgot a word.

The fix was maintaining a session state machine — a small enum that tracks which mode the session is currently in, and only transitions based on explicit user signals (tapping a mode button) or unambiguous intent patterns (a message that's 90%+ a known vocabulary query pattern).

from enum import Enum

class SessionMode(Enum):
    FREE_CONVERSATION = "free"
    CORRECTION = "correction"
    VOCAB_DRILL = "vocab_drill"
    TRANSLATION = "translation"
    PRONUNCIATION = "pronunciation"

State transitions are logged per-session and stored with the conversation history, which lets the model use few-shot context to stay coherent across mode switches.

The Correction Problem: How Do You Correct Without Killing Motivation?

This is where NLP meets pedagogy. Immediate, constant correction is psychologically harmful to language learners — it creates anxiety and suppresses output. But zero correction means fossilisation (permanent bad habits).

Research from Krashen's input hypothesis and subsequent work suggests delayed, selective correction is most effective. Specifically:

Correct only errors that impede comprehension, not stylistic differences
Use recasts (repeating the correct form naturally) rather than explicit metalinguistic feedback
Correct no more than 2–3 errors per conversational turn

I implemented this with a two-model pipeline:

Error detection model: A fine-tuned classifier that labels errors by type (morphological, syntactic, lexical, pragmatic) and severity (comprehension-blocking vs. minor).
Correction strategy model: Given the detected errors and the learner's level, decides which to correct and how.

For the error detection step, I initially tried prompting GPT-4 with a structured output schema. It worked but was expensive at scale. I switched to a smaller fine-tuned model (DistilBERT fine-tuned on the NUCLE corpus for English, with similar datasets for Spanish and French) that runs locally and costs nothing per inference.

def select_corrections(errors: list[dict], level: str) -> list[dict]:
    # Only return comprehension-blocking errors for A1/A2
    if level in ("A1", "A2"):
        return [e for e in errors if e["severity"] == "blocking"][:2]
    # For B1+, include common morphological errors too
    return [e for e in errors if e["severity"] in ("blocking", "morphological")][:3]

Handling 20+ Languages: The Tokenisation Nightmare

Supporting multiple languages isn't just a UI translation problem. Every language has fundamentally different tokenisation requirements:

Language	Tokenisation challenge
English	Relatively easy — whitespace + punctuation
German	Compound words (Donaudampfschifffahrtsgesellschaft) need decompounding
Japanese	No word boundaries — requires morphological analysis (MeCab, SudachiPy)
Arabic	Right-to-left, root-based morphology, heavy inflection
Chinese	Word segmentation (jieba, pkuseg) required
Turkish	Agglutinative — one word can express a full English sentence

I ended up with a language-router pattern:

TOKENISERS = {
    "en": lambda text: en_nlp(text),
    "de": lambda text: de_nlp(text),      # spaCy de_core_news_sm
    "ja": lambda text: mecab_tokenise(text),
    "zh": lambda text: jieba_tokenise(text),
    "ar": lambda text: cameltools_tokenise(text),
}

def tokenise(text: str, lang: str) -> list[str]:
    tokeniser = TOKENISERS.get(lang, default_tokeniser)
    return tokeniser(text)

This adds a dependency per language, but there's no general solution. Trying to use a single tokeniser across language families will produce garbage results for CJK and Arabic.

Latency: The Real UX Killer

In a conversational app, users tolerate roughly 800–1200 ms of latency before it feels broken. My initial pipeline — tokenise, check vocabulary, call LLM, validate response — was running at 2.4s average. That's a broken app.

The optimisations that actually moved the needle:

Stream the LLM response: Use server-sent events to start rendering the AI response before it's complete. Perceived latency drops by 60%+ even with identical total generation time.
Cache vocabulary grade results: Vocabulary checks on the input (not the output) can be cached with a short TTL. Most users repeat similar vocabulary within a session.
Run CEFR grading async on a separate thread: Don't block the main response path. If the grade check returns before the response is done, you can still intercept; if not, let it through and grade the next turn.
Move error detection to a smaller local model: 8ms on DistilBERT vs 400ms on GPT-4. Not suitable for all tasks but fine for binary error flagging.

After these changes, p50 latency dropped to 680ms and p95 to 1.1s — comfortably within the threshold.

What I'd Do Differently

Start with a state machine from day one. I retrofitted it after the intent classification failures. Every conversational app needs one.
Invest in evaluation datasets early. Without labeled examples of good vs. bad corrections for each language level, you're flying blind. NUCLE, BEA-2019, and Lang-8 are good starting points.
Separate the LLM call from the grading logic. Mixing them makes both harder to test. A clean pipeline — generate → validate → rewrite if needed — is worth the extra roundtrip.
Don't underestimate language-specific engineering costs. Adding Japanese support took 3x longer than adding Spanish. Budget accordingly.

Building a language tutor is one of the most rewarding NLP projects you can take on — every part of the stack from tokenisation to pedagogy shows up in the product. The challenge is exactly what makes it worth doing.

I also run Agnes AI — AI-powered services for businesses including security scans, content packs, translations, and custom AI solutions.

Why Duolingo's Gamification Works (And When It Doesn't)

Ahmed Mahmoud — Wed, 04 Mar 2026 11:00:01 +0000

Why Duolingo's Gamification Works (And When It Doesn't)

Duolingo has 500 million registered users and a market cap that peaked at over $10 billion. It's also frequently described by linguists as a tool that teaches you how to use Duolingo, not how to speak a language. Both things can be true simultaneously, and understanding why explains a lot about the limits and possibilities of gamification in education.

The Mechanics That Actually Work

Duolingo's core gamification stack is not novel — it's a careful assembly of well-validated psychological mechanisms.

Streak Mechanics and Loss Aversion

The daily streak counter is Duolingo's most powerful retention tool, and it works through loss aversion rather than positive motivation. Kahneman and Tversky established that the psychological impact of losing something is roughly twice as powerful as gaining something of equivalent value. A 90-day streak feels like an asset worth protecting. Breaking it triggers real aversion.

This is psychologically effective at driving daily logins. Its relationship to learning outcomes is more complicated — users are motivated to maintain streaks at the cost of careful engagement. Speed-running an easy lesson to protect a streak activates the retention mechanism without the learning.

The "streak freeze" item (which preserves your streak if you miss a day) is a masterclass in understanding your own mechanic. It removes the catastrophic failure state that causes abandonment, without eliminating the daily pressure.

XP and Leaderboards: Social Comparison

The weekly XP leaderboard exploits social comparison theory (Festinger, 1954) — humans calibrate their performance by comparing to relevant others. Being near the top of a leaderboard triggers effort; being at the bottom triggers either catch-up effort or disengagement.

Duolingo mitigates the disengagement risk by segmenting leaderboards by engagement level. You're not competing with someone who does 400 XP/day when you do 20. The algorithm places you against similarly-active users, keeping the competition close enough to motivate without being hopeless.

The failure mode: XP is a measure of lessons completed, not learning quality. The leaderboard optimises for quantity, which drives the exact speed-running behaviour that reduces learning effectiveness.

Hearts and Variable Reward

The "heart" system (limited mistakes allowed per lesson) combines two psychological mechanisms: punishment for failure and variable reward. The variable reward aspect is subtle — on some questions you might lose a heart, on others you won't, and the uncertainty of which category each question falls into is mildly activating in the same way a slot machine is.

The heart system also creates urgency. Finite resources under threat produce engagement. This is why limited-time offers work in commerce and why the heart system drives more careful engagement than an unlimited-try system.

Where Gamification Breaks Down

It Optimises Engagement, Not Learning

This is the central tension of gamification in education. The metrics that drive engagement (daily active users, session length, streak continuation) are partially aligned with learning outcomes but diverge significantly at the edges.

A user who completes 5 easy lessons per day to maintain their streak is engaging. They are probably not learning at the rate a user doing 2 challenging lessons would be. Duolingo's lesson difficulty algorithm has historically not been aggressive enough at pushing users into genuinely challenging material — because harder material produces more errors, more heart loss, more frustration, and lower engagement metrics.

This is a classic proxy metric problem: you measure what's measurable (engagement), and optimise for it, without measuring what you actually care about (language proficiency gains). The two are correlated but not identical, and optimising for the proxy will always push you toward the divergent cases.

Intrinsic vs. Extrinsic Motivation Crowding Out

Self-Determination Theory predicts that external rewards (XP, badges, streaks) can crowd out intrinsic motivation over time. A user who starts learning Spanish because they're genuinely excited about the language, and then gets enrolled in the XP/streak system, may end up studying because of the streak — and once the streak breaks, there's nothing left.

Research on this (the "overjustification effect") is contested in the language learning context, but there's consistent evidence that learners who depend primarily on gamification for motivation show higher abandonment rates than learners motivated by genuine interest in the language or culture.

The Plateau Problem

Gamification drives engagement most powerfully in early-stage users. The combination of rapid progress (fast XP gain), novelty (new mechanics being revealed), and low difficulty creates a compelling feedback loop.

At intermediate levels (B1+), progress becomes slower and less visible, the gamification mechanics feel more like obligations than rewards, and the genuine difficulty of reaching conversational fluency becomes apparent. This is where Duolingo's retention falls off sharply — users reach a level where the app can no longer hide that becoming fluent requires more than tapping the correct word from a multiple-choice list.

The problem isn't gamification per se; it's that Duolingo's gamification was designed for retention, not for guiding users through the authentic difficulty of language acquisition at higher levels.

What Better Gamification Looks Like

The apps that use gamification most effectively in education share a few characteristics:

Metrics that track real progress: Vocabulary retention rate, grammar accuracy over time, comprehension test scores — not just lessons completed. Making this progress visible (learning dashboards, proficiency estimates) ties the game mechanics to the actual learning outcome.
Difficulty that adapts genuinely: Adaptive difficulty should push users into the 80–90% accuracy zone, not the 95%+ comfort zone. Duolingo's default lesson difficulty is calibrated too low for most users. Users who turn on "hard mode" learn faster and retain more.
Intrinsic hooks alongside extrinsic ones: Connecting users to content they genuinely care about (a TV show in the target language, a community of speakers) sustains motivation when the streak-protection drive fades.
Deliberate practice, not just practice: Gamification that rewards deliberate repetition of weak areas (rather than repetition of strengths) produces better outcomes. This requires per-item spaced repetition and a mistake-analysis loop, not just lesson completion.

Duolingo's success is real and its gamification deserves credit for making language study habitual for millions of people who never would have sustained it otherwise. Its limitations are also real — it's built an exceptionally engaging product that is moderately effective at language learning. The gap between those two things is where the most interesting design problems in edtech still live.

Pocket Linguist is Now on YouTube - Daily Language Tips and AI Tutoring

Ahmed Mahmoud — Fri, 27 Feb 2026 10:53:39 +0000

We are excited to announce that Pocket Linguist is now on YouTube!

What You Will Find

Daily language tips - Quick, practical advice for learners
Travel phrases - The words that actually matter abroad
AI tutor lessons - Meet Ang and Agnes, your personal language coaches
Cultural insights - Understanding context, not just vocabulary

Languages Covered

Spanish, French, Japanese, Korean, and 40+ more languages.

YouTube: https://www.youtube.com/@pocketlinguist
App Store: https://apps.apple.com/us/app/pocket-linguist/id6758757224

New Shorts and full videos dropping daily.

Pocket Linguist is Now on YouTube - Daily Language Tips & AI Tutoring

Ahmed Mahmoud — Fri, 27 Feb 2026 10:52:08 +0000

We are excited to announce that Pocket Linguist is now on YouTube!

What You Will Find

Daily language tips - Quick, practical advice for learners
Travel phrases - The words that actually matter abroad
AI tutor lessons - Meet Ang and Agnes, your personal language coaches
Cultural insights - Understanding context, not just vocabulary
Common mistakes - The errors everyone makes and how to fix them

Languages Covered

Spanish, French, Japanese, Korean, and 40+ more languages.

YouTube: https://www.youtube.com/@pocketlinguist
App Store: https://apps.apple.com/us/app/pocket-linguist/id6758757224

New Shorts and full videos dropping daily.

I Built an AI Language Tutor — Here's What I Learned About NLP

Ahmed Mahmoud — Wed, 25 Feb 2026 11:00:19 +0000

I Built an AI Language Tutor — Here's What I Learned About NLP

Here's an honest technical breakdown of what I built, what broke, and what I'd do differently.

The Core Problem: Language Learning Needs More Than a Chatbot

Vocabulary Grading

To grade output, I built a vocabulary filter that:

Tokenises the model's response using a language-specific tokenizer (spaCy for European languages, MeCab for Japanese).
Lemmatises each token to its base form.
Checks each lemma against a CEFR word list (freely available from EVP — English Vocabulary Profile for English, ELP for other languages).
Flags any word above the user's target CEFR band.
Rewrites the prompt to the model, instructing it to replace flagged vocabulary with simpler alternatives.

This two-pass approach — generate then simplify — adds latency (roughly 300–600 ms on GPT-4o-mini) but produces dramatically more appropriate output.

def grade_vocabulary(text: str, target_level: str, lang: str) -> dict:
    doc = nlp_models[lang](text)
    lemmas = [token.lemma_.lower() for token in doc if token.is_alpha]
    above_level = [
        w for w in lemmas
        if cefr_level(w, lang) > target_level
    ]
    return {
        "flagged": above_level,
        "needs_rewrite": len(above_level) > 0,
    }

Intent Classification

A language tutor needs to handle multiple conversation modes:

Free conversation — user just chats, AI responds naturally
Correction mode — AI corrects grammar errors
Vocabulary drill — spaced repetition flashcard loop
Pronunciation practice — AI evaluates user speech (more on this below)
Translation check — user submits a translation, AI grades it

from enum import Enum

class SessionMode(Enum):
    FREE_CONVERSATION = "free"
    CORRECTION = "correction"
    VOCAB_DRILL = "vocab_drill"
    TRANSLATION = "translation"
    PRONUNCIATION = "pronunciation"

State transitions are logged per-session and stored with the conversation history, which lets the model use few-shot context to stay coherent across mode switches.

The Correction Problem: How Do You Correct Without Killing Motivation?

Research from Krashen's input hypothesis and subsequent work suggests delayed, selective correction is most effective. Specifically:

Correct only errors that impede comprehension, not stylistic differences
Use recasts (repeating the correct form naturally) rather than explicit metalinguistic feedback
Correct no more than 2–3 errors per conversational turn

I implemented this with a two-model pipeline:

Error detection model: A fine-tuned classifier that labels errors by type (morphological, syntactic, lexical, pragmatic) and severity (comprehension-blocking vs. minor).
Correction strategy model: Given the detected errors and the learner's level, decides which to correct and how.

def select_corrections(errors: list[dict], level: str) -> list[dict]:
    # Only return comprehension-blocking errors for A1/A2
    if level in ("A1", "A2"):
        return [e for e in errors if e["severity"] == "blocking"][:2]
    # For B1+, include common morphological errors too
    return [e for e in errors if e["severity"] in ("blocking", "morphological")][:3]

Handling 20+ Languages: The Tokenisation Nightmare

Supporting multiple languages isn't just a UI translation problem. Every language has fundamentally different tokenisation requirements:

Language	Tokenisation challenge
English	Relatively easy — whitespace + punctuation
German	Compound words (Donaudampfschifffahrtsgesellschaft) need decompounding
Japanese	No word boundaries — requires morphological analysis (MeCab, SudachiPy)
Arabic	Right-to-left, root-based morphology, heavy inflection
Chinese	Word segmentation (jieba, pkuseg) required
Turkish	Agglutinative — one word can express a full English sentence

I ended up with a language-router pattern:

TOKENISERS = {
    "en": lambda text: en_nlp(text),
    "de": lambda text: de_nlp(text),      # spaCy de_core_news_sm
    "ja": lambda text: mecab_tokenise(text),
    "zh": lambda text: jieba_tokenise(text),
    "ar": lambda text: cameltools_tokenise(text),
}

def tokenise(text: str, lang: str) -> list[str]:
    tokeniser = TOKENISERS.get(lang, default_tokeniser)
    return tokeniser(text)

This adds a dependency per language, but there's no general solution. Trying to use a single tokeniser across language families will produce garbage results for CJK and Arabic.

Latency: The Real UX Killer

The optimisations that actually moved the needle:

Stream the LLM response: Use server-sent events to start rendering the AI response before it's complete. Perceived latency drops by 60%+ even with identical total generation time.
Cache vocabulary grade results: Vocabulary checks on the input (not the output) can be cached with a short TTL. Most users repeat similar vocabulary within a session.
Run CEFR grading async on a separate thread: Don't block the main response path. If the grade check returns before the response is done, you can still intercept; if not, let it through and grade the next turn.
Move error detection to a smaller local model: 8ms on DistilBERT vs 400ms on GPT-4. Not suitable for all tasks but fine for binary error flagging.

After these changes, p50 latency dropped to 680ms and p95 to 1.1s — comfortably within the threshold.

What I'd Do Differently

Start with a state machine from day one. I retrofitted it after the intent classification failures. Every conversational app needs one.
Invest in evaluation datasets early. Without labeled examples of good vs. bad corrections for each language level, you're flying blind. NUCLE, BEA-2019, and Lang-8 are good starting points.
Separate the LLM call from the grading logic. Mixing them makes both harder to test. A clean pipeline — generate → validate → rewrite if needed — is worth the extra roundtrip.
Don't underestimate language-specific engineering costs. Adding Japanese support took 3x longer than adding Spanish. Budget accordingly.