<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Adarsh Kant</title>
    <description>The latest articles on Forem by Adarsh Kant (@adarsh_kant_ebb2fde1d0c6b).</description>
    <link>https://forem.com/adarsh_kant_ebb2fde1d0c6b</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3317450%2F685ad4d0-3bbf-4356-8c66-14bac766e0a6.png</url>
      <title>Forem: Adarsh Kant</title>
      <link>https://forem.com/adarsh_kant_ebb2fde1d0c6b</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/adarsh_kant_ebb2fde1d0c6b"/>
    <language>en</language>
    <item>
      <title>Building Real-Time Voice Forms with Google Gemini API: Architecture &amp; Learnings</title>
      <dc:creator>Adarsh Kant</dc:creator>
      <pubDate>Sun, 05 Apr 2026 21:43:54 +0000</pubDate>
      <link>https://forem.com/adarsh_kant_ebb2fde1d0c6b/building-real-time-voice-forms-with-google-gemini-api-architecture-learnings-4mn8</link>
      <guid>https://forem.com/adarsh_kant_ebb2fde1d0c6b/building-real-time-voice-forms-with-google-gemini-api-architecture-learnings-4mn8</guid>
      <description>&lt;p&gt;When you want to build voice-input forms that feel responsive and intuitive, the key challenge isn't transcription—modern APIs handle that well. It's &lt;strong&gt;latency&lt;/strong&gt;. Transcription that takes 2 seconds to return feels broken. Transcription that streams back in real-time (200-400ms for first token) feels magical.&lt;/p&gt;

&lt;p&gt;This post walks through the architecture we built at Anve Voice Forms to make real-time voice transcription feel fast and seamless in the browser.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Challenge: Why Basic Transcription APIs Feel Slow
&lt;/h2&gt;

&lt;p&gt;Most voice API approaches work like this:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;User speaks for N seconds&lt;/li&gt;
&lt;li&gt;Collect all audio&lt;/li&gt;
&lt;li&gt;Send entire audio file to API&lt;/li&gt;
&lt;li&gt;Wait for transcription response&lt;/li&gt;
&lt;li&gt;Display result&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Round-trip latency: 2-5 seconds. That's dead time where the user is waiting and nothing is happening.&lt;/p&gt;

&lt;p&gt;The better approach is &lt;strong&gt;streaming&lt;/strong&gt;: send audio chunks as they arrive, start processing immediately, and stream back results in real-time.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Architecture
&lt;/h2&gt;

&lt;p&gt;Here's the high-level flow:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Browser (Frontend)
  Microphone API → WebAudio Processor → WebSocket Client
                                              │ Chunks
                                              ▼
Backend (Node.js/Python)
  WebSocket Server → Audio Processor → Gemini API (Streaming)
                          │
                          ▼
                    Transcript Builder → Browser updates UI
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  1. Browser-Side Audio Capture
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Capture audio from microphone&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;audioContext&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;window&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;AudioContext&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="nb"&gt;window&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;webkitAudioContext&lt;/span&gt;&lt;span class="p"&gt;)();&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;mediaStream&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nb"&gt;navigator&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;mediaDevices&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getUserMedia&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;audio&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;source&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;audioContext&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;createMediaStreamAudioSource&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;mediaStream&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;processor&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;audioContext&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;createScriptProcessor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;4096&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="nx"&gt;processor&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;onaudioprocess&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;event&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;audioData&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;inputBuffer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getChannelData&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;pcmData&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Float32Array&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;audioData&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;int16Data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;float32ToInt16&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;pcmData&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="nx"&gt;socket&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;emit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;audio_chunk&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;int16Data&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;

&lt;span class="nx"&gt;source&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;connect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;processor&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="nx"&gt;processor&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;connect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;audioContext&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;destination&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;float32ToInt16&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;float32Array&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;int16Array&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Int16Array&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;float32Array&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="k"&gt;for &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;i&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nx"&gt;i&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="nx"&gt;float32Array&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nx"&gt;i&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;int16Array&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;float32Array&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
      &lt;span class="p"&gt;?&lt;/span&gt; &lt;span class="nx"&gt;float32Array&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mh"&gt;0x8000&lt;/span&gt;
      &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;float32Array&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mh"&gt;0x7fff&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;int16Array&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Key decisions:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;4096 sample chunk size: 93ms at 44.1kHz (good balance between latency and overhead)&lt;/li&gt;
&lt;li&gt;Int16 encoding: most APIs expect 16-bit PCM audio&lt;/li&gt;
&lt;li&gt;Send immediately: don't buffer, start streaming as chunks arrive&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  2. Streaming to Gemini API
&lt;/h2&gt;

&lt;p&gt;This is where real-time transcription happens:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;GoogleGenerativeAI&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;require&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;@google/generative-ai&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;genAI&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;GoogleGenerativeAI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;env&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;GEMINI_API_KEY&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;transcribeAudioStream&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;ws&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;audioChunks&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;genAI&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getGenerativeModel&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;gemini-2.0-flash&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;

  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generateContentStream&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="na"&gt;contents&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;
      &lt;span class="na"&gt;role&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;user&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;parts&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;inlineData&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;mimeType&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;audio/mp3&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;data&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;audioStream&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Transcribe this audio. Return ONLY the transcription.&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
      &lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="p"&gt;}]&lt;/span&gt;
  &lt;span class="p"&gt;});&lt;/span&gt;

  &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="k"&gt;await &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;chunk&lt;/span&gt; &lt;span class="k"&gt;of&lt;/span&gt; &lt;span class="nx"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;stream&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;text&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
    &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nx"&gt;ws&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;send&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;JSON&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;stringify&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
        &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;partial_transcript&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="na"&gt;text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="na"&gt;timestamp&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;Date&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
      &lt;span class="p"&gt;}));&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  3. Handling Codec Mismatches
&lt;/h2&gt;

&lt;p&gt;This was our biggest surprise issue. Browsers capture audio as PCM (44.1kHz, 16-bit mono). But APIs have different requirements — some want WAV, some MP3, some raw PCM.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;ffmpeg&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;require&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;fluent-ffmpeg&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;convertAudioCodec&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;inputBuffer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;outputFormat&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Promise&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;resolve&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;reject&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nf"&gt;ffmpeg&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;inputBuffer&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
      &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;outputFormat&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
      &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;audioFrequency&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;16000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
      &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;audioChannels&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
      &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;on&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;end&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nf"&gt;resolve&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;outputBuffer&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
      &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;on&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;error&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;reject&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
      &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;pipe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;outputBuffer&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="p"&gt;});&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  4. Latency Optimization
&lt;/h2&gt;

&lt;p&gt;Real-time means &amp;lt;500ms perception. Our latency breakdown:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Browser capture: 93ms (chunk size)&lt;/li&gt;
&lt;li&gt;Network round-trip: 50ms&lt;/li&gt;
&lt;li&gt;Gemini processing: 150ms&lt;/li&gt;
&lt;li&gt;Response streaming: 20ms&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Total: ~310ms&lt;/strong&gt; before transcription appears&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  5. Cost Optimization
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Don't send silence&lt;/span&gt;
&lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;shouldSendChunk&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;audioData&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;threshold&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.01&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;rms&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;Math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sqrt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="nx"&gt;audioData&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;reduce&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;s&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;sum&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nx"&gt;s&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="nx"&gt;audioData&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt;
  &lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;rms&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;threshold&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We estimate &lt;strong&gt;$0.0005 per form submission&lt;/strong&gt; at scale.&lt;/p&gt;

&lt;h2&gt;
  
  
  Lessons Learned
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Streaming changes everything.&lt;/strong&gt; 500ms feels slow. 200ms feels responsive.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Test with real audio.&lt;/strong&gt; Background noise, accents, quiet voices — test aggressively.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Browser audio APIs are still janky.&lt;/strong&gt; ScriptProcessorNode is deprecated but most compatible.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Don't ignore codec issues.&lt;/strong&gt; We lost 2 weeks to garbage transcription from wrong formats.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Frontend UX matters.&lt;/strong&gt; Debounce updates, show partial results clearly.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Production Stack
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Frontend&lt;/strong&gt;: React + WebSocket client&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Backend&lt;/strong&gt;: Node.js with &lt;code&gt;ws&lt;/code&gt; library&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;API&lt;/strong&gt;: Google Gemini 2.0 Flash&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Codec&lt;/strong&gt;: ffmpeg-wasm (browser) + ffmpeg (backend)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hosting&lt;/strong&gt;: Render + Cloudflare CDN&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;strong&gt;Building something with voice?&lt;/strong&gt; We'd love to hear about it. Drop a comment or check out &lt;a href="https://voiceforms.anvevoice.app/lifetime/" rel="noopener noreferrer"&gt;Anve Voice Forms&lt;/a&gt; if you want to see this architecture in action.&lt;/p&gt;

&lt;p&gt;—Adarsh, Founder @ Anve Voice Forms&lt;/p&gt;

</description>
      <category>webdev</category>
      <category>javascript</category>
      <category>ai</category>
      <category>productivity</category>
    </item>
    <item>
      <title>I Built a Voice-Powered Form Builder and 87% of Users Complete It</title>
      <dc:creator>Adarsh Kant</dc:creator>
      <pubDate>Sat, 04 Apr 2026 21:54:07 +0000</pubDate>
      <link>https://forem.com/adarsh_kant_ebb2fde1d0c6b/i-built-a-voice-powered-form-builder-and-87-of-users-complete-it-3hfm</link>
      <guid>https://forem.com/adarsh_kant_ebb2fde1d0c6b/i-built-a-voice-powered-form-builder-and-87-of-users-complete-it-3hfm</guid>
      <description>&lt;p&gt;Every developer has built a form. And every developer knows the pain: you spend hours perfecting the UX, adding validation, making it responsive... and then 85% of users abandon it halfway through.&lt;/p&gt;

&lt;p&gt;I got tired of this. So I built &lt;strong&gt;Anve Voice Forms&lt;/strong&gt; — a form builder where users can speak their answers instead of typing them.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem With Text Forms
&lt;/h2&gt;

&lt;p&gt;Here's what the data actually shows:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Average form completion rate: &lt;strong&gt;15-20%&lt;/strong&gt; (Formstack, 2024)&lt;/li&gt;
&lt;li&gt;Average time to complete a 10-field form: &lt;strong&gt;4 minutes 23 seconds&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;#1 reason for abandonment: "Too many fields" / "Takes too long"&lt;/li&gt;
&lt;li&gt;Mobile form completion is &lt;strong&gt;30% lower&lt;/strong&gt; than desktop&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We've been building forms the same way since the 90s. Text input, validation, submit. The entire interaction model assumes users want to type. But 40% of the world's population prefers voice input — whether due to accessibility needs, mobile context, or just convenience.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Built
&lt;/h2&gt;

&lt;p&gt;Anve Voice Forms lets you create forms where users can &lt;strong&gt;speak their answers&lt;/strong&gt;. The voice engine (powered by Google Gemini's multimodal API) transcribes responses in real-time across 40+ languages.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The tech stack:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;React + TypeScript + Vite (frontend)&lt;/li&gt;
&lt;li&gt;Tailwind CSS (styling)&lt;/li&gt;
&lt;li&gt;Supabase (database + auth + edge functions)&lt;/li&gt;
&lt;li&gt;Clerk (authentication)&lt;/li&gt;
&lt;li&gt;Google Gemini API (voice processing via real-time WebSocket streaming)&lt;/li&gt;
&lt;li&gt;Razorpay (payments)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;How it works:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;You build a form (drag-and-drop, just like Typeform)&lt;/li&gt;
&lt;li&gt;Each field can accept text OR voice input&lt;/li&gt;
&lt;li&gt;When a user clicks the mic, Gemini processes their speech in real-time&lt;/li&gt;
&lt;li&gt;The response is transcribed, validated, and stored&lt;/li&gt;
&lt;li&gt;You get analytics on completion rates, voice vs text usage, and more&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  The Results
&lt;/h2&gt;

&lt;p&gt;After testing with early users across education, HR, and customer feedback use cases:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;87%+ completion rates&lt;/strong&gt; (vs ~15-20% industry average for text)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;3x faster&lt;/strong&gt; form completion time&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;40+ languages&lt;/strong&gt; supported out of the box&lt;/li&gt;
&lt;li&gt;Users on mobile completed forms &lt;strong&gt;2.5x faster&lt;/strong&gt; with voice&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The biggest surprise? Users who had the &lt;em&gt;option&lt;/em&gt; of voice but chose text still completed at higher rates. Just having voice as a fallback reduced anxiety about long forms.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Voice Changes Everything for Forms
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Accessibility is built-in, not bolted on&lt;/strong&gt;&lt;br&gt;
1.3 billion people globally have some form of disability. Voice input isn't a nice-to-have — it's how a huge chunk of the world interacts with technology.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Multilingual by default&lt;/strong&gt;&lt;br&gt;
If your form serves users in multiple languages, voice forms handle it natively. No translation layers, no per-language form variants. A user in Tamil Nadu speaks Tamil, a user in Berlin speaks German — same form.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Mobile-first UX&lt;/strong&gt;&lt;br&gt;
Typing on a phone is slow and error-prone. Voice is the natural input method for mobile. Forms that support voice see significantly higher mobile completion rates.&lt;/p&gt;
&lt;h2&gt;
  
  
  The Architecture
&lt;/h2&gt;

&lt;p&gt;The voice processing pipeline:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User speaks → WebSocket to Gemini API → Real-time transcription → Client-side validation → Supabase insert → Analytics event
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Key technical decisions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;WebSocket streaming&lt;/strong&gt; over REST for real-time feel&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Client-side audio processing&lt;/strong&gt; — only processed text is stored&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Supabase Edge Functions&lt;/strong&gt; for server-side logic&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Progressive enhancement&lt;/strong&gt; — voice is additive, text always works&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Try It / Get a Lifetime Deal
&lt;/h2&gt;

&lt;p&gt;I'm running a limited launch: &lt;strong&gt;500 lifetime licenses at $199&lt;/strong&gt; (one-time payment, lifetime access).&lt;/p&gt;

&lt;p&gt;What you get:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Unlimited text form submissions (forever)&lt;/li&gt;
&lt;li&gt;50 voice responses/month&lt;/li&gt;
&lt;li&gt;Analytics dashboard&lt;/li&gt;
&lt;li&gt;API access + webhooks&lt;/li&gt;
&lt;li&gt;40+ languages&lt;/li&gt;
&lt;li&gt;Lifetime updates&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Live demo:&lt;/strong&gt; &lt;a href="https://voiceforms.anvevoice.app/lifetime/?utm_source=devto&amp;amp;utm_medium=blog&amp;amp;utm_campaign=ltd500" rel="noopener noreferrer"&gt;voiceforms.anvevoice.app/lifetime/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Main app:&lt;/strong&gt; &lt;a href="https://forms.anvevoice.app" rel="noopener noreferrer"&gt;forms.anvevoice.app&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;Currently working on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Zapier + Make integrations&lt;/li&gt;
&lt;li&gt;Conditional logic for voice flows&lt;/li&gt;
&lt;li&gt;Team collaboration features&lt;/li&gt;
&lt;li&gt;White-label option for agencies&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Would love feedback from the dev community. What would you build with voice-powered forms? Drop a comment.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Built by &lt;a href="https://twitter.com/adarshknt1" rel="noopener noreferrer"&gt;Adarsh&lt;/a&gt; — indie founder from India.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>webdev</category>
      <category>saas</category>
      <category>ai</category>
      <category>startup</category>
    </item>
    <item>
      <title>How I Built a Voice AI That Takes Real DOM Actions on Websites</title>
      <dc:creator>Adarsh Kant</dc:creator>
      <pubDate>Sat, 21 Mar 2026 19:31:19 +0000</pubDate>
      <link>https://forem.com/adarsh_kant_ebb2fde1d0c6b/how-i-built-a-voice-ai-that-takes-real-dom-actions-on-websites-4gn4</link>
      <guid>https://forem.com/adarsh_kant_ebb2fde1d0c6b/how-i-built-a-voice-ai-that-takes-real-dom-actions-on-websites-4gn4</guid>
      <description>&lt;p&gt;Every voice AI tool I evaluated did the same thing: listen to speech, convert to text, send to an LLM, return audio. Essentially a chatbot with a microphone.&lt;/p&gt;

&lt;p&gt;But I wanted something different. I wanted voice AI that could actually &lt;strong&gt;do things&lt;/strong&gt; on a website — click buttons, fill forms, navigate pages. A voice agent, not a voice chatbot.&lt;/p&gt;

&lt;p&gt;So I built &lt;a href="https://anvevoice.app" rel="noopener noreferrer"&gt;AnveVoice&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem with Voice Chatbots
&lt;/h2&gt;

&lt;p&gt;Here's what most "voice AI" tools do:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;User speaks&lt;/li&gt;
&lt;li&gt;Speech-to-text converts it&lt;/li&gt;
&lt;li&gt;Text goes to an LLM&lt;/li&gt;
&lt;li&gt;LLM generates a response&lt;/li&gt;
&lt;li&gt;Text-to-speech reads it back&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That's it. The AI talks back, but it doesn't &lt;em&gt;do&lt;/em&gt; anything. It can't click your "Book Appointment" button. It can't fill in your contact form. It can't navigate to your pricing page.&lt;/p&gt;

&lt;p&gt;For websites, this is a huge missed opportunity. 96.3% of websites fail basic accessibility standards (WebAIM 2025). Voice navigation isn't just a feature — it's an accessibility requirement.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Architecture: Voice → Intent → DOM Action
&lt;/h2&gt;

&lt;p&gt;Here's how AnveVoice works differently:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User Speech → STT (sub-200ms) → Intent Parser → Action Router
                                                    ↓
                                    ┌───────────────┼───────────────┐
                                    ↓               ↓               ↓
                              DOM Actions      Navigation      Form Fill
                              (click, scroll)  (page redirect)  (input values)
                                    ↓               ↓               ↓
                              Visual Feedback → TTS Response → State Update
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The key innovation is the &lt;strong&gt;Action Router&lt;/strong&gt;. Instead of just generating text responses, the AI interprets user intent and maps it to real DOM actions using 46 MCP (Model Context Protocol) tools over JSON-RPC 2.0.&lt;/p&gt;

&lt;h3&gt;
  
  
  Real DOM Actions
&lt;/h3&gt;

&lt;p&gt;When a user says "Book an appointment for Tuesday," AnveVoice doesn't just say "I'd be happy to help you book an appointment." It actually:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Identifies the booking form on the page&lt;/li&gt;
&lt;li&gt;Fills in the date field with next Tuesday's date&lt;/li&gt;
&lt;li&gt;Clicks the submit button&lt;/li&gt;
&lt;li&gt;Confirms the booking with voice feedback&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This is possible because we maintain a real-time DOM map of the page and use semantic understanding to match user intents to actionable elements.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Technical Challenge: Sub-700ms Latency
&lt;/h2&gt;

&lt;p&gt;End-to-end voice latency needs to be under 1 second to feel natural. Here's our pipeline:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Stage&lt;/th&gt;
&lt;th&gt;Target&lt;/th&gt;
&lt;th&gt;Actual&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;STT&lt;/td&gt;
&lt;td&gt;&amp;lt; 200ms&lt;/td&gt;
&lt;td&gt;~180ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Intent Parse&lt;/td&gt;
&lt;td&gt;&amp;lt; 100ms&lt;/td&gt;
&lt;td&gt;~80ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Action Execution&lt;/td&gt;
&lt;td&gt;&amp;lt; 200ms&lt;/td&gt;
&lt;td&gt;~150ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TTS&lt;/td&gt;
&lt;td&gt;&amp;lt; 200ms&lt;/td&gt;
&lt;td&gt;~190ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Total&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;&amp;lt; 700ms&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~600ms&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;We achieve this by:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Streaming STT&lt;/strong&gt; — processing audio chunks as they arrive, not waiting for silence detection&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pre-computed DOM maps&lt;/strong&gt; — indexing actionable elements on page load so we don't need to traverse the DOM at query time&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Parallel TTS&lt;/strong&gt; — starting speech synthesis while the action is still executing&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Edge inference&lt;/strong&gt; — running intent classification at the edge, not round-tripping to a central server&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Embed: One Script Tag
&lt;/h2&gt;

&lt;p&gt;The entire integration is a single script tag:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight html"&gt;&lt;code&gt;&lt;span class="nt"&gt;&amp;lt;script 
  &lt;/span&gt;&lt;span class="na"&gt;src=&lt;/span&gt;&lt;span class="s"&gt;"https://widget.anvevoice.app/embed.js"&lt;/span&gt; 
  &lt;span class="na"&gt;data-agent-id=&lt;/span&gt;&lt;span class="s"&gt;"YOUR_AGENT_ID"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;/script&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. No WebRTC server management. No complex API integration. Works with React, Vue, Angular, Next.js, Shopify, WordPress, or any HTML page.&lt;/p&gt;

&lt;p&gt;The widget handles:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Microphone permission and audio capture&lt;/li&gt;
&lt;li&gt;Real-time speech recognition in 50+ languages&lt;/li&gt;
&lt;li&gt;Intent classification and action routing&lt;/li&gt;
&lt;li&gt;DOM manipulation and visual feedback&lt;/li&gt;
&lt;li&gt;Text-to-speech response in the detected language&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  50+ Languages (Including 22 Indian Languages)
&lt;/h2&gt;

&lt;p&gt;This was a non-negotiable for us. India has 700M+ smartphone users, and 65 of every 100 mobile searches happen in non-English languages.&lt;/p&gt;

&lt;p&gt;We support all 22 scheduled Indian languages plus Hinglish (Hindi-English code-switching), which is how most urban Indians actually communicate with technology.&lt;/p&gt;

&lt;p&gt;The language detection works automatically — if a user starts speaking Hindi, the system detects it, locks to Hindi for the session, and responds in Hindi. No configuration needed.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pricing: Flat-Rate vs. Per-Minute
&lt;/h2&gt;

&lt;p&gt;Most voice AI tools charge per minute:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Retell AI: ~$0.13-0.31/min&lt;/li&gt;
&lt;li&gt;Vapi: ~$0.15-0.33/min
&lt;/li&gt;
&lt;li&gt;ElevenLabs: ~$0.08-0.10/min&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;At 1,000 minutes/month, that's $80-$330.&lt;/p&gt;

&lt;p&gt;AnveVoice uses flat-rate token pricing:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Free: $0/mo (50K tokens)&lt;/li&gt;
&lt;li&gt;Growth: $35/mo (500K tokens, 3 bots)&lt;/li&gt;
&lt;li&gt;Enterprise: Custom&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Predictable costs. No surprise bills.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;We're currently focused on:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Healthcare&lt;/strong&gt; — 94% appointment booking success rate in pilot clinics&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;E-commerce&lt;/strong&gt; — Voice-powered product discovery and checkout&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Government portals&lt;/strong&gt; — Citizen services in vernacular languages&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Accessibility&lt;/strong&gt; — Making WCAG 2.1 AA compliance achievable through voice&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Try It
&lt;/h2&gt;

&lt;p&gt;You can try AnveVoice at &lt;a href="https://anvevoice.app" rel="noopener noreferrer"&gt;anvevoice.app&lt;/a&gt; or see the experience hub at &lt;a href="https://experience.anvevoice.app" rel="noopener noreferrer"&gt;experience.anvevoice.app&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The embed is free to start. If you're building a website that needs voice interaction — especially if accessibility or multilingual support matters — give it a try.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;I'm Adarsh, founder of ANVE.AI. I'm a cybersecurity professional (CISA/CEH certified) who got obsessed with making the web more accessible through voice. If you have questions about the architecture or want to discuss voice AI, drop a comment below or find me on &lt;a href="https://linkedin.com/in/adarshknt/" rel="noopener noreferrer"&gt;LinkedIn&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>webdev</category>
      <category>ai</category>
      <category>javascript</category>
      <category>voiceai</category>
    </item>
    <item>
      <title>From 0 to 100+ Users: What Actually Worked After 20,000 SEO Pages Got Us Nothing</title>
      <dc:creator>Adarsh Kant</dc:creator>
      <pubDate>Thu, 19 Mar 2026 08:49:49 +0000</pubDate>
      <link>https://forem.com/adarsh_kant_ebb2fde1d0c6b/from-0-to-100-users-what-actually-worked-after-20000-seo-pages-got-us-nothing-5d7o</link>
      <guid>https://forem.com/adarsh_kant_ebb2fde1d0c6b/from-0-to-100-users-what-actually-worked-after-20000-seo-pages-got-us-nothing-5d7o</guid>
      <description>&lt;p&gt;A few weeks ago, I published a post here about adding voice AI to any website with one script tag. Today I'm sharing the business side of that story — because the technical win meant nothing without users.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Before: 6 Months of Beautiful Failure
&lt;/h2&gt;

&lt;p&gt;I built AnveVoice — a Voice OS for websites. One script tag. Agentic DOM actions (navigates, fills forms, clicks buttons). 53 languages. Sub-700ms latency.&lt;/p&gt;

&lt;p&gt;Then I did what every blog told me to do: I went all in on SEO.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;20,253 pages&lt;/strong&gt; of content written&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;1,000+ monthly visitors&lt;/strong&gt; from Google&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;$3,200/month&lt;/strong&gt; infrastructure costs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;0 signups.&lt;/strong&gt; Zero.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Cost per signup: undefined (can't divide by zero).&lt;/p&gt;

&lt;h2&gt;
  
  
  The Pivot That Changed Everything
&lt;/h2&gt;

&lt;p&gt;The product didn't change. The positioning did.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Before:&lt;/strong&gt; "Voice OS for websites" — so broad that nobody saw themselves in it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;After:&lt;/strong&gt; Three specific verticals with urgent deadlines:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Healthcare&lt;/strong&gt; — WCAG 2.1 AA deadline April 24, 2026. Telemedicine platforms face legal exposure if patient intake forms aren't accessible.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Government&lt;/strong&gt; — Same deadline. $55,000/day penalties for non-compliance.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;International e-commerce&lt;/strong&gt; — 53 languages as a competitive moat for global stores.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  What Actually Drove the First 100+ Users
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Multi-platform content blitz
&lt;/h3&gt;

&lt;p&gt;Published the raw, honest failure story simultaneously on Dev.to, Indie Hackers, Medium, Hacker News, and LinkedIn. The vulnerability resonated — founders DM'd saying they'd been through the same thing.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Vertical positioning
&lt;/h3&gt;

&lt;p&gt;Stopped saying "voice for everyone." Started saying "voice for healthcare sites facing the April 2026 WCAG deadline." Same product. Completely different conversion rate.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Cold outreach to deadline-driven buyers
&lt;/h3&gt;

&lt;p&gt;When you email someone facing a compliance deadline with $55K/day penalties, the conversation is fundamentally different from cold outreach to someone who might find your product interesting.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. "Powered by AnveVoice" badge
&lt;/h3&gt;

&lt;p&gt;Every free tier widget shows this. Each user becomes a distribution channel. It compounds silently.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Directory submissions
&lt;/h3&gt;

&lt;p&gt;Listed on 10+ directories (Product Hunt, BetaList, SaaSHub, AlternativeTo). Each listing is a permanent backlink and discovery channel.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Numbers Today
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Users:          100+ (up from 0)
Verticals:      3 (healthcare, government, e-commerce)
Languages:      53
Latency:        &amp;lt;700ms end-to-end
Free tier:      60 conversations/month
Growth plan:    $36/month
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  The Lesson
&lt;/h2&gt;

&lt;p&gt;Your product is probably fine. Your positioning might be the problem.&lt;/p&gt;

&lt;p&gt;Find people who need what you built &lt;strong&gt;urgently&lt;/strong&gt;. Not people who think it's cool.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="nx"&gt;enthusiasm&lt;/span&gt; &lt;span class="o"&gt;!==&lt;/span&gt; &lt;span class="nx"&gt;customers&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="nx"&gt;urgency&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="nx"&gt;customers&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you're building something and struggling with traction, ask yourself: Who has a deadline? Who faces a penalty without a solution? Who is actively shopping right now?&lt;/p&gt;

&lt;p&gt;Those are your first 100 users.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;AnveVoice is live at &lt;a href="https://anvevoice.app" rel="noopener noreferrer"&gt;anvevoice.app&lt;/a&gt;. Free tier, no credit card. If you're in healthcare, government, or e-commerce and face accessibility deadlines — happy to help.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Building in public on &lt;a href="https://x.com/adarshknt1" rel="noopener noreferrer"&gt;X/Twitter&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>startup</category>
      <category>webdev</category>
      <category>saas</category>
      <category>marketing</category>
    </item>
    <item>
      <title>I Added Voice AI to Any Website with One Script Tag</title>
      <dc:creator>Adarsh Kant</dc:creator>
      <pubDate>Wed, 18 Mar 2026 08:36:49 +0000</pubDate>
      <link>https://forem.com/adarsh_kant_ebb2fde1d0c6b/i-added-voice-ai-to-any-website-with-one-script-tag-3641</link>
      <guid>https://forem.com/adarsh_kant_ebb2fde1d0c6b/i-added-voice-ai-to-any-website-with-one-script-tag-3641</guid>
      <description>&lt;p&gt;What if you could add a voice AI assistant to any website with a single line of code?&lt;/p&gt;

&lt;p&gt;That's what I built. One &lt;code&gt;&amp;lt;script&amp;gt;&lt;/code&gt; tag. The user talks. The AI listens, understands, and takes real actions on the page — clicking buttons, filling forms, navigating pages.&lt;/p&gt;

&lt;p&gt;Here's how it works under the hood.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem
&lt;/h2&gt;

&lt;p&gt;Most websites are built for mouse-and-keyboard users. But:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;15-20% of the global population&lt;/strong&gt; has some form of disability&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Voice search&lt;/strong&gt; is growing 35% year over year&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;WCAG 2.1 AA compliance&lt;/strong&gt; is now legally required for government and healthcare sites (deadline: April 24, 2026)&lt;/li&gt;
&lt;li&gt;Mobile users on the go need hands-free interaction&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Traditional chatbots just answer questions. They don't &lt;em&gt;do&lt;/em&gt; anything on the page. I wanted to build something that actually takes action.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Architecture
&lt;/h2&gt;

&lt;p&gt;AnveVoice has three core layers:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Speech-to-Text (STT)
&lt;/h3&gt;

&lt;p&gt;We use a streaming STT pipeline that achieves sub-200ms first-token latency. The audio is captured via the Web Audio API:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Simplified audio capture&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;stream&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nb"&gt;navigator&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;mediaDevices&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getUserMedia&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;audio&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;audioContext&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;AudioContext&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;sampleRate&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;16000&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;source&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;audioContext&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;createMediaStreamSource&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;stream&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="c1"&gt;// Stream chunks to STT service via WebSocket&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We support 53 languages with automatic language detection. The system identifies the language within the first 500ms of audio.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Intent Resolution + DOM Mapping
&lt;/h3&gt;

&lt;p&gt;This is the hard part. Once we have the transcribed text, we need to:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Understand intent&lt;/strong&gt;: "I want to buy the blue shoes in size 10" maps to &lt;code&gt;{action: "click", target: "product-variant-blue", then: "select-size-10", then: "add-to-cart"}&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Map to DOM elements&lt;/strong&gt;: We crawl the page's accessibility tree and semantic HTML to find matching elements&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Execute actions&lt;/strong&gt;: Click, scroll, fill form fields, navigate
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Simplified DOM action executor&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;executeVoiceAction&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;intent&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;action&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;target&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;value&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;intent&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

  &lt;span class="c1"&gt;// Find the target element using multiple strategies&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;element&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;findElement&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;target&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;aria-label&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;      &lt;span class="c1"&gt;// ARIA attributes first&lt;/span&gt;
    &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;data-testid&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;     &lt;span class="c1"&gt;// Test IDs&lt;/span&gt;
    &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;innerText&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;       &lt;span class="c1"&gt;// Visible text matching&lt;/span&gt;
    &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;semantic-role&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   &lt;span class="c1"&gt;// HTML5 semantic roles&lt;/span&gt;
  &lt;span class="p"&gt;]);&lt;/span&gt;

  &lt;span class="k"&gt;switch &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;action&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;click&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
      &lt;span class="nx"&gt;element&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;click&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
      &lt;span class="k"&gt;break&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;fill&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
      &lt;span class="nx"&gt;element&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;value&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;value&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
      &lt;span class="nx"&gt;element&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dispatchEvent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Event&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;input&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;bubbles&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt; &lt;span class="p"&gt;}));&lt;/span&gt;
      &lt;span class="k"&gt;break&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;navigate&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
      &lt;span class="nb"&gt;window&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;location&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;href&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;element&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;href&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
      &lt;span class="k"&gt;break&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  3. Text-to-Speech (TTS) Response
&lt;/h3&gt;

&lt;p&gt;After executing the action, the system confirms what it did via natural speech. We use streaming TTS for sub-300ms response time.&lt;/p&gt;

&lt;p&gt;The total pipeline: STT (200ms) + Intent (100ms) + Action (50ms) + TTS (300ms) = &lt;strong&gt;under 700ms end-to-end&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  The One-Tag Integration
&lt;/h2&gt;

&lt;p&gt;Here's what the actual integration looks like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight html"&gt;&lt;code&gt;&lt;span class="nt"&gt;&amp;lt;script &lt;/span&gt;&lt;span class="na"&gt;src=&lt;/span&gt;&lt;span class="s"&gt;"https://app.anvevoice.app/widget.js"&lt;/span&gt;
        &lt;span class="na"&gt;data-key=&lt;/span&gt;&lt;span class="s"&gt;"your-api-key"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;/script&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. The script:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Injects a floating voice button into the page&lt;/li&gt;
&lt;li&gt;Handles microphone permissions&lt;/li&gt;
&lt;li&gt;Streams audio to our STT service&lt;/li&gt;
&lt;li&gt;Resolves intents against the current page's DOM&lt;/li&gt;
&lt;li&gt;Executes actions and provides voice feedback&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;No server-side changes. No framework dependencies. Works with React, Vue, Angular, vanilla HTML, Shopify, WordPress — anything with a DOM.&lt;/p&gt;

&lt;h2&gt;
  
  
  What It Can Actually Do
&lt;/h2&gt;

&lt;p&gt;Real examples from production:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;E-commerce&lt;/strong&gt;: "Show me red dresses under fifty dollars" → filters products, scrolls to results&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Healthcare forms&lt;/strong&gt;: "Fill in my date of birth, March 15, 1985" → finds the DOB field, enters the date&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Government portals&lt;/strong&gt;: "Navigate to the benefits application page" → clicks through menu navigation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-language&lt;/strong&gt;: A user says the same command in Hindi, Spanish, or Japanese — same result&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The WCAG Compliance Angle
&lt;/h2&gt;

&lt;p&gt;The April 24, 2026 WCAG 2.1 AA deadline affects:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Government sites serving 50,000+ people&lt;/li&gt;
&lt;li&gt;Healthcare organizations receiving federal funding&lt;/li&gt;
&lt;li&gt;Any site that wants to avoid accessibility lawsuits ($55K+/day penalties for government entities)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Voice interfaces aren't just nice to have anymore. They're becoming a legal requirement for accessible web experiences.&lt;/p&gt;

&lt;h2&gt;
  
  
  Performance Numbers
&lt;/h2&gt;

&lt;p&gt;After 6 months of optimization:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Target&lt;/th&gt;
&lt;th&gt;Actual&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;End-to-end latency&lt;/td&gt;
&lt;td&gt;&amp;lt;1000ms&lt;/td&gt;
&lt;td&gt;680ms avg&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Language detection&lt;/td&gt;
&lt;td&gt;&amp;lt;500ms&lt;/td&gt;
&lt;td&gt;420ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DOM action execution&lt;/td&gt;
&lt;td&gt;&amp;lt;100ms&lt;/td&gt;
&lt;td&gt;45ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Languages supported&lt;/td&gt;
&lt;td&gt;20+&lt;/td&gt;
&lt;td&gt;53&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Integration time&lt;/td&gt;
&lt;td&gt;&amp;lt;5 min&lt;/td&gt;
&lt;td&gt;~60 seconds&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Try It
&lt;/h2&gt;

&lt;p&gt;AnveVoice is live at &lt;a href="https://anvevoice.app" rel="noopener noreferrer"&gt;anvevoice.app&lt;/a&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Free tier&lt;/strong&gt;: 60 conversations/month&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Growth&lt;/strong&gt;: $36/month for 2,100 conversations&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scale&lt;/strong&gt;: $120/month for high-volume sites&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you're working on accessibility, multilingual support, or just want to make your site more interactive — I'd love to hear what you think.&lt;/p&gt;

&lt;p&gt;Drop a comment if you have questions about the architecture, the DOM mapping approach, or the STT/TTS pipeline. Happy to go deeper on any of these.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;I'm Adarsh, solo founder building AnveVoice. Currently pivoting from horizontal positioning to three urgent verticals: healthcare, government, and international e-commerce. Building in public on &lt;a href="https://x.com/adarshknt1" rel="noopener noreferrer"&gt;Twitter/X&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>javascript</category>
      <category>webdev</category>
      <category>ai</category>
      <category>a11y</category>
    </item>
    <item>
      <title>We Built LinuxOS-AI: The First Step Toward an AI-Native Linux OS</title>
      <dc:creator>Adarsh Kant</dc:creator>
      <pubDate>Wed, 02 Jul 2025 19:20:02 +0000</pubDate>
      <link>https://forem.com/adarsh_kant_ebb2fde1d0c6b/we-built-linuxos-ai-the-first-step-toward-an-ai-native-linux-os-4f7j</link>
      <guid>https://forem.com/adarsh_kant_ebb2fde1d0c6b/we-built-linuxos-ai-the-first-step-toward-an-ai-native-linux-os-4f7j</guid>
      <description>&lt;p&gt;Hey folks 👋&lt;/p&gt;

&lt;p&gt;I'm excited to share something we’ve been quietly working on — LinuxOS-AI, an AI-powered Linux terminal built on top of Google’s Gemini CLI.&lt;/p&gt;

&lt;p&gt;It’s open-source. It’s safe by default. And it’s a glimpse of what a future AI-native operating system might feel like.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0y1lie0t844mpstjuua0.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0y1lie0t844mpstjuua0.gif" alt="Image description" width="1200" height="800"&gt;&lt;/a&gt;&lt;br&gt;
🧠 Why We Built This&lt;br&gt;
Traditional terminals are powerful but rigid. You have to remember flags, read man pages, and always worry about breaking things.&lt;/p&gt;

&lt;p&gt;We asked: what if you could just tell your Linux shell what you want in plain English — and it would do it safely and intelligently?&lt;/p&gt;

&lt;p&gt;So we built LinuxOS-AI. A terminal where you can say:&lt;/p&gt;

&lt;p&gt;🗣️ “Install Oracle DB”&lt;br&gt;
🛡️ “Configure firewall to allow SSH only”&lt;br&gt;
📁 “List all Python files over 1MB”&lt;/p&gt;

&lt;p&gt;🔧 What Makes It Different&lt;br&gt;
✅ Natural Language System Admin (powered by Gemini CLI)&lt;br&gt;
✅ Dry-run &amp;amp; sudo confirmation for safety&lt;br&gt;
✅ Built-in agents for Shell, Filesystem, and Firewall tasks&lt;br&gt;
✅ Reskinned UX for clarity + extensibility&lt;br&gt;
✅ Fully open source and customizable&lt;/p&gt;

&lt;p&gt;This is just v0.1.0 — but we believe it’s the starting point for something big.&lt;/p&gt;

&lt;p&gt;🌐 Try It / Support It&lt;br&gt;
🔗 GitHub: github.com/ANVEAI/linuxos-ai&lt;/p&gt;

&lt;p&gt;🚀 Product Hunt launch: producthunt.com/products/linuxos-ai&lt;/p&gt;

&lt;p&gt;We’d love your feedback, feature ideas, or even just a GitHub ⭐️ if you like where this is going.&lt;/p&gt;

&lt;p&gt;🧩 What’s Next?&lt;br&gt;
We're exploring:&lt;/p&gt;

&lt;p&gt;Built-in package manager hooks&lt;/p&gt;

&lt;p&gt;AI-powered cron/scheduling&lt;/p&gt;

&lt;p&gt;Plugin support (think: agents.d)&lt;/p&gt;

&lt;p&gt;Voice module (in alpha 👀)&lt;/p&gt;

&lt;p&gt;If you’ve ever wished your terminal understood you better, we’d love to hear from you.&lt;/p&gt;

&lt;p&gt;💬 What’s one thing you’d want your terminal to do if it was truly intelligent?&lt;br&gt;
Drop a comment — let’s reimagine the shell together.&lt;/p&gt;

&lt;p&gt;– Adarsh Kant&lt;br&gt;
Founder, ANVE.AI&lt;br&gt;
LinuxOS-AI Maintainer&lt;/p&gt;

</description>
      <category>webdev</category>
      <category>linux</category>
      <category>ai</category>
      <category>dev</category>
    </item>
  </channel>
</rss>
