<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Boopathi</title>
    <description>The latest articles on Forem by Boopathi (@programmerraja).</description>
    <link>https://forem.com/programmerraja</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F430913%2F46bcaec6-10d0-4395-8161-5b0b881ac731.jpg</url>
      <title>Forem: Boopathi</title>
      <link>https://forem.com/programmerraja</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/programmerraja"/>
    <language>en</language>
    <item>
      <title>Sherlock Holmes: The Case Of AI Brought Down Our Servers</title>
      <dc:creator>Boopathi</dc:creator>
      <pubDate>Sun, 01 Mar 2026 23:08:46 +0000</pubDate>
      <link>https://forem.com/programmerraja/sherlock-holmes-the-case-of-ai-brought-down-our-servers-5002</link>
      <guid>https://forem.com/programmerraja/sherlock-holmes-the-case-of-ai-brought-down-our-servers-5002</guid>
      <description>&lt;p&gt;There are two kinds of production incidents.&lt;/p&gt;

&lt;p&gt;The first kind gives you signals. Metrics drift slightly off baseline. Latency edges upward. Dashboards turn yellow long before anything turns red. You have time to reason about it.&lt;/p&gt;

&lt;p&gt;The second kind doesn’t negotiate. It lets you sleep peacefully and then informs you in the morning that your server died multiple times overnight.&lt;/p&gt;

&lt;p&gt;This was the second kind.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Setup
&lt;/h2&gt;

&lt;p&gt;We’re building a voice agent platform.&lt;/p&gt;

&lt;p&gt;Calls come in from users. Audio streams over WebSocket. We integrate with Twilio for real-time media streams. AI agents process the conversation, decide what to say next, and occasionally invoke tools. Some of those tools query our database to fetch context or perform actions.&lt;/p&gt;

&lt;p&gt;Architecturally, nothing unusual. A fairly standard real-time pipeline: streaming input, AI orchestration, tool execution, database lookups.&lt;/p&gt;

&lt;p&gt;And everything had been working fine.&lt;/p&gt;

&lt;p&gt;Then one night, one of our Kubernetes pods  limited to 1GB of memory  started crashing repeatedly. There was no deployment. No configuration change. No obvious traffic spike. No infrastructure event. Just restarts.&lt;/p&gt;

&lt;p&gt;That’s always unsettling. When nothing changed, but something clearly broke.&lt;/p&gt;

&lt;h2&gt;
  
  
  The First Suspect: Streaming
&lt;/h2&gt;

&lt;p&gt;When memory spikes in a real-time system, your instinct immediately points to streaming.&lt;/p&gt;

&lt;p&gt;WebSockets can buffer unexpectedly. Audio chunks might accumulate if something downstream slows down. Garbage collection might not keep up under bursty traffic. Maybe some array was growing quietly in memory.&lt;/p&gt;

&lt;p&gt;All of those were reasonable hypotheses.&lt;/p&gt;

&lt;p&gt;We spun up a test environment and tried to simulate the issue. We created parallel calls. We streamed audio continuously. We monitored memory closely, expecting to see the same runaway pattern.&lt;/p&gt;

&lt;p&gt;Nothing happened.&lt;/p&gt;

&lt;p&gt;Memory usage remained stable. The heap grew and shrank normally. No vertical spikes. No crashes.&lt;/p&gt;

&lt;p&gt;That almost made it worse. Because in production, it was reproducible —just not consistently. During US night hours, when traffic was low, we triggered calls manually and sometimes we could reproduce the crash. Other times, everything behaved perfectly.&lt;/p&gt;

&lt;p&gt;Intermittent, probabilistic failures are far harder to reason about than deterministic ones.&lt;/p&gt;

&lt;h2&gt;
  
  
  Heap Snapshots and False Leads
&lt;/h2&gt;

&lt;p&gt;Next, we went for heap snapshots using V8 and Chrome DevTools. If something large was being retained, the snapshot would reveal it.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;on&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;SIGUSR1&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Received SIGUSR1 event on proc. Executing heap snapshot.&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;fs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;require&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;fs&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;v8&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;require&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;v8&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

  &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;writeHeapSnapshot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;filename&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;`/profiling/heap-&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nb"&gt;Date&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;()}&lt;/span&gt;&lt;span class="s2"&gt;.heapsnapshot`&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;snapshotStream&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;v8&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getHeapSnapshot&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;fileStream&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;fs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;createWriteStream&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;filename&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

    &lt;span class="nx"&gt;snapshotStream&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;pipe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;fileStream&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`Heap snapshot saved as &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;filename&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="nf"&gt;writeHeapSnapshot&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwqvrblg3m1o9howfkrmk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwqvrblg3m1o9howfkrmk.png" alt="preview" width="685" height="668"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We added a signal handler to our Node.js process so we could trigger heap snapshots on demand. The plan was simple: wait until memory rose, send the signal, capture the snapshot, and analyze it offline.&lt;/p&gt;

&lt;p&gt;There’s a catch, though. Generating a heap snapshot requires additional memory. If your pod is already close to its limit, the snapshot process itself can push it over.&lt;/p&gt;

&lt;p&gt;That’s exactly what happened.&lt;/p&gt;

&lt;p&gt;Sometimes the pod crashed before the snapshot completed. Other times it succeeded, but the analysis didn’t reveal anything clearly catastrophic. We saw objects. We saw JSON structures. We saw logs. But nothing that screamed “this is it.”&lt;/p&gt;

&lt;p&gt;We compared multiple snapshots  normal state versus spike moments. The differences weren’t obvious enough to explain a near 1GB allocation.&lt;/p&gt;

&lt;p&gt;Meanwhile, the day was progressing. It was evening in India. Which meant it was morning in the US.&lt;/p&gt;

&lt;p&gt;Traffic was about to return.&lt;/p&gt;

&lt;h2&gt;
  
  
  Watching It Happen Live
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbm9jdux95wovkhowjzbg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbm9jdux95wovkhowjzbg.png" alt="preview" width="480" height="480"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As calls started coming in, we stopped theorizing and simply watched production.&lt;/p&gt;

&lt;p&gt;There’s something tense about staring at live memory graphs when you know a crash is possible.&lt;/p&gt;

&lt;p&gt;At first, everything looked normal. Heap usage was steady. CPU was fine. Calls were connecting. Conversations were flowing.&lt;/p&gt;

&lt;p&gt;One call completed. No issue.&lt;/p&gt;

&lt;p&gt;Another started. Still stable.&lt;/p&gt;

&lt;p&gt;A few more came in. The graph moved slightly, but within normal range. For a moment, we thought maybe the issue had somehow resolved itself.&lt;/p&gt;

&lt;p&gt;Then it happened.&lt;/p&gt;

&lt;p&gt;The memory line didn’t drift upward gradually. It didn’t climb in a smooth curve. It jumped. A sharp vertical spike  as if a massive object had been allocated in a single operation.&lt;/p&gt;

&lt;p&gt;Within seconds, the pod was terminated due to OOM killed&lt;/p&gt;

&lt;p&gt;Restarting.&lt;/p&gt;

&lt;p&gt;This wasn’t a leak accumulating over time. This was a sudden allocation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Scaling Didn’t Save Us
&lt;/h2&gt;

&lt;p&gt;Under pressure  and with leadership understandably concerned  we tried the obvious mitigation: horizontal scaling.&lt;/p&gt;

&lt;p&gt;If one pod was overloaded, maybe splitting the traffic would help. So we spun up an additional pod and routed traffic between them.&lt;/p&gt;

&lt;p&gt;The assumption was simple: less load per instance means less memory pressure.&lt;/p&gt;

&lt;p&gt;It didn’t help.&lt;/p&gt;

&lt;p&gt;Both pods eventually crashed.&lt;/p&gt;

&lt;p&gt;That clarified something important. Scaling helps when the issue is cumulative load. It does not help when a single request is catastrophic. If one request allocates hundreds of megabytes, any pod that processes that request will fail independently.&lt;/p&gt;

&lt;p&gt;The problem wasn’t load distribution. It was logic.&lt;/p&gt;

&lt;h2&gt;
  
  
  Observing Memory in Real Time
&lt;/h2&gt;

&lt;p&gt;Instead of relying on snapshots, I added periodic memory logging directly in the application. Node.js exposes memory usage metrics like &lt;code&gt;rss&lt;/code&gt;, &lt;code&gt;heapTotal&lt;/code&gt;, &lt;code&gt;heapUsed&lt;/code&gt;, &lt;code&gt;external&lt;/code&gt;, and &lt;code&gt;arrayBuffers&lt;/code&gt;. We logged them every few seconds.&lt;/p&gt;

&lt;p&gt;if you are a nodejs dev you might know what each mean if not for you &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;rss&lt;/code&gt; (Resident Set Size): The total memory allocated for the process execution in main memory, including the heap, stack, and code segments.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;heapTotal&lt;/code&gt;: The total size of the allocated memory heap, which is managed by the V8 engine and stores objects, strings, and closures.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;heapUsed&lt;/code&gt;: The actual memory currently being used within the &lt;code&gt;heapTotal&lt;/code&gt;. This is often the most relevant parameter for identifying memory leaks in the JavaScript code itself.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;external&lt;/code&gt;: Memory used by C++ objects that are bound to JavaScript objects managed by V8.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;arrayBuffers&lt;/code&gt;: Memory allocated for &lt;code&gt;ArrayBuffer&lt;/code&gt;s and &lt;code&gt;SharedArrayBuffer&lt;/code&gt;s, which is also included in the &lt;code&gt;external&lt;/code&gt; value&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So I added below code on our codebase to log the memory usage over period of time&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="nf"&gt;setInterval&lt;/span&gt;&lt;span class="p"&gt;(()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; 
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;memory&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;memoryUsage&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

&lt;span class="k"&gt;for &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;key&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="nx"&gt;memory&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`[MEMORY]&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;key&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;: &lt;/span&gt;&lt;span class="p"&gt;${(&lt;/span&gt;&lt;span class="nx"&gt;memory&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;key&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;1024&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;toFixed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)}&lt;/span&gt;&lt;span class="s2"&gt; MB`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="mi"&gt;5000&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;so what we thought was If streaming is issue we see the increase of &lt;code&gt;arrayBuffers&lt;/code&gt; or &lt;code&gt;external&lt;/code&gt;  &lt;/p&gt;

&lt;p&gt;But supersinlgly we see the heapUsed is get increased and The pattern was consistent.&lt;/p&gt;

&lt;p&gt;Then suddenly, &lt;code&gt;heapUsed&lt;/code&gt; would spike dramatically  hundreds of megabytes in a short window  and the pod would be killed.&lt;/p&gt;

&lt;p&gt;This ruled out slow leaks and twilio audio streams Garbage collection wasn’t failing. Something large was being allocated all at once.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Pattern
&lt;/h2&gt;

&lt;p&gt;Eventually, our one of dev noticed something interesting in the logs around the spike.&lt;/p&gt;

&lt;p&gt;A tool call.&lt;/p&gt;

&lt;p&gt;More specifically, a tool call with an empty object as parameters.&lt;/p&gt;

&lt;p&gt;Our AI agents can invoke tools. One of those tools performs a database search and expects a required parameter to search on the collection so what we saw in log was empty object&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="nx"&gt;params&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;At first glance, it didn’t look dangerous. It was syntactically valid. It didn’t throw an error. The function executed normally.&lt;/p&gt;

&lt;p&gt;But that empty object changed everything. what we do was after receving the params we search on db using that&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;
&lt;span class="nx"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;collection&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;find&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;params&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Our collection contained around one million documents.&lt;/p&gt;

&lt;p&gt;When you execute &lt;code&gt;Model.find({})&lt;/code&gt; in MongoDB, you are not asking for nothing. You are asking for everything.&lt;/p&gt;

&lt;p&gt;MongoDB did exactly what we requested. It returned all documents.&lt;/p&gt;

&lt;p&gt;The Node.js driver then deserialized those documents into JavaScript objects in memory before our code could process them. That meant potentially hundreds of megabytes being allocated almost instantly.&lt;/p&gt;

&lt;p&gt;Inside a pod limited to 1GB.&lt;/p&gt;

&lt;p&gt;The vertical memory spike finally made sense.&lt;/p&gt;

&lt;p&gt;This wasn’t a memory leak. It wasn’t streaming buffers accumulating. It wasn’t garbage collection lag. It was a full-collection query triggered by an empty filter.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why It Was So Hard to Reproduce
&lt;/h2&gt;

&lt;p&gt;It didn’t happen on every call.&lt;/p&gt;

&lt;p&gt;Only one agent had access to that tool. Only certain conversation flows triggered it. Only when the AI decided the tool was relevant. And only when the model generated an empty object instead of a properly populated parameter set.&lt;/p&gt;

&lt;p&gt;Unless that exact probabilistic sequence occurred, the system behaved perfectly.&lt;/p&gt;

&lt;p&gt;Traditional bugs are deterministic. Given the same input, you get the same output.&lt;/p&gt;

&lt;p&gt;AI-integrated systems introduce probabilistic behavior. The model didn’t crash the server directly. It generated a syntactically valid tool call that was semantically unsafe. And we trusted it.&lt;/p&gt;

&lt;p&gt;That trust was the real bug.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Fix
&lt;/h2&gt;

&lt;p&gt;Once understood, the fix was straightforward.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ff7u8b1rvgc5vbcv1arkz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ff7u8b1rvgc5vbcv1arkz.png" alt="GIF" width="384" height="480"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We added strict schema validation before executing any tool call. If required parameters were missing, the call was rejected immediately. Empty filters were explicitly disallowed. We chose to fail fast instead of querying blindly.&lt;/p&gt;

&lt;p&gt;There was no infrastructure change. No scaling adjustment. No tuning of garbage collection.&lt;/p&gt;

&lt;p&gt;Just validation.&lt;/p&gt;

&lt;p&gt;After that, the crashes stopped. it almost took for us 24 hours to find and fix.&lt;/p&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>architecture</category>
      <category>sre</category>
    </item>
    <item>
      <title>2025 Voice AI Guide How to Make Your Own Real-Time Voice Agent (Part-3)</title>
      <dc:creator>Boopathi</dc:creator>
      <pubDate>Sun, 11 Jan 2026 12:25:50 +0000</pubDate>
      <link>https://forem.com/programmerraja/2025-voice-ai-guide-how-to-make-your-own-real-time-voice-agent-part-3-3ocb</link>
      <guid>https://forem.com/programmerraja/2025-voice-ai-guide-how-to-make-your-own-real-time-voice-agent-part-3-3ocb</guid>
      <description>&lt;p&gt;Welcome back! The waiting is over. In Part 3, we are going to see how to run the components of our voice agent locally, even on a CPU. Finally, you will have homework where you need to integrate all these into generic code to work it locally.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Performance Reality: Setting Expectations with Latency Budgets
&lt;/h2&gt;

&lt;p&gt;Before we dive into running components, &lt;strong&gt;you need to understand what "fast" actually means in voice AI&lt;/strong&gt;. Industry benchmarks show that users perceive natural conversation when end-to-end latency (time from user finishing speaking to hearing the agent's response) is &lt;strong&gt;under 800ms&lt;/strong&gt;, with the gold standard being &lt;strong&gt;under 500ms&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Let's break down where those milliseconds go:&lt;/p&gt;

&lt;h3&gt;
  
  
  Latency Budget Breakdown
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;Target Latency&lt;/th&gt;
&lt;th&gt;Upper Limit&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Speech-to-Text (STT)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;200-350ms&lt;/td&gt;
&lt;td&gt;500ms&lt;/td&gt;
&lt;td&gt;Measured from silence detection to final transcript&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;LLM Time-to-First-Token (TTFT)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;100-200ms&lt;/td&gt;
&lt;td&gt;400ms&lt;/td&gt;
&lt;td&gt;First token generation (not full response)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Text-to-Speech TTFB&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;75-150ms&lt;/td&gt;
&lt;td&gt;250ms&lt;/td&gt;
&lt;td&gt;Time to first byte of audio&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Network &amp;amp; Orchestration&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;50-100ms&lt;/td&gt;
&lt;td&gt;150ms&lt;/td&gt;
&lt;td&gt;WebSocket hops, service-to-service handoff&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Total Mouth-to-Ear Gap&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;500-800ms&lt;/td&gt;
&lt;td&gt;1100ms&lt;/td&gt;
&lt;td&gt;Complete turn latency&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Why this matters&lt;/strong&gt;: If your STT alone takes 500ms, you've already exhausted most of your latency budget. This is why model choice and orchestration matter a lot.&lt;/p&gt;

&lt;p&gt;If you want more depth about latency and other thing you can check articel from pipecat &lt;a href="https://voiceaiandvoiceagents.com/" rel="noopener noreferrer"&gt;Conversational Voice AI in 2025&lt;/a&gt; where they cover indepth.&lt;/p&gt;

&lt;p&gt;For &lt;strong&gt;local inference on CPU/modest GPU&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Expect 1.2-1.5s latency for the first response&lt;/li&gt;
&lt;li&gt;Subsequent turns may hit 800-1000ms as models warm up&lt;/li&gt;
&lt;li&gt;This is acceptable for local development; production requires better hardware or cloud providers&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Hardware Reality: CPU vs GPU
&lt;/h2&gt;

&lt;p&gt;Before we run anything, we need to address the elephant in the room: &lt;strong&gt;Computation&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why do models crave GPUs?
&lt;/h3&gt;

&lt;p&gt;AI models are essentially giant math problems involving billions of matrix multiplications.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;CPUs&lt;/strong&gt; are like a Ferrari: insanely fast at doing one or two complex things at a time (Sequential Processing).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GPUs&lt;/strong&gt; are like a bus service: slower at individual tasks, but can transport thousands of people (numbers) at once (Parallel Processing).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Since neural networks need to calculate billions of numbers simultaneously, GPUs are exponentially faster.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;"But I only have a CPU!"&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Don't worry. We can still run these models using a technique called &lt;strong&gt;Quantization&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Standard models use 16-bit floating-point numbers (e.g., &lt;code&gt;3.14159...&lt;/code&gt;). Quantization rounds these down to 4-bit or 8-bit integers (e.g., &lt;code&gt;3&lt;/code&gt;). This drastically reduces the size of the model and makes the math simple enough for a CPU to handle reasonably well, though it will practically always be slower than a GPU.&lt;/p&gt;

&lt;h3&gt;
  
  
  Minimum System Requirements for Local Voice Agents
&lt;/h3&gt;

&lt;p&gt;Here's what you actually need to get started:&lt;/p&gt;

&lt;h4&gt;
  
  
  For Development (CPU-Only)
&lt;/h4&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;Minimum&lt;/th&gt;
&lt;th&gt;Recommended&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;CPU&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;4-core modern processor (Intel i5/AMD Ryzen 5)&lt;/td&gt;
&lt;td&gt;8-core or better&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;RAM&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;16GB&lt;/td&gt;
&lt;td&gt;32GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Storage&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;50GB SSD&lt;/td&gt;
&lt;td&gt;100GB NVMe SSD&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;GPU&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;None required&lt;/td&gt;
&lt;td&gt;NVIDIA GTX 1070 or better&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Latency&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;1.5-2.5s per turn&lt;/td&gt;
&lt;td&gt;800-1200ms per turn&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h4&gt;
  
  
  For Production (GPU-Accelerated)
&lt;/h4&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;Entry&lt;/th&gt;
&lt;th&gt;Mid-Range&lt;/th&gt;
&lt;th&gt;High-Performance&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;GPU&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;NVIDIA RTX 3060 (12GB)&lt;/td&gt;
&lt;td&gt;RTX 3080 (10GB)&lt;/td&gt;
&lt;td&gt;RTX 4090 (24GB) or Tesla A100&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;VRAM&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;8-12GB&lt;/td&gt;
&lt;td&gt;16GB&lt;/td&gt;
&lt;td&gt;24GB+&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;System RAM&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;32GB&lt;/td&gt;
&lt;td&gt;64GB&lt;/td&gt;
&lt;td&gt;128GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;CPU&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;8-core (Intel i7/Ryzen 7)&lt;/td&gt;
&lt;td&gt;16-core&lt;/td&gt;
&lt;td&gt;32-core workstation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Latency Target&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;800-1000ms&lt;/td&gt;
&lt;td&gt;500-700ms&lt;/td&gt;
&lt;td&gt;&amp;lt;500ms&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;The 2x VRAM Rule&lt;/strong&gt;: Your system RAM should be &lt;strong&gt;at least double your total GPU VRAM&lt;/strong&gt;. If you have a single RTX 3080 (10GB), you need at least 20GB of system RAM; 32GB+ is better.&lt;/p&gt;

&lt;h2&gt;
  
  
  Speech-to-Text (STT)
&lt;/h2&gt;

&lt;p&gt;First, we are going to see how to run the STT component. As mentioned in Part 1, we are using Whisper from OpenAI. But before we blindly pick a model, we need to know what to look for.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Blueprints of Hearing: STT Selection Criteria
&lt;/h3&gt;

&lt;p&gt;When selecting a Speech-to-Text model for production, "it works" isn't enough. You need to verify specific metrics to ensure it won't break your conversational flow.&lt;/p&gt;

&lt;h4&gt;
  
  
  1. Word Error Rate (WER)
&lt;/h4&gt;

&lt;p&gt;This is the cornerstone accuracy metric. It calculates the percentage of incorrect words.&lt;br&gt;
&lt;strong&gt;Formula&lt;/strong&gt;: &lt;code&gt;WER = (Substitutions + Deletions + Insertions) / Total Words&lt;/code&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Goal&lt;/strong&gt;: Pro systems aim for &lt;strong&gt;5-10% WER&lt;/strong&gt; (90-95% accuracy).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reality Check&lt;/strong&gt;: For casual voice chats, anything under &lt;strong&gt;15-20%&lt;/strong&gt; is often acceptable.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Context Matters&lt;/strong&gt;: A "digit recognition" task might have 0.3% WER, while "broadcast news" might have 15%. Don't blindly trust paper benchmarks test on &lt;em&gt;your&lt;/em&gt; audio.&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;
  
  
  2. Latency &amp;amp; Real-Time Factor (RTF)
&lt;/h4&gt;

&lt;p&gt;Speed is more than just feeling fast; it's about physics.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Time to First Byte (TTFB)&lt;/strong&gt;: Time from "speech start" to "partial transcript". Target &lt;strong&gt;&amp;lt;300ms&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Real-Time Factor (RTF)&lt;/strong&gt;: &lt;code&gt;Processing Time / Audio Duration&lt;/code&gt;.

&lt;ul&gt;
&lt;li&gt;If RTF &amp;gt; 1.0, the system is slower than real-time (impossible for live agents).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Target&lt;/strong&gt;: You want an RTF of &lt;strong&gt;0.5 or lower&lt;/strong&gt; (processing 10s of audio in 5s) to handle overheads.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The "Flush Trick"&lt;/strong&gt;: Advanced pipelines don't wait. When VAD detects silence, they "flush" the buffer immediately, cutting latency from ~500ms to ~125ms.&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;
  
  
  3. Noise Robustness &amp;amp; SNR
&lt;/h4&gt;

&lt;p&gt;Lab audio is clean; user audio is messy. Performance drops sharply when &lt;strong&gt;Signal-to-Noise Ratio (SNR)&lt;/strong&gt; falls below 3dB.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;"Talking" Noise&lt;/strong&gt;: Background chatter usually doesn't break modern models like Whisper.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;"Crowded" Noise&lt;/strong&gt;: Train stations or cafes are the hardest tests. If your users are mobile, prioritize noise-robust models (like &lt;code&gt;distil-whisper&lt;/code&gt;) over pure accuracy models.&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;
  
  
  4. Critical Features for Agents
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Speaker Diarization&lt;/strong&gt;: "Who spoke when?" Essential if you want your agent to talk to multiple people, though it adds latency.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Punctuation &amp;amp; Capitalization&lt;/strong&gt;: Raw STT is lowercase streams (&lt;code&gt;hello world&lt;/code&gt;). Good models add punctuation (&lt;code&gt;Hello, world.&lt;/code&gt;) which is &lt;strong&gt;critical&lt;/strong&gt; for the LLM to understand semantics and mood.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  Model Selection for Real-Time Performance
&lt;/h3&gt;

&lt;p&gt;From &lt;code&gt;faster-whisper&lt;/code&gt; itself, we have used &lt;code&gt;Systran/faster-distil-whisper-medium.en&lt;/code&gt; from Hugging Face, but feel free to explore others:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model name&lt;/th&gt;
&lt;th&gt;Params&lt;/th&gt;
&lt;th&gt;Type&lt;/th&gt;
&lt;th&gt;Real-Time Factor (RTF)*&lt;/th&gt;
&lt;th&gt;Typical use case&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;tiny&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;39M&lt;/td&gt;
&lt;td&gt;Multilingual&lt;/td&gt;
&lt;td&gt;0.05 (20x real-time)&lt;/td&gt;
&lt;td&gt;Very fast, rough drafts, low-end CPU&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;tiny.en&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;39M&lt;/td&gt;
&lt;td&gt;English-only&lt;/td&gt;
&lt;td&gt;0.08 (12x real-time)&lt;/td&gt;
&lt;td&gt;Fast English-only STT with small footprint&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;base&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;74M&lt;/td&gt;
&lt;td&gt;Multilingual&lt;/td&gt;
&lt;td&gt;0.15 (6.5x real-time)&lt;/td&gt;
&lt;td&gt;Better than tiny, still lightweight&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;base.en&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;74M&lt;/td&gt;
&lt;td&gt;English-only&lt;/td&gt;
&lt;td&gt;0.20 (5x real-time)&lt;/td&gt;
&lt;td&gt;Accurate English with low compute&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;small&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;244M&lt;/td&gt;
&lt;td&gt;Multilingual&lt;/td&gt;
&lt;td&gt;0.35 (2.8x real-time)&lt;/td&gt;
&lt;td&gt;Good balance of speed and quality&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;small.en&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;244M&lt;/td&gt;
&lt;td&gt;English-only&lt;/td&gt;
&lt;td&gt;0.40 (2.5x real-time)&lt;/td&gt;
&lt;td&gt;Higher-quality English on moderate hardware&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;distil-medium&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;140M&lt;/td&gt;
&lt;td&gt;Multilingual&lt;/td&gt;
&lt;td&gt;0.25 (4x real-time)&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Best local balance&lt;/strong&gt;: 49% smaller, within 1% WER of full medium&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;medium&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;769M&lt;/td&gt;
&lt;td&gt;Multilingual&lt;/td&gt;
&lt;td&gt;0.80 (1.25x real-time)&lt;/td&gt;
&lt;td&gt;High accuracy, slower; needs stronger machine&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;medium.en&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;769M&lt;/td&gt;
&lt;td&gt;English-only&lt;/td&gt;
&lt;td&gt;0.85 (1.17x real-time)&lt;/td&gt;
&lt;td&gt;Very accurate English, heavier compute&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;large / v2&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;1.55B&lt;/td&gt;
&lt;td&gt;Multilingual&lt;/td&gt;
&lt;td&gt;2.5 (0.4x real-time)&lt;/td&gt;
&lt;td&gt;Best quality older large models, GPU required&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;large-v3&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;1.55B&lt;/td&gt;
&lt;td&gt;Multilingual&lt;/td&gt;
&lt;td&gt;3.2 (0.3x real-time)&lt;/td&gt;
&lt;td&gt;Latest, improved multilingual, GPU strongly recommended&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;em&gt;RTF (Real-Time Factor) = Time to process audio / Length of audio. 0.05 = 50x faster than real-time.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Recommendation for local voice agents&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;CPU-only&lt;/strong&gt;: &lt;code&gt;distil-medium&lt;/code&gt; or &lt;code&gt;small.en&lt;/code&gt; (aim for &amp;lt;300ms latency)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GPU with 8GB VRAM&lt;/strong&gt;: &lt;code&gt;medium.en&lt;/code&gt; (aim for 200-250ms latency)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GPU with 16GB+ VRAM&lt;/strong&gt;: &lt;code&gt;large-v3&lt;/code&gt; (aim for 150-200ms latency)&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  The Interruptibility Problem: Barge-In and VAD
&lt;/h3&gt;

&lt;p&gt;Here's something rarely discussed openly: &lt;strong&gt;VAD isn't just for silence detection it's a critical component for interruption handling (barge-in).&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;When a user speaks while your agent is talking, three things must happen instantly:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Echo Cancellation (AEC)&lt;/strong&gt;: Remove your agent's voice from the audio stream so the STT doesn't get confused hearing itself&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Voice Activity Detection (VAD)&lt;/strong&gt;: Detect the user speaking (probability-based, not just volume threshold)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Immediate TTS Cancellation&lt;/strong&gt;: Stop the agent from continuing mid-sentence&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Typical barge-in detection requires:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;VAD Latency&lt;/strong&gt;: 85-100ms (using algorithms like Silero VAD, which is Bayesian/probability-based rather than energy-based)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Barge-in Stop Latency&lt;/strong&gt;: &amp;lt;200ms (system must stop speaking within 200ms of user interruption for natural feel)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Accuracy&lt;/strong&gt;: 95%+ (must not false-trigger on background noise)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Without proper barge-in handling, your voice agent sounds robotic because users can't interrupt they must wait for the full response.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What's better: simple energy-based VAD that misses some speech, or Silero VAD that uses neural networks?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Use &lt;strong&gt;Silero VAD&lt;/strong&gt; which has builtin support in pipecat so we don't want to worry about much they handle for both CPU and GPU automatically. It trains models to understand "speech probability" rather than just volume, so it handles:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Whispers and soft speech&lt;/li&gt;
&lt;li&gt;Background noise (doesn't trigger on dog barks)&lt;/li&gt;
&lt;li&gt;Different accents and speech patterns&lt;/li&gt;
&lt;li&gt;Real-time streaming (10-20ms window processing)&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  How to run STT
&lt;/h3&gt;

&lt;p&gt;To serve this, we need a server or inference engine. While &lt;code&gt;faster-whisper&lt;/code&gt; has a library, we need a server like architecture (similar to Deepgram) where we connect to a WebSocket server, send audio, and receive text. I have written a simple WebSocket server that runs the model on either CPU or GPU.&lt;/p&gt;

&lt;p&gt;I have dockerized everything to make our life easier&lt;/p&gt;

&lt;p&gt;All the code for this component is located in &lt;code&gt;code/Models/STT&lt;/code&gt;. Let's look at what's inside:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;server.py&lt;/code&gt;: The heart of the STT. It starts a &lt;strong&gt;WebSocket server&lt;/strong&gt; that receives audio chunks, runs them through the Whisper model, and streams back text.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;download_model.py&lt;/code&gt;: A helper script to download the specific &lt;code&gt;faster-whisper&lt;/code&gt; model weights from HuggingFace.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;docker-gpu.dockerfile&lt;/code&gt;: The environment setup for NVIDIA GPU users (installs CUDA drivers).&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;docker-cpu.dockerfile&lt;/code&gt;: The environment for CPU users (lighter setup).&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  Architecture Flow
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;WebSocket Connection&lt;/strong&gt;: We use WebSockets instead of REST API because we need a persistent connection to stream audio continuously.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Audio Chunking&lt;/strong&gt;: The client (your browser/mic) records audio and chops it into small "chunks" (bytes).&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Streaming&lt;/strong&gt;: These chunks are sent over the WebSocket instantly.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Processing&lt;/strong&gt;: The server receives these raw bytes (usually Int16 format), converts them to floating-point numbers (Float32), and feeds them into the Whisper model.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Voice Activity Detection (VAD)&lt;/strong&gt;: The server listens to your audio stream. When it detects silence (you stopped speaking), it commits the transcription and sends it out.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Example Scenario&lt;/strong&gt;:&lt;br&gt;
Imagine you say &lt;strong&gt;"Hello Agent"&lt;/strong&gt;.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; Your microphone captures 1 second of audio.&lt;/li&gt;
&lt;li&gt; The browser slices this into 20 tiny audio packets and shoots them to the server one by one.&lt;/li&gt;
&lt;li&gt; The Server processes them in real-time. It hears "He...", then "Hello...", then "Hello A...".&lt;/li&gt;
&lt;li&gt; You stop talking. The VAD logic sees 500ms of silence.&lt;/li&gt;
&lt;li&gt; It shouts "STOP!" and sends the final text &lt;code&gt;"Hello Agent"&lt;/code&gt; to the next step.&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;
  
  
  How to Run
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;On GPU (Recommended):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker build &lt;span class="nt"&gt;-f&lt;/span&gt; docker-gpu.dockerfile &lt;span class="nt"&gt;-t&lt;/span&gt; stt-gpu &lt;span class="nb"&gt;.&lt;/span&gt;
docker run &lt;span class="nt"&gt;--gpus&lt;/span&gt; all &lt;span class="nt"&gt;-p&lt;/span&gt; 8000:8000 stt-gpu
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;On CPU:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker build &lt;span class="nt"&gt;-f&lt;/span&gt; docker-cpu.dockerfile &lt;span class="nt"&gt;-t&lt;/span&gt; stt-cpu &lt;span class="nb"&gt;.&lt;/span&gt;
docker run &lt;span class="nt"&gt;-p&lt;/span&gt; 8000:8000 stt-cpu
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Large Language Model (LLM)
&lt;/h2&gt;

&lt;p&gt;Next, we need a brain. But before we just pick "Llama 3", we need to understand the physics of running a brain on your computer.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Blueprints of Thinking: LLM Selection Criteria
&lt;/h3&gt;

&lt;p&gt;Choosing an LLM for voice isn't about choosing the smartest one; it's about choosing the one that fits.&lt;/p&gt;

&lt;h4&gt;
  
  
  1. The VRAM Formula
&lt;/h4&gt;

&lt;p&gt;Will it fit? Don't guess. Use the math.&lt;br&gt;
&lt;strong&gt;Formula&lt;/strong&gt;: &lt;code&gt;VRAM (GB) ≈ Params (Billions) × Precision (Bytes) × 1.2 (Overhead)&lt;/code&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Precision Refresher&lt;/strong&gt;:

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;FP16 (16-bit)&lt;/strong&gt;: 2 Bytes/param. (The standard).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;INT8 (8-bit)&lt;/strong&gt;: 1 Byte/param. (75% smaller than standard).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;INT4 (4-bit)&lt;/strong&gt;: 0.5 Bytes/param. (The sweet spot for locals).&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Example Calculation (Llama 3 8B)&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;@ FP16: &lt;code&gt;8 × 2 × 1.2&lt;/code&gt; = &lt;strong&gt;19.2 GB&lt;/strong&gt; (Needs A100/3090/4090)&lt;/li&gt;
&lt;li&gt;@ INT4: &lt;code&gt;8 × 0.5 × 1.2&lt;/code&gt; = &lt;strong&gt;4.8 GB&lt;/strong&gt; (Runs on almost any modern GPU/Laptop!)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;Note: Context window (KV Cache) adds variable memory. 8K context is usually +1GB.&lt;/em&gt;&lt;/p&gt;
&lt;h4&gt;
  
  
  2. Throughput vs. Latency
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Tokens Per Second (TPS)&lt;/strong&gt;: How fast it reads/generates.

&lt;ul&gt;
&lt;li&gt;Humans read/listen at ~4 TPS.&lt;/li&gt;
&lt;li&gt;&amp;gt; 8 TPS is diminishing returns for voice.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Time To First Token (TTFT)&lt;/strong&gt;: &lt;strong&gt;This is the King metric&lt;/strong&gt;.

&lt;ul&gt;
&lt;li&gt;Sub-200ms = Instant.&lt;/li&gt;
&lt;li&gt;&amp;gt; 2s = "Is it broken?"&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Goal&lt;/strong&gt;: Optimize for TTFT, not max throughput.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;
  
  
  3. Benchmarks That Actually Matter
&lt;/h4&gt;

&lt;p&gt;Don't just look at the leaderboard. Look at the right columns.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;MMLU&lt;/strong&gt;: General knowledge. Good baseline, but vague.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;IFEval (Instruction Following)&lt;/strong&gt;: &lt;strong&gt;Crucial for Agents&lt;/strong&gt;. Can it follow your system prompt instructions? Current small models (~2B) are getting good at this (80%+).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GSM8K&lt;/strong&gt;: Logic/Math. Good proxy for "reasoning" capability.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For a local voice agent, a &lt;strong&gt;high IFEval&lt;/strong&gt; score is often more valuable than a high MMLU score because if the agent ignores your "Keep responses short" instruction, the user experience fails.&lt;/p&gt;
&lt;h3&gt;
  
  
  Inference Engines
&lt;/h3&gt;

&lt;p&gt;To run a model locally, we need an &lt;strong&gt;Inference Engine&lt;/strong&gt;. If you search Google, you will find many options. Here are a few popular ones:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Engine&lt;/th&gt;
&lt;th&gt;Primary Use&lt;/th&gt;
&lt;th&gt;Hardware&lt;/th&gt;
&lt;th&gt;Quantization Support&lt;/th&gt;
&lt;th&gt;Best For&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Ollama&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Local single-machine LLM serving&lt;/td&gt;
&lt;td&gt;CPU, GPU (NVIDIA, Apple Metal)&lt;/td&gt;
&lt;td&gt;GGUF (Q4, Q5, Q8)&lt;/td&gt;
&lt;td&gt;Local dev, prototypes, low traffic&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;llama.cpp&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;CPU-optimized inference&lt;/td&gt;
&lt;td&gt;CPU (x86, ARM), GPU&lt;/td&gt;
&lt;td&gt;GGUF (Q2-Q8, AWQ, IQ2-IQ4)&lt;/td&gt;
&lt;td&gt;Resource-constrained, edge devices&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;vLLM&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;High-throughput production LLM serving&lt;/td&gt;
&lt;td&gt;NVIDIA GPU, AMD, Intel&lt;/td&gt;
&lt;td&gt;INT8, FP8, FP16, AWQ, GPTQ&lt;/td&gt;
&lt;td&gt;Production APIs, high concurrency&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;TensorRT-LLM&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Maximum NVIDIA performance&lt;/td&gt;
&lt;td&gt;NVIDIA GPU only (CC &amp;gt;= 7.0)&lt;/td&gt;
&lt;td&gt;INT8, FP16, FP8 (H100+)&lt;/td&gt;
&lt;td&gt;Ultra-low latency, NVIDIA-only&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;SGLang&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;High-throughput production LLM serving&lt;/td&gt;
&lt;td&gt;NVIDIA GPU, AMD, Intel&lt;/td&gt;
&lt;td&gt;FP16, INT8&lt;/td&gt;
&lt;td&gt;Research, RadixAttention, multi-turn&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;From this list, we are going to use &lt;strong&gt;SGLang&lt;/strong&gt; to run our model on GPU, and for CPU, we can go with &lt;strong&gt;Ollama&lt;/strong&gt;, which is very simple and easy to setup.&lt;/p&gt;

&lt;p&gt;We are using &lt;strong&gt;Llama 3.1 8B&lt;/strong&gt;, which is the current state-of-the-art for small open-source models.&lt;/p&gt;
&lt;h3&gt;
  
  
  Why TTFT (Time-to-First-Token) Is What Matters
&lt;/h3&gt;

&lt;p&gt;When users wait for a response, what they perceive is &lt;strong&gt;how long until they hear the first word&lt;/strong&gt;. Here's why:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Prefill Phase&lt;/strong&gt;: Model processes your entire prompt (100-500ms for 8B models)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Decoding Phase&lt;/strong&gt;: Model generates one token at a time, streams it immediately to TTS&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Key Insight&lt;/strong&gt;: TTS can start speaking as soon as token #1 arrives&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So if your TTFT is 150ms, users hear the first word in 150ms + TTS latency (75-150ms) = &lt;strong&gt;225-300ms total&lt;/strong&gt;. The full response might take 5 seconds to complete, but the user hears audio within 300ms.&lt;/p&gt;

&lt;p&gt;This is why token-generation-speed-per-second (throughput) matters less than TTFT in conversational AI.&lt;/p&gt;
&lt;h3&gt;
  
  
  Folder Structure
&lt;/h3&gt;

&lt;p&gt;Code location: &lt;code&gt;code/Models/LLM&lt;/code&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;llama-gpu.dockerfile&lt;/code&gt;: Setup for vLLM or SGLang (GPU).&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;llama-cpu.dockerfile&lt;/code&gt;: Setup for Ollama (CPU).&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  Architecture Flow
&lt;/h3&gt;

&lt;p&gt;The LLM server isn't just a text-in/text-out box. It handles queuing and batching to keep up.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;Request Queue&lt;/strong&gt;: Your prompt enters a waiting line.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Batching&lt;/strong&gt;: The server groups your request with others (if any).&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Prefill&lt;/strong&gt;: It processes your input text (Prompt) to understand the context.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Decoding (Token by Token)&lt;/strong&gt;: It generates one word-part (token) at a time.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Streaming&lt;/strong&gt;: As soon as a token is generated, it is sent back. It doesn't wait for the full sentence.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Example Scenario&lt;/strong&gt;:&lt;br&gt;
Input: &lt;strong&gt;"What is 2+2?"&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;Tokenizer&lt;/strong&gt;: Converts text to numbers &lt;code&gt;[123, 84, 99]&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Inference&lt;/strong&gt;: The model calculates the most likely next number.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Token 1&lt;/strong&gt;: Generates &lt;code&gt;"It"&lt;/code&gt;. Sends it immediately.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Token 2&lt;/strong&gt;: Generates &lt;code&gt;"is"&lt;/code&gt;. Sends it.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Token 3&lt;/strong&gt;: Generates &lt;code&gt;"4"&lt;/code&gt;. Sends it.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;End&lt;/strong&gt;: Sends &lt;code&gt;&amp;lt;EOS&amp;gt;&lt;/code&gt; (End of Sequence).&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;
  
  
  How to Run
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;1. On GPU (using SGLang/vLLM):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker build &lt;span class="nt"&gt;-f&lt;/span&gt; llama-gpu.dockerfile &lt;span class="nt"&gt;-t&lt;/span&gt; llm-gpu &lt;span class="nb"&gt;.&lt;/span&gt;
docker run &lt;span class="nt"&gt;--gpus&lt;/span&gt; all &lt;span class="nt"&gt;-p&lt;/span&gt; 30000:30000 llm-gpu
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Note: This exposes an OpenAI-compatible endpoint at port 30000.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. On CPU (using Ollama):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Easy method: Just install Ollama from ollama.com&lt;/span&gt;
ollama run llama3.1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Or using our dockerfile:&lt;/em&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker build &lt;span class="nt"&gt;-f&lt;/span&gt; llama-cpu.dockerfile &lt;span class="nt"&gt;-t&lt;/span&gt; llm-cpu &lt;span class="nb"&gt;.&lt;/span&gt;
docker run &lt;span class="nt"&gt;-p&lt;/span&gt; 11434:11434 llm-cpu
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  4. Text-to-Speech (TTS)
&lt;/h2&gt;

&lt;p&gt;Finally, for the &lt;strong&gt;Mouth&lt;/strong&gt;, we use &lt;strong&gt;&lt;a href="https://huggingface.co/hexgrad/Kokoro-82M" rel="noopener noreferrer"&gt;Kokoro&lt;/a&gt;&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Kokoro is an open-weight TTS model with 82 million parameters. Despite its lightweight architecture, it delivers comparable quality to larger models while being significantly faster and more cost-efficient.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Blueprints of Speaking: TTS Selection Criteria
&lt;/h3&gt;

&lt;p&gt;Evaluating a "Mouth" is tricky because it's both objective (speed) and subjective (beauty).&lt;/p&gt;

&lt;h4&gt;
  
  
  1. Latency &amp;amp; Real-Time Factor
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;TTFB (Time To First Byte)&lt;/strong&gt;: How fast does the first sound play?

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&amp;lt;100ms&lt;/strong&gt;: The Gold Standard.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&amp;lt;300ms&lt;/strong&gt;: Acceptable.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&amp;gt;500ms&lt;/strong&gt;: Breaks immersion.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;strong&gt;Real-Time Factor (RTF)&lt;/strong&gt;:

&lt;ul&gt;
&lt;li&gt;Anything &lt;strong&gt;&amp;lt; 0.1&lt;/strong&gt; (generating 10s audio in 1s) is amazing.&lt;/li&gt;
&lt;li&gt;Production systems target &lt;strong&gt;&amp;lt; 0.5&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;h4&gt;
  
  
  2. Human Quality Metrics (MOS)
&lt;/h4&gt;

&lt;p&gt;There isn't a "perfect" score, but we use &lt;strong&gt;Mean Opinion Score (MOS)&lt;/strong&gt; (rated 1-5 by humans).&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;4.0 - 5.0&lt;/strong&gt;: Near Human. (Modern models like Kokoro/ElevenLabs).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;2.5&lt;/strong&gt;: "Robot Voice". (Old school accessibility TTS).&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  3. Naturalness &amp;amp; Prosody
&lt;/h4&gt;

&lt;p&gt;"Prosody" is the rhythm and intonation.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Context Awareness&lt;/strong&gt;: Does it raise its pitch at a question mark? Does it pause for a period?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SSML Support&lt;/strong&gt;: Can you control it? (e.g. &lt;code&gt;&amp;lt;break time="500ms"/&amp;gt;&lt;/code&gt; or &lt;code&gt;&amp;lt;emphasis&amp;gt;&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Voice Cloning&lt;/strong&gt;:

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Zero-Shot&lt;/strong&gt;: 3s audio clip -&amp;gt; new voice. (Good for dynamic users).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fine-Tuned&lt;/strong&gt;: 3-5 hours of audio training. (Necessary for branded, professional voices).&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;h3&gt;
  
  
  The Critical: TTS Context Window &amp;amp; Streaming
&lt;/h3&gt;

&lt;p&gt;Here's a nuance many developers miss: &lt;strong&gt;TTS models like Kokoro need context windows to avoid sounding robotic when receiving partial text&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Problem Without Context Awareness:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;LLM sends: "It"     → Kokoro generates audio for just "It" → sounds like grunt
LLM sends: "is"     → Kokoro generates audio for just "is" → new voice, disconnected
LLM sends: "4"      → Kokoro generates audio for just "4" → jumpy prosody
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;The Solution: Context Window in Streaming TTS:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;LLM sends: "It"     → Kokoro waits (buffering)
LLM sends: "is"     → Kokoro now has "It is" → generates better prosody
LLM sends: "4"      → Kokoro has "It is 4" → natural cadence
OR, Kokoro predicts: "wait for punctuation before speaking"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Kokoro uses a &lt;strong&gt;250-word context window&lt;/strong&gt; internally. This means:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;It buffers incoming tokens until it reaches punctuation (&lt;code&gt;.&lt;/code&gt;, &lt;code&gt;!&lt;/code&gt;, &lt;code&gt;?&lt;/code&gt;, or a configurable threshold)&lt;/li&gt;
&lt;li&gt;Once it has enough context, it generates audio with proper intonation&lt;/li&gt;
&lt;li&gt;As more text arrives, it streams the audio bytes back without waiting for the full response&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This is why &lt;strong&gt;Kokoro excels at streaming&lt;/strong&gt; it doesn't try to speak partial fragments; it waits just enough to sound natural.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;LLM stream: "Let me think... " (no punctuation yet)
  └─ Kokoro buffers silently

LLM stream: "Let me think... 2+2 equals 4." (full sentence)
  └─ Kokoro now has context → generates natural speech with correct stress
  └─ Streams audio back in chunks (50-100ms windows)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We'll also use the &lt;strong&gt;Kokoro library&lt;/strong&gt; and build a &lt;strong&gt;server to expose it as a service&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Folder Structure
&lt;/h3&gt;

&lt;p&gt;Code location: &lt;code&gt;code/Models/TTS/Kokoro&lt;/code&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;server.py&lt;/code&gt;: Takes text input and streams out audio bytes.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;download_model.py&lt;/code&gt;: Fetches the model weights (&lt;code&gt;v0_19&lt;/code&gt; weights).&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;kokoro-gpu.dockerfile&lt;/code&gt;: GPU setup (Requires NVIDIA container toolkit).&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;kokoro-cpu.dockerfile&lt;/code&gt;: CPU setup (Works on standard laptops).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you like A minimal Kokoro-FastAPI server impelementation you can check out &lt;a href="https://github.com/programmerraja/Kokoro-FastAPI" rel="noopener noreferrer"&gt;here&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Architecture Flow
&lt;/h3&gt;

&lt;p&gt;The TTS server receives a stream of text tokens from the LLM. It immediately starts converting them to Phonemes (sound units) and generating audio. It streams this audio back to the user &lt;em&gt;before&lt;/em&gt; the LLM has even finished the sentence. This &lt;strong&gt;Streaming Pipeline&lt;/strong&gt; is crucial for low latency and natural feel.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How it works&lt;/strong&gt;:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Token Buffering&lt;/strong&gt;: TTS receives token #1 from LLM. Checks if it's punctuation.&lt;/li&gt;
&lt;/ol&gt;

&lt;ul&gt;
&lt;li&gt;If no punctuation: buffer and wait for more tokens.&lt;/li&gt;
&lt;li&gt;If punctuation or buffer size &amp;gt; 64 tokens: proceed.&lt;/li&gt;
&lt;/ul&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Phonemization&lt;/strong&gt;: Convert buffered text to phonetic units (e.g., "Hello" → &lt;code&gt;/həˈloʊ/&lt;/code&gt;).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Model Inference&lt;/strong&gt;: Kokoro generates audio features (mel-spectrogram) from phonemes.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Waveform Generation&lt;/strong&gt;: iSTFTNet vocoder converts mel-spec to raw audio bytes.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Streaming&lt;/strong&gt;: Audio chunks (50-100ms windows) stream back immediately over WebSocket.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Repeat&lt;/strong&gt;: As LLM sends token #2, buffer grows, phonemization updates, new audio generates.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Example Scenario&lt;/strong&gt;:&lt;br&gt;
Input Stream: &lt;strong&gt;"It"&lt;/strong&gt; → &lt;strong&gt;"is"&lt;/strong&gt; → &lt;strong&gt;"4"&lt;/strong&gt; → &lt;strong&gt;"."&lt;/strong&gt; (with timestamps)&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;T=0ms:   LLM sends "It"
         Kokoro: "No punctuation, buffering..."

T=150ms: LLM sends " is"
         Kokoro: "Still buffering: 'It is'"

T=300ms: LLM sends " for"
         Kokoro: "Still buffering: 'It is for'"

T=400ms: LLM sends "."
         Kokoro: "Got punctuation! Phonemize: 'ɪt ɪz for'"
         → Infer mel-spec (100ms)
         → Vocoder (50ms)
         → Stream chunk #1 (40ms audio) at T=550ms ✓ User hears "It"

T=550ms: More tokens arrive, regenerate from updated context "It is for."
         → Refined mel-spec (includes proper prosody now)
         → Stream chunk #2 at T=650ms ✓ User hears "is"
         → Stream chunk #3 at T=750ms ✓ User hears "for"

Total latency: ~550ms to first audio, streaming continues until EOS token.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Performance Benchmarks
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Setup&lt;/th&gt;
&lt;th&gt;Model Size&lt;/th&gt;
&lt;th&gt;TTFB&lt;/th&gt;
&lt;th&gt;Throughput&lt;/th&gt;
&lt;th&gt;Real-Time Factor&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;CPU (Intel i7, 32GB RAM)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Kokoro 82M&lt;/td&gt;
&lt;td&gt;500-800ms&lt;/td&gt;
&lt;td&gt;3-11x RT&lt;/td&gt;
&lt;td&gt;Suitable for dev&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;GPU (RTX 3060, 12GB VRAM)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Kokoro 82M&lt;/td&gt;
&lt;td&gt;97-150ms&lt;/td&gt;
&lt;td&gt;100x RT&lt;/td&gt;
&lt;td&gt;Production-ready&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;GPU (RTX 4090, 24GB VRAM)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Kokoro 82M&lt;/td&gt;
&lt;td&gt;40-60ms&lt;/td&gt;
&lt;td&gt;210x RT&lt;/td&gt;
&lt;td&gt;Excellent&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Quantized (4-bit)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Kokoro INT4&lt;/td&gt;
&lt;td&gt;200-300ms&lt;/td&gt;
&lt;td&gt;8-15x RT&lt;/td&gt;
&lt;td&gt;Good balance&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  How to Run
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;1. On GPU:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker build &lt;span class="nt"&gt;-f&lt;/span&gt; kokoro-gpu.dockerfile &lt;span class="nt"&gt;-t&lt;/span&gt; tts-gpu &lt;span class="nb"&gt;.&lt;/span&gt;
docker run &lt;span class="nt"&gt;--gpus&lt;/span&gt; all &lt;span class="nt"&gt;-p&lt;/span&gt; 8880:8880 tts-gpu
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;2. On CPU:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker build &lt;span class="nt"&gt;-f&lt;/span&gt; kokoro-cpu.dockerfile &lt;span class="nt"&gt;-t&lt;/span&gt; tts-cpu &lt;span class="nb"&gt;.&lt;/span&gt;
docker run &lt;span class="nt"&gt;-p&lt;/span&gt; 8880:8880 tts-cpu
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Putting It Together: End-to-End Latency
&lt;/h2&gt;

&lt;p&gt;Now that we understand each component, here's what your full local pipeline looks like:&lt;/p&gt;

&lt;h3&gt;
  
  
  Realistic Local Performance (8B LLM + Kokoro + Whisper on RTX 3060)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User speaks: "What is 2+2?"
    ↓
STT (faster-distil-whisper-medium)     : 200ms ✓
LLM (Llama 3.1 8B, TTFT)               : 120ms ✓
    └─ Token 1 "It" available at 120ms
    ↓
TTS (Kokoro buffering for punctuation) : 400ms ✓
    └─ Buffering tokens until "4." (takes ~300ms for full sentence)
    └─ Phonemization + inference: 100ms
    ↓
Streaming audio starts back to user    : 120 + 400 = 520ms ✓
User hears first word "It"

Subsequent tokens stream in background:
    Token 2 "is" available at 180ms    → Audio generated in parallel
    Token 3 "4" available at 250ms     → User hears full "It is 4" by 650ms
    Token EOS at 300ms                 → Stop TTS

TOTAL MOUTH-TO-EAR: ~650ms (acceptable for local, within production &amp;lt;800ms)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Compare to production APIs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Deepgram STT + GPT-4 + ElevenLabs TTS (cloud)&lt;/strong&gt;: 200-300ms (optimized, lower variance)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Your local setup&lt;/strong&gt;: 650-800ms (good for dev, acceptable for many use cases)&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Homework: Integrate With Pipecat
&lt;/h2&gt;

&lt;p&gt;So now that all three components are up and running, it's your turn to think through how we can integrate them with &lt;strong&gt;Pipecat&lt;/strong&gt; and get a fully local &lt;strong&gt;"Hello World"&lt;/strong&gt; working end to end.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Challenge&lt;/strong&gt;:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Run all three Docker containers (STT, LLM, TTS) locally&lt;/li&gt;
&lt;li&gt;Create a Pipecat pipeline that:

&lt;ul&gt;
&lt;li&gt;Accepts WebSocket audio from client&lt;/li&gt;
&lt;li&gt;Sends to STT server (port 8000)&lt;/li&gt;
&lt;li&gt;Streams STT output to LLM server (port 30000)&lt;/li&gt;
&lt;li&gt;Streams LLM tokens to TTS server (port 8880)&lt;/li&gt;
&lt;li&gt;Streams TTS audio back to client&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Implement &lt;strong&gt;barge-in handling&lt;/strong&gt;: If user speaks while TTS is playing, cancel TTS and process new input&lt;/li&gt;
&lt;li&gt;Measure latency at each step&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Tips&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Use &lt;code&gt;asyncio&lt;/code&gt; and &lt;code&gt;WebSocket&lt;/code&gt; for non-blocking streaming&lt;/li&gt;
&lt;li&gt;Implement a simple latency meter to log timestamps&lt;/li&gt;
&lt;li&gt;Test with quiet and noisy audio to validate VAD&lt;/li&gt;
&lt;li&gt;Start with synchronous (blocking) for simplicity, then optimize&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you'd like to share your implementation, feel free to raise a PR on our GitHub repo here:&lt;br&gt;
&lt;strong&gt;&lt;a href="https://github.com/programmerraja/VoiceAgentGuide" rel="noopener noreferrer"&gt;https://github.com/programmerraja/VoiceAgentGuide&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

</description>
      <category>voiceagent</category>
      <category>llm</category>
      <category>openai</category>
      <category>pipecat</category>
    </item>
    <item>
      <title>Prompt Engineering From First Principles: The Mechanics They Don't Teach You part-1</title>
      <dc:creator>Boopathi</dc:creator>
      <pubDate>Sun, 28 Dec 2025 08:42:21 +0000</pubDate>
      <link>https://forem.com/programmerraja/prompt-engineering-from-first-principles-the-mechanics-they-dont-teach-you-part-1-12nb</link>
      <guid>https://forem.com/programmerraja/prompt-engineering-from-first-principles-the-mechanics-they-dont-teach-you-part-1-12nb</guid>
      <description>&lt;p&gt;This is my new series on Prompt Engineering and it's different from everything else out there.&lt;/p&gt;

&lt;p&gt;Most blogs give you templates: "Try this prompt!" or "Use these 10 techniques!" That's not what we're doing here.&lt;/p&gt;

&lt;p&gt;We're going deep: &lt;strong&gt;How do LLMs actually process your prompts? What makes a prompt effective at the mechanical level? Where do LLMs fail and why?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This series will give you the mental models to engineer prompts yourself, not just copy someone else's examples. Let's dive in.&lt;/p&gt;

&lt;p&gt;We going to have 5 parts (so far I think but may be in future we add more )&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;The Foundation - How LLMs Really Work&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;The Art &amp;amp; Science of Prompting&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Prompting techniques and optimization&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Prompt Evaluation and Scaling&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Tips, Tricks, and Experience&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Let's jump into Part 1.&lt;/p&gt;

&lt;h2&gt;
  
  
  Do LLMs Think Like Humans?
&lt;/h2&gt;

&lt;p&gt;Let me ask you something: &lt;strong&gt;Do you think LLMs are intelligent like humans?&lt;/strong&gt; Do they have a "brain" that understands your questions and thinks through answers?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;If you answered yes, you're wrong.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;LLMs don't think. They don't understand. They're just &lt;strong&gt;next-token predictors&lt;/strong&gt;—sophisticated autocomplete machines that guess what word (or rather, "token") should come next based on patterns they've seen before.&lt;/p&gt;

&lt;p&gt;Now, you might be thinking: &lt;em&gt;"Wait, how can simple next-word prediction answer complex questions, write code, or have conversations?"&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;That's a great question, and the answer involves some fascinating engineering. But we're not going to dive too deep into the theoretical computer science here—that would make this series endless. We're focusing on &lt;strong&gt;what you need to know to write better prompts&lt;/strong&gt;, nothing more, nothing less.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fned9tmt8d4clwjb2jhrb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fned9tmt8d4clwjb2jhrb.png" alt="Preview" width="465" height="311"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you really intrested in understanding How Machines Learn you can check out &lt;a href="https://dev.to/programmerraja/how-machines-learn-understanding-the-core-concepts-of-neural-networks-3a9j"&gt;here&lt;/a&gt; where I have written a detail way&lt;/p&gt;

&lt;p&gt;Let's start with the basics.&lt;/p&gt;

&lt;h2&gt;
  
  
  How Does an LLM Process Your Prompt?
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F14kn6apq5y0tgfaoaohm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F14kn6apq5y0tgfaoaohm.png" alt="Prompt Preview" width="800" height="268"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;When you type a prompt and hit enter, here's the simplified workflow of what happens inside the model:&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 1: English Text → Tokens
&lt;/h3&gt;

&lt;p&gt;Your text doesn't go directly into the model. First, it gets broken down into &lt;strong&gt;tokens&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;A token is roughly a chunk of text—sometimes a whole word, sometimes part of a word, sometimes punctuation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Examples:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;"Hello world"&lt;/code&gt; → &lt;code&gt;["Hello", " world"]&lt;/code&gt; (2 tokens)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;"apple"&lt;/code&gt; → &lt;code&gt;["apple"]&lt;/code&gt; (1 token)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;"12345"&lt;/code&gt; → &lt;code&gt;["123", "45"]&lt;/code&gt; (2 tokens)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Why does this matter? Because:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Models have &lt;strong&gt;token limits&lt;/strong&gt; (context windows), not word limits&lt;/li&gt;
&lt;li&gt;The way text is tokenized affects how the model "sees" it&lt;/li&gt;
&lt;li&gt;Some words the model handles better because they're single tokens, while others are split up&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Step 2: Tokens → Numbers (Embeddings)
&lt;/h3&gt;

&lt;p&gt;The model can't work with text directly—it only understands numbers. Each token gets converted into a long list of numbers called an &lt;strong&gt;embedding&lt;/strong&gt; (basically a mathematical representation of that token).&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 3: The Transformer Does Its Magic
&lt;/h3&gt;

&lt;p&gt;Your tokens (now numbers) pass through the &lt;strong&gt;Transformer architecture&lt;/strong&gt; layers of neural network computations. Here's where the attention mechanism kicks in, letting the model figure out which tokens relate to which.&lt;/p&gt;

&lt;p&gt;Example: In the sentence &lt;em&gt;"The bank of the river was muddy"&lt;/em&gt;, the model's attention mechanism connects &lt;code&gt;bank&lt;/code&gt; with &lt;code&gt;river&lt;/code&gt; and &lt;code&gt;muddy&lt;/code&gt; to understand we're talking about a riverbank, not a financial institution.&lt;/p&gt;

&lt;p&gt;Note: Currently we have some other emerging llm architectures like &lt;a href="https://www.ibm.com/think/topics/diffusion-models" rel="noopener noreferrer"&gt;&lt;strong&gt;Diffusion Models&lt;/strong&gt;&lt;/a&gt;, &lt;a href="https://huggingface.co/blog/lbourdois/get-on-the-ssm-train" rel="noopener noreferrer"&gt;&lt;strong&gt;State Space Models&lt;/strong&gt;&lt;/a&gt;, etc.. but for sake of simplicity i cover only &lt;strong&gt;Transformer&lt;/strong&gt; based models.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 4: Predict the Next Token (Probabilities)
&lt;/h3&gt;

&lt;p&gt;At the end of all this processing, the model outputs a &lt;strong&gt;probability distribution&lt;/strong&gt; over all possible next tokens in its vocabulary (which can be 50,000+ tokens).&lt;/p&gt;

&lt;p&gt;It looks something like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Paris:       0.85  (85% probability)
the:         0.05  (5% probability)
beautiful:   0.03  (3% probability)
London:      0.02  (2% probability)
[thousands of other tokens with tiny probabilities...]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The model doesn't "know" Paris is the capital of France. It just calculates that based on the patterns it learned during training, &lt;code&gt;Paris&lt;/code&gt; has the highest probability of being the next token after &lt;em&gt;"The capital of France is"&lt;/em&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 5: Select a Token &amp;amp; Repeat
&lt;/h3&gt;

&lt;p&gt;The model picks a token based on these probabilities (we'll talk about &lt;em&gt;how&lt;/em&gt; it picks in a moment), adds it to the sequence, and repeats the whole process to generate the next token, then the next, until it decides to stop.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;That's it. That's all an LLM does: predict the next token, over and over, based on probability.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  But Wait... How Does This Answer My Questions?
&lt;/h2&gt;

&lt;p&gt;Here's where it gets interesting. If LLMs are just probability machines playing "guess the next word," how do they:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Answer questions correctly?&lt;/li&gt;
&lt;li&gt;Write code that actually works?&lt;/li&gt;
&lt;li&gt;Hold coherent conversations?&lt;/li&gt;
&lt;li&gt;Follow complex instructions?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The answer is &lt;strong&gt;training&lt;/strong&gt; specifically, the two major phases that shape model behavior.&lt;/p&gt;

&lt;h3&gt;
  
  
  Phase 1: Pre-Training (Learning Patterns from the Internet)
&lt;/h3&gt;

&lt;p&gt;In this phase, the model reads &lt;strong&gt;trillions of tokens&lt;/strong&gt; from:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Websites (Wikipedia, forums, blogs)&lt;/li&gt;
&lt;li&gt;Books&lt;/li&gt;
&lt;li&gt;Code repositories (GitHub)&lt;/li&gt;
&lt;li&gt;Research papers&lt;/li&gt;
&lt;li&gt;Social media&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;What it learns:&lt;/strong&gt; Statistical patterns. If it sees "The capital of France is Paris" thousands of times, it learns that &lt;code&gt;Paris&lt;/code&gt; has a high probability of following "The capital of France is".&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What it doesn't learn:&lt;/strong&gt; How to answer questions like an assistant. A pre-trained "base model" has knowledge but no manners.&lt;/p&gt;

&lt;p&gt;Ask a base model: &lt;em&gt;"What is the capital of France?"&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;It might respond: &lt;em&gt;"What is the capital of Germany? What is the capital of Spain?"&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Why? Because it's just completing patterns it saw in training data probably quiz lists from forums. It has information, but no concept of "answering questions."&lt;/p&gt;

&lt;h3&gt;
  
  
  Phase 2: Post-Training (Teaching It to Be an Assistant)
&lt;/h3&gt;

&lt;p&gt;This is where base models become &lt;strong&gt;ChatGPT&lt;/strong&gt;, &lt;strong&gt;Claude&lt;/strong&gt;, or other chat assistants. Two key steps:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Supervised Fine-Tuning (SFT):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Humans write thousands of example conversations: questions and good answers&lt;/li&gt;
&lt;li&gt;The model learns: &lt;em&gt;"Oh, when I see a question, I should provide a helpful answer, not continue the question"&lt;/em&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;2. Reinforcement Learning from Human Feedback (RLHF):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Humans rate different model responses as "good" or "bad"&lt;/li&gt;
&lt;li&gt;The model learns to optimize for helpful, harmless, and honest responses&lt;/li&gt;
&lt;li&gt;This is why models refuse harmful requests or add disclaimers&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The result:&lt;/strong&gt; A model that not only predicts the next token, but predicts tokens that &lt;em&gt;look like helpful assistant responses&lt;/em&gt; because that pattern now has the highest probability in its training.&lt;/p&gt;

&lt;p&gt;So when you ask &lt;em&gt;"What is the capital of France?"&lt;/em&gt;, the model isn't "thinking" about geography. It's predicting that tokens forming a helpful answer have higher probability than tokens that continue the question—because that's what its training reinforced.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;It's all still just next-token prediction. The training just shaped &lt;em&gt;which&lt;/em&gt; predictions have high probability.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Model Configuration: Controlling the Output
&lt;/h2&gt;

&lt;p&gt;Remember that probability distribution we talked about? Here's where you get control. The model gives you probabilities, but &lt;strong&gt;configuration parameters&lt;/strong&gt; decide how tokens are actually selected from those probabilities.&lt;/p&gt;

&lt;h3&gt;
  
  
  Temperature: The Creativity Dial
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Temperature&lt;/strong&gt; controls how "random" the model's choices are.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example scenario:&lt;/strong&gt; The model predicts:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Paris:     85%
beautiful: 3%
London:    2%
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Low Temperature (e.g., 0.2):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The model becomes more "confident" and almost always picks the top choice&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;Paris&lt;/code&gt; might effectively become 95%+ likely&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Result:&lt;/strong&gt; Deterministic, focused, repetitive outputs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Use for:&lt;/strong&gt; Code generation, data extraction, factual answers&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;High Temperature (e.g., 0.8):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The model flattens the probability curve&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;Paris&lt;/code&gt; might drop to 60%, &lt;code&gt;beautiful&lt;/code&gt; rises to 10%, &lt;code&gt;London&lt;/code&gt; to 8%&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Result:&lt;/strong&gt; More varied, creative, unpredictable outputs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Use for:&lt;/strong&gt; Creative writing, brainstorming, multiple perspectives&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Real example:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Prompt: &lt;em&gt;"The sky is"&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Temperature 0.2:&lt;/strong&gt; "blue" (almost always) &lt;strong&gt;Temperature 0.8:&lt;/strong&gt; "blue" or "cloudy" or "vast" or "filled with stars" (varies)&lt;/p&gt;

&lt;h3&gt;
  
  
  Top-P (Nucleus Sampling): Cutting Off the Nonsense
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Top-P&lt;/strong&gt; (also called nucleus sampling) sets a probability threshold.&lt;/p&gt;

&lt;p&gt;If you set &lt;strong&gt;Top-P = 0.9&lt;/strong&gt;, the model only considers tokens that together make up the top 90% of probability, ignoring everything else.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why this matters:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Without Top-P, even with reasonable temperature, the model might occasionally pick a token with 0.001% probability—resulting in complete nonsense.&lt;/p&gt;

&lt;p&gt;With Top-P = 0.9, those ultra-low-probability tokens are never even considered. The model stays coherent while still being creative.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Practical combination:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Temperature 0.7 + Top-P 0.9&lt;/strong&gt; = Creative but coherent&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Temperature 0.2 + Top-P 1.0&lt;/strong&gt; = Deterministic and focused&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Top-K: Limiting Choices
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Top-K&lt;/strong&gt; simply limits the model to considering only the K most likely tokens.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example:&lt;/strong&gt; Top-K = 50 means the model only looks at the 50 highest-probability tokens and ignores the rest.&lt;/p&gt;

&lt;p&gt;This is a simpler version of Top-P and less commonly used in modern systems.&lt;/p&gt;

&lt;h2&gt;
  
  
  Putting It Together
&lt;/h2&gt;

&lt;p&gt;Let's trace through a complete example:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Your prompt:&lt;/strong&gt; &lt;em&gt;"Explain photosynthesis in simple terms"&lt;/em&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Tokenization:&lt;/strong&gt; &lt;code&gt;["Explain", " photosynthesis", " in", " simple", " terms"]&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Model processing:&lt;/strong&gt; Transformer calculates relationships between tokens&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Probability distribution for next token:&lt;/strong&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;   Photosynthesis:  40%
   It:             15%
   The:            12%
   In:              8%
   [...]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Configuration applied (Temperature 0.3, Top-P 0.9):&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;ul&gt;
&lt;li&gt;Low temperature sharpens: &lt;code&gt;Photosynthesis&lt;/code&gt; → 65%&lt;/li&gt;
&lt;li&gt;Model picks &lt;code&gt;Photosynthesis&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Repeat:&lt;/strong&gt; Now the sequence is &lt;em&gt;"Explain photosynthesis in simple terms Photosynthesis"&lt;/em&gt;

&lt;ul&gt;
&lt;li&gt;Calculate probabilities for the next token&lt;/li&gt;
&lt;li&gt;Pick based on configuration&lt;/li&gt;
&lt;li&gt;Continue until complete answer is generated&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;The model never "understood" photosynthesis. It predicted tokens that statistically form explanations based on patterns from its training data.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Now You Have the Mental Model
&lt;/h2&gt;

&lt;p&gt;You now understand the fundamental truth: &lt;strong&gt;LLMs are probability engines, not reasoning machines.&lt;/strong&gt; Every response is just a statistical prediction of the next token, shaped by training data and controlled by configuration parameters.&lt;/p&gt;

&lt;p&gt;But here's where it gets powerful: &lt;strong&gt;If you understand the mechanism, you can engineer the probabilities.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Your prompt doesn't just ask a question it shapes the entire probability landscape the model uses to generate its response. Change a few words, reorder your instructions, add an example, and suddenly different tokens become more likely. Different tokens mean different outputs.&lt;/p&gt;

&lt;p&gt;In the next part, we're going to explore &lt;strong&gt;The Art &amp;amp; Science of Prompting&lt;/strong&gt; how to deliberately craft prompts that steer those probabilities in your favor.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The foundation is set. Now let's learn to build on it.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I’ve also set up a &lt;a href="https://github.com/programmerraja/PromptEngineering" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt; repository for this series, where I’ll be sharing the code and additional resources. Make sure to check it out and give it a star!&lt;/p&gt;

&lt;p&gt;Feel free to share your thoughts, comments, and insights below. Let’s learn and grow together!&lt;/p&gt;

</description>
      <category>promptengineering</category>
      <category>llm</category>
      <category>openai</category>
      <category>gemin</category>
    </item>
    <item>
      <title>Spotify Wrapped, But for Developers: Introducing GitHub Wrapped 2025</title>
      <dc:creator>Boopathi</dc:creator>
      <pubDate>Sat, 20 Dec 2025 02:02:50 +0000</pubDate>
      <link>https://forem.com/programmerraja/spotify-wrapped-but-for-developers-introducing-github-wrapped-2025-21lm</link>
      <guid>https://forem.com/programmerraja/spotify-wrapped-but-for-developers-introducing-github-wrapped-2025-21lm</guid>
      <description>&lt;p&gt;&lt;strong&gt;GitHub Wrapped 2025 is LIVE!&lt;/strong&gt; 🎉&lt;/p&gt;

&lt;p&gt;Your year of code &lt;strong&gt;wrapped&lt;/strong&gt; and ready to explore!&lt;/p&gt;

&lt;p&gt;Ever wondered what your year looked like in code?&lt;br&gt;
&lt;em&gt;GitHub Wrapped&lt;/em&gt; turns your GitHub activity into a visual story that celebrates the work you &lt;em&gt;actually did&lt;/em&gt;  not just green dots on a graph.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Is GitHub Wrapped?
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9pxhja746b8vdf99newu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9pxhja746b8vdf99newu.png" alt="Product preview" width="800" height="540"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GitHub Wrapped&lt;/strong&gt; is your personal year-in-review for GitHub activity.&lt;br&gt;
Inspired by &lt;em&gt;Spotify Wrapped&lt;/em&gt; and other year review tools, it takes your contributions and turns them into insights you can &lt;em&gt;see, analyze, and share&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;With it you can:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;📆 See your contribution graph for the year&lt;/li&gt;
&lt;li&gt;🏆 Unlock fun &lt;strong&gt;badges&lt;/strong&gt; that reflect your coding persona&lt;/li&gt;
&lt;li&gt;📌 Discover your &lt;strong&gt;top repositories&lt;/strong&gt; and where you spent most of your energy&lt;/li&gt;
&lt;li&gt;🎨 Apply themes like &lt;strong&gt;Cyberpunk&lt;/strong&gt; and &lt;strong&gt;Sunset Vibes&lt;/strong&gt; for personalized style&lt;/li&gt;
&lt;li&gt;🔐 Optionally include &lt;strong&gt;private contributions&lt;/strong&gt; using a personal access token (token is used locally only)  no tracking or storage&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Why You’ll Love It
&lt;/h2&gt;

&lt;p&gt;As developers, GitHub has become our &lt;em&gt;public portfolio&lt;/em&gt;  showing commits, collaborations, issues, and code that tell a story of growth, consistency, and learning.&lt;/p&gt;

&lt;p&gt;But GitHub itself doesn’t wrap that story up for you at the end of the year. &lt;em&gt;GitHub Wrapped&lt;/em&gt; fills that gap by:&lt;br&gt;
✔ giving you &lt;strong&gt;a snapshot of the year&lt;/strong&gt;&lt;br&gt;
✔ surfacing patterns you might miss in the daily grind&lt;br&gt;
✔ making something shareable and fun&lt;/p&gt;

&lt;h2&gt;
  
  
  What You’ll Get
&lt;/h2&gt;

&lt;p&gt;Once you generate your wrapped summary, you’ll see:&lt;/p&gt;

&lt;p&gt;✨ Your &lt;strong&gt;year’s contribution graph&lt;/strong&gt;&lt;br&gt;
🎭 Your &lt;strong&gt;developer persona&lt;/strong&gt;  who &lt;em&gt;you&lt;/em&gt; were in code this year&lt;br&gt;
🏆 &lt;strong&gt;Badges&lt;/strong&gt; that celebrate your style and activity&lt;br&gt;
📊 &lt;strong&gt;Top projects&lt;/strong&gt; that defined your GitHub year&lt;br&gt;
🎨 Themes to make it &lt;em&gt;your own&lt;/em&gt; ([programmerraja.is-a.dev][1])&lt;/p&gt;

&lt;h2&gt;
  
  
  How It Works
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Visit 👉 &lt;strong&gt;&lt;a href="https://programmerraja.is-a.dev/githubwrap/" rel="noopener noreferrer"&gt;Githubwrapup&lt;/a&gt;&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Enter your &lt;strong&gt;GitHub username&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Add a &lt;strong&gt;personal access token&lt;/strong&gt; to include private contributions&lt;/li&gt;
&lt;li&gt;Pick your &lt;strong&gt;year &amp;amp; theme&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Click &lt;strong&gt;Generate&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Your GitHub year in code appears  ready to explore and share!&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Here’s my Github 2025 Wrapped summary 👇  &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F87a36qhgjd30y2ogbb0p.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F87a36qhgjd30y2ogbb0p.png" alt="Preview" width="800" height="1769"&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>webdev</category>
      <category>githubwrapped</category>
      <category>2025</category>
      <category>programming</category>
    </item>
    <item>
      <title>How Your LLM Costs 5X More If You Don't Speak English</title>
      <dc:creator>Boopathi</dc:creator>
      <pubDate>Sat, 25 Oct 2025 07:22:29 +0000</pubDate>
      <link>https://forem.com/programmerraja/why-your-llm-costs-5x-more-if-you-dont-speak-english-2d0e</link>
      <guid>https://forem.com/programmerraja/why-your-llm-costs-5x-more-if-you-dont-speak-english-2d0e</guid>
      <description>&lt;p&gt;&lt;strong&gt;Are you overpaying for AI because of your language?&lt;/strong&gt; If you're building LLM applications in Spanish, Hindi, or Greek, you could be spending up to &lt;strong&gt;6 times more&lt;/strong&gt; than English users for the exact same functionality. &lt;/p&gt;

&lt;p&gt;This blog insipred from the research paper  &lt;a href="https://arxiv.org/pdf/2305.13707" rel="noopener noreferrer"&gt;Do All Languages Cost the Same? Tokenization in the Era of Commercial Language Models&lt;/a&gt; &lt;/p&gt;

&lt;h2&gt;
  
  
  The Hidden Tokenization Tax
&lt;/h2&gt;

&lt;p&gt;When you send text to GPT-4, Claude, or Gemini, your input gets broken into &lt;strong&gt;tokens&lt;/strong&gt; chunks roughly 3-4 characters long in English. You pay per token for both input and output.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The shocking truth:&lt;/strong&gt; The same sentence costs &lt;strong&gt;wildly different amounts&lt;/strong&gt; depending on your language.&lt;/p&gt;

&lt;h3&gt;
  
  
  Real Example: "Hello, my name is Sarah"
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Language&lt;/th&gt;
&lt;th&gt;Tokens Needed&lt;/th&gt;
&lt;th&gt;Cost vs English&lt;/th&gt;
&lt;th&gt;Annual Cost (10K msgs/day)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;English&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;7 tokens&lt;/td&gt;
&lt;td&gt;1.0x baseline&lt;/td&gt;
&lt;td&gt;\$16,425&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Spanish&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;11 tokens&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;1.5x more&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;\$24,638 (+\$8,213)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Hindi&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;35 tokens&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;5.0x more&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;\$82,125 (+\$65,700)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Greek&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;42 tokens&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;6.0x more&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;\$98,550 (+\$82,125)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;That's an &lt;strong&gt;$82,000 annual difference&lt;/strong&gt; for the exact same chatbot purely because of language.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Complete Language Cost Breakdown
&lt;/h2&gt;

&lt;p&gt;Research from ACL 2023 and recent LLM benchmarks reveals systematic bias in how models tokenize different languages. Here's what it costs to process 24 major languages:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb3zc8ct92f2z627ynl5a.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb3zc8ct92f2z627ynl5a.png" alt="Tokenization cost comparison across 24 languages showing how many times more expensive each language is compared to English due to tokenization differences" width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Tokenization cost comparison across 24 languages showing how many times more expensive each language is compared to English due to tokenization differences&lt;/p&gt;

&lt;h3&gt;
  
  
  Most Efficient Languages (1.0-1.5x English)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;English:&lt;/strong&gt; 1.0x (baseline)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;French:&lt;/strong&gt; 1.2x&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Italian:&lt;/strong&gt; 1.2x&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Portuguese:&lt;/strong&gt; 1.3x&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Spanish:&lt;/strong&gt; 1.5x&lt;/p&gt;
&lt;h3&gt;
  
  
  Moderately Expensive (1.6-2.5x)
&lt;/h3&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Korean:&lt;/strong&gt; 1.6x&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Japanese:&lt;/strong&gt; 1.8x&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Chinese (Simplified):&lt;/strong&gt; 2.0x&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Arabic:&lt;/strong&gt; 2.0x&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Russian:&lt;/strong&gt; 2.5x&lt;/p&gt;
&lt;h3&gt;
  
  
  Highly Expensive (3.0-6.0x)
&lt;/h3&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Ukrainian:&lt;/strong&gt; 3.0x&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Bengali:&lt;/strong&gt; 4.0x&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Thai:&lt;/strong&gt; 4.0x&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Hindi:&lt;/strong&gt; 5.0x&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Tamil:&lt;/strong&gt; 5.0x&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Telugu:&lt;/strong&gt; 5.0x&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Greek:&lt;/strong&gt; 6.0x (most expensive)&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Why Writing Systems Matter
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftd3uf4lg27axvrz1ali9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftd3uf4lg27axvrz1ali9.png" alt="Comparison of tokenization costs and efficiency across different writing systems, showing why Latin-based languages are most cost-effective for LLM applications" width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Comparison of tokenization costs and efficiency across different writing systems, showing why Latin-based languages are most cost-effective for LLM applications&lt;/p&gt;

&lt;p&gt;The script your language uses creates dramatic efficiency gaps:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Latin script:&lt;/strong&gt; 1.4x average (73.5% efficient)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hangul (Korean):&lt;/strong&gt; 1.6x (63% efficient)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Han/Japanese:&lt;/strong&gt; 1.8-2.0x (50-56% efficient)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cyrillic:&lt;/strong&gt; 2.75x average (36.5% efficient)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Indic scripts:&lt;/strong&gt; 4-5x average (20% efficient)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Greek:&lt;/strong&gt; 6.0x (17% efficient—worst)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Why This Inequality Exists
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Training Data Bias
&lt;/h3&gt;

&lt;p&gt;GPT-4, Claude, and Gemini are trained on English-dominant datasets. The Common Crawl corpus shows stark imbalance:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;~60% English&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;~10-15% combined for Spanish/French/German&lt;/li&gt;
&lt;li&gt;&amp;lt;5% for most other languages&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Tokenizers learn to compress what they see most. English gets ultra-efficient encoding; everything else is treated as "foreign."&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Morphological Complexity
&lt;/h3&gt;

&lt;p&gt;Languages with rich morphology generate far more word variations&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;English:&lt;/strong&gt; "run" → runs, running, ran (4 forms)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Turkish:&lt;/strong&gt; Single root → 50+ forms with suffixes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Arabic:&lt;/strong&gt; Root system → thousands of variations&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hindi:&lt;/strong&gt; Complex verb conjugations with gender/number/tense&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Tokenizers can't learn compact patterns for high-variation, low-data languages.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Unicode Encoding Overhead
&lt;/h3&gt;

&lt;p&gt;Different scripts need different byte counts:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Latin:&lt;/strong&gt; 1 byte per character&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cyrillic:&lt;/strong&gt; 2 bytes per character&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Devanagari/Tamil:&lt;/strong&gt; 3+ bytes per character&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;More bytes = more tokens = higher cost—even for the same semantic content.&lt;/p&gt;

&lt;h2&gt;
  
  
  Real-World Cost Impact
&lt;/h2&gt;

&lt;p&gt;Here's what tokenization inequality means for actual business applications:&lt;/p&gt;

&lt;h3&gt;
  
  
  Customer Support Chatbot (10,000 messages/day)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;English:&lt;/strong&gt; \$16,425/year&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Spanish:&lt;/strong&gt; \$24,638/year (+50%, +\$8,213)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hindi:&lt;/strong&gt; \$82,125/year (+400%, +\$65,700)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Content Generation Platform (1M words/month)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;English:&lt;/strong&gt; \$14,400/year&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Spanish:&lt;/strong&gt; \$21,600/year&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hindi:&lt;/strong&gt; \$72,000/year&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Document Translation Service (100K words/day)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;English:&lt;/strong&gt; \$65,700/year&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Spanish:&lt;/strong&gt; \$98,550/year (+\$32,850)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hindi:&lt;/strong&gt; \$328,500/year (+\$262,800)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Code Assistant (50K queries/day)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;English:&lt;/strong&gt; \$91,250/year&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Spanish:&lt;/strong&gt; \$136,875/year&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hindi:&lt;/strong&gt; \$456,250/year (+\$365,000)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Bottom line:&lt;/strong&gt; A company serving Hindi users pays &lt;strong&gt;\$262,800-\$365,000 more annually&lt;/strong&gt; than an identical English service.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Socioeconomic Dimension
&lt;/h2&gt;

&lt;p&gt;Research reveals a disturbing &lt;strong&gt;-0.5 correlation&lt;/strong&gt; between a country's Human Development Index and LLM tokenization cost.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Translation:&lt;/strong&gt; Less developed countries often speak languages that cost more to process.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Users in developing nations pay premium rates&lt;/li&gt;
&lt;li&gt;Communities with fewer resources face higher AI barriers&lt;/li&gt;
&lt;li&gt;This creates "double unfairness" in AI democratization&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Example:&lt;/strong&gt; A startup in India building a Hindi customer service bot pays 5x more than a US competitor despite likely having far less funding.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Future of Fair AI
&lt;/h2&gt;

&lt;p&gt;Language should never determine how much intelligence costs. Yet today, the world’s most spoken tongues pay a silent premium just to access the same models. Fixing this isn’t about optimization  it’s about fairness. Until every language is tokenized equally, AI remains fluent in inequality.&lt;/p&gt;

&lt;p&gt;feel &lt;/p&gt;

</description>
    </item>
    <item>
      <title>2025 Voice AI Guide: How to Make Your Own Real-Time Voice Agent (Part-2)</title>
      <dc:creator>Boopathi</dc:creator>
      <pubDate>Sat, 11 Oct 2025 12:50:32 +0000</pubDate>
      <link>https://forem.com/programmerraja/2025-voice-ai-guide-how-to-make-your-own-real-time-voice-agent-part-2-1288</link>
      <guid>https://forem.com/programmerraja/2025-voice-ai-guide-how-to-make-your-own-real-time-voice-agent-part-2-1288</guid>
      <description>&lt;p&gt;Welcome to &lt;strong&gt;Part 2 of the 2025 Voice AI Guide  How to Build Your Own RealTime Voice Agent.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In this section, we’ll dive deep into &lt;strong&gt;Pipecat&lt;/strong&gt; and create a simple &lt;strong&gt;“Hello World” program&lt;/strong&gt; to understand how real-time voice AI works in practice. &lt;/p&gt;

&lt;p&gt;If you have not read the part 1 read here&lt;/p&gt;

&lt;h2&gt;
  
  
  Pipecat
&lt;/h2&gt;

&lt;p&gt;Pipecat is an &lt;strong&gt;open-source Python framework&lt;/strong&gt; developed by Daily.co for building &lt;strong&gt;real-time voice and multimodal conversational AI agents&lt;/strong&gt;. It provides a powerful yet intuitive way to orchestrate audio, video, AI services, and transport protocols to create sophisticated voice assistants, AI companions, and interactive conversational experiences&lt;/p&gt;

&lt;p&gt;What makes Pipecat special is its &lt;strong&gt;voice-first approach&lt;/strong&gt; combined with a &lt;strong&gt;modular, composable architecture&lt;/strong&gt; Instead of building everything from scratch, you can focus on what makes your agent unique while Pipecat handles the complex orchestration of real-time audio processing, speech recognition, language models, and speech synthesis.&lt;/p&gt;

&lt;h3&gt;
  
  
  What You Can Build with Pipecat
&lt;/h3&gt;

&lt;p&gt;Pipecat enables a wide range of applications&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Voice Assistants&lt;/strong&gt; - Natural, streaming conversations with AI&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AI Companions&lt;/strong&gt; - Coaches, meeting assistants, and interactive characters
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Phone Agents&lt;/strong&gt; - Customer support, intake bots, and automated calling systems&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multimodal Interfaces&lt;/strong&gt; - Applications combining voice, video, and images&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Business Agents&lt;/strong&gt; - Customer service bots and guided workflow systems&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Interactive Games&lt;/strong&gt; - Voice-controlled gaming experiences&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Creative Tools&lt;/strong&gt; - Interactive storytelling with generative media&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Pipecat Architecture: How It Works
&lt;/h2&gt;

&lt;p&gt;Understanding Pipecat's architecture is crucial for building effective voice agents. The framework is built around three core concepts:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Frames
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Frames&lt;/strong&gt; are data packages that move through your application. Think of them as containers that hold specific types of information:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Audio frames&lt;/strong&gt; - Raw audio data from microphones&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Text frames&lt;/strong&gt; - Transcribed speech or generated responses
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Image frames&lt;/strong&gt; - Visual data for multimodal applications&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Control frames&lt;/strong&gt; - System messages like start/stop signals&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  2. Frame Processors
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Frame processors&lt;/strong&gt; are specialized workers that handle specific tasks. Each processor:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Receives specific frame types as input&lt;/li&gt;
&lt;li&gt;Performs a specialized transformation (transcription, language processing, etc.)&lt;/li&gt;
&lt;li&gt;Outputs new frames for the next processor&lt;/li&gt;
&lt;li&gt;Passes through frames it doesn't handle&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Common processor types include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;STT (Speech-to-Text)&lt;/strong&gt; processors that convert audio frames to text frames&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LLM processors&lt;/strong&gt; that take text frames and generate response frames&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;TTS (Text-to-Speech)&lt;/strong&gt; processors that convert text frames to audio frames&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Context aggregators&lt;/strong&gt; that manage conversation history&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  3. Pipelines
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Pipelines&lt;/strong&gt; connect processors together, creating a structured path for data to flow through your application. They handle orchestration automatically and enable &lt;strong&gt;parallel processing&lt;/strong&gt; - while the LLM generates later parts of a response, earlier parts are already being converted to speech and played back to users.&lt;/p&gt;

&lt;h3&gt;
  
  
  Voice AI Processing Flow
&lt;/h3&gt;

&lt;p&gt;Here's how a typical voice conversation flows through a Pipecat pipeline:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Audio Input&lt;/strong&gt; - User speaks → Transport receives streaming audio → Creates audio frames&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Speech Recognition&lt;/strong&gt; - STT processor receives audio → Transcribes in real-time → Outputs text frames
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Context Management&lt;/strong&gt; - Context processor aggregates text with conversation history&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Language Processing&lt;/strong&gt; - LLM processor generates streaming response → Outputs text frames&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Speech Synthesis&lt;/strong&gt; - TTS processor converts text to speech → Outputs audio frames&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Audio Output&lt;/strong&gt; - Transport streams audio to user's device → User hears response&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The key insight is that &lt;strong&gt;everything happens in parallel&lt;/strong&gt; - this parallel processing enables the ultra-low latency that makes conversations feel natural.&lt;/p&gt;

&lt;h2&gt;
  
  
  Hello World Voice Agent: Complete Implementation
&lt;/h2&gt;

&lt;p&gt;Now let's build a complete "Hello World" voice agent that demonstrates all the core concepts. This example creates a friendly AI assistant you can have real-time voice conversations with.&lt;/p&gt;

&lt;h3&gt;
  
  
  Prerequisites
&lt;/h3&gt;

&lt;p&gt;Before we start, you'll need:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Python 3.10 or later&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;uv package manager&lt;/strong&gt; (or pip)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;API keys&lt;/strong&gt; from three services:

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://deepgram.com" rel="noopener noreferrer"&gt;Deepgram&lt;/a&gt; for Speech-to-Text&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://openai.com" rel="noopener noreferrer"&gt;OpenAI&lt;/a&gt; for the language model&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://cartesia.ai" rel="noopener noreferrer"&gt;Cartesia&lt;/a&gt; for Text-to-Speech&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;h3&gt;
  
  
  Project Setup
&lt;/h3&gt;

&lt;p&gt;First, let's set up our project:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Install Pipecat with required services&lt;/span&gt;
uv add &lt;span class="s2"&gt;"pipecat-ai[deepgram,openai,cartesia,webrtc]"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Environment Configuration
&lt;/h3&gt;

&lt;p&gt;Create a &lt;code&gt;.env&lt;/code&gt; file with your API keys:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# .env&lt;/span&gt;
&lt;span class="nv"&gt;DEEPGRAM_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;your_deepgram_api_key_here
&lt;span class="nv"&gt;OPENAI_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;your_openai_api_key_here  
&lt;span class="nv"&gt;CARTESIA_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;your_cartesia_api_key_here
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The code is some what bigger so i have not shared the whole code you can check out the whole code &lt;a href="https://github.com/programmerraja/VoiceAgentGuide/tree/main/code/HelloWorld" rel="noopener noreferrer"&gt;here&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Main entry point for the Hello World bot.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;bot&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;HelloWorldVoiceBot&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;bot&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run_bot&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Understanding the Code Structure
&lt;/h3&gt;

&lt;p&gt;Let's break down the key components of our Hello World implementation:&lt;/p&gt;

&lt;h4&gt;
  
  
  1. Service Initialization
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Speech-to-Text service
&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;stt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;DeepgramSTTService&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;DEEPGRAM_API_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="c1"&gt;# Language Model service  
&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;llm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenAILLMService&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;OPENAI_API_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-3.5-turbo&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Text-to-Speech service
&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tts&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;CartesiaTTSService&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;CARTESIA_API_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;voice_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each service is a &lt;strong&gt;frame processor&lt;/strong&gt; that handles a specific part of the voice AI pipeline.&lt;/p&gt;

&lt;h4&gt;
  
  
  2. Pipeline Configuration
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;pipeline&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Pipeline&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;
    &lt;span class="n"&gt;transport&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;input&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;              &lt;span class="c1"&gt;# Audio input from browser
&lt;/span&gt;    &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;stt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;                      &lt;span class="c1"&gt;# Speech → Text
&lt;/span&gt;    &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;context_aggregator&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;user&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;&lt;span class="c1"&gt;# Add to conversation history
&lt;/span&gt;    &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;                      &lt;span class="c1"&gt;# Generate response
&lt;/span&gt;    &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tts&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;                      &lt;span class="c1"&gt;# Text → Speech  
&lt;/span&gt;    &lt;span class="n"&gt;transport&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;output&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;            &lt;span class="c1"&gt;# Audio output to browser
&lt;/span&gt;    &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;context_aggregator&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;assistant&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="c1"&gt;# Save response to history
&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The pipeline defines the &lt;strong&gt;data flow&lt;/strong&gt;  each processor receives frames, transforms them, and passes them to the next processor.&lt;/p&gt;

&lt;h4&gt;
  
  
  3. Event-Driven Interactions
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@transport.event_handler&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;on_first_participant_joined&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;on_participant_joined&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;transport&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;participant&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# Trigger bot to greet the user
&lt;/span&gt;    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;queue_frame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;LLMMessagesFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Event handlers manage the &lt;strong&gt;conversation lifecycle&lt;/strong&gt; - when users join/leave, when they start/stop speaking, etc.&lt;/p&gt;

&lt;p&gt;The diagram below shows a typical voice assistant pipeline, where each step happens in real-time:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwug7hlfym5t96bqfnqy7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwug7hlfym5t96bqfnqy7.png" alt="pipecat" width="800" height="572"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Running Your Hello World Bot
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Save the code&lt;/strong&gt; as &lt;code&gt;hello_world_bot.py&lt;/code&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Run the bot&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;   python hello_world_bot.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Open your browser&lt;/strong&gt; to &lt;code&gt;http://localhost:7860&lt;/code&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Click "Connect"&lt;/strong&gt; and allow microphone access&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Start talking!&lt;/strong&gt; Say something like "Hello, how are you?"&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The bot will:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Listen to your speech (STT)&lt;/li&gt;
&lt;li&gt;Process it with OpenAI (LLM) &lt;/li&gt;
&lt;li&gt;Respond with natural speech (TTS)&lt;/li&gt;
&lt;li&gt;Remember the conversation context&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For more examples and advanced features, check out the &lt;a href="https://docs.pipecat.ai" rel="noopener noreferrer"&gt;Pipecat documentation&lt;/a&gt; and &lt;a href="https://github.com/pipecat-ai/pipecat-examples" rel="noopener noreferrer"&gt;example repository&lt;/a&gt;.*&lt;/p&gt;

&lt;h2&gt;
  
  
  What Next?
&lt;/h2&gt;

&lt;p&gt;Now that you’re familiar with &lt;strong&gt;Pipecat&lt;/strong&gt; and can build your own real-time voice agent, it’s time to take the next step.&lt;/p&gt;

&lt;p&gt;In the upcoming part, we’ll explore &lt;strong&gt;how to run all models locally even on a CPU&lt;/strong&gt; and build a fully offline voice agent.&lt;/p&gt;

&lt;p&gt;I’ve created a GitHub repository &lt;a href="https://github.com/programmerraja/VoiceAgentGuide" rel="noopener noreferrer"&gt;VoiceAgentGuide&lt;/a&gt; for this series, where we can store our notes and related resources. Don’t forget to check it out and share your feedback. Feel free to contribute or add missing content by submitting a pull request (PR).&lt;/p&gt;

&lt;p&gt;Stay tuned for the next part of the &lt;strong&gt;2025 Voice AI Guide!&lt;/strong&gt;&lt;/p&gt;

</description>
      <category>voiceagent</category>
      <category>pipecat</category>
      <category>llm</category>
    </item>
    <item>
      <title>How Machines Learn: Understanding the Core Concepts of Neural Networks</title>
      <dc:creator>Boopathi</dc:creator>
      <pubDate>Sat, 04 Oct 2025 08:22:57 +0000</pubDate>
      <link>https://forem.com/programmerraja/how-machines-learn-understanding-the-core-concepts-of-neural-networks-3a9j</link>
      <guid>https://forem.com/programmerraja/how-machines-learn-understanding-the-core-concepts-of-neural-networks-3a9j</guid>
      <description>&lt;p&gt;Imagine trying to teach a child who’s never seen the world  to recognize a face, feel that fire is hot, or sense when it might rain. How would you do it?&lt;/p&gt;

&lt;p&gt;For centuries, we thought intelligence required something mystical a soul, consciousness, divine spark. But what if it’s just &lt;strong&gt;pattern recognition&lt;/strong&gt; at an extraordinary scale? What if learning is simply tuning millions of tiny parameters until inputs map correctly to outputs?&lt;/p&gt;

&lt;p&gt;That’s the bold idea behind &lt;strong&gt;deep learning&lt;/strong&gt;: mathematical systems that can learn any pattern, approximate any function, and tackle problems once thought uniquely human.&lt;/p&gt;

&lt;p&gt;In 1989, mathematicians proved the &lt;strong&gt;&lt;a href="https://en.wikipedia.org/wiki/Universal_approximation_theorem" rel="noopener noreferrer"&gt;Universal Approximation Theorem&lt;/a&gt;&lt;/strong&gt; showing that even a single hidden layer neural network can approximate any continuous function. In theory, such a network can learn to translate, recognize, play, or predict anything.&lt;/p&gt;

&lt;p&gt;But theory alone isn’t enough. The theorem says such a network &lt;em&gt;exists&lt;/em&gt;  not how to build or train it. That’s where the real craft of deep learning begins: finding the right weights, training efficiently, and learning patterns that generalize.&lt;/p&gt;

&lt;p&gt;Let’s unpack the six core ideas that make this possible.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; This is a deep dive  not a skim. Grab a coffee, settle in, and take your time. By the end, you’ll understand neural networks from the ground up, not just in words but in logic.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. Neural Networks: Universal Function Approximators
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What Are We Trying to Do?
&lt;/h3&gt;

&lt;p&gt;Before we understand neural networks, let's start with something simpler: what is a function?&lt;/p&gt;

&lt;p&gt;In mathematics, a function is a relationship that maps inputs to outputs. &lt;code&gt;f(x) = 2x + 1&lt;/code&gt; is a function. You give it &lt;code&gt;x = 3&lt;/code&gt;, it returns &lt;code&gt;7&lt;/code&gt;. Simple, deterministic, predictable.&lt;/p&gt;

&lt;p&gt;But real-world problems involve functions we can't write down. Consider:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;f(image) = "cat" or "dog"&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;f(email_text) = "spam" or "not spam"&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;f(patient_symptoms) = disease_probability&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These are still functions they map inputs to outputs but we don't know their mathematical form. Traditional programming can't help us here because we can't write explicit rules for every possible image or email.&lt;/p&gt;

&lt;h3&gt;
  
  
  Building Blocks: The Artificial Neuron
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5pad2olztck0jmxyxdsi.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5pad2olztck0jmxyxdsi.png" alt="alt text" width="679" height="607"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Let's build from the ground up. Start with a single neuron the atomic unit of a neural network.&lt;/p&gt;

&lt;p&gt;A neuron does three things:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Receives multiple inputs&lt;/strong&gt; (x₁, x₂, x₃, ...)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multiplies each input by a weight&lt;/strong&gt; (w₁, w₂, w₃, ...)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sums everything up and adds a bias&lt;/strong&gt;: &lt;code&gt;z = w₁x₁ + w₂x₂ + w₃x₃ + ... + b&lt;/code&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Why this structure? Because it's the simplest way to combine multiple pieces of information into a single decision.&lt;/p&gt;

&lt;h3&gt;
  
  
  Geometry of a Neuron: Drawing a Line
&lt;/h3&gt;

&lt;p&gt;Let’s ground this in a real example.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Problem&lt;/strong&gt;: You're a bank deciding whether to approve loans. You have two pieces of information:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;x₁ = Annual income (in thousands)&lt;/li&gt;
&lt;li&gt;x₂ = Credit score&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Goal&lt;/strong&gt;: Separate "approve" from "reject" applications.&lt;/p&gt;

&lt;h4&gt;
  
  
  A Single Neuron Creates a Line (2D) or Hyperplane (Higher Dimensions)
&lt;/h4&gt;

&lt;p&gt;The equation &lt;code&gt;z = w₁x₁ + w₂x₂ + b&lt;/code&gt; is actually the equation of a line! Let's see how:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example neuron with specific weights:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;z = 0.5·income + 2·credit_score - 150
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This neuron outputs positive values for "approve" and negative for "reject". The &lt;strong&gt;decision boundary&lt;/strong&gt; is where z = 0:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;0 = 0.5·income + 2·credit_score - 150
credit_score = 75 - 0.25·income
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is a line! Let's plot it:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fukswvy01qsmym9a2a5ba.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fukswvy01qsmym9a2a5ba.png" alt="alt text" width="800" height="800"&gt;&lt;/a&gt;&lt;br&gt;
&lt;strong&gt;What the weights mean geometrically:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;w₁ = 0.5&lt;/strong&gt;: For every $1000 increase in income, the decision shifts by 0.5 units toward approval&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;w₂ = 2.0&lt;/strong&gt;: For every 1-point increase in credit score, the decision shifts by 2 units toward approval (4× more important than income!)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;b = -150&lt;/strong&gt;: The bias shifts the entire line. Without it, the line would pass through origin (0,0)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The learning process is finding the right line:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Start with a random line (random weights)&lt;/li&gt;
&lt;li&gt;See which points it classifies wrong&lt;/li&gt;
&lt;li&gt;Adjust the weights to rotate and shift the line&lt;/li&gt;
&lt;li&gt;Repeat until the line best separates the two groups&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;
  
  
  What One Neuron Can and Cannot Do
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwwjpd2zz3kbyqrtovbwd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwwjpd2zz3kbyqrtovbwd.png" alt="alt text" width="690" height="494"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cannot separate (non-linearly separable):&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;XOR is the classic example: you need (0,1) and (1,0) to be class 1, but (0,0) and (1,1) to be class 0. No single line can achieve this separation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;This is why we need multiple layers.&lt;/strong&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  Multiple Neurons, Multiple Lines: Building Complex Boundaries
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0wkyz1thrtvbi5h0oopf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0wkyz1thrtvbi5h0oopf.png" alt="alt text" width="800" height="635"&gt;&lt;/a&gt;&lt;br&gt;
If one neuron creates one line, what happens with multiple neurons in one layer?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example: 3 neurons in one layer&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
Neuron 1: z₁ = w₁₁x₁ + w₁₂x₂ + b₁ [Line 1]
Neuron 2: z₂ = w₂₁x₁ + w₂₂x₂ + b₂ [Line 2]
Neuron 3: z₃ = w₃₁x₁ + w₃₂x₂ + b₃ [Line 3]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each neuron draws a different line! But without additional layers, we still can't solve XOR. Why? Because we're just drawing multiple lines without combining them in complex ways.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The key insight&lt;/strong&gt;: We need to combine these lines non-linearly. This is where activation functions and depth come in.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Layer Abstraction
&lt;/h3&gt;

&lt;p&gt;Now stack multiple neurons side by side that's a layer. Each neuron in the layer:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Receives the same inputs&lt;/li&gt;
&lt;li&gt;Has its own unique weights and bias&lt;/li&gt;
&lt;li&gt;Produces its own output&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A layer with 10 neurons transforms one input vector into 10 different outputs, each representing a different "feature" or "pattern" it has detected.&lt;/p&gt;

&lt;h4&gt;
  
  
  Solving XOR: A Complete Example
&lt;/h4&gt;

&lt;p&gt;Let's solve XOR step-by-step to understand how layers work together.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The XOR Problem:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Input (x₁, x₂) → Output
(0, 0) → 0
(0, 1) → 1
(1, 0) → 1
(1, 1) → 0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Two-Layer Solution:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Layer 1: Create useful features (2 neurons with ReLU)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Neuron 1&lt;/strong&gt;: Detects "at least one input is 1"&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;z₁ = x₁ + x₂ - 0.5
a₁ = ReLU(z₁)

Testing:

(0,0): z₁ = -0.5, a₁ = 0
(0,1): z₁ = 0.5, a₁ = 0.5
(1,0): z₁ = 0.5, a₁ = 0.5
(1,1): z₁ = 1.5, a₁ = 1.5
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Neuron 2&lt;/strong&gt;: Detects "both inputs are 1"&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;z₂ = x₁ + x₂ - 1.5
a₂ = ReLU(z₂)

Testing:

(0,0): z₂ = -1.5, a₂ = 0

(0,1): z₂ = -0.5, a₂ = 0

(1,0): z₂ = -0.5, a₂ = 0

(1,1): z₂ = 0.5, a₂ = 0.5

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Layer 2: Combine features (1 neuron with Sigmoid)&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;z₃ = a₁ - 2·a₂ - 0.25

output = Sigmoid(z₃)

Testing:
(0,0): z₃ = 0 - 0 - 0.25 = -0.25 → ≈0 ✓

(0,1): z₃ = 0.5 - 0 - 0.25 = 0.25 → ≈1 ✓

(1,0): z₃ = 0.5 - 0 - 0.25 = 0.25 → ≈1 ✓

(1,1): z₃ = 1.5 - 1 - 0.25 = 0.25 → ≈0 ✓

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;What happened geometrically?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpkh97kfg4xvhipjm0o5q.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpkh97kfg4xvhipjm0o5q.png" alt="alt text" width="600" height="306"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Layer 1 transformed the space:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The first layer created new features where the problem becomes linearly separable!&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;a₁ captures "OR-ness" (at least one is true)&lt;/li&gt;
&lt;li&gt;a₂ captures "AND-ness" (both are true)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Layer 2 drew a simple line&lt;/strong&gt; in this new space:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;a₁ - 2·a₂ = 0.25 [decision boundary]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This line easily separates XOR in the transformed space!&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The key insight&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Layer 1&lt;/strong&gt;: Creates useful intermediate features by drawing multiple lines/planes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Layer 2&lt;/strong&gt;: Combines these features with another line/plane&lt;/li&gt;
&lt;li&gt;Together: They can represent any decision boundary!&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  The Complete Architecture
&lt;/h3&gt;

&lt;p&gt;A typical neural network:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Input Layer (raw data) 
    → Hidden Layer 1 (low-level features)
    → Hidden Layer 2 (mid-level features)
    → Hidden Layer 3 (high-level features)
    → Output Layer (predictions)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The power lies not in any single neuron, but in the billions of connections between them, each with its own weight, collectively forming a function approximator of extraordinary flexibility.&lt;/p&gt;

&lt;h3&gt;
  
  
  Universal Approximation Theorem
&lt;/h3&gt;

&lt;p&gt;The &lt;strong&gt;Universal Approximation Theorem (1989)&lt;/strong&gt; proves:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;A neural network with just one hidden layer can approximate any continuous function, given enough neurons.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;But  “enough” might mean billions, which is impractical.&lt;br&gt;&lt;br&gt;
Deep (multi-layer) networks achieve the same expressive power &lt;em&gt;more efficiently&lt;/em&gt; through &lt;strong&gt;hierarchical composition&lt;/strong&gt;  like compression for abstractions.&lt;/p&gt;

&lt;p&gt;So, in theory, neural networks can learn any mapping; in practice, depth makes it tractable.&lt;/p&gt;
&lt;h2&gt;
  
  
  2. Activation Functions: Breaking Linearity
&lt;/h2&gt;
&lt;h3&gt;
  
  
  The Linear Trap: A Fundamental Problem
&lt;/h3&gt;

&lt;p&gt;Imagine we build a neural network with three layers, but we don't use activation functions. Let's trace through what happens mathematically:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Layer 1&lt;/strong&gt;: &lt;code&gt;z₁ = W₁x + b₁&lt;/code&gt;&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Layer 2&lt;/strong&gt;: &lt;code&gt;z₂ = W₂z₁ + b₂ = W₂(W₁x + b₁) + b₂ = W₂W₁x + W₂b₁ + b₂&lt;/code&gt;&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Layer 3&lt;/strong&gt;: &lt;code&gt;z₃ = W₃z₂ + b₃ = W₃(W₂W₁x + W₂b₁ + b₂) + b₃&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Simplifying: &lt;code&gt;z₃ = (W₃W₂W₁)x + (W₃W₂b₁ + W₃b₂ + b₃)&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Notice what happened? No matter how many layers we add, we always end up with: &lt;code&gt;Wx + b&lt;/code&gt; a simple linear function. Matrix multiplication of matrices is still a matrix. We've built an expensive way to do simple linear regression.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;This is catastrophic.&lt;/strong&gt; Linear functions can only model linear relationships. The real world is non-linear. The path of a thrown ball, the spread of a virus, the relationship between study time and test scores—all non-linear.&lt;/p&gt;
&lt;h3&gt;
  
  
  The Solution: Non-Linear Activation Functions
&lt;/h3&gt;

&lt;p&gt;After each neuron computes its weighted sum, we pass it through a non-linear activation function: &lt;code&gt;a = σ(z)&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;This single addition breaks the linear trap. Now our layers actually do different things, building increasingly complex representations.&lt;/p&gt;
&lt;h3&gt;
  
  
  What Makes a Good Activation Function?
&lt;/h3&gt;

&lt;p&gt;Let's think about what properties we need:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Non-linearity&lt;/strong&gt; (obviously, or we're back where we started)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Differentiability&lt;/strong&gt; (we'll need derivatives for learning)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Computational efficiency&lt;/strong&gt; (we'll apply it billions of times)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Avoid saturation&lt;/strong&gt; (outputs shouldn't always be at extremes)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Zero-centered or positive&lt;/strong&gt; (depending on the problem)&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;
  
  
  Common Activation Functions
&lt;/h3&gt;
&lt;h4&gt;
  
  
  ReLU (Rectified Linear Unit): &lt;code&gt;f(x) = max(0, x)&lt;/code&gt;
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F33dr2n7h7gyni9o6c42r.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F33dr2n7h7gyni9o6c42r.png" alt="alt text" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why it works:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Dead simple: if input is positive, output equals input; if negative, output is zero&lt;/li&gt;
&lt;li&gt;Non-linear despite looking linear (the "kink" at zero creates non-linearity)&lt;/li&gt;
&lt;li&gt;Computationally trivial: just one comparison and zero multiplication&lt;/li&gt;
&lt;li&gt;Doesn't saturate for positive values (unlike sigmoid)&lt;/li&gt;
&lt;li&gt;Induces sparsity: many neurons output exactly zero, creating efficient representations&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The problem:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;"Dying ReLU": if a neuron's weights push it permanently into negative territory, its gradient becomes zero and it stops learning forever&lt;/li&gt;
&lt;li&gt;Not zero-centered: all outputs are positive, which can slow convergence&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Variants:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Leaky ReLU&lt;/strong&gt;: &lt;code&gt;f(x) = max(0.01x, x)&lt;/code&gt; — allows small gradients when x &amp;lt; 0, preventing death&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ELU (Exponential Linear Unit)&lt;/strong&gt;: Smooth curve for negative values, better learning dynamics&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;
  
  
  Sigmoid: &lt;code&gt;f(x) = 1/(1 + e^(-x))&lt;/code&gt;
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7b8sv3se51qlj9vepv2k.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7b8sv3se51qlj9vepv2k.png" alt="alt text" width="800" height="554"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why it exists:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Squashes any input into range (0, 1)&lt;/li&gt;
&lt;li&gt;Historically motivated by biological neurons (firing rates between 0 and 1)&lt;/li&gt;
&lt;li&gt;Output can be interpreted as probability&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;what's happening?&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;For large positive x: &lt;code&gt;e^(-x)&lt;/code&gt; approaches 0, so output approaches 1&lt;/li&gt;
&lt;li&gt;For large negative x: &lt;code&gt;e^(-x)&lt;/code&gt; approaches infinity, so output approaches 0&lt;/li&gt;
&lt;li&gt;At x = 0: output is 0.5&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Why it's problematic:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Vanishing gradients&lt;/strong&gt;: For large positive or negative inputs, the sigmoid is nearly flat. The derivative approaches zero. During backpropagation, gradients get multiplied across layers; zeros multiply to deepen zeros. Deep networks can't learn.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Not zero-centered&lt;/strong&gt;: Outputs always positive (0 to 1), causing zig-zagging during optimization&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Computationally expensive&lt;/strong&gt;: Exponential function&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Where it's still used:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Output layer for binary classification (want probability between 0 and 1)&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;
  
  
  Tanh: &lt;code&gt;f(x) = (e^x - e^(-x))/(e^x + e^(-x))&lt;/code&gt;
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqid12xsk2zwj7kh5jom4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqid12xsk2zwj7kh5jom4.png" alt="alt text" width="587" height="405"&gt;&lt;/a&gt;&lt;br&gt;
&lt;strong&gt;Advantages over sigmoid:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Zero-centered: outputs range from -1 to 1&lt;/li&gt;
&lt;li&gt;Stronger gradients: derivative at zero is 1 (compared to 0.25 for sigmoid)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Still suffers from:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Vanishing gradients for extreme values&lt;/li&gt;
&lt;li&gt;Computational cost of exponentials&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;
  
  
  Softmax: &lt;code&gt;f(x_i) = e^(x_i) / Σe^(x_j)&lt;/code&gt;
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Completely different purpose:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Not used between hidden layers&lt;/li&gt;
&lt;li&gt;Exclusively for multi-class classification output layers&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;In simple term&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Takes a vector of arbitrary values (logits)&lt;/li&gt;
&lt;li&gt;Converts them into probabilities that sum to 1&lt;/li&gt;
&lt;li&gt;Exponentiation ensures all values are positive&lt;/li&gt;
&lt;li&gt;Division by sum ensures they sum to 1&lt;/li&gt;
&lt;li&gt;Higher inputs get exponentially higher probabilities&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Example:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Input: [2.0, 1.0, 0.1]&lt;/li&gt;
&lt;li&gt;After softmax: [0.659, 0.242, 0.099]&lt;/li&gt;
&lt;li&gt;Notice: still ordered the same way, but now they're probabilities&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  Why Different Layers Need Different Activations
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Hidden Layers&lt;/strong&gt;: ReLU family (efficiency, avoiding vanishing gradients)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Binary Classification Output&lt;/strong&gt;: Sigmoid (get probability for one class)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-class Classification Output&lt;/strong&gt;: Softmax (get probability distribution over all classes)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Regression Output&lt;/strong&gt;: Often no activation (or linear) — we want the raw value, not a bounded one&lt;/li&gt;
&lt;/ul&gt;


&lt;h2&gt;
  
  
  3. Forward Propagation: The Prediction Process
&lt;/h2&gt;
&lt;h3&gt;
  
  
  What is Propagation?
&lt;/h3&gt;

&lt;p&gt;"Propagation" is just a fancy word for "passing information through." Forward propagation is the process of taking input data and pushing it through every layer until we get a prediction.&lt;/p&gt;

&lt;p&gt;Let's build this concept from absolute scratch.&lt;/p&gt;
&lt;h3&gt;
  
  
  The Single Neuron Case
&lt;/h3&gt;

&lt;p&gt;You have:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Input: &lt;code&gt;x = 3&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Weight: &lt;code&gt;w = 2&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Bias: &lt;code&gt;b = 1&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Step 1: Linear combination&lt;/strong&gt; &lt;code&gt;z = wx + b = 2(3) + 1 = 7&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 2: Activation&lt;/strong&gt; &lt;code&gt;a = ReLU(z) = max(0, 7) = 7&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;That's it. The neuron outputs 7. This output might be the final prediction (if it's the only neuron), or it might be input to the next layer.&lt;/p&gt;
&lt;h3&gt;
  
  
  Multiple Inputs, Single Neuron
&lt;/h3&gt;

&lt;p&gt;Now you have three inputs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Inputs: &lt;code&gt;x = [x₁=2, x₂=3, x₃=1]&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Weights: &lt;code&gt;w = [w₁=0.5, w₂=-1, w₃=2]&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Bias: &lt;code&gt;b = 1&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Step 1: Weighted sum&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;z = w₁x₁ + w₂x₂ + w₃x₃ + b
z = 0.5(2) + (-1)(3) + 2(1) + 1
z = 1 - 3 + 2 + 1 = 1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Step 2: Activation&lt;/strong&gt; &lt;code&gt;a = ReLU(1) = 1&lt;/code&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Single Layer: Multiple Neurons
&lt;/h3&gt;

&lt;p&gt;Now suppose we have 3 neurons in one layer, all receiving the same 3 inputs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Neuron 1:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Weights: &lt;code&gt;[w₁₁, w₁₂, w₁₃]&lt;/code&gt;, Bias: &lt;code&gt;b₁&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Output: &lt;code&gt;a₁ = ReLU(w₁₁x₁ + w₁₂x₂ + w₁₃x₃ + b₁)&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Neuron 2:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Weights: &lt;code&gt;[w₂₁, w₂₂, w₂₃]&lt;/code&gt;, Bias: &lt;code&gt;b₂&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Output: &lt;code&gt;a₂ = ReLU(w₂₁x₁ + w₂₂x₂ + w₂₃x₃ + b₂)&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Neuron 3:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Weights: &lt;code&gt;[w₃₁, w₃₂, w₃₃]&lt;/code&gt;, Bias: &lt;code&gt;b₃&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Output: &lt;code&gt;a₃ = ReLU(w₃₁x₁ + w₃₂x₃₃ + w₃₃x₃ + b₃)&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The layer transforms input vector &lt;code&gt;[x₁, x₂, x₃]&lt;/code&gt; into output vector &lt;code&gt;[a₁, a₂, a₃]&lt;/code&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Matrix Representation: Scaling to Thousands of Neurons
&lt;/h3&gt;

&lt;p&gt;Writing out every neuron individually is tedious. We use matrix notation:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Weight Matrix W:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;W = [w₁₁  w₁₂  w₁₃]
    [w₂₁  w₂₂  w₂₃]
    [w₃₁  w₃₂  w₃₃]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each row represents one neuron's weights.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Input Vector x:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;x = [x₁]
    [x₂]
    [x₃]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Forward propagation for the layer:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;z = Wx + b
a = ReLU(z)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This single matrix multiplication computes all neurons simultaneously. With modern GPUs optimized for matrix operations, we can process thousands of neurons in parallel.&lt;/p&gt;

&lt;h3&gt;
  
  
  Deep Networks: Chaining Layers
&lt;/h3&gt;

&lt;p&gt;Now stack multiple layers. The output of layer 1 becomes the input to layer 2:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Layer 1:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;z¹ = W¹x + b¹
a¹ = ReLU(z¹)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Layer 2:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;z² = W²a¹ + b²
a² = ReLU(z²)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Layer 3 (output):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;z³ = W³a² + b³
ŷ = softmax(z³)  [if classification]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The final output &lt;code&gt;ŷ&lt;/code&gt; is our prediction.&lt;/p&gt;

&lt;h3&gt;
  
  
  Concrete Example: Digit Recognition
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw9tuoc4vs2qmzeoe470b.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw9tuoc4vs2qmzeoe470b.png" alt="alt text" width="372" height="238"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Input&lt;/strong&gt;: 28×28 pixel image of a handwritten digit (flattened to 784 values)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Architecture&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Input layer: 784 neurons&lt;/li&gt;
&lt;li&gt;Hidden layer 1: 128 neurons (with ReLU)&lt;/li&gt;
&lt;li&gt;Hidden layer 2: 64 neurons (with ReLU)&lt;/li&gt;
&lt;li&gt;Output layer: 10 neurons (with softmax for digits 0-9)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Forward propagation:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;z¹ = W¹x + b¹           [128 values]
a¹ = ReLU(z¹)           [128 values]

z² = W²a¹ + b²          [64 values]
a² = ReLU(z²)           [64 values]

z³ = W³a² + b³          [10 values]
ŷ = softmax(z³)         [10 probabilities summing to 1]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Output might be: &lt;code&gt;[0.01, 0.02, 0.05, 0.7, 0.1, 0.05, 0.03, 0.02, 0.01, 0.01]&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;The network predicts "3" with 70% confidence (index 3 has highest probability).&lt;/p&gt;

&lt;h3&gt;
  
  
  Why "Forward"?
&lt;/h3&gt;

&lt;p&gt;Because information flows in one direction: from input → through hidden layers → to output. No loops, no feedback (in standard feedforward networks). Each layer only looks forward, never backward.&lt;/p&gt;

&lt;p&gt;Later, during learning, we'll propagate in the opposite direction (backward) to adjust weights. But prediction is always forward.&lt;/p&gt;




&lt;h2&gt;
  
  
  4. Loss Functions: Quantifying Error
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnxrn81o50b9cin3alg19.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnxrn81o50b9cin3alg19.png" alt="alt text" width="800" height="489"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Why Do We Need Loss?
&lt;/h3&gt;

&lt;p&gt;Imagine you're teaching a child to draw circles. They draw something. How do you tell them how "wrong" it is? You need a measurement some way to quantify the difference between what they drew and a perfect circle.&lt;/p&gt;

&lt;p&gt;Neural networks face the same problem. After forward propagation, we have a prediction &lt;code&gt;ŷ&lt;/code&gt;. We also have the true answer &lt;code&gt;y&lt;/code&gt;. The &lt;strong&gt;loss function&lt;/strong&gt; &lt;code&gt;L(ŷ, y)&lt;/code&gt; measures how wrong the prediction is.&lt;/p&gt;

&lt;p&gt;This single number is crucial because:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;It tells us how well the model is performing&lt;/li&gt;
&lt;li&gt;It guides the learning process (we'll adjust weights to minimize this number)&lt;/li&gt;
&lt;li&gt;Different problems need different ways of measuring "wrongness"&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Property Requirements for Loss Functions
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Non-negative&lt;/strong&gt;: &lt;code&gt;L ≥ 0&lt;/code&gt; always (can't be "negative wrong")&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Zero when perfect&lt;/strong&gt;: &lt;code&gt;L = 0&lt;/code&gt; when &lt;code&gt;ŷ = y&lt;/code&gt; exactly&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Increases with error&lt;/strong&gt;: Worse predictions → higher loss&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Differentiable&lt;/strong&gt;: We need gradients for learning (calculus requirement)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Appropriate for the task&lt;/strong&gt;: Regression vs classification need different measures&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Mean Squared Error (MSE): For Regression
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;The Problem&lt;/strong&gt;: Predict a continuous value (house price, temperature, stock price)&lt;/p&gt;

&lt;p&gt;The most intuitive approach: absolute difference &lt;code&gt;|ŷ - y|&lt;/code&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;If true value is 100 and we predict 90, error = 10&lt;/li&gt;
&lt;li&gt;Simple, interpretable&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But there's a problem: absolute value isn't differentiable at zero (the derivative has a discontinuity). This complicates learning algorithms.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Better approach: Square the difference&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;L = (ŷ - y)²
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Why squaring?&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Always positive (negative errors don't cancel positive ones)&lt;/li&gt;
&lt;li&gt;Differentiable everywhere: &lt;code&gt;dL/dŷ = 2(ŷ - y)&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Penalizes large errors more (error of 10 contributes 100, but error of 1 contributes only 1)&lt;/li&gt;
&lt;li&gt;Mathematically convenient (leads to elegant solutions)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;For multiple predictions (a batch):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;MSE = (1/n) Σᵢ(ŷᵢ - yᵢ)²
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We average across all samples to get a single loss value.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Concrete Example:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Predicting house prices&lt;/li&gt;
&lt;li&gt;True prices: [200k, 300k, 250k]&lt;/li&gt;
&lt;li&gt;Predicted: [210k, 280k, 255k]&lt;/li&gt;
&lt;li&gt;Errors: [10k, -20k, 5k]&lt;/li&gt;
&lt;li&gt;Squared errors: [100M, 400M, 25M]&lt;/li&gt;
&lt;li&gt;MSE = (100M + 400M + 25M) / 3 = 175M&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The large middle error dominates the loss, signaling that's where improvement is needed most.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Variant: MAE (Mean Absolute Error)&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;MAE = (1/n) Σᵢ|ŷᵢ - yᵢ|
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;More robust to outliers (doesn't square them)&lt;/li&gt;
&lt;li&gt;Less sensitive to large errors&lt;/li&gt;
&lt;li&gt;Harder to optimize (non-smooth at zero)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Cross-Entropy Loss: For Classification
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;The Problem&lt;/strong&gt;: Predict discrete categories (cat vs dog, spam vs ham, digit 0-9)&lt;/p&gt;

&lt;p&gt;MSE doesn't work well here. Why? Because classification outputs are probabilities, and we need to measure "how wrong" a probability distribution is.&lt;/p&gt;

&lt;h4&gt;
  
  
  Binary Cross-Entropy (Two Classes)
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Setup:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;True label: &lt;code&gt;y ∈ {0, 1}&lt;/code&gt; (e.g., 0 = not spam, 1 = spam)&lt;/li&gt;
&lt;li&gt;Predicted probability: &lt;code&gt;ŷ ∈ [0, 1]&lt;/code&gt; (from sigmoid activation)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If true label is 1 (positive class):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;If we predict ŷ = 1.0 (certain it's positive): perfect, loss should be 0&lt;/li&gt;
&lt;li&gt;If we predict ŷ = 0.9 (very confident): small loss&lt;/li&gt;
&lt;li&gt;If we predict ŷ = 0.5 (uncertain): moderate loss&lt;/li&gt;
&lt;li&gt;If we predict ŷ = 0.1 (confident it's negative): large loss&lt;/li&gt;
&lt;li&gt;If we predict ŷ = 0.0 (certain it's negative): infinite loss (catastrophically wrong)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The formula that captures this:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;L = -[y·log(ŷ) + (1-y)·log(1-ŷ)]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why this works:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Case 1: y = 1 (true class is positive)&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;L = -log(ŷ)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;If ŷ = 1: L = -log(1) = 0 ✓&lt;/li&gt;
&lt;li&gt;If ŷ = 0.5: L = -log(0.5) ≈ 0.69&lt;/li&gt;
&lt;li&gt;If ŷ = 0.1: L = -log(0.1) ≈ 2.30&lt;/li&gt;
&lt;li&gt;If ŷ → 0: L → ∞ (massive penalty for confident wrong answer)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Case 2: y = 0 (true class is negative)&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;L = -log(1-ŷ)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;If ŷ = 0: L = -log(1) = 0 ✓&lt;/li&gt;
&lt;li&gt;If ŷ = 0.5: L = -log(0.5) ≈ 0.69&lt;/li&gt;
&lt;li&gt;If ŷ = 0.9: L = -log(0.1) ≈ 2.30&lt;/li&gt;
&lt;li&gt;If ŷ → 1: L → ∞&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The logarithm creates the right penalty structure: small errors have small losses, but confident mistakes are punished severely.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why "cross-entropy"?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;It comes from information theory. Cross-entropy measures the average number of bits needed to encode data from one distribution using another distribution. Here, we're measuring the "distance" between the true distribution (y) and predicted distribution (ŷ).&lt;/p&gt;

&lt;h4&gt;
  
  
  Categorical Cross-Entropy (Multiple Classes)
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Setup:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;True label: one-hot encoded vector (e.g., [0, 0, 1, 0, 0] for class 3)&lt;/li&gt;
&lt;li&gt;Predicted: probability distribution from softmax (e.g., [0.1, 0.2, 0.5, 0.15, 0.05])&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Formula:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;L = -Σᵢ yᵢ·log(ŷᵢ)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Since y is one-hot (only one element is 1, rest are 0), this simplifies to:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;L = -log(ŷ_true_class)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Example: Digit classification (0-9)&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;True label: 7 → one-hot: [0,0,0,0,0,0,0,1,0,0]&lt;/li&gt;
&lt;li&gt;Predicted: [0.05, 0.05, 0.1, 0.05, 0.05, 0.05, 0.1, 0.4, 0.1, 0.05]&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Loss = -log(0.4) ≈ 0.916&lt;/p&gt;

&lt;p&gt;If the model had predicted 7 with 0.9 probability: Loss = -log(0.9) ≈ 0.105 (much better)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Intuition&lt;/strong&gt;: We only care about the probability assigned to the correct class. The loss increases as this probability decreases.&lt;/p&gt;

&lt;h3&gt;
  
  
  Choosing the Right Loss Function
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Regression (predicting continuous values):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;MSE: Standard choice, penalizes large errors heavily&lt;/li&gt;
&lt;li&gt;MAE: More robust to outliers&lt;/li&gt;
&lt;li&gt;Huber Loss: Combines benefits of both (MSE for small errors, MAE for large)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Binary Classification:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Binary Cross-Entropy: Standard choice when using sigmoid output&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Multi-class Classification:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Categorical Cross-Entropy: When labels are one-hot encoded&lt;/li&gt;
&lt;li&gt;Sparse Categorical Cross-Entropy: When labels are integers (more memory efficient)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Custom Loss Functions:&lt;/strong&gt; Sometimes you need domain-specific losses. For example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Medical diagnosis: False negatives might be more costly than false positives&lt;/li&gt;
&lt;li&gt;Image generation: Perceptual losses that compare high-level features, not pixels&lt;/li&gt;
&lt;li&gt;Reinforcement learning: Reward-based losses&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The loss function is the objective we're optimizing. Choose it carefully—your model will become excellent at minimizing it, for better or worse.&lt;/p&gt;




&lt;h2&gt;
  
  
  5. Backpropagation: The Learning Algorithm
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdq6nclr8x9jdso1qt75e.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdq6nclr8x9jdso1qt75e.png" alt="alt text" width="800" height="396"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This step is crucial  it’s where the real &lt;em&gt;learning&lt;/em&gt; happens.&lt;/p&gt;

&lt;p&gt;Our neural network has millions of tiny adjustable numbers called &lt;strong&gt;weights&lt;/strong&gt;. We make a prediction, compare it with the correct answer, and realize we’re off. The big question is: &lt;strong&gt;how do we tweak those millions of weights to make the next prediction better?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;It’s not as simple as it sounds. Each weight affects many others, and changing even one can ripple through the entire network. Should we increase it or decrease it? And by how much?&lt;/p&gt;

&lt;p&gt;That’s where &lt;strong&gt;backpropagation&lt;/strong&gt; comes in  a beautifully systematic way to figure out exactly how every single weight should change to reduce the overall error.&lt;/p&gt;

&lt;p&gt;To really &lt;em&gt;grasp&lt;/em&gt; what’s happening here, you’ll need a bit of comfort with &lt;strong&gt;calculus&lt;/strong&gt;, especially with &lt;strong&gt;derivatives&lt;/strong&gt; and how small changes in one variable affect another. &lt;/p&gt;

&lt;h3&gt;
  
  
  The Core Insight: The Chain Rule of Calculus
&lt;/h3&gt;

&lt;p&gt;Everything in backpropagation stems from one calculus concept: &lt;a href="https://en.wikipedia.org/wiki/Chain_rule" rel="noopener noreferrer"&gt;the chain rule&lt;/a&gt;**.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Simple example:&lt;/strong&gt; If &lt;code&gt;z = f(y)&lt;/code&gt; and &lt;code&gt;y = g(x)&lt;/code&gt;, then:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;dz/dx = (dz/dy) · (dy/dx)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;In words&lt;/strong&gt;: The rate of change of z with respect to x equals the rate of change of z with respect to y, multiplied by the rate of change of y with respect to x.&lt;/p&gt;

&lt;p&gt;This might seem abstract, so let's make it concrete.&lt;/p&gt;

&lt;h3&gt;
  
  
  Concrete Example: A Tiny Network
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Architecture:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;One input: x = 2&lt;/li&gt;
&lt;li&gt;One weight: w = 3&lt;/li&gt;
&lt;li&gt;One bias: b = 1&lt;/li&gt;
&lt;li&gt;Activation: ReLU&lt;/li&gt;
&lt;li&gt;True output: y = 15&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Forward pass:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;z = wx + b = 3(2) + 1 = 7
a = ReLU(z) = 7
L = (a - y)² = (7 - 15)² = 64
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Loss is 64. We want to reduce it. Should we increase or decrease w?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Backward pass (backpropagation):&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;We need &lt;code&gt;dL/dw&lt;/code&gt; (how much does loss change when we change w?).&lt;/p&gt;

&lt;p&gt;Using the chain rule:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;dL/dw = (dL/da) · (da/dz) · (dz/dw)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Let's calculate each piece:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 1: dL/da&lt;/strong&gt; (how does loss change with activation?)&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;L = (a - y)²
dL/da = 2(a - y) = 2(7 - 15) = -16
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Step 2: da/dz&lt;/strong&gt; (how does activation change with pre-activation?)&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;a = ReLU(z) = max(0, z)
For z &amp;gt; 0: da/dz = 1
For z ≤ 0: da/dz = 0
Since z = 7 &amp;gt; 0: da/dz = 1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Step 3: dz/dw&lt;/strong&gt; (how does pre-activation change with weight?)&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;z = wx + b
dz/dw = x = 2
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Combine them:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;dL/dw = (dL/da) · (da/dz) · (dz/dw)
dL/dw = (-16) · (1) · (2) = -32
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Interpretation&lt;/strong&gt;: The gradient is -32. This means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;If we increase w by a tiny amount, the loss will decrease by approximately 32 times that amount&lt;/li&gt;
&lt;li&gt;The negative sign tells us to increase w (move opposite to the gradient)&lt;/li&gt;
&lt;li&gt;The magnitude (32) tells us how sensitive the loss is to changes in w&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Update the weight:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;w_new = w_old - learning_rate · (dL/dw)
w_new = 3 - 0.01 · (-32) = 3 + 0.32 = 3.32
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We've just learned! The network adjusted its weight to reduce the loss.&lt;/p&gt;

&lt;h3&gt;
  
  
  Scaling to Deep Networks
&lt;/h3&gt;

&lt;p&gt;In real networks with many layers, we calculate gradients layer by layer, moving backward from the output.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example: 3-layer network&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Forward pass:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Layer 1: z¹ = W¹x + b¹,  a¹ = ReLU(z¹)
Layer 2: z² = W²a¹ + b², a² = ReLU(z²)
Layer 3: z³ = W³a² + b³, ŷ = softmax(z³)
Loss: L = CrossEntropy(ŷ, y)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Backward pass:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Layer 3 (output layer):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;dL/dz³ = ŷ - y  [derivative of softmax + cross-entropy]
dL/dW³ = (dL/dz³) · a²ᵀ
dL/db³ = dL/dz³
dL/da² = W³ᵀ · (dL/dz³)  [pass gradient to previous layer]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Layer 2:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;dL/dz² = (dL/da²) ⊙ ReLU'(z²)  [⊙ is element-wise multiplication]
dL/dW² = (dL/dz²) · a¹ᵀ
dL/db² = dL/dz²
dL/da¹ = W²ᵀ · (dL/dz²)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Layer 1:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;dL/dz¹ = (dL/da¹) ⊙ ReLU'(z¹)
dL/dW¹ = (dL/dz¹) · xᵀ
dL/db¹ = dL/dz¹
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Notice the pattern:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Calculate gradient with respect to pre-activation (z)&lt;/li&gt;
&lt;li&gt;Calculate gradient for weights: &lt;code&gt;dL/dW = (dL/dz) · inputᵀ&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Calculate gradient for bias: &lt;code&gt;dL/db = dL/dz&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Pass gradient backward: &lt;code&gt;dL/d(previous_activation) = Wᵀ · (dL/dz)&lt;/code&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Why "Backpropagation"?
&lt;/h3&gt;

&lt;p&gt;Because we propagate gradients backward through the network, from output to input. Each layer receives the gradient from the layer ahead, computes its own gradients, and passes gradients to the layer behind.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Vanishing Gradient Problem
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbfd5eukeb9l212u5396a.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbfd5eukeb9l212u5396a.png" alt="alt text" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fundamental issue in deep networks:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;When we multiply many small numbers (gradients) together through many layers, the product can become vanishingly small—approaching zero.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example:&lt;/strong&gt; If each layer has gradient 0.1, after 10 layers:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;0.1¹⁰ = 0.0000000001
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The early layers receive essentially zero gradient and stop learning. The network is deep but only the last few layers are actually training.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solutions:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;ReLU activation&lt;/strong&gt;: Gradient is 1 for positive inputs (doesn't shrink)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Residual connections&lt;/strong&gt;: Skip connections that allow gradients to bypass layers&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Batch normalization&lt;/strong&gt;: Keeps activations in a healthy range&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Careful initialization&lt;/strong&gt;: Start with weights that don't lead to extreme activations&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  The Exploding Gradient Problem
&lt;/h3&gt;

&lt;p&gt;The opposite issue: gradients grow exponentially.&lt;/p&gt;

&lt;p&gt;If each layer has gradient 2, after 10 layers:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;2¹⁰ = 1024
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Weights update by huge amounts, causing wild oscillations and instability. The model never converges.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solutions:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Gradient clipping&lt;/strong&gt;: Cap gradients at a maximum value&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Careful initialization&lt;/strong&gt;: Start with smaller weights&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Batch normalization&lt;/strong&gt;: Stabilizes the scale of activations and gradients&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Lower learning rates&lt;/strong&gt;: Smaller update steps&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Computational Efficiency: Why Backpropagation is Brilliant
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Naive approach to finding gradients:&lt;/strong&gt; For each weight, we could:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Make a tiny change: &lt;code&gt;w → w + ε&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Recalculate the entire loss&lt;/li&gt;
&lt;li&gt;Compute: &lt;code&gt;(L_new - L_old) / ε&lt;/code&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;For a network with 1 million weights, this requires 1 million forward passes. Computationally prohibitive.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Backpropagation insight:&lt;/strong&gt; Calculate all gradients in a single backward pass by reusing intermediate calculations. For N weights, we need:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;1 forward pass&lt;/li&gt;
&lt;li&gt;1 backward pass&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That's it. Backpropagation computes all million gradients with just two passes through the network. This is why deep learning became practical.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Mathematics: Derivatives of Common Components
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;ReLU:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;f(x) = max(0, x)
f'(x) = 1 if x &amp;gt; 0, else 0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Sigmoid:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;σ(x) = 1/(1 + e^(-x))
σ'(x) = σ(x)(1 - σ(x))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Tanh:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;tanh(x) = (e^x - e^(-x))/(e^x + e^(-x))
tanh'(x) = 1 - tanh²(x)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Softmax + Cross-Entropy (combined):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;dL/dz = ŷ - y
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This remarkably simple gradient is why we use softmax with cross-entropy.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;MSE:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;L = (ŷ - y)²
dL/dŷ = 2(ŷ - y)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Memory Requirements
&lt;/h3&gt;

&lt;p&gt;Backpropagation requires storing all activations from the forward pass to compute gradients in the backward pass. For a network with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Batch size: 32&lt;/li&gt;
&lt;li&gt;4 layers with 1000 neurons each&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We must store: 32 × 4 × 1000 = 128,000 activation values in memory.&lt;/p&gt;

&lt;p&gt;This is why training large models requires substantial GPU memory, and why techniques like &lt;strong&gt;gradient checkpointing&lt;/strong&gt; (recomputing some activations rather than storing them) become necessary.&lt;/p&gt;




&lt;h2&gt;
  
  
  6. Gradient Descent: The Optimization Algorithm
&lt;/h2&gt;

&lt;p&gt;Imagine you're standing on a mountain in thick fog. You can't see the bottom of the valley, but you can feel the slope beneath your feet. Your goal: reach the lowest point.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Strategy&lt;/strong&gt;: Take a step in the direction of steepest descent.&lt;/p&gt;

&lt;p&gt;This is gradient descent. The "mountain" is the &lt;strong&gt;loss landscape&lt;/strong&gt;—a high-dimensional surface where each dimension represents one weight, and the height represents the loss.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Mathematical Foundation
&lt;/h3&gt;

&lt;p&gt;After backpropagation, we have gradients: &lt;code&gt;∂L/∂w₁, ∂L/∂w₂, ..., ∂L/∂wₙ&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Each gradient tells us:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Direction&lt;/strong&gt;: Positive gradient means loss increases when weight increases&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Magnitude&lt;/strong&gt;: Large gradient means weight strongly affects loss&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Gradient descent update rule:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;w_new = w_old - α · (∂L/∂w)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Where &lt;code&gt;α&lt;/code&gt; (alpha) is the &lt;strong&gt;learning rate&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why subtract?&lt;/strong&gt; The gradient points in the direction of &lt;em&gt;increasing&lt;/em&gt; loss. We want to &lt;em&gt;decrease&lt;/em&gt; loss, so we move in the opposite direction (negative gradient).&lt;/p&gt;

&lt;h3&gt;
  
  
  The Learning Rate: The Most Critical Hyperparameter
&lt;/h3&gt;

&lt;p&gt;The learning rate controls the step size. Choosing it is an art and science.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Too large (α = 1.0):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Iteration 1: Loss = 100
Iteration 2: Loss = 250  [overshot the minimum]
Iteration 3: Loss = 80
Iteration 4: Loss = 300  [wild oscillations]
...never converges
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Too small (α = 0.000001):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Iteration 1: Loss = 100.00
Iteration 2: Loss = 99.99
Iteration 3: Loss = 99.98
...painfully slow, might get stuck in local minimum
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Just right (α = 0.01):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Iteration 1: Loss = 100
Iteration 2: Loss = 85
Iteration 3: Loss = 73
...steady progress toward minimum
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Typical ranges:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Small networks&lt;/strong&gt;: 0.001 - 0.01&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Large networks&lt;/strong&gt;: 0.0001 - 0.001&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;With Adam optimizer&lt;/strong&gt;: 0.001 (default)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Variants of Gradient Descent
&lt;/h3&gt;

&lt;h4&gt;
  
  
  1. Batch Gradient Descent
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Approach&lt;/strong&gt;: Use the entire dataset to compute one gradient update.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;epoch&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;num_epochs&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# Compute gradient using ALL training samples
&lt;/span&gt;    &lt;span class="n"&gt;gradient&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;compute_gradient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;all_data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;weights&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;weights&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;learning_rate&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;gradient&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Pros:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Smooth convergence&lt;/li&gt;
&lt;li&gt;Guaranteed to find the minimum (for convex functions)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Cons:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Slow: One update per epoch&lt;/li&gt;
&lt;li&gt;Memory intensive: Must load entire dataset&lt;/li&gt;
&lt;li&gt;Gets stuck in local minima (for non-convex functions)&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  2. Stochastic Gradient Descent (SGD)
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Approach&lt;/strong&gt;: Use one random sample at a time.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;epoch&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;num_epochs&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="nf"&gt;shuffle&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;sample&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# Compute gradient using ONE sample
&lt;/span&gt;        &lt;span class="n"&gt;gradient&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;compute_gradient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sample&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;weights&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;weights&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;learning_rate&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;gradient&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Pros:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Fast updates: One update per sample&lt;/li&gt;
&lt;li&gt;Can escape local minima (due to noise)&lt;/li&gt;
&lt;li&gt;Memory efficient&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Cons:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Noisy updates: path to minimum is erratic&lt;/li&gt;
&lt;li&gt;Doesn't fully utilize parallel computing (GPUs)&lt;/li&gt;
&lt;li&gt;May oscillate around minimum without settling&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  3. Mini-Batch Gradient Descent (Most Common)
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Approach&lt;/strong&gt;: Use a small batch of samples (typically 32, 64, 128, or 256).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;epoch&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;num_epochs&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="nf"&gt;shuffle&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;batch&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;create_batches&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;batch_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;32&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="c1"&gt;# Compute gradient using BATCH of samples
&lt;/span&gt;        &lt;span class="n"&gt;gradient&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;compute_gradient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;batch&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;weights&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;weights&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;learning_rate&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;gradient&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Pros:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Balanced: More stable than SGD, faster than batch GD&lt;/li&gt;
&lt;li&gt;Efficient: Perfect for GPU parallelization&lt;/li&gt;
&lt;li&gt;Moderate memory usage&lt;/li&gt;
&lt;li&gt;Noise helps escape local minima, but not too much&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Cons:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Another hyperparameter to tune (batch size)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;This is the standard in modern deep learning.&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Advanced Optimizers: Beyond Basic Gradient Descent
&lt;/h3&gt;

&lt;p&gt;Basic gradient descent treats all parameters equally and uses a fixed learning rate. Modern optimizers are more sophisticated.&lt;/p&gt;

&lt;h4&gt;
  
  
  Momentum
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Problem with basic GD:&lt;/strong&gt; Imagine a narrow valley: steep sides, gentle slope toward minimum. Basic GD oscillates between sides while slowly progressing forward.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solution: Momentum&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;velocity&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;iteration&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;gradient&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;compute_gradient&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;velocity&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;β&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;velocity&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;learning_rate&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;gradient&lt;/span&gt;
    &lt;span class="n"&gt;weights&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;weights&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;velocity&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Intuition&lt;/strong&gt;: Remember previous gradients. If we keep going in the same direction, accelerate. If we oscillate, dampen the movement.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Effect:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Faster convergence in consistent directions&lt;/li&gt;
&lt;li&gt;Reduced oscillations&lt;/li&gt;
&lt;li&gt;Can roll through small local minima&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Typical β&lt;/strong&gt;: 0.9 (use 90% of previous velocity)&lt;/p&gt;

&lt;h4&gt;
  
  
  RMSprop (Root Mean Square Propagation)
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Problem&lt;/strong&gt;: Some parameters need large updates, others need small ones. A single learning rate is suboptimal.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solution&lt;/strong&gt;: Adapt the learning rate for each parameter based on recent gradient magnitudes.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;squared_gradient_avg&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;iteration&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;gradient&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;compute_gradient&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;squared_gradient_avg&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;β&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;squared_gradient_avg&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;β&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;gradient&lt;/span&gt;&lt;span class="err"&gt;²&lt;/span&gt;
    &lt;span class="n"&gt;adjusted_gradient&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;gradient&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;sqrt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;squared_gradient_avg&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;ε&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;weights&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;weights&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;learning_rate&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;adjusted_gradient&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Intuition&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Parameters with consistently large gradients get smaller effective learning rates (divided by large number)&lt;/li&gt;
&lt;li&gt;Parameters with small gradients get larger effective learning rates (divided by small number)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Effect&lt;/strong&gt;: Each parameter gets its own adaptive learning rate.&lt;/p&gt;

&lt;h4&gt;
  
  
  Adam (Adaptive Moment Estimation)
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;The gold standard&lt;/strong&gt;: Combines momentum and RMSprop.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;m&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;  &lt;span class="c1"&gt;# first moment (momentum)
&lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;  &lt;span class="c1"&gt;# second moment (RMSprop)
&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;iteration&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;gradient&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;compute_gradient&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="c1"&gt;# Update moments
&lt;/span&gt;    &lt;span class="n"&gt;m&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;β&lt;/span&gt;&lt;span class="err"&gt;₁&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;β&lt;/span&gt;&lt;span class="err"&gt;₁&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;gradient&lt;/span&gt;
    &lt;span class="n"&gt;v&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;β&lt;/span&gt;&lt;span class="err"&gt;₂&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;β&lt;/span&gt;&lt;span class="err"&gt;₂&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;gradient&lt;/span&gt;&lt;span class="err"&gt;²&lt;/span&gt;

    &lt;span class="c1"&gt;# Bias correction (important in early iterations)
&lt;/span&gt;    &lt;span class="n"&gt;m_corrected&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;β&lt;/span&gt;&lt;span class="err"&gt;₁&lt;/span&gt;&lt;span class="o"&gt;^&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;v_corrected&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;β&lt;/span&gt;&lt;span class="err"&gt;₂&lt;/span&gt;&lt;span class="o"&gt;^&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Update weights
&lt;/span&gt;    &lt;span class="n"&gt;weights&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;weights&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;learning_rate&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;m_corrected&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;sqrt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;v_corrected&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;ε&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why Adam dominates:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Combines best of both worlds: momentum + adaptive learning rates&lt;/li&gt;
&lt;li&gt;Robust to hyperparameter choices (default values work well)&lt;/li&gt;
&lt;li&gt;Efficient and converges quickly&lt;/li&gt;
&lt;li&gt;Works across diverse problem types&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Default hyperparameters:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;learning_rate = 0.001&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;β₁ = 0.9&lt;/code&gt; (momentum)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;β₂ = 0.999&lt;/code&gt; (RMSprop)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;ε = 1e-8&lt;/code&gt; (numerical stability)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Learning Rate Schedules
&lt;/h3&gt;

&lt;p&gt;Even with Adam, learning rates can be adjusted during training.&lt;/p&gt;

&lt;h4&gt;
  
  
  1. Step Decay
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Epochs 1-30:   lr = 0.001
Epochs 31-60:  lr = 0.0001
Epochs 61+:    lr = 0.00001
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why&lt;/strong&gt;: Start with larger steps to quickly find the general region, then smaller steps to fine-tune.&lt;/p&gt;

&lt;h4&gt;
  
  
  2. Exponential Decay
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;lr(t) = lr₀ * e^(-kt)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Smoothly decreases learning rate over time.&lt;/p&gt;

&lt;h4&gt;
  
  
  3. Cosine Annealing
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;lr(t) = lr_min + 0.5 * (lr_max - lr_min) * (1 + cos(πt/T))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Gradually reduces learning rate following a cosine curve.&lt;/p&gt;

&lt;h4&gt;
  
  
  4. Warm Restarts
&lt;/h4&gt;

&lt;p&gt;Periodically reset learning rate to initial value. Helps escape local minima by occasionally taking large steps again.&lt;/p&gt;

&lt;h4&gt;
  
  
  5. Learning Rate Warmup
&lt;/h4&gt;

&lt;p&gt;Start with very small learning rate, gradually increase to target value over first few epochs. Prevents instability in early training.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Convergence Question: When to Stop?
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Training loss keeps decreasing&lt;/strong&gt; but should we keep training?&lt;/p&gt;

&lt;h4&gt;
  
  
  Early Stopping
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Concept&lt;/strong&gt;: Monitor performance on a validation set (data the model hasn't trained on).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Epoch 1:  Train Loss = 2.5, Val Loss = 2.6
Epoch 5:  Train Loss = 1.2, Val Loss = 1.3
Epoch 10: Train Loss = 0.8, Val Loss = 0.9
Epoch 15: Train Loss = 0.4, Val Loss = 0.85  [val loss stopped decreasing]
Epoch 20: Train Loss = 0.2, Val Loss = 0.9   [val loss increasing!]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Stop at epoch 10&lt;/strong&gt;: Model is starting to overfit (memorizing training data rather than learning generalizable patterns).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Implementation:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;best_val_loss&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;infinity&lt;/span&gt;
&lt;span class="n"&gt;patience&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;  &lt;span class="c1"&gt;# epochs to wait for improvement
&lt;/span&gt;&lt;span class="n"&gt;patience_counter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;epoch&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;train&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;val_loss&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;validate&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;val_loss&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;best_val_loss&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;best_val_loss&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;val_loss&lt;/span&gt;
        &lt;span class="nf"&gt;save_model&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;patience_counter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;patience_counter&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;patience_counter&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;patience&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Early stopping!&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;break&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Challenges in the Optimization Landscape
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Local Minima
&lt;/h4&gt;

&lt;p&gt;The loss surface has multiple valleys. Gradient descent might settle into a shallow local minimum instead of the deep global minimum.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solutions:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Momentum (can roll over small bumps)&lt;/li&gt;
&lt;li&gt;Multiple random initializations&lt;/li&gt;
&lt;li&gt;Stochastic updates (noise helps escape)&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Saddle Points
&lt;/h4&gt;

&lt;p&gt;Points where gradient is zero but it's neither a minimum nor maximum—a "saddle" shape. More common than local minima in high dimensions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solutions:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Momentum helps push through&lt;/li&gt;
&lt;li&gt;Second-order methods (Newton's method)&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Plateaus
&lt;/h4&gt;

&lt;p&gt;Flat regions where gradients are nearly zero. Progress stalls.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solutions:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Adaptive learning rates (Adam)&lt;/li&gt;
&lt;li&gt;Patience (eventually gradients increase again)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Batching and Parallelization
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Why batches matter for GPUs:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Modern GPUs have thousands of cores. Computing gradients for 32 samples independently is slow. Computing them in parallel is fast.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Matrix operations on batches:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Input batch:  [32 × 784] (32 images, 784 pixels each)
Weights:      [784 × 128]
Output:       [32 × 128] (32 outputs, 128 neurons)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Single matrix multiplication computes all 32 samples simultaneously. This is why GPUs are essential for deep learning.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Batch size trade-offs:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Small batches (e.g., 8-32):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;More frequent updates&lt;/li&gt;
&lt;li&gt;More noise (helps generalization)&lt;/li&gt;
&lt;li&gt;Less memory&lt;/li&gt;
&lt;li&gt;Slower per epoch&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Large batches (e.g., 256-1024):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Fewer updates per epoch&lt;/li&gt;
&lt;li&gt;Smoother gradients&lt;/li&gt;
&lt;li&gt;More memory required&lt;/li&gt;
&lt;li&gt;Faster per epoch&lt;/li&gt;
&lt;li&gt;Risk of poor generalization (too smooth)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Sweet spot&lt;/strong&gt;: Usually 32-128 for most applications.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Complete Training Loop: Putting It All Together
&lt;/h2&gt;

&lt;p&gt;Now we understand all the pieces. Here's how they work together:&lt;/p&gt;

&lt;h3&gt;
  
  
  Initialization
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Initialize weights (Xavier/He initialization)
&lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;layer&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;network&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;layer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;weights&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;random_normal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;sqrt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;n_inputs&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="n"&gt;layer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;biases&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;zeros&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Initialize optimizer
&lt;/span&gt;&lt;span class="n"&gt;optimizer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Adam&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;learning_rate&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.001&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why careful initialization matters:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Too large: Exploding activations and gradients&lt;/li&gt;
&lt;li&gt;Too small: Vanishing gradients&lt;/li&gt;
&lt;li&gt;Xavier/He initialization: Scaled to maintain activation variance across layers&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  The Training Loop
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;epoch&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;num_epochs&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# Shuffle data for randomness
&lt;/span&gt;    &lt;span class="nf"&gt;shuffle&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;training_data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;batch&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;create_batches&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;training_data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;batch_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;32&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="c1"&gt;# 1. FORWARD PROPAGATION
&lt;/span&gt;        &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_true&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;batch&lt;/span&gt;

        &lt;span class="n"&gt;z1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;W1&lt;/span&gt; &lt;span class="o"&gt;@&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;b1&lt;/span&gt;
        &lt;span class="n"&gt;a1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;relu&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;z1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="n"&gt;z2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;W2&lt;/span&gt; &lt;span class="o"&gt;@&lt;/span&gt; &lt;span class="n"&gt;a1&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;b2&lt;/span&gt;
        &lt;span class="n"&gt;a2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;relu&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;z2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="n"&gt;z3&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;W3&lt;/span&gt; &lt;span class="o"&gt;@&lt;/span&gt; &lt;span class="n"&gt;a2&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;b3&lt;/span&gt;
        &lt;span class="n"&gt;y_pred&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;softmax&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;z3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# 2. COMPUTE LOSS
&lt;/span&gt;        &lt;span class="n"&gt;loss&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;cross_entropy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y_pred&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_true&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# 3. BACKPROPAGATION
&lt;/span&gt;        &lt;span class="n"&gt;dL_dz3&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;y_pred&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;y_true&lt;/span&gt;
        &lt;span class="n"&gt;dL_dW3&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dL_dz3&lt;/span&gt; &lt;span class="o"&gt;@&lt;/span&gt; &lt;span class="n"&gt;a2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;T&lt;/span&gt;
        &lt;span class="n"&gt;dL_db3&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dL_dz3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;dL_da2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;W3&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;T&lt;/span&gt; &lt;span class="o"&gt;@&lt;/span&gt; &lt;span class="n"&gt;dL_dz3&lt;/span&gt;

        &lt;span class="n"&gt;dL_dz2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dL_da2&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="nf"&gt;relu_derivative&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;z2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;dL_dW2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dL_dz2&lt;/span&gt; &lt;span class="o"&gt;@&lt;/span&gt; &lt;span class="n"&gt;a1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;T&lt;/span&gt;
        &lt;span class="n"&gt;dL_db2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dL_dz2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;dL_da1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;W2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;T&lt;/span&gt; &lt;span class="o"&gt;@&lt;/span&gt; &lt;span class="n"&gt;dL_dz2&lt;/span&gt;

        &lt;span class="n"&gt;dL_dz1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dL_da1&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="nf"&gt;relu_derivative&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;z1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;dL_dW1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dL_dz1&lt;/span&gt; &lt;span class="o"&gt;@&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;T&lt;/span&gt;
        &lt;span class="n"&gt;dL_db1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dL_dz1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# 4. OPTIMIZATION (using Adam)
&lt;/span&gt;        &lt;span class="n"&gt;W3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b3&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;optimizer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;update&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;W3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dL_dW3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dL_db3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;W2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;optimizer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;update&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;W2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dL_dW2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dL_db2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;W1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;optimizer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;update&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;W1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dL_dW1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dL_db1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# 5. VALIDATION
&lt;/span&gt;    &lt;span class="n"&gt;val_loss&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;evaluate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;validation_data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Epoch &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;epoch&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;: Train Loss = &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;loss&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;, Val Loss = &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;val_loss&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# 6. EARLY STOPPING CHECK
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;should_stop&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;val_loss&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;break&lt;/span&gt;

&lt;span class="c1"&gt;# 7. FINAL EVALUATION
&lt;/span&gt;&lt;span class="n"&gt;test_accuracy&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;evaluate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;test_data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Final Test Accuracy: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;test_accuracy&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  What Happens Over Time
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Epoch 1:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Weights are random&lt;/li&gt;
&lt;li&gt;Predictions are terrible (10% accuracy on 10 classes = random guessing)&lt;/li&gt;
&lt;li&gt;Loss is high (maybe 2.3)&lt;/li&gt;
&lt;li&gt;Large gradients&lt;/li&gt;
&lt;li&gt;Big weight updates&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Epoch 10:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Network learned basic patterns&lt;/li&gt;
&lt;li&gt;Accuracy improved to 60%&lt;/li&gt;
&lt;li&gt;Loss decreased to 1.2&lt;/li&gt;
&lt;li&gt;Moderate gradients&lt;/li&gt;
&lt;li&gt;Steady learning&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Epoch 50:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Network refined understanding&lt;/li&gt;
&lt;li&gt;Accuracy at 92%&lt;/li&gt;
&lt;li&gt;Loss at 0.3&lt;/li&gt;
&lt;li&gt;Small gradients&lt;/li&gt;
&lt;li&gt;Fine-tuning details&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Epoch 100:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Diminishing returns&lt;/li&gt;
&lt;li&gt;Accuracy 93% (validation starting to plateau)&lt;/li&gt;
&lt;li&gt;Risk of overfitting&lt;/li&gt;
&lt;li&gt;Time to stop&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Monitoring Training: What to Watch
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;1. Training Loss&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Should decrease steadily&lt;/li&gt;
&lt;li&gt;If fluctuating wildly: learning rate too high&lt;/li&gt;
&lt;li&gt;If barely moving: learning rate too low or stuck in minimum&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;2. Validation Loss&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Should track training loss initially&lt;/li&gt;
&lt;li&gt;If diverging: overfitting&lt;/li&gt;
&lt;li&gt;If much higher from start: train/val data distribution mismatch&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;3. Gradient Norms&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Should be moderate (0.001 - 1.0)&lt;/li&gt;
&lt;li&gt;If very small (&amp;lt; 0.0001): vanishing gradients&lt;/li&gt;
&lt;li&gt;If very large (&amp;gt; 10): exploding gradients&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;4. Activation Statistics&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Mean should be near zero&lt;/li&gt;
&lt;li&gt;Std should be moderate (~1)&lt;/li&gt;
&lt;li&gt;If activations saturate (all 0 or all max): architectural problem&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;5. Learning Rate&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Can be adjusted based on progress&lt;/li&gt;
&lt;li&gt;Too aggressive: divergence&lt;/li&gt;
&lt;li&gt;Too conservative: slow progress&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Conclusion: The Symphony of Learning
&lt;/h2&gt;

&lt;p&gt;Machine learning is not one algorithm—it's a carefully orchestrated system:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Architecture&lt;/strong&gt; provides the capacity to represent complex functions (Universal Approximation Theorem)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Activation functions&lt;/strong&gt; enable non-linear transformations&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Forward propagation&lt;/strong&gt; generates predictions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Loss functions&lt;/strong&gt; quantify error&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Backpropagation&lt;/strong&gt; computes gradients efficiently&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Gradient descent&lt;/strong&gt; iteratively improves weights&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Each component is essential. Remove any one, and learning fails.&lt;/p&gt;

&lt;p&gt;The beauty lies in the simplicity of each piece and the power of their combination. From these building blocks—matrix multiplications, non-linear functions, derivatives, and iterative updates—emerges the capability to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Recognize faces in photos&lt;/li&gt;
&lt;li&gt;Translate between languages&lt;/li&gt;
&lt;li&gt;Generate realistic images&lt;/li&gt;
&lt;li&gt;Play games at superhuman levels&lt;/li&gt;
&lt;li&gt;Predict protein structures&lt;/li&gt;
&lt;li&gt;Drive cars autonomously&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All from the same fundamental algorithm, repeated billions of times, gradually sculpting random weights into a representation of the world's patterns.&lt;/p&gt;

&lt;p&gt;This is how machines learn: not through magic, but through mathematics, iteration, and the elegant interplay of calculus and optimization across high-dimensional spaces.&lt;/p&gt;

</description>
      <category>llm</category>
      <category>neurons</category>
      <category>deeplearning</category>
      <category>webdev</category>
    </item>
    <item>
      <title>2025 Voice AI Guide: How to Make Your Own Real-Time Voice Agent (Part-1)</title>
      <dc:creator>Boopathi</dc:creator>
      <pubDate>Sun, 21 Sep 2025 04:34:25 +0000</pubDate>
      <link>https://forem.com/programmerraja/2025-voice-ai-guide-how-to-make-your-own-real-time-voice-agent-part-1-45hl</link>
      <guid>https://forem.com/programmerraja/2025-voice-ai-guide-how-to-make-your-own-real-time-voice-agent-part-1-45hl</guid>
      <description>&lt;p&gt;Over the past few months I’ve been building a fully open-source voice agent, exploring the stack end-to-end and learning a ton along the way. Now I’m ready to share everything I discovered.  &lt;/p&gt;

&lt;p&gt;The best part? In 2025 you actually &lt;strong&gt;can&lt;/strong&gt; build one yourself. With today’s open-source models and frameworks you can piece together a real-time voice agent that listens, reasons, and talks back almost like a human  without relying on closed platforms.  &lt;/p&gt;

&lt;p&gt;Let’s walk through the building blocks, step by step.&lt;/p&gt;

&lt;p&gt;Let’s walk through the building blocks step by step.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Core Pipeline
&lt;/h2&gt;

&lt;p&gt;At a high level, a modern voice agent looks like this:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F883x74ub7svc8h60c14p.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F883x74ub7svc8h60c14p.png" alt="overview" width="744" height="451"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Pretty simple on paper but each step has its own challenges. Let’s dig deeper.&lt;/p&gt;

&lt;h2&gt;
  
  
  Speech-to-Text (STT)
&lt;/h2&gt;

&lt;p&gt;Speech is a &lt;strong&gt;continuous audio wave&lt;/strong&gt; it doesn’t naturally have clear sentence boundaries or pauses. That’s where &lt;strong&gt;Voice Activity Detection (VAD)&lt;/strong&gt; comes in:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;VAD (Voice Activity Detection):&lt;/strong&gt; Detects when the user starts and stops talking. Without it, your bot either cuts you off too soon or stares at you blankly.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Once the boundaries are clear, the audio is passed into an &lt;strong&gt;STT model&lt;/strong&gt; for transcription.&lt;/p&gt;

&lt;h4&gt;
  
  
  Popular VAD
&lt;/h4&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Factor&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Silero VAD&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;WebRTC VAD&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;TEN VAD&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Yamnet VAD&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Cobra (Picovoice)&lt;/strong&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Accuracy&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;State-of-the-art, &amp;gt;95% in multi-noise&lt;a href="https://www.qed42.com/insights/voice-activity-detection-in-text-to-speech-how-real-time-vad-works" rel="noopener noreferrer"&gt;&lt;/a&gt;
&lt;/td&gt;
&lt;td&gt;Good for silence/non-silence; lower speech/noise discrimination&lt;/td&gt;
&lt;td&gt;High, lower false positives than WebRTC/Silero&lt;a href="https://www.communeify.com/en/blog/ten-vad-webrtc-killer-opensource-ai-voice-detection/" rel="noopener noreferrer"&gt;&lt;/a&gt;
&lt;/td&gt;
&lt;td&gt;Good, multi-class capable&lt;/td&gt;
&lt;td&gt;Top-tier (see Picovoice benchmarks)&lt;a href="https://picovoice.ai/docs/benchmark/vad/" rel="noopener noreferrer"&gt;&lt;/a&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Latency&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&amp;lt;1ms per 30+ms chunk (CPU/GPU/ONNX)&lt;a href="https://github.com/snakers4/silero-vad" rel="noopener noreferrer"&gt;&lt;/a&gt;
&lt;/td&gt;
&lt;td&gt;10-30ms frame decision, ultra low-lag&lt;a href="https://github.com/wiseman/py-webrtcvad/issues/68" rel="noopener noreferrer"&gt;&lt;/a&gt;
&lt;/td&gt;
&lt;td&gt;2-5ms (real-time capable)&lt;/td&gt;
&lt;td&gt;5–10ms/classify&lt;/td&gt;
&lt;td&gt;5–10ms/classify&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Chunk Size&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;30, 60, 100ms selector&lt;/td&gt;
&lt;td&gt;10–30ms&lt;/td&gt;
&lt;td&gt;20ms, 40ms custom&lt;/td&gt;
&lt;td&gt;30-50ms&lt;/td&gt;
&lt;td&gt;30-50ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Noise Robustness&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Excellent, trained on 100+ noises&lt;a href="https://github.com/snakers4/silero-vad" rel="noopener noreferrer"&gt;&lt;/a&gt;
&lt;/td&gt;
&lt;td&gt;Poor for some background noise/overlapping speech&lt;a href="https://github.com/wiseman/py-webrtcvad/issues/68" rel="noopener noreferrer"&gt;&lt;/a&gt;
&lt;/td&gt;
&lt;td&gt;Excellent&lt;/td&gt;
&lt;td&gt;Moderate&lt;/td&gt;
&lt;td&gt;Excellent&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Language Support&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;6000+ languages/no domain restriction&lt;a href="https://github.com/snakers4/silero-vad" rel="noopener noreferrer"&gt;&lt;/a&gt;
&lt;/td&gt;
&lt;td&gt;Language-agnostic, good for basic speech/silence&lt;/td&gt;
&lt;td&gt;Language-agnostic&lt;/td&gt;
&lt;td&gt;Multi-language possible&lt;/td&gt;
&lt;td&gt;Language-agnostic&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Footprint&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;~2MB JIT, &amp;lt;1MB ONNX, minimal CPU/edge&lt;a href="https://github.com/snakers4/silero-vad" rel="noopener noreferrer"&gt;&lt;/a&gt;
&lt;/td&gt;
&lt;td&gt;~158KB binary, extremely light&lt;/td&gt;
&lt;td&gt;~400KB&lt;a href="https://www.reddit.com/r/selfhosted/comments/1lvdfaq/found_a_really_wellmade_opensource_vad_great/" rel="noopener noreferrer"&gt;&lt;/a&gt;
&lt;/td&gt;
&lt;td&gt;~2MB (.tflite format)&lt;/td&gt;
&lt;td&gt;Small, edge-ready&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Streaming Support&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Yes, supports real-time pipelines&lt;/td&gt;
&lt;td&gt;Yes, designed for telecom/audio streams&lt;/td&gt;
&lt;td&gt;Yes, real-time&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Integration&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Python, ONNX, PyTorch, Pipecat, edge/IoT data&lt;a href="https://github.com/snakers4/silero-vad" rel="noopener noreferrer"&gt;&lt;/a&gt;
&lt;/td&gt;
&lt;td&gt;C/C++/Python, embedded/web/mobile&lt;/td&gt;
&lt;td&gt;Python, C++, web&lt;/td&gt;
&lt;td&gt;TensorFlow Lite APIs&lt;/td&gt;
&lt;td&gt;Python, C, web, WASM&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Licensing&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;MIT (commercial/edge/distribution OK)&lt;a href="https://github.com/snakers4/silero-vad" rel="noopener noreferrer"&gt;&lt;/a&gt;
&lt;/td&gt;
&lt;td&gt;BSD (very permissive)&lt;/td&gt;
&lt;td&gt;Apache 2.0, open&lt;/td&gt;
&lt;td&gt;Apache 2.0&lt;/td&gt;
&lt;td&gt;Apache 2.0&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;a href="https://github.com/snakers4/silero-vad" rel="noopener noreferrer"&gt;Silero VAD&lt;/a&gt; is the gold standard and pipecat has builtin support so I have choosen that :&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Sub-1ms per chunk on CPU&lt;/li&gt;
&lt;li&gt;Just 2MB in size&lt;/li&gt;
&lt;li&gt;Handles 6000+ languages&lt;/li&gt;
&lt;li&gt;Works with 8kHz &amp;amp; 16kHz audio&lt;/li&gt;
&lt;li&gt;MIT license (unrestricted use)&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Popular STT Options
&lt;/h4&gt;

&lt;p&gt;What are thing we need focus on choosing STT for voice agent&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Accuracy&lt;/strong&gt;:

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Word Error Rate (WER):&lt;/strong&gt; Measures transcription mistakes (lower is better).

&lt;ul&gt;
&lt;li&gt;Example: WER 5% means 5 mistakes per 100 words.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;strong&gt;Sentence-level correctness:&lt;/strong&gt; Some models may get individual words right but fail on sentence structure.&lt;/li&gt;

&lt;/ul&gt;

&lt;/li&gt;

&lt;li&gt;

&lt;strong&gt;Multilingual support:&lt;/strong&gt; If your users speak multiple languages, check language coverage.&lt;/li&gt;

&lt;li&gt;

&lt;strong&gt;Noise tolerance:&lt;/strong&gt; Can it handle background noise, music, or multiple speakers?&lt;/li&gt;

&lt;li&gt;

&lt;strong&gt;Accent/voice variation handling:&lt;/strong&gt; Works across accents, genders, and speech speeds.&lt;/li&gt;

&lt;li&gt;

&lt;strong&gt;Voice Activity Detection (VAD) integration:&lt;/strong&gt; Detects when speech starts and ends.&lt;/li&gt;

&lt;li&gt;

&lt;strong&gt;Streaming:&lt;/strong&gt; Most STT models work in batch mode (great for YouTube captions, bad for live conversations). For real-time agents, we need streaming output words should appear &lt;em&gt;while you’re still speaking&lt;/em&gt;.&lt;/li&gt;

&lt;li&gt;

&lt;strong&gt;Low Latency:&lt;/strong&gt; Even 300 500ms delays feel unnatural. Target &lt;strong&gt;sub-second responses&lt;/strong&gt;.&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;Whisper often comes first to mind for most people when discussing speech-to-text because it has a large community, numerous variants, and is backed by OpenAI. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;OpenAI Whisper Family&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://huggingface.co/openai/whisper-large-v3" rel="noopener noreferrer"&gt;Whisper Large V3&lt;/a&gt; — State-of-the-art accuracy with multilingual support&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/SYSTRAN/faster-whisper" rel="noopener noreferrer"&gt;Faster-Whisper&lt;/a&gt;&lt;/strong&gt; — Optimized implementation using CTranslate2&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/huggingface/distil-whisper" rel="noopener noreferrer"&gt;Distil-Whisper&lt;/a&gt;&lt;/strong&gt; — Lightweight for resource-constrained environments&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/m-bain/whisperX" rel="noopener noreferrer"&gt;WhisperX&lt;/a&gt;&lt;/strong&gt; — Enhanced timestamps and speaker diarization&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;NVIDIA&lt;/strong&gt; also offers some interesting STT models, though I haven’t tried them yet since Whisper works well for my use case. I’m just listing them here for you to explore:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://huggingface.co/nvidia/canary-1b" rel="noopener noreferrer"&gt;Canary Qwen 2.5B&lt;/a&gt; — Leading performance, 5.63% WER&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://huggingface.co/nvidia/parakeet-tdt-0.6b" rel="noopener noreferrer"&gt;Parakeet TDT 0.6B V2&lt;/a&gt; — Ultra-fast inference (3,386 RTFx)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Here the comparsion table&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;WER (EN, Public Bench.)&lt;/th&gt;
&lt;th&gt;Multilingual&lt;/th&gt;
&lt;th&gt;Noise/Accent/Voice&lt;/th&gt;
&lt;th&gt;Sentence Accuracy&lt;/th&gt;
&lt;th&gt;VAD Integration&lt;/th&gt;
&lt;th&gt;Streaming&lt;/th&gt;
&lt;th&gt;Latency&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Whisper Large V3&lt;/td&gt;
&lt;td&gt;2–5%&lt;/td&gt;
&lt;td&gt;99+&lt;/td&gt;
&lt;td&gt;Excellent&lt;/td&gt;
&lt;td&gt;Excellent&lt;/td&gt;
&lt;td&gt;Yes (Silero)&lt;/td&gt;
&lt;td&gt;Batch†&lt;/td&gt;
&lt;td&gt;~700ms†&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Faster-Whisper&lt;/td&gt;
&lt;td&gt;2–5%&lt;/td&gt;
&lt;td&gt;99+&lt;/td&gt;
&lt;td&gt;Excellent&lt;/td&gt;
&lt;td&gt;Excellent&lt;/td&gt;
&lt;td&gt;Yes (Silero)&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;~300ms‡&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Canary 1B&lt;/td&gt;
&lt;td&gt;3.06% (MLS EN) &lt;a href="https://huggingface.co/nvidia/canary-1b" rel="noopener noreferrer"&gt;&lt;/a&gt;
&lt;/td&gt;
&lt;td&gt;4 (EN, DE, ES, FR)&lt;/td&gt;
&lt;td&gt;Top-tier, fair on voice/gender/age&lt;/td&gt;
&lt;td&gt;Excellent&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;~500ms–&amp;lt;1s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Parakeet TDT 0.6B&lt;/td&gt;
&lt;td&gt;5–7%&lt;/td&gt;
&lt;td&gt;3 (EN, DE, FR)&lt;/td&gt;
&lt;td&gt;Good&lt;/td&gt;
&lt;td&gt;Very Good&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Ultra Low (~3,400x Real-time)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h4&gt;
  
  
  Why I Chose &lt;strong&gt;FastWhisper&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;After testing, my pick is &lt;strong&gt;FastWhisper&lt;/strong&gt;, an optimized inference engine for Whisper.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Advantages:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;12.5× faster&lt;/strong&gt; than original Whisper&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;3× faster&lt;/strong&gt; than Faster-Whisper with batching&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sub-200ms latency&lt;/strong&gt; possible with proper tuning&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Same accuracy&lt;/strong&gt; as Whisper&lt;/li&gt;
&lt;li&gt;Runs on &lt;strong&gt;CPU &amp;amp; GPU&lt;/strong&gt; with automatic fallback&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It’s built in &lt;strong&gt;C++ + CTranslate2&lt;/strong&gt;, supports batching, and integrates neatly with &lt;strong&gt;VAD&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;For more you can check &lt;a href="https://artificialanalysis.ai/speech-to-text" rel="noopener noreferrer"&gt;Speech to Text AI Model &amp;amp; Provider Leaderboard&lt;/a&gt; &lt;/p&gt;

&lt;h3&gt;
  
  
  Large Language Model (LLM)
&lt;/h3&gt;

&lt;p&gt;Once speech is transcribed, the text goes into an &lt;strong&gt;LLM  the “brain” of your agent&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;What we want in an LLM for voice agents:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Understands prompts, history, and context&lt;/li&gt;
&lt;li&gt;Generates responses quickly&lt;/li&gt;
&lt;li&gt;Supports &lt;strong&gt;tool calls&lt;/strong&gt; (search, RAG, memory, APIs)&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Leading Open-Source LLMs
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Meta Llama Family&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct" rel="noopener noreferrer"&gt;Llama 3.3 70B&lt;/a&gt; — Open-source leader&lt;/li&gt;
&lt;li&gt;Llama 3.2 (1B, 3B, 11B) — Scaled for different deployments&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;128K context window&lt;/strong&gt; — remembers long conversations&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tool calling support&lt;/strong&gt; — built-in function execution&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Others&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://huggingface.co/mistralai/Mistral-7B-v0.1" rel="noopener noreferrer"&gt;Mistral 7B&lt;/a&gt; / &lt;strong&gt;Mixtral 8x7B&lt;/strong&gt; — Efficient and competitive&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://huggingface.co/Qwen/Qwen2.5-72B-Instruct" rel="noopener noreferrer"&gt;Qwen 2.5&lt;/a&gt; — Strong multilingual support&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://huggingface.co/google/gemma-2-27b-it" rel="noopener noreferrer"&gt;Google Gemma&lt;/a&gt; — Lightweight but solid&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  My Choice: Llama 3.3 70B Versatile
&lt;/h4&gt;

&lt;p&gt;Why?&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Large context window&lt;/strong&gt; → keeps conversations coherent&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tool use&lt;/strong&gt; built-in&lt;/li&gt;
&lt;li&gt;Widely supported in the open-source community&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Text-to-Speech (TTS)
&lt;/h2&gt;

&lt;p&gt;Now the agent needs to &lt;strong&gt;speak back&lt;/strong&gt;  and this is where quality can make or break the experience.&lt;/p&gt;

&lt;p&gt;A poor TTS voice instantly ruins immersion. The key requirements are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Low latency&lt;/strong&gt;  avoid awkward pauses&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Natural speech&lt;/strong&gt;  no robotic tone&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Streaming output&lt;/strong&gt;  start speaking mid-sentence&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Open-Source TTS Models I’ve Tried
&lt;/h4&gt;

&lt;p&gt;There are plenty of open-source TTS models available. Here’s a snapshot of the ones I experimented with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://huggingface.co/hexgrad/Kokoro-82M" rel="noopener noreferrer"&gt;&lt;strong&gt;Kokoro-82M&lt;/strong&gt;&lt;/a&gt; — Lightweight, #1 on HuggingFace TTS Arena, blazing fast&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://huggingface.co/ResembleAI/chatterbox" rel="noopener noreferrer"&gt;&lt;strong&gt;Chatterbox&lt;/strong&gt;&lt;/a&gt; — Built on Llama, fast inference, rising adoption&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/coqui-ai/TTS" rel="noopener noreferrer"&gt;&lt;strong&gt;XTTS-v2&lt;/strong&gt;&lt;/a&gt; — Zero-shot voice cloning, 17 languages, streaming support&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/fishaudio/fish-speech" rel="noopener noreferrer"&gt;&lt;strong&gt;FishSpeech&lt;/strong&gt;&lt;/a&gt; — Natural dialogue flow&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://huggingface.co/CanopyLabs/Orpheus-3B" rel="noopener noreferrer"&gt;&lt;strong&gt;Orpheus&lt;/strong&gt;&lt;/a&gt; — Scales from 150M–3B&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/nari-labs/dia?tab=readme-ov-file" rel="noopener noreferrer"&gt;Dia&lt;/a&gt; — A TTS model capable of generating ultra-realistic dialogue in one pass.&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Factor&lt;/th&gt;
&lt;th&gt;Kokoro-82M&lt;/th&gt;
&lt;th&gt;Chatterbox&lt;/th&gt;
&lt;th&gt;XTTS-v2&lt;/th&gt;
&lt;th&gt;FishSpeech&lt;/th&gt;
&lt;th&gt;Orpheus&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Voice Naturalness&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Human-like, top-rated in community&lt;a href="https://www.digitalocean.com/community/tutorials/best-text-to-speech-models" rel="noopener noreferrer"&gt;&lt;/a&gt;
&lt;/td&gt;
&lt;td&gt;Very natural, quickly improving&lt;/td&gt;
&lt;td&gt;High, especially with good samples&lt;/td&gt;
&lt;td&gt;Natural, especially for dialogue&lt;/td&gt;
&lt;td&gt;Good, scales with model size&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Expressiveness / Emotion&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Moderate, some emotional range&lt;/td&gt;
&lt;td&gt;Good, improving&lt;/td&gt;
&lt;td&gt;High, can mimic sample emotion&lt;/td&gt;
&lt;td&gt;Moderate, aims for conversational flow&lt;/td&gt;
&lt;td&gt;Moderate-High, model-dependent&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Accent / Language Coverage&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;8+ languages (EN, JP, ZH, FR, more)&lt;a href="https://www.digitalocean.com/community/tutorials/best-text-to-speech-models" rel="noopener noreferrer"&gt;&lt;/a&gt;
&lt;/td&gt;
&lt;td&gt;EN-focused, expanding&lt;/td&gt;
&lt;td&gt;17+ languages, strong global support&lt;/td&gt;
&lt;td&gt;Several; focus varies&lt;/td&gt;
&lt;td&gt;Varies by checkpoint (3B supports many)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Latency / Inference&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&amp;lt;300ms for any length, streaming-first&lt;a href="https://www.digitalocean.com/community/tutorials/best-text-to-speech-models" rel="noopener noreferrer"&gt;&lt;/a&gt;
&lt;/td&gt;
&lt;td&gt;Fast inference, suitable for real-time&lt;/td&gt;
&lt;td&gt;~500ms (depends on hardware), good streaming support&lt;/td&gt;
&lt;td&gt;~400ms, streaming variants&lt;/td&gt;
&lt;td&gt;3B: ~1s+ (large), 150M: fast (CPU/no-GPU)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Streaming Support&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Yes, natural dialogue with chunked streaming&lt;a href="https://www.digitalocean.com/community/tutorials/best-text-to-speech-models" rel="noopener noreferrer"&gt;&lt;/a&gt;
&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes, early output&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes (3B may be slower)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Resource Usage&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Extremely light (&amp;lt;300MB), great for CPU/edge&lt;a href="https://www.digitalocean.com/community/tutorials/best-text-to-speech-models" rel="noopener noreferrer"&gt;&lt;/a&gt;
&lt;/td&gt;
&lt;td&gt;Moderate (500M params), GPU preferred&lt;/td&gt;
&lt;td&gt;Moderate-high, 500M+ params, GPU preferred&lt;/td&gt;
&lt;td&gt;Moderate, CPU/GPU&lt;/td&gt;
&lt;td&gt;150M–3B options (higher = more GPU/memory)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Quantization / Optimization&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;8-bit available, runs on most hardware&lt;/td&gt;
&lt;td&gt;Some support&lt;/td&gt;
&lt;td&gt;Yes, 8-bit/4-bit&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Voice Cloning / Custom&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Not by default, needs training&lt;/td&gt;
&lt;td&gt;Via fine-tuning&lt;/td&gt;
&lt;td&gt;Zero-shot (few seconds of target voice)&lt;/td&gt;
&lt;td&gt;Beta, improving cloning&lt;/td&gt;
&lt;td&gt;Fine-tuning supported for custom voices&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Documentation / Community&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Active, rich demos, open source, growing&lt;a href="https://github.com/hexgrad/kokoro" rel="noopener noreferrer"&gt;&lt;/a&gt;
&lt;/td&gt;
&lt;td&gt;Good docs, quickly growing&lt;/td&gt;
&lt;td&gt;Very large (Coqui), strong docs&lt;/td&gt;
&lt;td&gt;Medium but positive community&lt;/td&gt;
&lt;td&gt;Medium, active research group&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;License&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Apache 2.0 (commercial OK)&lt;/td&gt;
&lt;td&gt;Commercial/Proprietary use may require license&lt;/td&gt;
&lt;td&gt;LGPL-3.0, open (see repo)&lt;/td&gt;
&lt;td&gt;See repo, mostly permissive&lt;/td&gt;
&lt;td&gt;Apache 2.0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Pretrained Voices / Demos&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Yes (multiple voices, demos available)&lt;a href="https://www.digitalocean.com/community/tutorials/best-text-to-speech-models" rel="noopener noreferrer"&gt;&lt;/a&gt;
&lt;/td&gt;
&lt;td&gt;Yes, continually adding more&lt;/td&gt;
&lt;td&gt;Yes, huge library, instant demo&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes (many public models on Hugging Face)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h4&gt;
  
  
  Why I Chose &lt;strong&gt;Kokoro-82M&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Key Advantages:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;5–15× smaller&lt;/strong&gt; than competing models while maintaining high quality&lt;/li&gt;
&lt;li&gt;Runs under &lt;strong&gt;300MB&lt;/strong&gt; — edge-device friendly&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Sub-300ms latency&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;High-fidelity &lt;strong&gt;24kHz audio&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Streaming-first design&lt;/strong&gt; — natural conversation flow&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Limitations:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;No zero-shot voice cloning (uses a fixed voice library)&lt;/li&gt;
&lt;li&gt;Less expressive than XTTS-v2&lt;/li&gt;
&lt;li&gt;Relatively new model with a smaller community&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You can also check out my minimal &lt;strong&gt;&lt;a href="https://github.com/programmerraja/Kokoro-FastAPI" rel="noopener noreferrer"&gt;Kokoro-FastAPI server&lt;/a&gt;&lt;/strong&gt; to experiment with it:  &lt;/p&gt;

&lt;h2&gt;
  
  
  Speech-to-Speech Models
&lt;/h2&gt;

&lt;p&gt;Speech-to-Speech (S2S) models represent an exciting advancement in AI, combining &lt;strong&gt;speech recognition, language understanding, and text-to-speech synthesis&lt;/strong&gt; into a single, end-to-end pipeline. These models allow &lt;strong&gt;natural, real-time conversations&lt;/strong&gt; by converting speech input directly into speech output, reducing latency and minimizing intermediate processing steps.&lt;/p&gt;

&lt;p&gt;Some notable models in this space include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://github.com/kyutai-labs/moshi" rel="noopener noreferrer"&gt;&lt;strong&gt;Moshi&lt;/strong&gt;&lt;/a&gt;: Developed by Kyutai-Labs, Moshi is a state-of-the-art speech-text foundation model designed for &lt;strong&gt;real-time full-duplex dialogue&lt;/strong&gt;. Unlike traditional voice agents that process ASR, LLM, and TTS separately, Moshi handles the entire flow end-to-end.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://github.com/SesameAILabs/csm" rel="noopener noreferrer"&gt;CSM&lt;/a&gt; (Conversational Speech Model) is a speech generation model from &lt;a href="https://www.sesame.com/" rel="noopener noreferrer"&gt;Sesame&lt;/a&gt; that generates RVQ audio codes from text and audio inputs. The model architecture employs a &lt;a href="https://www.llama.com/" rel="noopener noreferrer"&gt;Llama&lt;/a&gt; backbone and a smaller audio decoder that produces &lt;a href="https://huggingface.co/kyutai/mimi" rel="noopener noreferrer"&gt;Mimi&lt;/a&gt; audio codes.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;&lt;a href="https://www.microsoft.com/en-us/research/project/vall-e-x/vall-e-x/" rel="noopener noreferrer"&gt;VALL-E &amp;amp; VALL-E X (Microsoft)&lt;/a&gt;&lt;/strong&gt;: These models support &lt;strong&gt;zero-shot voice conversion&lt;/strong&gt; and speech-to-speech synthesis from limited voice samples.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;&lt;a href="https://research.google/blog/audiolm-a-language-modeling-approach-to-audio-generation/" rel="noopener noreferrer"&gt;AudioLM (Google Research)&lt;/a&gt;&lt;/strong&gt;: Leverages &lt;strong&gt;language modeling on audio tokens&lt;/strong&gt; to generate high-quality speech continuation and synthesis.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Among these, I’ve primarily worked with &lt;strong&gt;Moshi&lt;/strong&gt;. I’ve implemented it on a &lt;strong&gt;FastAPI server with streaming support&lt;/strong&gt;, which allows you to test and interact with it in real-time. You can explore the FastAPI implementation here: &lt;a href="https://github.com/programmerraja/FastAPI_Moshi" rel="noopener noreferrer"&gt;FastAPI + Moshi GitHub&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Framework (The Glue)
&lt;/h2&gt;

&lt;p&gt;Finally, you need something to tie all the pieces together: &lt;strong&gt;streaming audio, message passing, and orchestration&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Open-Source Frameworks&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/pipecat-ai/pipecat" rel="noopener noreferrer"&gt;Pipecat&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Purpose-built for &lt;strong&gt;voice-first agents&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Streaming-first&lt;/strong&gt; (ultra-low latency)&lt;/li&gt;
&lt;li&gt;Modular design — swap models easily&lt;/li&gt;
&lt;li&gt;Active community&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/vocodedev/vocode-python" rel="noopener noreferrer"&gt;Vocode&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Developer-friendly, good docs&lt;/li&gt;
&lt;li&gt;Direct telephony integration&lt;/li&gt;
&lt;li&gt;Smaller community, less active&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/livekit/agents" rel="noopener noreferrer"&gt;LiveKit Agents&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Based on WebRTC&lt;/li&gt;
&lt;li&gt;Supports voice, video, text&lt;/li&gt;
&lt;li&gt;Self-hosting options&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Traditional Orchestration&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;LangChain&lt;/strong&gt; — great for docs, weak at streaming&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LlamaIndex&lt;/strong&gt; — RAG-focused, not optimized for voice&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Custom builds&lt;/strong&gt; — total control, but high overhead&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Why I Recommend Pipecat
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Voice-Centric Features&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Streaming-first, frame-based pipeline (TTS can start before text is done)&lt;/li&gt;
&lt;li&gt;Smart Turn Detection v2 (intonation-aware)&lt;/li&gt;
&lt;li&gt;Built-in interruption handling&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Production Ready&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Sub-500ms latency achievable&lt;/li&gt;
&lt;li&gt;Efficient for long-running agents&lt;/li&gt;
&lt;li&gt;Excellent docs + examples&lt;/li&gt;
&lt;li&gt;Strong, growing community&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Real-World Performance&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;~500ms voice-to-voice latency in production&lt;/li&gt;
&lt;li&gt;Works with Twilio + phone systems&lt;/li&gt;
&lt;li&gt;Supports multi-agent orchestration&lt;/li&gt;
&lt;li&gt;Scales to thousands of concurrent users&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;Pipecat&lt;/th&gt;
&lt;th&gt;Vocode&lt;/th&gt;
&lt;th&gt;LiveKit&lt;/th&gt;
&lt;th&gt;LangChain&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Voice-First Design&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;⚠️&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Real-Time Streaming&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Vendor Neutral&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;⚠️&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Turn Detection&lt;/td&gt;
&lt;td&gt;✅ Smart V2&lt;/td&gt;
&lt;td&gt;⚠️ Basic&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Community Activity&lt;/td&gt;
&lt;td&gt;✅ High&lt;/td&gt;
&lt;td&gt;⚠️ Moderate&lt;/td&gt;
&lt;td&gt;✅ High&lt;/td&gt;
&lt;td&gt;✅ High&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Learning Curve&lt;/td&gt;
&lt;td&gt;⚠️ Moderate&lt;/td&gt;
&lt;td&gt;⚠️ Moderate&lt;/td&gt;
&lt;td&gt;❌ Steep&lt;/td&gt;
&lt;td&gt;✅ Easy&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Lead to Next Part
&lt;/h2&gt;

&lt;p&gt;In this first part, we’ve covered the &lt;strong&gt;core tech stack and models&lt;/strong&gt; needed to build a real-time voice agent.&lt;/p&gt;

&lt;p&gt;In the next part of the series, we’ll dive into &lt;strong&gt;integration with Pipecat&lt;/strong&gt;, explore our &lt;strong&gt;voice architecture&lt;/strong&gt;, and walk through &lt;strong&gt;deployment strategies&lt;/strong&gt;. Later, we’ll show how to enhance your agent with &lt;strong&gt;RAG (Retrieval-Augmented Generation)&lt;/strong&gt;, &lt;strong&gt;memory features&lt;/strong&gt;, and other advanced capabilities to make your voice assistant truly intelligent.&lt;/p&gt;

&lt;p&gt;Stay tuned the next guide will turn all these building blocks into a working, real time voice agent you can actually deploy.&lt;/p&gt;

&lt;p&gt;I’ve created a &lt;strong&gt;GitHub repository&lt;/strong&gt; &lt;strong&gt;&lt;a href="https://github.com/programmerraja/VoiceAgentGuide" rel="noopener noreferrer"&gt;VoiceAgentGuide&lt;/a&gt;&lt;/strong&gt; for this series, where we can store our notes and related resources. Don’t forget to &lt;strong&gt;check it out&lt;/strong&gt; and share your &lt;strong&gt;feedback&lt;/strong&gt;. Feel free to &lt;strong&gt;contribute or add missing content&lt;/strong&gt; by submitting a pull request (PR).&lt;/p&gt;

&lt;h2&gt;
  
  
  Resources
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://voiceaiandvoiceagents.com/" rel="noopener noreferrer"&gt;Voice AI &amp;amp; Voice Agents An Illustrated Primer&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>generativeai</category>
      <category>voiceagent</category>
      <category>pipecat</category>
    </item>
    <item>
      <title>The Other Side of OpenAI 12 Surprising Stories You Haven’t Heard</title>
      <dc:creator>Boopathi</dc:creator>
      <pubDate>Wed, 10 Sep 2025 01:13:35 +0000</pubDate>
      <link>https://forem.com/programmerraja/the-other-side-of-openai-12-surprising-stories-you-havent-heard-c9</link>
      <guid>https://forem.com/programmerraja/the-other-side-of-openai-12-surprising-stories-you-havent-heard-c9</guid>
      <description>&lt;p&gt;While browsing YouTube, I stumbled across a video titled &lt;a href="https://www.youtube.com/watch?v=lYWtMF_eurA" rel="noopener noreferrer"&gt;&lt;em&gt;This Book Changed How I Think About AI&lt;/em&gt;&lt;/a&gt;. Curious, I clicked and it introduced me to&lt;a href="https://en.wikipedia.org/wiki/Empire_of_AI" rel="noopener noreferrer"&gt; &lt;strong&gt;&lt;em&gt;Empire of AI&lt;/em&gt; by Karen Hao&lt;/strong&gt;&lt;/a&gt;, a book that dives deep into the evolution of OpenAI.&lt;/p&gt;

&lt;p&gt;The book explores OpenAI’s history, its culture of secrecy, and its almost single-minded pursuit of artificial general intelligence (AGI). Drawing on interviews with more than 260 people, along with correspondence and internal documents, Hao paints a revealing picture of the company.&lt;/p&gt;

&lt;p&gt;After reading it, I uncovered 12 particularly fascinating facts about OpenAI that most people don’t know. Let’s dive in.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;1. The “Open” in OpenAI Was More Branding Than Belief&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;The name sounds noble who doesn’t like the idea of “open” AI? But here’s the catch: from the very beginning, openness was more narrative than commitment. Founders Sam Altman, Greg Brockman, and Elon Musk leaned into it because it helped them stand out. Behind closed doors, though, cofounder Ilya Sutskever was already suggesting they could scale it back once the story had served its purpose. In other words: open, until it wasn’t convenient.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;2. Elon Musk’s Billion-Dollar Promise? Mostly Smoke and Mirrors&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Remember Musk’s flashy $1 billion funding pledge? Turns out, OpenAI only ever saw about $130 million of it. And less than $45 million came directly from Musk himself. His back-and-forth on funding almost pushed the organization into crisis, forcing Altman to hunt down new sources of money.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;3. The For-Profit Shift Was More About Survival Than Vision&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;In 2019, OpenAI unveiled its “capped-profit” structure, pitching it as an innovative way to balance mission and money. But the truth is far less glamorous: the nonprofit model wasn’t bringing in the billions needed to compete with tech giants. At one point, Brockman and Sutskever even discussed merging with a chip startup. Creating OpenAI LP wasn’t a bold visionit was a lifeline.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;4. The “Capped-Profit” Model Looked Unlimited to Critics&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Investors were told their returns would be capped at 100x. Sounds responsible, right? But do the math: a $10 million check could still turn into a $1 billion payout. Critics quickly called it “basically unlimited,” arguing the cap only looked meaningful until you saw the actual numbers.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;5. GPT-2’s “Too Dangerous” Storyline Was a PR Masterstroke&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;In 2019, OpenAI said its GPT-2 model was so powerful it had to be withheld for safety reasons. Headlines exploded. But here’s the twist: many researchers thought the risk claims were overblown and saw the whole thing as a publicity stunt engineered by Jack Clark, OpenAI’s communications chief at the time. The stunt worked—the company was suddenly everywhere.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;6. OpenAI’s Culture Had Clashing “Tribes”&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Inside OpenAI, things weren’t exactly harmonious. Sam Altman himself described the organization as divided into three factions: research explorers, safety advocates, and startup minded builders. He even warned of “tribal warfare” if they couldn’t pull together. That’s not just workplace tension it’s a sign of deep conflict over the company’s direction.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;7. ChatGPT’s Global Debut Was Basically an Accident&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Think ChatGPT’s launch was carefully choreographed? Not at all. The product that made OpenAI a household name was released in just two weeks as a “research preview,” right after Thanksgiving 2022. The rush was partly to get ahead of a rumored chatbot from Anthropic. Even Microsoft OpenAI’s biggest partner was caught off guard and reportedly annoyed.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;8. Training Data Included Pirated Books and YouTube Videos&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Where do you get enough data to train something like GPT-3 or GPT-4? In OpenAI’s case, by scraping almost everything it could. GPT-3 used a secret dataset nicknamed “Books2,” which reportedly included pirated works from Library Genesis. GPT-4 went even further, with employees transcribing YouTube videos and scooping up anything online without explicit “do not scrape” warnings.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;9. “AI Safety” Initially Ignored Social Harms&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;OpenAI loves to talk about AI safety now. But early on, executives resisted calls to broaden the term to include real-world harms like discrimination and bias. When pressed, one leader bluntly said, “That’s not our role.” The message was clear: safety meant existential risks, not everyday impacts.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;10. Scaling Up Came with Hidden Environmental Costs&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Bigger models require more compute and more resources. Training GPT-4 in Microsoft’s Iowa data centers consumed roughly 11.5 million gallons of water in a single month, during a drought. Strikingly, Altman and other leaders reportedly never discussed these environmental costs in company-wide meetings.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;11. “SummerSafe LP” Had a Dark Inspiration&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Before OpenAI LP had its public name, it was secretly incorporated as “SummerSafe LP.” The reference? An episode of &lt;em&gt;Rick and Morty&lt;/em&gt; where a car, tasked with keeping Summer safe, resorts to murder and torture. Internally, it was an ironic nod to how AI systems can twist well-meaning goals into dangerous outcomes.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;12. Departing Employees Faced Equity Pressure&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Leaked documents revealed OpenAI used a hardball tactic with departing employees: sign a strict nondisparagement agreement or risk losing vested equity. This essentially forced people into lifelong silence. Altman later said he didn’t know this was happening and was embarrassed, but records show he had signed paperwork granting the company those rights a year earlier.&lt;/p&gt;

&lt;h2&gt;
  
  
  Final Thoughts
&lt;/h2&gt;

&lt;p&gt;OpenAI’s story is anything but straightforward. From broken promises and internal clashes to controversial data practices, the company has often operated in ways that don’t match its public messaging. Whether you see that as savvy strategy, messy growing pains, or something more troubling depends on your perspective.&lt;/p&gt;

&lt;p&gt;But one thing’s clear: the “open” in OpenAI has always been complicated.&lt;/p&gt;

</description>
      <category>openai</category>
      <category>books</category>
      <category>elonmusk</category>
    </item>
    <item>
      <title>Rust Tools That Made Our Dev Team Productive Again</title>
      <dc:creator>Boopathi</dc:creator>
      <pubDate>Sat, 09 Aug 2025 02:58:09 +0000</pubDate>
      <link>https://forem.com/programmerraja/rust-tools-that-made-our-dev-team-productive-again-479</link>
      <guid>https://forem.com/programmerraja/rust-tools-that-made-our-dev-team-productive-again-479</guid>
      <description>&lt;p&gt;As regular readers of my blog may know, our primary technology stack is the &lt;strong&gt;MERN stack&lt;/strong&gt; MongoDB, Express, React, and Node.js. On the frontend, we use React with TypeScript; on the backend, Node.js with TypeScript, and MongoDB serves as our database.&lt;/p&gt;

&lt;p&gt;While this stack has served us well, we encountered significant challenges as our application scaled particularly around build times, memory usage, and developer experience. In this post, I will outline two key areas where &lt;strong&gt;Rust-based tools&lt;/strong&gt; helped us resolve these issues and substantially improved our team’s development velocity.&lt;/p&gt;

&lt;h2&gt;
  
  
  Improving Frontend Performance
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The Problem: Slow Builds and Poor Developer Experience&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;As our frontend codebase grew, we began facing several recurring issues:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Local development startup times became painfully slow.&lt;/li&gt;
&lt;li&gt;Build processes consumed large amounts of memory.&lt;/li&gt;
&lt;li&gt;On lower-end machines, builds caused systems to hang or crash.&lt;/li&gt;
&lt;li&gt;Developers regularly raised concerns about delays and performance bottlenecks.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These issues were primarily due to our use of &lt;strong&gt;Create React App (CRA)&lt;/strong&gt; with an &lt;strong&gt;ejected Webpack configuration&lt;/strong&gt;. While powerful, this setup became increasingly inefficient for our scale and complexity.&lt;/p&gt;

&lt;h3&gt;
  
  
  First Attempt: Migrating to Vite
&lt;/h3&gt;

&lt;p&gt;In search of a solution, I explored &lt;strong&gt;&lt;a href="https://vite.dev/" rel="noopener noreferrer"&gt;Vite&lt;/a&gt;&lt;/strong&gt;, a build tool known for its speed and modern architecture.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Benefits:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Faster initial load times due to native ES module imports.&lt;/li&gt;
&lt;li&gt;Noticeable improvement in development server startup.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Challenges:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Migrating from an ejected CRA setup was complex due to custom Webpack configurations.&lt;/li&gt;
&lt;li&gt;Issues arose with lazy-loaded routes, SVG assets, and ESLint/type-checking delays.&lt;/li&gt;
&lt;li&gt;Certain runtime errors occurred during navigation, likely due to missing or incorrect Vite configurations.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Ultimately, while Vite offered some performance benefits, it did not fully resolve our problems and introduced new complications.&lt;/p&gt;

&lt;h3&gt;
  
  
  Final Solution: Adopting Rspack
&lt;/h3&gt;

&lt;p&gt;After further research, we came across &lt;a href="https://rspack.rs/" rel="noopener noreferrer"&gt;&lt;strong&gt;Rspack&lt;/strong&gt;&lt;/a&gt;, a high-performance Webpack-compatible bundler written in &lt;strong&gt;Rust&lt;/strong&gt;. What caught my attention was its focus on performance and ease of migration.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8f8rkkudvzz2fyxu4bum.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8f8rkkudvzz2fyxu4bum.png" alt="rspack" width="800" height="293"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key advantages of Rspack:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Significantly faster build times up to 70%  improvement in our case.&lt;/li&gt;
&lt;li&gt;Reduced memory consumption during both build and development.&lt;/li&gt;
&lt;li&gt;Compatibility with existing Webpack plugins and configurations, which simplified migration.&lt;/li&gt;
&lt;li&gt;Designed as a drop-in replacement for Webpack.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;After resolving a few initial issues, we successfully integrated Rspack into our frontend build system. The migration resulted in substantial improvements in build speed and developer satisfaction. The system is now in production with no reported issues, and developers are once again comfortable working on the frontend.&lt;/p&gt;

&lt;h2&gt;
  
  
  Accelerating Backend Testing
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The Problem: Slow Kubernetes-Based Testing Cycle&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Our backend uses Kubernetes for deployment and testing. The typical development workflow looked like this:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;A developer makes code changes.&lt;/li&gt;
&lt;li&gt;A Docker image is built and pushed to a registry using github action.&lt;/li&gt;
&lt;li&gt;The updated image is deployed to the Kubernetes cluster.&lt;/li&gt;
&lt;li&gt;Testers verify the changes.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This process, while standard, became inefficient. Even small changes (such as adding a log statement) required a full image build and redeployment, resulting in delays of 15 minutes or more per test cycle.&lt;/p&gt;

&lt;h3&gt;
  
  
  Optimization: Runtime Code Sync
&lt;/h3&gt;

&lt;p&gt;To address this, we have written the shell script that will first run when the pod starts or restart which will pull the latest changes from github and run the code.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git reset &lt;span class="nt"&gt;--hard&lt;/span&gt; origin/&lt;span class="nv"&gt;$BRANCH_NAME&lt;/span&gt;
git pull origin &lt;span class="nv"&gt;$BRANCH_NAME&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This significantly reduced testing turnaround time for JavaScript-based services.&lt;/p&gt;

&lt;h3&gt;
  
  
  The TypeScript Bottleneck
&lt;/h3&gt;

&lt;p&gt;However, for services written in TypeScript, the situation was more complex. After pulling the latest code, we needed to transpile TypeScript to JavaScript using &lt;code&gt;tsc&lt;/code&gt; or &lt;code&gt;npm run build&lt;/code&gt;. Unfortunately, this process:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Consumed excessive memory.&lt;/li&gt;
&lt;li&gt;Took too long to complete.&lt;/li&gt;
&lt;li&gt;Caused pods to crash, especially in test environments with limited resources.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Solution: Integrating SWC
&lt;/h3&gt;

&lt;p&gt;To solve this, we adopted &lt;strong&gt;SWC&lt;/strong&gt;, a Rust-based TypeScript compiler. Unlike &lt;code&gt;tsc&lt;/code&gt;, SWC focuses on speed and performance.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Results after integrating SWC:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Compilation time reduced to approximately 250 milliseconds.&lt;/li&gt;
&lt;li&gt;Memory usage dropped significantly.&lt;/li&gt;
&lt;li&gt;Allowed us to support live code updates without full builds or redeployments.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Because SWC does not perform type checking, we use it only in test environments. This tradeoff allows testers to verify code changes rapidly, without impacting our production pipeline.&lt;/p&gt;

&lt;h3&gt;
  
  
  Conclusion: Rust’s Impact on Team Efficiency
&lt;/h3&gt;

&lt;p&gt;In both our frontend and backend workflows, &lt;strong&gt;Rust-based tools&lt;/strong&gt; Rspack and SWCdelivered substantial improvements:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Frontend build times&lt;/strong&gt; were reduced by more than 70%, with better memory efficiency.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Testing cycles&lt;/strong&gt; became significantly faster, especially for TypeScript services.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Developer experience&lt;/strong&gt; improved across the board, reducing frustration and increasing velocity.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Rust’s performance characteristics, coupled with thoughtful tool design, played a critical role in resolving bottlenecks in our JavaScript-based systems. For teams facing similar challenges, especially around build performance and scalability, we strongly recommend exploring Rust-powered tools as a viable solution.&lt;/p&gt;

</description>
      <category>webdev</category>
      <category>rust</category>
      <category>javascript</category>
      <category>react</category>
    </item>
    <item>
      <title>This AI Interview Assistant Chrome Extension Was My Weekend Project</title>
      <dc:creator>Boopathi</dc:creator>
      <pubDate>Sat, 24 May 2025 05:23:09 +0000</pubDate>
      <link>https://forem.com/programmerraja/i-built-a-chrome-extension-that-reads-resumes-and-generates-interview-questions-with-ai-2ppk</link>
      <guid>https://forem.com/programmerraja/i-built-a-chrome-extension-that-reads-resumes-and-generates-interview-questions-with-ai-2ppk</guid>
      <description>&lt;p&gt;Hiring has always been one of those tasks that &lt;em&gt;seems&lt;/em&gt; easy until you’re knee deep in resumes and trying to remember who did what, and more importantly what to ask them during the interview._&lt;/p&gt;

&lt;p&gt;A few weeks ago, I had this exact moment. I was preparing for an interview and had a resume open in one tab, a notepad in another, and ChatGPT somewhere in the background trying to help me brainstorm questions. That's when the idea hit me:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“Why am I juggling between tabs? What if this entire process could live in a single Chrome extension?”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;So I built one. It’s called &lt;strong&gt;HireZen&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem I Kept Running Into
&lt;/h2&gt;

&lt;p&gt;Every time I had to take an interview, I’d start by opening the candidate’s resume. But even after reading it top to bottom, I wasn’t always sure:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;What’s the best way to dig deeper into their projects?&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Are they really comfortable with the tools they listed?&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;What kinds of behavioral or situational questions would be relevant?&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I’d often resort to generic questions or spend too much time prepping just one resume. It felt repetitive and inefficient  and I knew there had to be a better way.&lt;/p&gt;

&lt;h2&gt;
  
  
  Enter HireZen
&lt;/h2&gt;

&lt;p&gt;HireZen is a Chrome extension that does one simple thing:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;You upload a resume, and it generates personalized interview questions for you using AI.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;That’s it. No over engineering. No login required. Just upload, generate, copy or print  done.&lt;/p&gt;

&lt;p&gt;Here’s what it currently supports:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;🧠 Reads and parses PDF resumes&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;🤖 Uses LLMs (like GPT-4) to generate questions based on the candidate’s experience&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;🖨️ Lets you print the generated questions or share them with HR&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The idea is to take the mental load off interviewers and let AI handle the repetitive thinking.&lt;/p&gt;

&lt;h2&gt;
  
  
  How It Works
&lt;/h2&gt;

&lt;p&gt;By default, when you visit Google Meet, HireZen will auto-open as a sidebar so you can prep questions while you’re on the call.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Press Ctrl + M to hide or show the extension anytime (toggle view).&lt;/li&gt;
&lt;li&gt;Click the Settings icon to:&lt;/li&gt;
&lt;li&gt;Choose your LLM provider (OpenAI, Claude, etc.)&lt;/li&gt;
&lt;li&gt;Enter your API key&lt;/li&gt;
&lt;li&gt;Select the model you prefer (e.g., GPT-4, GPT-3.5)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Everything is stored securely inside your browser. Once configured, just upload a resume and start generating questions instantly.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcltwdnifnyao1u9kmt4l.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcltwdnifnyao1u9kmt4l.png" alt="Preview" width="800" height="403"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxqd0kqtmxyl22kg9ikok.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxqd0kqtmxyl22kg9ikok.png" alt="preview2" width="800" height="403"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Tech Behind It
&lt;/h2&gt;

&lt;p&gt;Initially, I was using a GitHub-hosted API to call OpenAI’s models. It worked well, but obviously, not scalable for others. So I added a &lt;strong&gt;Settings page&lt;/strong&gt; where anyone using the extension can:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Choose their &lt;strong&gt;LLM provider&lt;/strong&gt; (e.g., OpenAI or others)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Set their &lt;strong&gt;preferred model&lt;/strong&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Enter their own &lt;strong&gt;API key&lt;/strong&gt;, which is stored securely in the browser (not sent to me or any server)&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;No backend. No database. Just local storage via Chrome's &lt;code&gt;storage.local&lt;/code&gt; API.&lt;/p&gt;

&lt;p&gt;It’s simple, and more importantly safe.&lt;/p&gt;

&lt;h2&gt;
  
  
  On Security
&lt;/h2&gt;

&lt;p&gt;One thing I was very cautious about was handling the API key. I didn’t want to mess around with storing sensitive data anywhere outside the user’s browser. So everything model, provider, key is stored &lt;strong&gt;locally&lt;/strong&gt; and only accessible to the extension.&lt;/p&gt;

&lt;p&gt;You control your own usage. You bring your own key. I never see it.&lt;/p&gt;

&lt;h2&gt;
  
  
  What’s Next?
&lt;/h2&gt;

&lt;p&gt;This is just the beginning. I’m planning to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Add support for exporting question sets as PDF&lt;/li&gt;
&lt;li&gt;Build a small feedback form to help interviewers leave notes&lt;/li&gt;
&lt;li&gt;Eventually list it on the Chrome Web Store&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Right now, it’s all open and available to try.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try It Out
&lt;/h2&gt;

&lt;p&gt;Here’s the &lt;a href="https://programmerraja.is-a.dev/hirezen/" rel="noopener noreferrer"&gt;link&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;It works on Chrome and Chromium-based browsers. Just open it, upload a resume, and let it do the rest.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why I’m Sharing This
&lt;/h2&gt;

&lt;p&gt;I’m a solo developer. I build things out of curiosity and real-world pain points I face at work. HireZen is one of those small tools I wish I had earlier, so I built it and put it out there.&lt;/p&gt;

&lt;p&gt;If it saves you time or makes your interviews a little smoother  that’s all I hoped for.&lt;/p&gt;

&lt;p&gt;And hey, if you found it helpful and want to support my work...&lt;/p&gt;

&lt;p&gt;☕️ You can &lt;strong&gt;&lt;a href="https://buymeacoffee.com/programmerraja" rel="noopener noreferrer"&gt;buy me a coffee&lt;/a&gt;&lt;/strong&gt; – it helps me keep building little tools like this and pushing updates.&lt;/p&gt;

&lt;p&gt;Thanks for reading!&lt;/p&gt;

</description>
      <category>webdev</category>
      <category>sideprojects</category>
      <category>llm</category>
    </item>
    <item>
      <title>Prototyping AI Agents with GitHub Models (for Free!) 🤖💸</title>
      <dc:creator>Boopathi</dc:creator>
      <pubDate>Sun, 20 Apr 2025 06:48:47 +0000</pubDate>
      <link>https://forem.com/programmerraja/prototyping-ai-agents-with-github-models-for-free-1go9</link>
      <guid>https://forem.com/programmerraja/prototyping-ai-agents-with-github-models-for-free-1go9</guid>
      <description>&lt;p&gt;I &lt;em&gt;love&lt;/em&gt; when companies roll out generous free tiers. It feels like they’re saying, “Hey, go build your thing — we’ve got your back.” And if you’re a student, between jobs, or just tired of racking up charges for every API call (yep, been there too), free tiers can be a total game-changer.&lt;/p&gt;

&lt;p&gt;That’s exactly why &lt;strong&gt;GitHub Models&lt;/strong&gt; stood out to me — it’s like an AI candy shop, completely free to explore, as long as you have a GitHub account.&lt;/p&gt;

&lt;p&gt;Here’s what’s on the shelf:&lt;/p&gt;

&lt;p&gt;🔮 OpenAI models like &lt;code&gt;gpt-4o&lt;/code&gt;, &lt;code&gt;o3-mini&lt;/code&gt;&lt;br&gt;&lt;br&gt;
🧠 Research favorites like &lt;code&gt;Phi&lt;/code&gt; and &lt;code&gt;LLaMA&lt;/code&gt;&lt;br&gt;&lt;br&gt;
🌍 Multimodal models like &lt;code&gt;llama-vision-instruct&lt;/code&gt;&lt;br&gt;&lt;br&gt;
📚 Embeddings from Cohere and OpenAI&lt;br&gt;&lt;br&gt;
⚡ Plus providers like &lt;code&gt;Mistral&lt;/code&gt;, &lt;code&gt;Jamba&lt;/code&gt;, &lt;code&gt;Codestral&lt;/code&gt; and more&lt;/p&gt;

&lt;p&gt;Oh, and the best part? Many of these models support &lt;strong&gt;function calling&lt;/strong&gt;, making them perfect for agent-style apps.&lt;/p&gt;

&lt;p&gt;Now here’s the real kicker:&lt;br&gt;&lt;br&gt;
GitHub Models speak &lt;strong&gt;OpenAI-compatible API&lt;/strong&gt; — which means any Python framework that already works with OpenAI's ChatCompletion API... just works out of the box.&lt;/p&gt;
&lt;h3&gt;
  
  
  Example 1: Connecting &lt;code&gt;openai&lt;/code&gt; SDK to GitHub Models
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;GITHUB_TOKEN&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
  &lt;span class="n"&gt;base_url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://models.inference.ai.azure.com&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Now go ahead and use it like you would with OpenAI:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(...)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Simple, clean, no surprises.&lt;/p&gt;

&lt;h3&gt;
  
  
  Example 2: Running AutoGen with GitHub Models
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;autogen_ext.models.openai&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;autogen_agentchat.agents&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;autogen_ext&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;openai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;OpenAIChatCompletionClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;GITHUB_TOKEN&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
  &lt;span class="n"&gt;base_url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://models.inference.ai.azure.com&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;math_teacher&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;autogen_agentchat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;agents&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;AssistantAgent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Math_teacher&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;model_client&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;system_message&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;You only teach maths.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Just like that, your agent is ready to go.&lt;/p&gt;

&lt;p&gt;You can plug GitHub Models into &lt;em&gt;tons&lt;/em&gt; of other Python libraries too — LangGraph, PydanticAI, LlamaIndex, you name it.&lt;/p&gt;

&lt;p&gt;Go build something fun. Happy tinkering!&lt;/p&gt;

</description>
      <category>github</category>
      <category>openai</category>
      <category>free</category>
      <category>llm</category>
    </item>
  </channel>
</rss>
