<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Juan</title>
    <description>The latest articles on Forem by Juan (@juanok).</description>
    <link>https://forem.com/juanok</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3864719%2F703a7b7a-a551-44c6-882f-86824afd6ef1.jpeg</url>
      <title>Forem: Juan</title>
      <link>https://forem.com/juanok</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/juanok"/>
    <language>en</language>
    <item>
      <title>I built an AI Gateway with no technical background. Here's where I'm stuck.</title>
      <dc:creator>Juan</dc:creator>
      <pubDate>Wed, 08 Apr 2026 11:51:19 +0000</pubDate>
      <link>https://forem.com/juanok/i-built-an-ai-gateway-with-no-technical-background-heres-where-im-stuck-558b</link>
      <guid>https://forem.com/juanok/i-built-an-ai-gateway-with-no-technical-background-heres-where-im-stuck-558b</guid>
      <description>&lt;p&gt;I'm a solo founder from Argentina. I'm not an engineer — I built the backend of &lt;a href="https://neuralrouting.io" rel="noopener noreferrer"&gt;NeuralRouting.io&lt;/a&gt; almost entirely with Claude.&lt;/p&gt;

&lt;h2&gt;
  
  
  The problem
&lt;/h2&gt;

&lt;p&gt;Most teams send every LLM request to GPT-4 even when a smaller model would return the same quality answer. The cost difference between a $30/M token model and a $0.50/M model is massive at scale.&lt;/p&gt;

&lt;h2&gt;
  
  
  What NeuralRouting does
&lt;/h2&gt;

&lt;p&gt;It sits between your app and LLM providers. Each request gets a complexity score, and the router picks the cheapest model that can handle it.&lt;/p&gt;

&lt;p&gt;It also has:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Dual-layer semantic cache&lt;/strong&gt; — similar queries get served from cache instead of hitting the API again&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Shadow Engine&lt;/strong&gt; — runs cheaper models in parallel to benchmark quality over time&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;PII filtering and rate limiting&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Agent loop detection&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Where it's at
&lt;/h2&gt;

&lt;p&gt;Early. I only support OpenAI and Groq right now. Zero users. I built too much before talking to anyone and I'm fixing that now.&lt;/p&gt;

&lt;p&gt;If you work with LLMs and want to try it, I'm looking for honest feedback: neuralrouting.io&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>beginners</category>
      <category>showdev</category>
    </item>
    <item>
      <title>I was mass-sending everything to GPT-4. Here's what I changed.</title>
      <dc:creator>Juan</dc:creator>
      <pubDate>Tue, 07 Apr 2026 01:19:39 +0000</pubDate>
      <link>https://forem.com/juanok/i-was-mass-sending-everything-to-gpt-4-heres-what-i-changed-20jh</link>
      <guid>https://forem.com/juanok/i-was-mass-sending-everything-to-gpt-4-heres-what-i-changed-20jh</guid>
      <description>&lt;p&gt;I'm a solo dev from Argentina building AI tools. For months I was doing what most of us do — every API call went straight to GPT-4 (now GPT-4o). Summarization? GPT-4. Formatting a JSON? GPT-4. Answering "what's 2+2"? GPT-4.&lt;/p&gt;

&lt;p&gt;Then I looked at my bill and did some math.&lt;/p&gt;

&lt;h2&gt;
  
  
  The numbers that made me stop
&lt;/h2&gt;

&lt;p&gt;Here's what the main LLM providers charge per 1M tokens right now:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Input&lt;/th&gt;
&lt;th&gt;Output&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;GPT-4o&lt;/td&gt;
&lt;td&gt;$2.50&lt;/td&gt;
&lt;td&gt;$10.00&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPT-4o-mini&lt;/td&gt;
&lt;td&gt;$0.15&lt;/td&gt;
&lt;td&gt;$0.60&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Llama 3.1 8B (via Groq)&lt;/td&gt;
&lt;td&gt;$0.05&lt;/td&gt;
&lt;td&gt;$0.05&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Claude 3.5 Sonnet&lt;/td&gt;
&lt;td&gt;$3.00&lt;/td&gt;
&lt;td&gt;$15.00&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Look at that gap between GPT-4o and Llama. That's a &lt;strong&gt;50x price difference&lt;/strong&gt; on input tokens.&lt;/p&gt;

&lt;p&gt;And here's the thing — for probably 70% of what I was sending to GPT-4o, Llama would've given me the same answer.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I tried first
&lt;/h2&gt;

&lt;p&gt;The obvious solution: just add some if/else logic.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;is_simple&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;llama-3.1-8b-instant&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Sounds easy. It's not.&lt;/p&gt;

&lt;p&gt;What's "simple"? How do you define that? Token count? Keywords? And then you need different API clients for OpenAI vs Groq. Different error handling. Fallback logic when one provider goes down. Rate limiting per provider.&lt;/p&gt;

&lt;p&gt;It turned into spaghetti real fast.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I actually ended up building
&lt;/h2&gt;

&lt;p&gt;I spent a few months building a proxy that handles all of this automatically. You point your OpenAI SDK at it, and it figures out which model to use per request.&lt;/p&gt;

&lt;p&gt;The routing logic is basically:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Classify the prompt&lt;/strong&gt; — is it casual chat, coding, analysis, math, translation? Each has a different complexity baseline.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Check complexity&lt;/strong&gt; — token count, multi-step signals, risk level.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Route&lt;/strong&gt; — simple stuff goes to Llama (nearly free), complex stuff goes to GPT-4o.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Validate&lt;/strong&gt; — a background process compares cheap vs premium responses to catch quality drops.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The integration looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OpenAI&lt;/span&gt;

&lt;span class="c1"&gt;# just change the base_url, that's it
&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;base_url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://your-proxy-url/v1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;your-routing-key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# use it exactly like before
&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# the router overrides this
&lt;/span&gt;    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;hello&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;One line change. Everything else stays the same.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I learned
&lt;/h2&gt;

&lt;p&gt;A few things that surprised me:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Most classification doesn't need AI.&lt;/strong&gt; I started with GPT-4o-mini classifying each prompt (ironic, paying for AI to decide if I should pay for AI). Switched to regex + heuristics. Zero cost, runs in &amp;lt;1ms.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Fallbacks matter more than routing.&lt;/strong&gt; If Groq goes down and you don't have a fallback, your users don't care about your 80% savings. They care that the app is broken.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Quality validation is the hard part.&lt;/strong&gt; Routing is easy. Knowing if the cheap model gave a good enough answer — that's the real problem. I built a shadow engine that samples responses and compares them. Still not perfect.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Where I'm at now
&lt;/h2&gt;

&lt;p&gt;$0 MRR. Zero paying customers. The product works — I use it myself every day. But I'm at the "talking to people" stage now, which is way harder than building.&lt;/p&gt;

&lt;p&gt;If you're spending $100+/mo on LLM APIs and want to try it, the project is called &lt;a href="https://neuralrouting.io" rel="noopener noreferrer"&gt;NeuralRouting&lt;/a&gt;. Free tier available, no credit card.&lt;/p&gt;

&lt;p&gt;But honestly I'm more interested in hearing: &lt;strong&gt;how are you handling multi-model routing?&lt;/strong&gt; Roll your own? Using a gateway? Just sending everything to one model and accepting the cost?&lt;/p&gt;




&lt;p&gt;&lt;em&gt;First post here. Building in public from Corrientes, Argentina. 🇦🇷&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>python</category>
      <category>saas</category>
    </item>
  </channel>
</rss>
