<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Nishkarsh Sahu</title>
    <description>The latest articles on Forem by Nishkarsh Sahu (@nishkarsh_sahu_09732900c6).</description>
    <link>https://forem.com/nishkarsh_sahu_09732900c6</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3940882%2F6528267e-bd3a-458a-bca9-9fc441d92a0c.jpg</url>
      <title>Forem: Nishkarsh Sahu</title>
      <link>https://forem.com/nishkarsh_sahu_09732900c6</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/nishkarsh_sahu_09732900c6"/>
    <language>en</language>
    <item>
      <title>Building a Rails-Native AI Abstraction Layer for Local and Hosted LLMs</title>
      <dc:creator>Nishkarsh Sahu</dc:creator>
      <pubDate>Tue, 19 May 2026 18:33:34 +0000</pubDate>
      <link>https://forem.com/nishkarsh_sahu_09732900c6/building-a-rails-native-ai-abstraction-layer-for-local-and-hosted-llms-5968</link>
      <guid>https://forem.com/nishkarsh_sahu_09732900c6/building-a-rails-native-ai-abstraction-layer-for-local-and-hosted-llms-5968</guid>
      <description>&lt;p&gt;Recently I’ve been experimenting with integrating local AI runtimes into Rails applications using tools like Ollama and LM Studio.&lt;/p&gt;

&lt;p&gt;At first, the integration looked straightforward:&lt;br&gt;
make an HTTP request, stream the response, and return the generated text.&lt;/p&gt;

&lt;p&gt;But after experimenting with multiple providers, I realized the actual challenge wasn’t calling the APIs - it was normalizing the differences between providers cleanly.&lt;/p&gt;

&lt;p&gt;The Problem&lt;/p&gt;

&lt;p&gt;Every AI provider behaves slightly differently.&lt;/p&gt;

&lt;p&gt;Some providers:&lt;/p&gt;

&lt;p&gt;stream using SSE&lt;br&gt;
stream newline-delimited JSON&lt;br&gt;
return partial JSON chunks&lt;br&gt;
expose different finish signals&lt;br&gt;
structure responses differently&lt;br&gt;
implement retries/errors differently&lt;/p&gt;

&lt;p&gt;Even providers claiming OpenAI compatibility often differ subtly in:&lt;/p&gt;

&lt;p&gt;chunk formatting&lt;br&gt;
streaming behavior&lt;br&gt;
error payloads&lt;br&gt;
lifecycle handling&lt;/p&gt;

&lt;p&gt;This becomes painful when trying to build reusable Rails infrastructure.&lt;/p&gt;

&lt;p&gt;You quickly end up writing:&lt;/p&gt;

&lt;p&gt;provider-specific parsing&lt;br&gt;
provider-specific retry handling&lt;br&gt;
provider-specific response normalization&lt;br&gt;
provider-specific streaming logic&lt;/p&gt;

&lt;p&gt;inside application code.&lt;/p&gt;

&lt;p&gt;What I Wanted Instead&lt;/p&gt;

&lt;p&gt;I wanted a Rails-native abstraction layer where application code could stay provider-independent.&lt;/p&gt;

&lt;p&gt;Something conceptually similar to how ActiveRecord abstracts databases.&lt;/p&gt;

&lt;p&gt;The goal became:&lt;/p&gt;

&lt;p&gt;response = AiModels.chat(&lt;br&gt;
  provider: :ollama,&lt;br&gt;
  model: "llama3.2",&lt;br&gt;
  messages: [&lt;br&gt;
    {&lt;br&gt;
      role: "user",&lt;br&gt;
      content: "Explain ActiveRecord associations"&lt;br&gt;
    }&lt;br&gt;
  ]&lt;br&gt;
)&lt;/p&gt;

&lt;p&gt;puts response.content&lt;/p&gt;

&lt;p&gt;without the application caring about:&lt;/p&gt;

&lt;p&gt;SSE parsing&lt;br&gt;
chunk formats&lt;br&gt;
provider-specific APIs&lt;br&gt;
retry lifecycle behavior&lt;br&gt;
Streaming Was the Hardest Part&lt;/p&gt;

&lt;p&gt;The most interesting challenge turned out to be streaming.&lt;/p&gt;

&lt;p&gt;Different providers stream differently:&lt;/p&gt;

&lt;p&gt;SSE chunks&lt;br&gt;
JSON lines&lt;br&gt;
partial JSON payloads&lt;br&gt;
token-by-token deltas&lt;br&gt;
different completion signals&lt;/p&gt;

&lt;p&gt;Normalizing these cleanly required:&lt;/p&gt;

&lt;p&gt;provider adapters&lt;br&gt;
shared streaming parsers&lt;br&gt;
unified response objects&lt;br&gt;
lifecycle hooks&lt;br&gt;
retry boundaries&lt;/p&gt;

&lt;p&gt;I ended up implementing:&lt;/p&gt;

&lt;p&gt;callback-based streaming&lt;br&gt;
Enumerator-based streaming&lt;br&gt;
normalized chunk responses&lt;br&gt;
provider-independent lifecycle hooks&lt;/p&gt;

&lt;p&gt;Example:&lt;/p&gt;

&lt;p&gt;AiModels.chat_stream(&lt;br&gt;
  provider: :lm_studio,&lt;br&gt;
  model: "tinyllama-1.1b-chat-v1.0",&lt;br&gt;
  messages: [&lt;br&gt;
    {&lt;br&gt;
      role: "user",&lt;br&gt;
      content: "Explain belongs_to vs has_many"&lt;br&gt;
    }&lt;br&gt;
  ]&lt;br&gt;
) do |chunk|&lt;br&gt;
  print chunk.content&lt;br&gt;
end&lt;br&gt;
Why Local AI Matters&lt;/p&gt;

&lt;p&gt;One thing I found particularly interesting was how useful local AI becomes during development.&lt;/p&gt;

&lt;p&gt;Running models through:&lt;/p&gt;

&lt;p&gt;Ollama&lt;br&gt;
LM Studio&lt;br&gt;
LocalAI&lt;/p&gt;

&lt;p&gt;gives:&lt;/p&gt;

&lt;p&gt;faster experimentation&lt;br&gt;
offline development&lt;br&gt;
lower costs&lt;br&gt;
more privacy&lt;br&gt;
easier debugging&lt;/p&gt;

&lt;p&gt;without depending entirely on hosted APIs.&lt;/p&gt;

&lt;p&gt;Rails developers are already used to running infrastructure locally:&lt;/p&gt;

&lt;p&gt;PostgreSQL&lt;br&gt;
Redis&lt;br&gt;
Sidekiq&lt;br&gt;
Elasticsearch&lt;/p&gt;

&lt;p&gt;Local AI runtimes fit naturally into that workflow.&lt;/p&gt;

&lt;p&gt;Architecture Approach&lt;/p&gt;

&lt;p&gt;The structure I ended up with looks roughly like this:&lt;/p&gt;

&lt;p&gt;Rails App&lt;br&gt;
   ↓&lt;br&gt;
AiModels.chat&lt;br&gt;
   ↓&lt;br&gt;
Client&lt;br&gt;
   ↓&lt;br&gt;
Provider Registry&lt;br&gt;
   ↓&lt;br&gt;
Provider Adapter&lt;br&gt;
   ↓&lt;br&gt;
Ollama / LM Studio / DeepSeek / OpenAI-compatible APIs&lt;/p&gt;

&lt;p&gt;Key ideas:&lt;/p&gt;

&lt;p&gt;provider isolation&lt;br&gt;
normalized response objects&lt;br&gt;
reusable streaming lifecycle&lt;br&gt;
provider-independent retries/hooks&lt;br&gt;
Rails-native configuration&lt;br&gt;
Current State&lt;/p&gt;

&lt;p&gt;The project currently supports:&lt;/p&gt;

&lt;p&gt;Ollama&lt;br&gt;
LM Studio&lt;br&gt;
DeepSeek&lt;br&gt;
OpenAI-compatible providers&lt;br&gt;
streaming&lt;br&gt;
retries/hooks&lt;br&gt;
Rails integration&lt;/p&gt;

&lt;p&gt;The next area I’m exploring is embeddings support for:&lt;/p&gt;

&lt;p&gt;semantic search&lt;br&gt;
RAG pipelines&lt;br&gt;
vector databases&lt;br&gt;
AI memory systems&lt;br&gt;
Final Thoughts&lt;/p&gt;

&lt;p&gt;One thing I’ve learned while building AI integrations:&lt;br&gt;
the hard part usually isn’t the model call itself.&lt;/p&gt;

&lt;p&gt;The difficult part is building stable infrastructure around:&lt;/p&gt;

&lt;p&gt;streaming&lt;br&gt;
retries&lt;br&gt;
provider abstraction&lt;br&gt;
observability&lt;br&gt;
lifecycle management&lt;/p&gt;

&lt;p&gt;especially once multiple providers enter the picture.&lt;/p&gt;

&lt;p&gt;I’m curious how other Ruby/Rails developers are approaching:&lt;/p&gt;

&lt;p&gt;local AI runtimes&lt;br&gt;
provider abstractions&lt;br&gt;
streaming APIs&lt;br&gt;
embeddings/RAG infrastructure&lt;br&gt;
Rails AI architecture in general&lt;/p&gt;

&lt;p&gt;GitHub:&lt;br&gt;
&lt;a href="https://github.com/nishkarshh013/ai_models" rel="noopener noreferrer"&gt;https://github.com/nishkarshh013/ai_models&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>architecture</category>
      <category>llm</category>
      <category>rails</category>
    </item>
  </channel>
</rss>
