Forem: Ferit

Why We're Giving Octopus Free to Open Source, Forever

Ferit — Tue, 05 May 2026 13:12:06 +0000

I'm the developer behind Octopus Review, an AI code review tool that runs on pull requests. We just flipped a switch: every public, OSI-licensed repository gets unlimited reviews. No credit card, no monthly quota, no "free tier with limits."

I want to explain why, because the reasoning is more interesting than the announcement.

The Bill My Laptop Doesn't Send Me

Open the package.json of any project I've shipped in the last decade. React, TypeScript, Prisma, Next.js, Tailwind, Postgres, Redis, Nginx. Every single one is open source. Every single one was built by someone who didn't ask me for money.

The bill never arrives, but the debt is real.

And the people producing all of that are, by and large, exhausted. If you've ever maintained a moderately popular open source project, you know the pattern: a drive-by PR shows up on a Saturday morning, the contributor means well, the diff is 400 lines, half of it is unrelated formatting, the tests don't run on their machine, and they want a response now. Multiply by ten PRs a week, and that's your weekend.

Maintainer burnout isn't caused by writing code. It's caused by reviewing other people's code.

That's the exact problem Octopus was built to solve. Giving it to the people who need it most felt less like generosity and more like the obvious move.

The Selfish Version

I want to be honest. This isn't pure altruism, and the self-interested case is actually the stronger argument.

Public repos are where Octopus gets battle-tested. A private SaaS codebase looks broadly similar across customers. Open source is wilder: embedded C, Rust kernel patches, Lua game engines, six-decorator-deep Python ML libraries. If we can review those well, we can review anything.

Word of mouth in OSS is the best growth channel software has. None of that happens if your free tier is artificially crippled.

And open source is honest. Bad suggestions get called out in PR threads in public. We can't hide behind closed-source NDAs. That keeps us better than any internal QA process would.

Why "OSI-Licensed" and Not Just "Public"

There's a real difference between a permissively-licensed library and a public repo with a "no commercial use" license slapped on it. The first is open source. The second is source-available with a paywall, and that's fine, but it's not what we're subsidizing.

We use the OSI's list as the line: MIT, Apache 2.0, GPL, BSD, MPL, the standard ones. If your project is genuinely open, you qualify.

The Setup

One file in your repo:

# .github/workflows/octopus.yml
name: Octopus Review
on:
  pull_request:
    types: [opened, synchronize]

permissions:
  contents: read
  pull-requests: write

jobs:
  review:
    runs-on: ubuntu-latest
    steps:
      - uses: octopusreview/action@v1

That's it. No dashboard signup gymnastics. No credit card "just in case." No quota meter ticking down.

The Catch

There isn't one. We pay for the LLM tokens, the indexing storage, the runners on our side. The maintainer pays nothing. There's no "we train on your code" clause: public code is already public, and we don't need to train on it to review it.

If you maintain an open source project, the workflow file above is the entire onboarding. Octopus itself is MIT-licensed and lives on GitHub. You can self-host the whole thing if you'd rather not depend on us, and that option stays open too.

To every maintainer reading this: thank you for the code I shipped this year on top of yours. This is the smallest piece of it I know how to give back.

Tags: opensource, ai, codereview, devtools, github

AI Writes 80% of Your Code. Who Reviews It?

Ferit — Thu, 02 Apr 2026 18:01:43 +0000

AI writes your functions, scaffolds your services, and ships PRs before you finish your coffee. Welcome to 2026, where "vibe coding" isn't a meme anymore: developers describe what they want in plain English, and AI generates the code.

The output is staggering. Teams report 40-70% of their committed code now originates from AI. But here's the number nobody puts on their landing page: AI co-authored code contains 1.7x more major issues than human-written code, including 2.74x more security vulnerabilities.

We got really good at generating code. We didn't get better at reviewing it.

The Review Bottleneck Nobody Planned For

Traditional code review was designed for a world where a senior engineer writes 200 lines a day. Now a junior dev with an AI assistant pushes 2,000. The reviewer's workload didn't scale, it exploded.

Most teams respond in one of two ways. They rubber-stamp PRs to keep velocity up, or they create a review backlog so deep that it kills the speed AI was supposed to deliver. Both options end the same way: bugs in production.

The core problem isn't volume, though. It's context. When a human writes code, they carry the project's architecture in their head. They know the naming conventions, the edge cases from last quarter's outage, the service boundary that shouldn't be crossed. AI doesn't carry any of that. It generates plausible code that compiles, passes lint, and quietly violates three architectural decisions your team made six months ago.

Why Diff-Only Review Tools Fall Short

Here's where most AI review tools break down. They look at the diff: the lines added and removed in a pull request. That's it.

A diff-only reviewer sees a function that looks correct in isolation. It doesn't know that your project already has a utility doing the same thing in a different module. It can't tell that the new database query bypasses the caching layer every other service uses. It won't flag that the error handling pattern contradicts what your team agreed on in an ADR last month.

Studies from early 2026 show that full-codebase-aware review tools catch 40-60% more cross-file issues than diff-only approaches. When AI is writing code that doesn't understand your project, your reviewer needs to understand your project deeply enough for both of them.

Codebase-Aware Review: Closing the Gap

This is the approach we built Octopus Review around. Instead of reviewing diffs in a vacuum, Octopus indexes your entire codebase using RAG (Retrieval-Augmented Generation) with vector search. When a PR comes in, the review has full project context: your patterns, your abstractions, your existing code.

The difference is practical. When AI-generated code introduces a new HTTP client wrapper, a diff-only tool checks if the syntax is correct. Octopus checks if you already have one, whether the new one follows your error handling conventions, and whether it respects the service boundaries defined elsewhere in the repo.

Every review comment comes with a severity level (Critical, Major, Minor, Suggestion, or Tip), so you're not drowning in noise. A mismatched bracket and a security vulnerability don't sit in the same bucket.

Here's what it looks like in practice with the CLI:

octopus pr review 42

 PR #247: Add user notification service

 CRITICAL (1)
 ├─ src/services/notify.ts:42
 │  Unvalidated user input passed directly to email template.
 │  This creates an injection vector. Sanitize with your existing
 │  `sanitizeHtml()` utility in src/utils/sanitize.ts.

 MAJOR (2)
 ├─ src/services/notify.ts:15
 │  New NotificationClient duplicates functionality of existing
 │  MessagingService in src/services/messaging.ts. Consider extending
 │  the existing service instead.
 ├─ src/services/notify.ts:38
 │  Missing retry logic. All other service calls in this project
 │  use the withRetry() wrapper from src/utils/retry.ts.

 SUGGESTION (1)
 ├─ src/services/notify.ts:5
 │  Consider moving notification templates to src/templates/
 │  to match the project's existing template organization.

 4 comments across 1 file | Review time: 12s

That second MAJOR comment is the one that matters most. A diff-only tool would never catch it because the duplication exists in a completely different file. Octopus catches it because it has indexed the entire codebase and knows MessagingService already exists.

Your Standards, Not Generic Rules

The other gap in vibe-coded projects is consistency. AI models are trained on the entire internet. They'll write perfectly valid Go in one function and subtly different Go in the next, mixing community conventions your team never adopted.

Octopus has a Knowledge Base where you feed your team's standards, architectural decision records, and style guides. Reviews then enforce YOUR rules, not generic best practices from Stack Overflow circa 2023.

This matters more in the vibe coding era than ever. When half your code comes from an AI that has no memory of your last sprint, someone needs to be the institutional memory. That someone can be automated.

Open Source, Self-Hostable, No Vendor Lock-in

One more thing that matters when AI is touching every line of your code: where does that code go during review?

Octopus is open source (Modified MIT) and fully self-hostable. Your code is processed in-memory only. Embeddings are persisted for search, but source code is never stored. Bring your own API key for Claude or OpenAI, and run it on your own infrastructure.

git clone https://github.com/octopusreview/octopus.git
docker-compose up -d

That's it. Full AI code review running on your hardware, your keys, your rules.

The Bottom Line

Vibe coding isn't going away. The teams that thrive won't be the ones generating the most code. They'll be the ones who close the gap between generation and understanding.

If your review process can't see beyond the diff, it can't catch what AI gets wrong. And AI gets a lot wrong, quietly, confidently, at scale.

Give Octopus Review a try. Star the GitHub repo. Join the Discord if you want to talk about what codebase-aware review actually looks like in practice.

Octopus Review is an open-source, RAG-powered AI code review tool. It works with GitHub and Bitbucket, reviews PRs with full codebase context, and can be self-hosted with zero vendor lock-in.

I Built an AI Code Review Tool and Tested It Against 6 Competitors. Here's an Honest Breakdown.

Ferit — Tue, 17 Mar 2026 18:28:51 +0000

Full disclosure: I built Octopus Review. So take everything I say with a grain of salt. But I also genuinely use other tools in this space, and I think being honest about where competitors do better is more useful than pretending my tool is perfect.

Why I Built Yet Another Code Review Tool

I was frustrated with existing AI review tools for one specific reason: they only look at the diff.

You open a PR, the AI reads the changed lines, and gives you feedback based on just those lines. It doesn't know your project uses a specific error handling pattern. It doesn't know you have a utility function that already does what the new code is doing. It doesn't know your team decided last month to stop using that deprecated API.

I wanted a tool that actually understands the full codebase before reviewing anything. That's the core idea behind Octopus Review: index the entire repository using vector embeddings (RAG), so every review has the full picture.

Does it work perfectly? Not yet. But it addresses a real gap that none of the other tools are solving the same way.

The Honest Comparison

I tested 6 other tools alongside Octopus Review on a mid-size TypeScript monorepo (~120k lines). Here's what I found.

Octopus Review (mine)

What it does well:

Indexes your codebase into a vector database (Qdrant), so reviews reference existing patterns and code
CLI tool for terminal-based reviews and codebase Q&A
RAG Chat lets you ask questions about your own code
Knowledge Base for team conventions
Open source (MIT), self-hostable, free

Where it falls short:

Indexing takes time on large repos. First setup is not instant for big codebases.
Security scanning is basic. If you need vulnerability detection, you need another tool alongside it.
Still young. The ecosystem and community are growing, but it's not battle-tested like Semgrep or Codacy.
RAG retrieval isn't always perfect. Sometimes it pulls irrelevant context, which can lead to noisy suggestions.
Analytics dashboard is functional but minimal compared to Codacy's trend tracking.

Pricing: Free, no limits.

I genuinely believe the RAG approach is the right direction for code reviews. Whether Octopus is the best implementation of that idea yet is a fair question, but it's getting there.

CodeRabbit

This is the tool I find myself comparing against the most.

What it does well:

Inline review comments are clean and well-integrated into the PR flow
Goes beyond style to comment on maintainability and readability
Multi-reviewer collaboration is nice for teams
Broad language support

Where it beats Octopus:

The inline comment UX is more polished. My tool leaves review comments too, but CodeRabbit's formatting and presentation is a step ahead.
For teams that just want "better PR comments" without caring about codebase-wide context, CodeRabbit is simpler to adopt.

Where it falls short:

Reviews are diff-scoped. It doesn't know about the rest of your codebase.
Limited security features
Paid tiers can get expensive for small teams

Panto AI

The most comprehensive tool on this list if security and compliance are your priority.

What it does well:

Combines static analysis, secrets detection, dependency scanning, and IaC security in one place
Compliance reporting for SOC 2, ISO, PCI-DSS
Context-aware prioritization reduces alert fatigue

Where it beats Octopus:

Security, and it's not even close. If you need vulnerability scanning, secrets detection, or compliance reports, Panto does what Octopus simply can't right now.
For regulated industries, Panto is a much more complete solution.

Where it falls short:

Better suited for medium-to-large teams with dedicated security workflows
Heavier setup and onboarding process
Not open source

Aikido Security

Security-first approach with a strong focus on reducing false positives.

What it does well:

Smart filtering that genuinely cuts down noise
Good GitHub integration
Customizable security policies
Actionable remediation guidance with fix suggestions

Where it beats Octopus:

Security coverage, obviously
The false positive reduction is impressive. My tool can sometimes be noisy with suggestions. Aikido is more disciplined about what it surfaces.

Where it falls short:

Needs tuning to get the best results for your specific stack
Focused on security, not code quality or architecture

Devlo.ai

Devlo does something interesting that neither I nor most competitors do well: deep logic analysis.

What it does well:

Catches architectural weaknesses and edge cases
Suggests tests for uncovered code paths
Identifies performance risks

Where it beats Octopus:

Logic flaw detection. Octopus understands context, but Devlo is better at catching subtle logic bugs and suggesting tests. I'd love to get there eventually.

Where it falls short:

Configuration-heavy for advanced features
Struggles with less common frameworks and languages

Semgrep

The veteran. Not AI-powered in the modern LLM sense, but incredibly effective.

What it does well:

Custom rule creation is unmatched
Fast scanning at scale
Huge ecosystem of community rules
Solid CI/CD integration

Where it beats Octopus:

Maturity and reliability. Semgrep has been around longer and is trusted by serious security teams.
Rule-based analysis is deterministic. You know exactly what it checks. LLM-based tools (including mine) can be unpredictable.
Open source with a proven track record

Where it falls short:

Not AI-powered. It won't catch novel patterns or give natural language explanations.
Custom rules require expertise to write and maintain

Codacy

The dashboard tool. Best for teams that want to track quality metrics over time.

What it does well:

Quality trend dashboards are genuinely useful for spotting regressions
Wide language support
Technical debt visibility with actionable breakdowns
Flexible coding standard enforcement

Where it beats Octopus:

Long-term quality tracking. Octopus reviews individual PRs. Codacy shows you the bigger picture of how your code quality evolves over weeks and months.
More mature analytics and reporting

Where it falls short:

Reviews aren't as deep as AI-native tools
Innovation has slowed compared to newer entrants in the space

Quick Comparison

Feature	Octopus	CodeRabbit	Panto	Aikido	Devlo	Semgrep	Codacy
Codebase context	Full (RAG)	Diff only	Partial	Partial	Deep logic	Rule-based	Partial
Security	Basic	Limited	Strong	Strong	Basic	Strong	Basic
Self-hostable	Yes	No	No	No	No	Yes	No
Open source	MIT	No	No	No	No	Partial	No
CLI	Yes	No	No	No	No	Yes	No
Free tier	Unlimited	Limited	Limited	Limited	Limited	OSS free	Limited
Maturity	Early	Established	Established	Established	Growing	Veteran	Established

So What Should You Actually Use?

Here's my honest take, even though it sometimes means recommending a competitor:

Use Octopus Review if you care about codebase-aware reviews, want something open source and self-hostable, and are okay with a tool that's still evolving. It's free, so the risk is low.

Use Panto or Aikido if security and compliance are your primary concern. Octopus is not a security tool and I'm not planning to make it one.

Use Semgrep if you want deterministic, rule-based analysis you can fully control. Sometimes you don't want AI creativity in your code reviews. You want rules.

Use CodeRabbit if you want the most polished inline review experience and codebase context isn't a priority for you.

Use Codacy if tracking quality metrics over time matters more than individual PR feedback.

Use Devlo if you're working on complex systems and want logic-level analysis that goes deeper than style and patterns.

Getting Started

If you want to try Octopus, head over to octopus-review.ai, connect your GitHub repos, and it starts reviewing PRs automatically. No credit card, no paywall.

If something breaks or the reviews are off, open an issue. I read all of them.

I don't think AI code review is a winner-take-all market. Different tools solve different problems. I built Octopus because I wanted full codebase context in my reviews, and I couldn't find it elsewhere. If that resonates with you, give it a shot. If your needs are different, one of the other tools on this list might be a better fit.

What's your experience with AI code review tools? I'd genuinely love to hear what's working (or not) for your team.

From scikit-learn to Production, Deploying ML Models That Actually Work

Ferit — Mon, 02 Mar 2026 14:31:42 +0000

There is a gap between training a model in a Jupyter Notebook and running it in production. Most tutorials stop at model.score() and call it done. This article covers the full pipeline: data preprocessing, model selection, evaluation, serialization, and serving a scikit-learn model behind a FastAPI endpoint.

The Problem

We needed a transaction risk scoring system for a crypto payment gateway. The model receives transaction features and returns a fraud probability between 0 and 1. Requirements:

Latency under 50ms per prediction
Handle 1,000 requests per second
Update the model weekly without downtime
Explainable predictions (regulators want to know why a transaction was flagged)

scikit-learn turned out to be the right tool. Not TensorFlow, not PyTorch. For tabular data with fewer than 100 features, gradient boosted trees in scikit-learn are hard to beat.

Data Preprocessing Pipeline

Raw transaction data is messy. Missing values, mixed types, different scales. scikit-learn pipelines handle this cleanly:

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer

numeric_features = [
    "amount_usd", "gas_price", "tx_count_24h",
    "address_age_days", "avg_tx_value"
]

categorical_features = [
    "chain", "token_type", "sender_country"
]

numeric_transformer = Pipeline([
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler())
])

categorical_transformer = Pipeline([
    ("imputer", SimpleImputer(strategy="constant", fill_value="unknown")),
    ("encoder", OneHotEncoder(handle_unknown="ignore"))
])

preprocessor = ColumnTransformer([
    ("num", numeric_transformer, numeric_features),
    ("cat", categorical_transformer, categorical_features)
])

The key insight here is that the preprocessor becomes part of the saved model. When you serialize the pipeline, all the preprocessing logic (imputation values, scaling parameters, encoding mappings) travels with the model. No separate preprocessing code needed at inference time.

Model Selection

We evaluated four models on our dataset:

from sklearn.ensemble import (
    GradientBoostingClassifier,
    RandomForestClassifier,
    AdaBoostClassifier
)
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

models = {
    "logistic_regression": LogisticRegression(max_iter=1000),
    "random_forest": RandomForestClassifier(n_estimators=200),
    "gradient_boosting": GradientBoostingClassifier(n_estimators=200),
    "adaboost": AdaBoostClassifier(n_estimators=200)
}

results = {}
for name, model in models.items():
    pipeline = Pipeline([
        ("preprocessor", preprocessor),
        ("classifier", model)
    ])

    scores = cross_val_score(
        pipeline, X_train, y_train,
        cv=5, scoring="roc_auc"
    )

    results[name] = {
        "mean_auc": scores.mean(),
        "std": scores.std()
    }
    print(f"{name}: AUC = {scores.mean():.4f} (+/- {scores.std():.4f})")

Results on our dataset:

Model	AUC	Latency (p95)
Logistic Regression	0.87	0.2ms
Random Forest	0.93	3.1ms
Gradient Boosting	0.96	1.8ms
AdaBoost	0.91	2.4ms

Gradient boosting won on accuracy. The 1.8ms latency is well within our 50ms budget.

Hyperparameter Tuning

Grid search with cross-validation finds the best parameters:

from sklearn.model_selection import GridSearchCV

param_grid = {
    "classifier__n_estimators": [100, 200, 300],
    "classifier__max_depth": [3, 5, 7],
    "classifier__learning_rate": [0.01, 0.05, 0.1],
    "classifier__min_samples_leaf": [5, 10, 20]
}

pipeline = Pipeline([
    ("preprocessor", preprocessor),
    ("classifier", GradientBoostingClassifier())
])

search = GridSearchCV(
    pipeline,
    param_grid,
    cv=5,
    scoring="roc_auc",
    n_jobs=-1,
    verbose=1
)

search.fit(X_train, y_train)

print(f"Best AUC: {search.best_score_:.4f}")
print(f"Best params: {search.best_params_}")

Best configuration for our dataset:

best_model = Pipeline([
    ("preprocessor", preprocessor),
    ("classifier", GradientBoostingClassifier(
        n_estimators=200,
        max_depth=5,
        learning_rate=0.05,
        min_samples_leaf=10
    ))
])

Evaluation Beyond Accuracy

For fraud detection, accuracy is misleading because the dataset is imbalanced (99% legitimate, 1% fraud). We care about precision and recall:

from sklearn.metrics import (
    classification_report,
    roc_auc_score,
    precision_recall_curve
)

best_model.fit(X_train, y_train)
y_pred = best_model.predict(X_test)
y_proba = best_model.predict_proba(X_test)[:, 1]

print(classification_report(y_test, y_pred))
print(f"ROC AUC: {roc_auc_score(y_test, y_proba):.4f}")

# Find optimal threshold
precisions, recalls, thresholds = precision_recall_curve(y_test, y_proba)

# We want at least 90% recall (catch 90% of fraud)
for i, recall in enumerate(recalls):
    if recall >= 0.90:
        optimal_threshold = thresholds[i]
        print(f"Threshold for 90% recall: {optimal_threshold:.3f}")
        print(f"Precision at this threshold: {precisions[i]:.3f}")
        break

Feature Importance

Regulators need explainability. scikit-learn makes this straightforward:

import numpy as np

classifier = best_model.named_steps["classifier"]
feature_names = (
    numeric_features +
    list(best_model.named_steps["preprocessor"]
         .named_transformers_["cat"]
         .named_steps["encoder"]
         .get_feature_names_out(categorical_features))
)

importances = classifier.feature_importances_
indices = np.argsort(importances)[::-1]

print("Top 10 features:")
for i in range(min(10, len(indices))):
    idx = indices[i]
    print(f"  {feature_names[idx]}: {importances[idx]:.4f}")

Output:

Top 10 features:
  amount_usd: 0.2341
  tx_count_24h: 0.1876
  address_age_days: 0.1523
  gas_price: 0.0987
  avg_tx_value: 0.0834
  chain_ETH: 0.0612
  chain_TRC20: 0.0445
  sender_country_unknown: 0.0398
  token_type_USDT: 0.0321
  chain_BTC: 0.0287

This tells us: high transaction amounts from new addresses with unusual gas prices are the strongest fraud signals. That is explainable to a regulator.

Model Serialization

Save the entire pipeline (preprocessor + model) as a single artifact:

import joblib
from datetime import datetime

model_version = datetime.now().strftime("%Y%m%d_%H%M%S")
model_path = f"models/fraud_detector_{model_version}.joblib"

joblib.dump(best_model, model_path)

# Verify the saved model works
loaded_model = joblib.load(model_path)
assert np.allclose(
    loaded_model.predict_proba(X_test),
    best_model.predict_proba(X_test)
)

# Save metadata
metadata = {
    "version": model_version,
    "auc": roc_auc_score(y_test, y_proba),
    "threshold": optimal_threshold,
    "features": feature_names,
    "training_samples": len(X_train)
}

import json
with open(f"models/fraud_detector_{model_version}.json", "w") as f:
    json.dump(metadata, f, indent=2)

Serving with FastAPI

The production API loads the model once at startup and serves predictions:

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import joblib
import numpy as np

app = FastAPI()

# Load model at startup
model = joblib.load("models/fraud_detector_latest.joblib")
metadata = json.load(open("models/fraud_detector_latest.json"))
threshold = metadata["threshold"]

class TransactionInput(BaseModel):
    amount_usd: float
    gas_price: float
    tx_count_24h: int
    address_age_days: int
    avg_tx_value: float
    chain: str
    token_type: str
    sender_country: str

class PredictionOutput(BaseModel):
    fraud_probability: float
    is_flagged: bool
    model_version: str

@app.post("/predict", response_model=PredictionOutput)
async def predict(tx: TransactionInput):
    features = pd.DataFrame([tx.model_dump()])

    probability = model.predict_proba(features)[0][1]

    return PredictionOutput(
        fraud_probability=round(probability, 4),
        is_flagged=probability >= threshold,
        model_version=metadata["version"]
    )

@app.get("/health")
async def health():
    return {"status": "ok", "model_version": metadata["version"]}

Hot-Swapping Models

To update the model without downtime, we use a simple versioning scheme:

import threading

class ModelManager:
    def __init__(self, model_dir: str):
        self.model_dir = model_dir
        self.current_model = None
        self.current_metadata = None
        self.lock = threading.Lock()
        self.load_latest()

    def load_latest(self):
        model_files = sorted(glob.glob(f"{self.model_dir}/fraud_detector_*.joblib"))
        if not model_files:
            raise FileNotFoundError("No model found")

        latest = model_files[-1]
        new_model = joblib.load(latest)
        meta_path = latest.replace(".joblib", ".json")
        new_metadata = json.load(open(meta_path))

        with self.lock:
            self.current_model = new_model
            self.current_metadata = new_metadata

    def predict(self, features):
        with self.lock:
            return self.current_model.predict_proba(features)

Performance Numbers

After deploying this setup:

Prediction latency: 2ms p50, 8ms p95
Throughput: 2,400 requests/second on a single core
Model size: 12MB (serialized pipeline)
Fraud detection rate: 94% recall at 87% precision
False positive rate: 0.3%

The full implementation is on GitHub: python-machine-learning

If you are deploying ML models to production or working with scikit-learn at scale, I would like to hear about your experience. Find me on GitHub.

Building Multi-Model AI Agents with OpenAI, Ollama, Groq and Gemini

Ferit — Sun, 01 Mar 2026 22:23:42 +0000

Most AI applications today rely on a single LLM provider. That works fine until the API goes down, rate limits hit, or your costs spiral out of control. A better approach is to build agents that can orchestrate multiple models and switch between them based on the task at hand.

In this article, I will walk through how I built an AI agent framework that supports OpenAI GPT-4, Ollama local models, Groq ultra-fast inference, and Google Gemini as interchangeable backends.

Why Multi-Model?

Each provider has different strengths:

OpenAI GPT-4 has the best reasoning and function calling
Ollama runs locally with zero latency and no API costs
Groq delivers sub-200ms inference for real-time applications
Gemini excels at multimodal tasks (vision, audio, code)

By abstracting the provider layer, your agent can pick the right model for each subtask, fall back gracefully when one provider fails, and optimize cost by routing simple tasks to cheaper models.

Architecture Overview

The framework has four main components:

Agent Core -> Planning -> Tool Execution -> Memory
     |            |            |              |
  LLM Router   Task Graph   Registry     Redis/PostgreSQL
     |
  OpenAI | Ollama | Groq | Gemini

The LLM Router is the key piece. It decides which provider handles each request based on configurable rules.

Setting Up the Provider Layer

First, define a common interface that all providers implement:

from abc import ABC, abstractmethod
from dataclasses import dataclass

@dataclass
class LLMResponse:
    content: str
    model: str
    provider: str
    tokens_used: int
    latency_ms: float

class LLMProvider(ABC):
    @abstractmethod
    async def complete(self, messages: list, tools: list = None) -> LLMResponse:
        pass

    @abstractmethod
    async def embed(self, text: str) -> list[float]:
        pass

Then implement each provider:

import openai
import ollama
from groq import Groq
import google.generativeai as genai

class OpenAIProvider(LLMProvider):
    def __init__(self, model="gpt-4"):
        self.client = openai.AsyncOpenAI()
        self.model = model

    async def complete(self, messages, tools=None):
        start = time.monotonic()
        response = await self.client.chat.completions.create(
            model=self.model,
            messages=messages,
            tools=tools
        )
        latency = (time.monotonic() - start) * 1000
        return LLMResponse(
            content=response.choices[0].message.content,
            model=self.model,
            provider="openai",
            tokens_used=response.usage.total_tokens,
            latency_ms=latency
        )

class OllamaProvider(LLMProvider):
    def __init__(self, model="llama3"):
        self.model = model

    async def complete(self, messages, tools=None):
        start = time.monotonic()
        response = await ollama.AsyncClient().chat(
            model=self.model,
            messages=messages
        )
        latency = (time.monotonic() - start) * 1000
        return LLMResponse(
            content=response["message"]["content"],
            model=self.model,
            provider="ollama",
            tokens_used=response.get("eval_count", 0),
            latency_ms=latency
        )

class GroqProvider(LLMProvider):
    def __init__(self, model="llama3-70b-8192"):
        self.client = Groq()
        self.model = model

    async def complete(self, messages, tools=None):
        start = time.monotonic()
        response = self.client.chat.completions.create(
            model=self.model,
            messages=messages
        )
        latency = (time.monotonic() - start) * 1000
        return LLMResponse(
            content=response.choices[0].message.content,
            model=self.model,
            provider="groq",
            tokens_used=response.usage.total_tokens,
            latency_ms=latency
        )

class GeminiProvider(LLMProvider):
    def __init__(self, model="gemini-pro"):
        genai.configure(api_key=os.environ["GEMINI_API_KEY"])
        self.model = genai.GenerativeModel(model)

    async def complete(self, messages, tools=None):
        start = time.monotonic()
        response = await self.model.generate_content_async(
            messages[-1]["content"]
        )
        latency = (time.monotonic() - start) * 1000
        return LLMResponse(
            content=response.text,
            model="gemini-pro",
            provider="gemini",
            tokens_used=0,
            latency_ms=latency
        )

The LLM Router

The router picks the best provider based on task type, latency requirements, and availability:

class LLMRouter:
    def __init__(self, providers: dict[str, LLMProvider]):
        self.providers = providers
        self.fallback_order = ["openai", "groq", "ollama", "gemini"]

    async def route(self, messages, task_type="general", tools=None):
        provider_name = self._select_provider(task_type)

        for name in self._get_fallback_chain(provider_name):
            try:
                provider = self.providers[name]
                return await provider.complete(messages, tools)
            except Exception as e:
                logger.warning(f"{name} failed: {e}, trying next provider")
                continue

        raise RuntimeError("All providers failed")

    def _select_provider(self, task_type):
        routing_rules = {
            "reasoning": "openai",
            "realtime": "groq",
            "local": "ollama",
            "vision": "gemini",
            "general": "openai"
        }
        return routing_rules.get(task_type, "openai")

    def _get_fallback_chain(self, primary):
        chain = [primary]
        for name in self.fallback_order:
            if name != primary:
                chain.append(name)
        return chain

Building the Agent

With the router in place, the agent itself is straightforward:

class Agent:
    def __init__(self, router: LLMRouter, tools: ToolRegistry, memory: Memory):
        self.router = router
        self.tools = tools
        self.memory = memory

    async def execute(self, task: str) -> str:
        context = await self.memory.get_relevant(task)

        messages = [
            {"role": "system", "content": self._build_system_prompt(context)},
            {"role": "user", "content": task}
        ]

        while True:
            response = await self.router.route(
                messages,
                task_type=self._classify_task(task),
                tools=self.tools.get_schemas()
            )

            if not response.has_tool_calls:
                break

            tool_results = await self.tools.execute(response.tool_calls)
            messages.extend(tool_results)

        await self.memory.store(task, response.content)
        return response.content

Tool Registry

Tools give the agent the ability to interact with external systems:

class ToolRegistry:
    def __init__(self):
        self._tools = {}

    def register(self, name: str, func, schema: dict):
        self._tools[name] = {"func": func, "schema": schema}

    async def execute(self, tool_calls):
        results = []
        for call in tool_calls:
            tool = self._tools[call.name]
            result = await tool["func"](**call.arguments)
            results.append({
                "role": "tool",
                "content": str(result),
                "tool_call_id": call.id
            })
        return results

    @classmethod
    def default(cls):
        registry = cls()
        registry.register("web_search", web_search, web_search_schema)
        registry.register("code_execute", code_execute, code_execute_schema)
        registry.register("file_read", file_read, file_read_schema)
        return registry

Putting It All Together

providers = {
    "openai": OpenAIProvider("gpt-4"),
    "ollama": OllamaProvider("llama3"),
    "groq": GroqProvider("llama3-70b-8192"),
    "gemini": GeminiProvider("gemini-pro")
}

router = LLMRouter(providers)
tools = ToolRegistry.default()
memory = RedisMemory(url="redis://localhost:6379")

agent = Agent(router=router, tools=tools, memory=memory)

result = await agent.execute(
    "Analyze the performance bottlenecks in our API and suggest fixes"
)

Cost Optimization

One of the biggest benefits of multi-model routing is cost control. Here is a practical routing strategy:

Task Type	Provider	Cost per 1M tokens
Complex reasoning	OpenAI GPT-4	$30
Simple Q&A	Groq LLaMA 3	$0.59
Code generation	Ollama (local)	$0
Image analysis	Gemini Pro	$0.50

By routing 70% of requests to Groq/Ollama and only using GPT-4 for complex tasks, we reduced our monthly AI costs by 80%.

What I Learned

Building this framework taught me a few things:

Provider abstraction pays off fast. The moment one API has an outage, your system keeps running.
Latency varies wildly. Groq at 200ms vs OpenAI at 1-2s makes a real difference for interactive applications.
Local models are underrated. Ollama with LLaMA 3 handles 80% of tasks without any API calls.
Memory is the hard part. Deciding what to remember and what to forget matters more than which model you use.

The full source code is available on GitHub: ai-agent-framework

If you are building AI agents or working with multiple LLM providers, I would love to hear about your approach. Drop a comment below or connect with me on GitHub.