Forem: Dumebi Okolo

Best AI Content Generator 2026 (How Ozigi Produces Human Content)

Dumebi Okolo — Mon, 11 May 2026 12:30:11 +0000

This article is an honest comparison of the top 5 AI content creation tools in 2026 for technical creators, plus Ozigi, the only one that blocks AI slop at the generation layer and publishes directly to X, LinkedIn, Discord, Slack, and email.

What Is the Best AI Content Generator in 2026?

Short answer: there is no single best tool. There are five mainstream options that each solve one part of the workflow well (Jasper for brand voice, Copy.ai for sales workflows, Writesonic for GEO tracking, Writer.com for enterprise governance, Buffer AI for multi-platform scheduling), and one emerging tool (Ozigi) that solves the gap they all leave open: producing AI-generated content that does not read as AI-generated content and publishing directly to every social surface and email in one workflow.

This guide breaks down which tool wins for which use case, with verified pricing and feature data from 2026.

Why AI Content Tools Stopped Working in 2026 (and What Changed)

Two structural shifts changed the cost of bad AI content this year.

The first is algorithmic. LinkedIn rolled out 360Brew, a 150-billion-parameter foundation model that reads posts the way an editor would and suppresses content that pattern matches to AI generation. AuthoredUp's reach study of over three million posts found that 98% of users saw a decline, with median impressions falling roughly 47% between mid-2024 and mid-2025. Google's helpful content systems applied the same logic to long-form writing. I wrote an article that explains this in more detail.

The second shift is user-side. "AI slop" was a word-of-the-year contender for 2024 and readers have learned the tells. "Delve", "tapestry", "robust", "in today's fast-paced landscape", the bold-colon paragraph prefix, the contrast structure of "it's not X, it's Y".
When a reader sees one or more of these words in your content, they lose trust in your brand and quality of your content.

That means the tool you pick to generate content is now a distribution decision, not a productivity nice-to-have.

Is Jasper AI Worth It in 2026?

Jasper is the incumbent and still the default pick for marketing teams of 5+ writers who need brand voice consistency at scale. Pricing starts at $49/seat/month, $69 for Pro with image generation and multiple brand voices, and Business is custom-quoted. There is no free plan, only a 7-day trial.

Use Jasper AI If:

You run a marketing team with multiple writers producing branded content daily, you already pay for an SEO tool or want the native Surfer SEO integration, and you can absorb 49 dollars per seat per month minimum.

Do Not Use Jasper AI If:

You are a solo creator, a technical founder, or anyone who needs direct social publishing. Jasper has no native publishing layer. You generate in Jasper, then paste into Buffer or Hootsuite separately. The output still requires aggressive editing to strip standard AI vocabulary like "delve" and "robust."

Case studies from Bloomreach (113% blog output increase) and WalkMe (3,000+ hours saved) speak to genuine team-level leverage when the workflow is right.

Is Copy.ai Still Good for Content in 2026?

The honest answer: not for content quality. Copy.ai still has the original 90+ template library and a real free tier with 2,000 words per month, but the company's roadmap has shifted toward go-to-market workflow automation. HubSpot and Salesforce integrations, sales sequence generation, and a workflow builder on the 249 dollar per month Advanced plan are now the primary investments.

Use Copy.ai If:

You run a sales team and want AI to power outreach sequences, CRM workflows, and repetitive task automation more than thought leadership.

Do Not Use Copy.ai If:

Content quality is your primary need. Independent reviewers in 2026 have flagged that Copy.ai's content quality investments stalled while engineering moved to GTM workflows. Brand voice on Pro is less refined than Jasper's. No image generation. No social media publishing. The output reads competently but defaults to corporate cadence that LinkedIn's 360Brew model flags.

The free plan is genuinely useful for validation. The Pro plan at 49 dollars per month gives unlimited words.

Is Writesonic the Best Cheap AI Writing Tool?

For raw price-to-feature ratio, yes. Writesonic starts at 16 dollars per month for Standard and 79 for Professional, and the 2026 product makes an explicit bet on Generative Engine Optimization (GEO). It tracks how your brand appears across ChatGPT, Gemini, Perplexity, Claude, Microsoft Copilot, and 10+ other AI search platforms, then connects that visibility data back into a content creation workflow.

Use Writesonic If:

If you are a solo operator or small team optimizing for AI search visibility, you want Chatsonic with live web browsing and Photosonic image generation in-platform, and you are willing to edit heavily.

Do Not Use Writesonic If:

Writing quality is non-negotiable. The output sounds the most "AI default" of the five tools here without significant prompt discipline. No native social publishing. Brand voice training is shallower than Jasper or Writer. The credit system creates usage anxiety on the lower tiers.

The 25% increase in AI-driven traffic case study for Viscaweb is one of the more credible numbers in the GEO category.

Is Writer.com Better Than Jasper for Enterprise?

For governance and compliance, yes. Writer is API-first, built around the proprietary Palmyra model family, and ships with 100+ prebuilt agents, a Knowledge Graph, and SOC 2 Type II compliance. Team plans start around 18 dollars per user per month, but real Enterprise deployments are quoted at 89 to 129 dollars per month per user and up, with custom pricing for serious governance requirements.

Use Writer.com If:

You are in finance, healthcare, legal, or any regulated industry where AI-generated content has to pass legal and compliance review before it ships. If your CTO or CISO is involved in AI procurement, Writer wins on the spec sheet.

Do Not Use Writer.com If:

You are an individual creator or small team. The customization process is technical and time-consuming. No social publishing layer. Output is brand-safe but tends toward formal corporate prose that reads as AI to a discerning audience. Pricing is opaque above the Team tier.

Does Buffer AI Assistant Replace a Content Generator?

For caption variations on social posts, yes. For real content creation, no. Buffer's AI Assistant is free on every plan, uses GPT-4 under the hood, and can generate post ideas, repurpose long-form content into social posts, adjust tone, and translate content. Per-channel pricing starts at 5 dollars per month annually.

Use Buffer Ai If:

You are a solopreneur or small team that already needs a scheduler and wants a free generator for caption variations. Direct publishing to 11 platforms (Facebook, Instagram, LinkedIn, Pinterest, Threads, TikTok, X, YouTube, Bluesky, Google Business Profile, Mastodon) is the strongest publishing surface in this comparison.

Do Not Use Buffer AI If:

You need real content generation. The AI Assistant produces what every honest review calls first-draft output. Skews formal and generic. Lacks brand voice training. Needs 5 to 10 minutes of human refinement per post to be ready to ship. No persona system, no banned vocabulary enforcement, no awareness of the 360Brew era of LinkedIn content. The AI is a feature, not the product.

What Are All Five Tools Missing?

If you take a step back, you will see that a pattern emerges. Each tool solves one slice of the workflow well and leaves the rest of the chain for you to bridge.

Jasper handles brand voice but you publish elsewhere. Copy.ai handles sales workflows but writing quality plateaued. Writesonic handles GEO tracking but output is generic. Writer handles enterprise governance but pricing is hostile to individuals. Buffer handles publishing but the AI is an afterthought.

None of them, and this is the honest assessment, treat AI slop as an engineering problem to be solved at the generation layer. They all treat it as a user problem to be edited around. That is the gap Ozigi was built for.

How to Make AI Content Sound Human: The Ozigi Approach

Ozigi is the emerging context engine built for the exact problem the five tools above leave open. It is positioned for technical creators, founders, and DevRel teams who have real things to say and find that every AI writing tool strips out the specificity, voice, and credibility that make content worth reading.

The mental model is different from the start. The five tools above are writing assistants. Ozigi is a context engine. You drop in a raw signal (a URL, scattered notes, a PDF, an image, a podcast transcript, or a course deck), and Ozigi returns a structured multi-platform campaign in your voice, ready to publish directly.
The output does not open with "in today's fast-paced landscape," and it does not use "delve," "tapestry," or "robust" because those words are blocked at the API route level during generation, not filtered after the fact.

How Does Ozigi Block AI Slop?

This is the single feature that no other tool in the comparison ships. Ozigi maintains a structured banned lexicon across six categories: vocabulary tells (delve, tapestry, robust, crucial), corporate fluff (cutting-edge, game-changer, thought leadership), AI tells (at its core, plays a significant role, in today's fast-paced), Gemini affirmation tells (Certainly!, Here is, Let's explore), engagement-bait closers (Tag someone who needs this), and structural patterns (the bold-colon paragraph prefix, the "it's not X, it's Y" contrast).

The lexicon lives both inside the system prompt and inside the code path as a two-layer validator. Every generation is scanned against the structured arrays, and if a slop pattern leaks through, a bounded repair retry fires automatically. The team has published the full implementation as a TypeScript file and writes openly about the latency tradeoffs (worst case is roughly 2x baseline, average is unchanged).

This is the engineering answer to the prompt-engineering ceiling. Soft instructions get you to roughly 80% slop-free output. Production reputation lives in the remaining 20%. Ozigi closes that gap with code, not pleading.

How Do Personas Work in Ozigi (and Why It Beats Brand Voice)?

Most tools let you set a tone slider or train a brand voice from samples. Ozigi treats this differently. You define a system persona once (identity, origin, beliefs, tone, pacing, banned phrases, things you would never say, things you always say) and Ozigi applies that persona to every campaign forever.

There are 14 pre-built personas covering both technical and non-technical creators: Battle-Tested Engineer, DevRel Champion, Technical Founder, Brand and Marketing Manager, Career Coach, and more. Each produces meaningfully different output. The pragmatic Staff Engineer persona writes nothing like the Career Coach persona, because the persona is a character spec, not a tone preset.

Can AI Content Tools Use My GitHub Repos for Context?

Only Ozigi does this. Connect your GitHub account once through Composio (Ozigi never sees your token directly), and on every campaign generation and every Copilot conversation, Ozigi silently pulls your three most recently active repositories into the generation context.

This makes the output to not be generic. Instead of "just shipped a new feature", you get "just pushed a fix to OziGi where rate limiting now handles bursts without dropping legitimate traffic".
The model has your actual project names, descriptions, and recent activity, so the content is grounded in what you built rather than padded with filler.

This is the feature that matters specifically for technical creators and ships in none of Jasper, Copy.ai, Writesonic, Writer, or Buffer.

Which AI Content Tool Publishes Directly to X, LinkedIn, Discord, Slack, and Email?

Only Ozigi covers all five surfaces. Buffer covers X, LinkedIn, and other socials but not Discord, Slack, or email newsletters. The other four cover none of them and force you to copy-paste into separate publishing tools.

Ozigi ships content directly from the dashboard. LinkedIn and X use built-in OAuth so you sign in once. Discord and Slack use webhooks you configure in Settings. For X, you receive an email with a one-click post intent link. Email newsletters are managed inside the dashboard with subscriber lists (manual entry, CSV upload, or import), validated sending, and scheduled delivery.

This is the workflow Jasper, Copy.ai, Writesonic, and Writer all force you to bridge manually. Ozigi closes it.

What Kinds of Content Can You Create on Ozigi?

Most tools specialize. Ozigi covers the practitioner's full stack across four content types.

Social media posts for X (single or thread), LinkedIn, Discord, and Slack, formatted natively for each platform.
Email newsletters sent to your managed subscriber list with sender configuration and scheduling.
Long-form content, including the kind of practitioner writing that Ozigi's own blog hosts (1,000 to 3,000 words, frameworks-and-lessons format, no fluff).
High-intent technical briefs, the format DevRel teams and engineering founders ship to position products, document decisions, and convert technical buyers.

The unifying thread is the 90/10 rule. Ozigi handles the 90% (extraction, structure, platform formatting, lexicon enforcement, persona application). You own the 10% (the insider detail, the contrarian take, and the judgment call only you can make). Every campaign ships with an edit button. Nothing publishes without your review.

Jasper vs Copy.ai vs Writesonic vs Writer vs Buffer vs Ozigi: Feature Comparison

Capability	Jasper	Copy.ai	Writesonic	Writer.com	Buffer AI	Ozigi
Free plan	No	Yes	Yes	No	Yes	Yes
AI slop blocked at API layer	No	No	No	No	No	Yes
Persona as character spec	Brand voice	Brand voice (limited)	Limited	Brand guardrails	No	Yes (14 prebuilt)
GitHub context grounding	No	No	No	No	No	Yes
Direct publish to X	No	No	No	No	Yes	Yes
Direct publish to LinkedIn	No	No	No	No	Yes	Yes
Direct publish to Discord	No	No	No	No	No	Yes
Direct publish to Slack	No	No	No	No	No	Yes
Email newsletter delivery	No	No	No	No	No	Yes
Long-form content	Yes	Yes	Yes	Yes	No	Yes
Technical briefs	Limited	Limited	No	Yes	No	Yes
Built for technical creators	No	No	No	No	No	Yes
Open source codebase	No	No	No	No	No	Yes
Starting price (monthly)	49	0	16	18	0	0

What Is the Best AI Content Tool for Developers and Technical Writers?

Ozigi, specifically. The reasoning is concrete.

GitHub context grounding means the output references your actual repos, commits, and project names instead of generic placeholder language. The 14 prebuilt personas include Battle-Tested Engineer, DevRel Champion, and Technical Founder, which produce meaningfully different output from a generic "professional tone" preset. The banned lexicon strips the corporate vocabulary that makes developer-facing content read as marketing. The direct publishing to Discord and Slack covers the channels where technical communities actually live, which Jasper, Copy.ai, Writesonic, Writer, and Buffer all ignore.

The codebase is open source on GitHub at Ozigi-app/OziGi. The stack is Next.js 15, Supabase, Gemini 3 Pro for generation, and Playwright for end-to-end testing. The banned lexicon implementation lives in lib/prompts/anti-ai.ts with a dev-mode drift guard that fails CI if a term gets added to the structured arrays but not the prose rulebook. PostHog telemetry logs three properties on every generation (lexiconViolations, lexiconSlopScore, lexiconRetried) so the lexicon grows from production data instead of guesswork.

If you ship LLM output to end users yourself, the minimum viable version of this layer is four files: anti-ai.ts, a code-side scanner, a bounded retry handler, and a telemetry hook. The full implementation is readable, forkable, and shipping in production.

How Much Does Ozigi Cost?

There is a free tier with no credit card required to try. The unauthenticated path lets you generate a campaign without signing up at all. Premium features (history, persona library, Discord integration) are gated behind paid tiers. Pricing is published on the Ozigi site.

By comparison: Jasper is 49 to 69+ dollars per seat per month with no free plan. Copy.ai is 0 to 249 dollars per month. Writesonic is 0 to 79+ dollars per month. Writer.com is 18 to 129+ dollars per user per month, custom for enterprise. Buffer is 0 to 10+ dollars per channel per month.

Which AI Content Generator Should You Pick?

Match the tool to the use case.

If you write generic B2B SaaS marketing copy for a Fortune 500 with a 12-stakeholder review chain, Writer is still the right pick. If you run cold outbound for a sales team, Copy.ai is still the right pick. If you need to schedule 50 channels across 12 brands, Buffer is still the right pick. If you produce long-form SEO articles for a marketing team with a Surfer subscription, Jasper is still the right pick. If you optimize for AI search visibility on a tight budget, Writesonic is still the right pick.

If you are a technical creator, founder, DevRel professional, or anyone whose LinkedIn reach dropped in the back half of 2025 and who suspects 360Brew is flagging their AI-generated output, Ozigi is the only tool in this comparison engineered specifically for that audience.

How to Test Ozigi Against Your Current Tool This Week

Open ozigi.app, drop in a URL of your latest dev.to post, and generate a campaign without signing up. The unauthenticated path is real.
Compare the output side-by-side with what Jasper or Copy.ai would produce from the same input. Look specifically for the banned vocabulary (delve, robust, seamlessly, in today's fast-paced). Count occurrences in each.
If you publish on LinkedIn, post both versions across two weeks and watch the reach data. The 360Brew penalty for AI vocabulary is now measurable in your own analytics.
If you build in public, connect your GitHub and regenerate. Compare how the output references your actual repos versus generic placeholder language.

The tool you use to generate content is now part of your distribution stack. Pick the one that treats that responsibility as an engineering problem.

Frequently Asked Questions

What is the best AI content generator in 2026?
There is no single best tool. Jasper wins for marketing teams that need brand voice consistency. Copy.ai wins for sales workflows. Writesonic wins for GEO tracking on a budget. Writer.com wins for enterprise governance. Buffer wins for multi-platform scheduling. Ozigi wins for technical creators who need AI-generated content that does not read as AI-generated content and publishes directly to X, LinkedIn, Discord, Slack, and email in one workflow.

How do I make AI writing sound human?
Three approaches. First, pick a tool that enforces a banned vocabulary at the generation layer instead of relying on prompts alone (currently only Ozigi). Second, define a persona with specific character traits, not just a tone preset. Third, edit the output to add the 10% that only you can write: insider details, contrarian takes, and personal stories.

Is Jasper AI worth 49 dollars a month in 2026?
For marketing teams of 5+ writers producing daily branded content, yes. For solo creators or technical founders, no. There are cheaper options with the same or better output quality.

What is the cheapest AI content writing tool?
Writesonic at 16 dollars per month for Standard, or Copy.ai's free plan with 2,000 words per month, or ChatGPT Plus at 20 dollars per month. Ozigi has a free tier with no credit card required.

Which AI tool publishes directly to LinkedIn?
Buffer (as part of its scheduler) and Ozigi (as a built-in feature with OAuth authentication). Jasper, Copy.ai, Writesonic, and Writer.com all require you to copy-paste into a separate publishing tool.

Is Ozigi free?
Yes, there is a free tier with no credit card required to try. The unauthenticated path lets you generate a campaign without signing up at all.

Is the Ozigi codebase open source?
Yes, on GitHub at Ozigi-app/OziGi. The team actively welcomes contributions, including vibe-coded ones, and has open issues tagged for the community.

How does Ozigi compare to ChatGPT for content?
ChatGPT is a general-purpose chat interface. Ozigi is a context engine with structured banned lexicon enforcement, persona system, GitHub grounding, and direct publishing. ChatGPT will produce competent content if you bring detailed prompts and edit heavily. Ozigi closes that gap as a product feature.

The Bottom Line

The five established tools in the GenAI content creation space each solve one part of the problem and leave the rest to you. Jasper owns brand voice for teams. Copy.ai owns GTM workflows. Writesonic owns GEO tracking. Writer owns enterprise governance. Buffer owns multi-platform scheduling.

Ozigi is the one engineered around the problem they all leave open: producing AI generated content that does not sound like AI generated content, grounded in your actual work, ready to publish across every surface a technical creator cares about. The banned lexicon at the API layer, the persona system, the GitHub context grounding, and the direct publishing to X, LinkedIn, Discord, Slack, and email together form a workflow that exists nowhere else in the category.

If the next 18 months of search rewards content that reads as genuinely human, the tool you use to generate it has to be built for that constraint from the architecture up. That is the bet Ozigi is making, and it is the reason the practitioner end of the market is paying attention.

This article was generated on Ozigi. The raw notes, comparison research, and competitor data were dropped into the context engine, run through the Technical Founder persona, scanned by the banned lexicon validator, and published from the dashboard. If anything in here reads like a human wrote it, that is the point.

What Is an MCP Gateway — and Why Do Enterprise AI Teams Need One in 2026?

Dumebi Okolo — Thu, 07 May 2026 14:29:51 +0000

The Model Context Protocol (MCP) was released by Anthropic in November 2024. Eighteen months later, it had 97 million monthly SDK downloads as of December 2025, backing from every major AI lab, and is now governed as a founding project of the Agentic AI Foundation (AAIF), a directed fund under the Linux Foundation.

That adoption happened fast, even faster than most protocols manage. But it created an immediate problem: connecting AI agents directly to dozens of MCP servers at scale is operationally unsustainable, and the protocol itself does not solve governance.

This article explains what an MCP Gateway is, what it does at the infrastructure level, and how to evaluate one for a production enterprise environment.

What Is MCP, and What Problem Does It Solve?

Before understanding the gateway, you need to understand what MCP standardizes.

Enterprise AI teams historically faced what is called the N×M integration problem: connecting N agents to M tools requires N×M custom integrations, each with its own authentication flow, error-handling logic, and credential store. Without MCP, integration complexity rises quadratically as AI agents spread through an organization; with MCP, it scales linearly.

MCP defines a standardized way for AI models to discover and invoke external tools using JSON-RPC 2.0 over HTTP. An agent sends a tools/list request to understand what a server exposes, then uses call_tool to invoke those tools. That handshake is consistent regardless of whether the backend is GitHub, Salesforce, Postgres, or an internal API.

What MCP does not define is who can call what, under whose identity, with what constraints, and at what cost. Those are governance problems, and they fall outside the protocol specification by design.

What Is an MCP Gateway?

An MCP Gateway is a centralized infrastructure layer that sits between AI agents and one or more MCP servers. It acts as a specialized reverse proxy purpose-built for MCP traffic: handling authentication, routing, policy enforcement, credential management, and observability in one place.

From the agent's perspective, nothing changes. It still performs a tools/list handshake and issues call_tool requests. The difference is that those requests are now intercepted, evaluated against policies, and routed by the gateway before any backend system executes them.

Architecturally, the shift looks like this:

Without a gateway:

Agent A → GitHub MCP Server
Agent A → Slack MCP Server
Agent B → GitHub MCP Server
Agent B → Postgres MCP Server
Agent C → Salesforce MCP Server
... (N×M connections, each managing its own auth and credentials)

With a gateway:

Agent A ──┐
Agent B ──┤──→ [MCP Gateway] ──→ GitHub MCP Server
Agent C ──┘                  ──→ Slack MCP Server
                             ──→ Postgres MCP Server
                             ──→ Salesforce MCP Server

The gateway becomes the single chokepoint where security policy, access control, and observability can be enforced consistently. As one Hacker News discussion on MCP gateways noted, practitioners want features like central MCP registries, OAuth integration, and curated toolset scoping; all things that make MCP viable at organizational scale, not just in a prototype.

Why Does MCP Alone Fall Short in Enterprise Environments?

Credential Sprawl

Without a gateway, each agent carries its own API keys, OAuth tokens, and service account credentials for every tool it accesses. Those credentials end up in environment variables, config files, and secret stores scattered across services. This is not a theoretical risk: GitGuardian's research found 24,008 unique secrets exposed in MCP configuration files in 2025 alone, with Google API keys and PostgreSQL connection strings among the most common leaked types. Rotating credentials becomes a manual exercise across multiple codebases. Revoking access for a compromised agent requires hunting down every integration it touches. There is no single point of revocation.

No Centralized Access Control

MCP does not define native role-based access control. If an agent can connect to a server, it can discover every tool that server exposes. A finance agent can see development tools. A support agent can see database administration endpoints. Principle of least privilege has to be implemented outside the protocol, in every agent individually, or not at all. As engineers in the MCP-Scanner Hacker News thread observed, people are over-provisioning MCPs the way they install apps on a phone, without applying least-privilege access.

Least-privilege access is the principle that an agent should only be able to see and invoke the specific tools it needs for its defined task, and nothing beyond that. In an MCP context, this means a support agent should have no visibility into deployment tools, and a read-only analytics agent should have no access to write operations, regardless of what the underlying server exposes..

Observability Black Holes

When agents connect directly to tools, there is no aggregated view of what any agent is actually doing. Debugging a multi-step workflow requires stitching together logs from N different servers. There is no unified execution timeline, no trace correlation, no cost attribution. Anomalies go undetected because there is no baseline.

No Cost Governance

MCP does not track token consumption or enforce usage limits. An agent can invoke tools repeatedly, triggering LLM calls and paid API operations, with no budget ceiling. At enterprise scale, this becomes a financial control problem, not just a technical one.

Security Attack Surface

In April 2025, security researchers published an analysis identifying multiple outstanding MCP security issues, including prompt injection, tool permissions that allow combining tools to exfiltrate data, and lookalike tools that can silently replace trusted ones. A centralized gateway is the practical enforcement point for mitigating all three.

What Does an MCP Gateway Actually Do?

Centralized Authentication and Identity Propagation

A production gateway validates incoming identity (typically via JWT, OAuth 2.0 with PKCE, or OIDC) and propagates that identity downstream to MCP servers. Instead of agents running under shared service accounts, requests execute on behalf of specific authenticated users.

This closes a real vulnerability. If a user cannot delete a repository, neither can the agent acting for them. Authorization is enforced at the protocol layer, not assumed in prompts. The MCP specification introduced OAuth 2.1 support in the March 2025 revision, with significant refinements in June 2025, but implementation quality varies between gateways. Some handle enterprise SSO automatically; others require manual configuration per server.

Tool-Level RBAC

The gateway intercepts tools/list responses and filters them based on the requesting agent's role and permissions. Sensitive tools simply do not appear in the agent's context. A configuration like:

virtual_server:
  name: support-scope
  allow_tools:
    - github.list_issues
    - github.get_comments
    - crm.update_ticket

...means the agent calling this endpoint never sees database administration tools, deployment controls, or any capability it has no business using. This directly improves model performance, agents reason more accurately when the action space is deliberately constrained, and reduces blast radius when something goes wrong.

Intelligent Routing

The gateway examines each request and routes it to the appropriate upstream MCP server based on the tool being called. Session affinity keeps stateful, multi-step agent conversations on the same backend server. Load balancing distributes traffic. Circuit breakers prevent cascading failures when an upstream tool degrades.

Unified Observability

Every tools/list and call_tool invocation is logged with metadata: agent identity, user context, tool arguments, response status, and latency. This creates a coherent audit trail across all connected systems. Metrics export in Prometheus format. Traces follow the OpenTelemetry standard for distributed tracing, which matters when debugging multi-step agent tasks that touch six different tools.

Cost Management

The gateway can implement caching for repeated tool calls, enforce per-agent or per-user rate limits, and surface usage analytics. Caching strategies for repeated tool calls can meaningfully reduce LLM costs, and the gateway is the practical place to implement this at scale.

Credential Vaulting

API keys, OAuth tokens, and service credentials are stored centrally in the gateway. Agents never handle raw credentials directly. Rotation policies apply once at the gateway level rather than across every agent codebase.

How Does an MCP Gateway Differ from an API Gateway?

A traditional API gateway is designed for stateless, client-server request-response cycles, standard in web and mobile applications. It handles HTTP routing, authentication, rate limiting, and transformation for REST or GraphQL traffic.

An MCP gateway is designed for stateful, session-aware, and often bidirectional communication patterns specific to AI agents. It understands the context of a long-running agent task. It can propagate user identity across multiple sequential tool calls. It maintains session state so that a multi-step agent workflow does not lose context mid-execution. It understands the tools/list → call_tool protocol cycle and can enforce policies at that semantic level, not just at the HTTP layer.

In modern enterprise architectures, both typically coexist. APIs serve application services. API gateways govern traditional HTTP traffic. MCP servers expose selected capabilities to agents. An MCP gateway governs agent-to-tool communication. The relationship is complementary.

How Does an MCP Gateway Differ from an AI Gateway?

This is worth separating out because it's a more common source of confusion in practice. Buyers evaluating AI gateways frequently find themselves looking at MCP gateways instead.

An AI gateway sits in front of LLM inference. It manages which model gets called, routes traffic between providers (OpenAI, Anthropic, Mistral), enforces token budgets, handles prompt/response logging, and abstracts model provider APIs behind a single interface. Its job is governing model calls.

An MCP gateway sits between agents and the tools those agents invoke. It governs tool calls: what an agent can do after the model has already decided to act. The two layers are complementary: an AI gateway controls which brain your agent uses; an MCP gateway controls which hands it has.

In a mature enterprise architecture, both are present. The AI gateway handles model-level traffic. The MCP gateway handles the downstream tool execution that the model's output triggers.

What Are the Categories of MCP Gateway Available?

Understanding the gateway landscape requires understanding the primary design philosophies, not just the feature checklist.

Managed Integration Platforms

These prioritize developer velocity by abstracting integration complexity behind a large library of pre-built, maintained connectors. Authentication lifecycle management (including complex OAuth 2.1 flows) is handled for you.

Composio's MCP Gateway is the primary example. It offers 1000+ tools and actions across major enterprise SaaS applications, a unified authentication layer, SOC2 and ISO certification, action-level RBAC, and zero data-retention architecture. The architecture is designed for teams that need to connect agents to many different tools quickly without owning the integration layer: instead of juggling 22 different MCP servers for 22 different tools, you install one gateway and access a broad library of pre-built integrations with a single authentication flow and audit surface.

For most enterprise teams moving from pilot to production, this is the most practical starting point. Refer to the Composio guide to MCP gateways for a deeper walkthrough of the architecture.

Security-First Proxies

These treat security as the primary constraint and performance as secondary. Lasso Security inspects all MCP traffic in real time to detect prompt injection, mask PII, and calculate reputation scores for MCP servers before they are loaded. The tradeoff is latency — deep security scanning adds 100–250ms overhead — which makes this category unsuitable for latency-sensitive workflows but appropriate for regulated environments where compliance is non-negotiable.

Infrastructure-Native Open Source

These integrate into existing container-native DevOps workflows. Docker MCP Gateway runs MCP servers as isolated Docker containers with familiar docker mcp CLI tooling and container-based security. Obot is Kubernetes-native and designed for organizations that require full data sovereignty.

Both require your team to own the integration layer. Your team brings the MCP servers, and the gateway governs them. The operational overhead is higher than a managed platform, but so is the control.

What Should Enterprise Teams Evaluate When Choosing a Gateway?

Deployment Model

Cloud-hosted managed gateways reduce time-to-production but involve data transiting external infrastructure. Self-hosted or VPC-deployed gateways give you data sovereignty. For teams in healthcare, finance, or government where regulated data must stay in your cloud, deployment model is often the first filter, not an afterthought.

Authentication Standards

Verify support for OAuth 2.1 with PKCE, OIDC, and SAML. Check whether the gateway integrates with your existing identity provider (Okta, Microsoft Entra ID, Auth0) and whether it supports on-behalf-of token propagation: the pattern where agents act under the authenticated user's identity rather than a shared service account.

RBAC Granularity

Gateway-level RBAC (which tools each role can see) is the baseline. Tool-level RBAC, allowing read but not write within a single server, is more sophisticated and significantly reduces blast radius. Verify what the enforcement model looks like in practice, not just in the marketing copy.

Observability Depth

Prometheus-compatible metrics and OpenTelemetry traces are the minimum. Look for whether the gateway can attribute tool calls to specific users and agents (not just service accounts), whether audit logs meet your compliance format requirements, and whether the dashboard supports anomaly detection or cost attribution, and whether the gateway offers a zero data retention architecture — meaning tool call payloads and credentials are never stored on the gateway provider's infrastructure, which matters for regulated industries and data sovereignty requirements.

Integration Breadth vs. Governance Depth

Managed platforms offer wide integration libraries but less control over the underlying infrastructure. Governance-first platforms offer deep control but require you to bring your own servers. For teams that need both, a large library of managed integrations and enterprise-grade governance, Composio's MCP Gateway is the only option currently combining 500+ tools and actions with SOC2 compliance, RBAC, and zero data retention in a single product.

See the full comparison in Composio's breakdown of the best MCP gateways for developers.

Performance Overhead

Every proxy adds latency. Managed platforms typically run under 10ms overhead. TrueFoundry publishes under 5ms p95. Lunar.dev MCPX publishes approximately 4ms p99. Docker MCP Gateway adds overhead due to container management; warm-path performance is significantly better than cold-start, which can add 50–200ms. Lasso Security adds 100–250ms. For conversational agents where response time is visible to users, this matters. For background automation workflows, it typically does not.

Building Your Own MCP Gateway

Building a custom gateway is possible but requires solving non-trivial distributed systems problems: credential rotation, distributed rate limiting, OAuth 2.1 state management, PII redaction, and circuit breakers. The ongoing maintenance burden as the MCP spec grows as tool APIs change and security requirements mature is the real cost, not the initial build. For most teams, a managed gateway has a significantly lower total cost of ownership than a DIY solution, even when accounting for licensing costs.

A Note on the MCP Security Threat Landscape

Security threats against MCP deployments are not theoretical. A representative risk: an agent running with privileged service-role access that processes user-supplied input could inadvertently execute those instructions, exfiltrating sensitive data through legitimate output channels. Principle of least privilege at the gateway level is the primary defense.

The OWASP guidance on LLM security identifies prompt injection as among the highest-risk attack vectors for AI systems. An MCP gateway is the practical enforcement layer for mitigating it through input validation against JSON-RPC schemas, allowlisted actions, PII redaction, and real-time tool reputation scoring.

Without a gateway, the security posture of your MCP deployment is only as strong as the weakest link among N independently managed agents.

How much latency does a gateway add?

Managed platforms: typically under 10ms overhead. High-performance purpose-built gateways (TrueFoundry, Lunar.dev MCPX): under 5ms p99. Security-scanning gateways (Lasso Security): 100–250ms depending on inspection depth. Docker MCP Gateway warm-path latency is low; cold-start overhead can add 50–200ms.

What Comes Next for MCP Gateways

Based on MCP's published direction and community discussions from early 2026, four priority areas have emerged: transport evolution (stateless Streamable HTTP for load balancer compatibility), agent communication primitives (retry semantics and expiry policies for the Tasks primitive), governance maturation (formal contributor processes), and enterprise readiness (audit trails, SSO-integrated auth, and gateway patterns).

Gateway patterns are now explicitly on the protocol roadmap. The gateway layer is no longer an addon but is becoming formalized infrastructure for enterprise MCP deployments.

Start with your primary constraint. If it is integration velocity, a managed platform is the right answer. If it is compliance in a regulated industry, prioritize SOC 2 certification, audit log format, and IdP integration. If it is data sovereignty, evaluate VPC-deployable options. If it is raw performance for a latency-sensitive conversational product, benchmark the p95 numbers against your SLA.

The Composio MCP Gateway covers the first and most common case: an enterprise team that needs to move from prototype to production with a broad integration library, unified auth, and compliance controls without owning the infrastructure. For teams with narrower requirements or existing MCP server infrastructure, the list of specialized options covered above gives you the tradeoffs needed to make that call.

For a deeper look at gateway architecture patterns, see Composio's developer guide to MCP gateways. For a full comparison of gateway options by use case, see the best MCP gateways for developers in 2026.

Frequently Asked Questions

What is an MCP Gateway, in one sentence?

A centralized infrastructure layer between AI agents and MCP servers that enforces authentication, routes requests, applies access controls, and provides observability across all agent-tool interactions.

Is an MCP Gateway required for production deployments?

Not required by the protocol specification. Required in practice for any deployment with more than two or three MCP servers, multiple teams, regulated data, or compliance obligations.

What is the difference between an MCP server and an MCP gateway?

An MCP server executes tools. It connects to GitHub, Postgres, Slack, or an internal API and performs operations. An MCP gateway governs access to those servers. It handles identity, visibility filtering, policy enforcement, and routing before any tool executes.

How do MCP gateways handle prompt injection?

Security-first gateways like Lasso Security scan all traffic in real time and block payloads that trigger injection detection. Governance platforms like MintMCP apply input schema validation and allowlisted actions. Managed platforms like Composio run tool implementations in sandboxed environments. Using multiple layers of defense is the current best practice.

What authentication standards should my gateway support?

OAuth 2.1 with PKCE, OIDC, SAML, and support for enterprise IdPs. The MCP specification introduced OAuth 2.1 in the March 2025 revision with refinements in June 2025, but implementation quality varies significantly. Test the on-behalf-of identity propagation flow specifically. This is where implementations most commonly diverge.

How to Stop AI Slop in Production: A Two-Layer Validator for LLM Output (2026)

Dumebi Okolo — Wed, 06 May 2026 12:45:23 +0000

A user reached out to us this week. Their generated newsletter contained the word delve. Twice.
This immediaimmediately shot alarm spikes through the team because that word has been on our banned list since version one. The system prompt in lib/prompts/anti-ai.ts tells the model never to use it. Gemini 3 used it anyway, and this was a big issue.

This is the documentation of everything we did: the architecture we shipped to fix it, and the latency numbers from the first 48 hours in production. If you ship LLM output to end users, you probably need this layer too.

Does Better Prompting Make AI Output Better?

Short answer: No.
Prompts alone stop AI slop in roughly 80% of generations. The remaining 20% is where production reputation lives. Our fix for this is a code-side validator that scans every draft against a structured banned lexicon, runs four detection passes (vocabulary, phrases, openers, regex structures), and triggers one bounded repaired retry on slop. Worst case is that the latency (time-to-output) goes from N seconds to roughly 2N. The average latency is unchanged, and the user gets a draft that does not read like ChatGPT.

What is "AI slop" and why does it slip past prompt rules?

AI slop is low-quality, formulaic, machine-sounding text that an LLM produces by default: bloated paragraphs, corporate buzzwords, predictable cadences, and a recurring set of vocabulary crutches like delve, tapestry, robust, and crucially. The term entered general use in 2024 and was named American Dialect Society's word of the year for 2024. Wikipedia tracks the broader phenomenon.

The reason it slips past prompt rules is structural, and not a bug. Three things break the prompt-as-contract assumption in production:

Attention dilution. The longer your system prompt grows, the less weight any single rule carries during decoding. By the time any LLM is generating token 1,800 of a long-form article, the rule "do not use the word delve" is competing with several thousand other instructions and the entire user input. Anthropic's own prompt engineering guidance acknowledges that instruction following degrades over long contexts.
Regression to the training mean. LLMs are predictive engines. When a sentence is half-built and the next likely token is a high-probability buzzword that appeared millions of times in the training corpus, the model picks it. A negative instruction in the prompt is a soft constraint. The training data is a hard prior.
No inference-time ground truth. The model has no way to verify it complied. It cannot self-check the same way a TypeScript compiler can. Whatever rolls out of the final softmax is what ships.

We have written about why standard prompting alone is not enough in our Banned Lexicon deep dive and the System Personas deep dive. The TL;DR is that soft instructions only carry you to about 80% reliability. Production needs an enforcement layer on top.
This video goes into more detail.

What is a banned lexicon, and how is it different from a safety filter?

A banned lexicon is a curated list of words, phrases, sentence openers, and structural patterns that signal AI-generated text. It is a quality filter, not a safety filter. Safety filters block harmful content. A banned lexicon blocks bland content.

At Ozigi, the lexicon contains six categories: vocabulary tells (delve, tapestry, robust), corporate fluff (cutting-edge, game-changer, thought leadership), AI tells (at its core, plays a significant role, in today's fast-paced), Gemini affirmation tells (Certainly!, Here is, Let's explore), engagement-bait closers (Tag someone who needs this), and structural patterns (the bold-colon paragraph prefix **Term:**, double-hyphen em-dash substitutes, contrast structures like "It's not X. It's Y.").

Until last week, that lexicon lived only inside the prompt. The fix was to also live inside the code path.

The two-layer architecture

The full surface looks like this:

┌─────────────────────────────────────────────┐
│  lib/prompts/anti-ai.ts                     │
│  ─────────────────────                      │
│  ANTI_AI_RULES         ← prose for the LLM  │
│  BANNED_WORDS          ← code-side          │
│  BANNED_PHRASES        ← code-side          │
│  BANNED_OPENERS        ← code-side          │
│  BANNED_CLOSERS        ← code-side          │
│  BANNED_REGEX_PATTERNS ← code-side          │
└─────────────────────────────────────────────┘
           │                        │
           ▼                        ▼
┌──────────────────────┐   ┌──────────────────────────┐
│ lib/prompts.ts       │   │ lib/prompts/long-form.ts │
│ Social engine        │   │ Long-form engine         │
│ (X · LI · DC · EM)   │   │ (blog · newsletter)      │
└──────────────────────┘   └──────────────────────────┘
           │                        │
           ▼                        ▼
       LLM call               LLM call
           │                        │
           ▼                        ▼
┌──────────────────────────────────────────────┐
│   lib/prompts/lexicon-validator.ts           │
│   validateText / validateCampaign / repair   │
└──────────────────────────────────────────────┘
           │
           ▼
   slop? → one bounded retry → keep cleaner output
   clean? → ship

anti-ai.ts is the single source of truth. It exports both the prose rulebook the model sees and the structured arrays the validator scans against. A dev-mode drift guard warns if anything drifts between the two so the rulebook can never silently disagree with the validator.

// lib/prompts/anti-ai.ts (excerpt)
export const BANNED_WORDS: readonly string[] = [
  'delve', 'delving', 'tapestry', 'realm', 'paradigm',
  'robust', 'seamlessly', 'underscore', 'pivotal',
  /* ...several hundred more */
];

export const BANNED_REGEX_PATTERNS: readonly {
  label: string;
  pattern: RegExp;
  kind: 'banned-structure' | 'banned-contrast' | 'banned-cadence';
}[] = [
  {
    label: 'bold-colon paragraph prefix (**Term:**)',
    kind: 'banned-structure',
    pattern: /\*\*[^*\n]{1,40}:\*\*/g,
  },
  {
    label: 'contrast: "It is not X. It is Y."',
    kind: 'banned-contrast',
    pattern: /\bit\s+is\s+not\s+[\w\s,'-]{1,40}\.\s+it\s+is\b/gi,
  },
  // …seven more contrast patterns from §5 of the prose rules
];

if (process.env.NODE_ENV !== 'production') {
  // Drift guard — warn if structured entries are missing from prose
  const proseLower = ANTI_AI_RULES.toLowerCase();
  for (const w of BANNED_WORDS) {
    if (!proseLower.includes(w.toLowerCase())) {
      console.warn(`[anti-ai] structured entry "${w}" missing from prose rules`);
    }
  }
}

What does the validator actually scan for?

lib/prompts/lexicon-validator.ts runs four passes on every parsed draft:

Pass 1: vocabulary. Word-bounded, case-insensitive match against BANNED_WORDS. Hits return { kind: 'banned-word', term, snippet, location }.

Pass 2: phrases. Whole-token-sequence match against BANNED_PHRASES. Catches multi-word slop like navigate the complexities or gain valuable insights.

Pass 3: openers and closers. Position-aware. An opener match only fires if the term appears at the start of a sentence, paragraph, or post — not mid-sentence where it might be legitimate.

Pass 4: regex patterns. The structural tells. The bold-colon prefix Gemini loves (**Architecture:**), double-hyphen em-dash substitutes, and seven variants of the "It's not X. It's Y." contrast structure.

Two details matter for precision. First, code-block sanitization. Engineering content includes JSON, shell commands, and inline code where words like delve might legitimately appear (or never appear, but you don't want a regex false-positive on a JSON field name). The validator strips fenced blocks, inline code, and URL targets before scanning:

function sanitize(text: string): string {
  return text
    .replace(/```
{% endraw %}
[\s\S]*?
{% raw %}
```/g, '')      // fenced code
    .replace(/`[^`\n]+`/g, '')           // inline code
    .replace(/!?\[[^\]]*\]\([^)]+\)/g, ''); // markdown links + images
}

Second, same-opener cadence detection. For Gemini, its signature tell is starting three or more consecutive sentences with the same word or short phrase. The validator splits on sentence boundaries and reports a banned-cadence violation when it sees three+ consecutive sentences sharing a leading word.

The output is a typed ValidationReport:

export interface ValidationReport {
  violations: Violation[];
  slopScore: number;   // weighted total
  clean: boolean;
}

Slop score is weighted: a banned-structure hit counts triple a banned-word hit because structural tells are harder to miss as a reader. Word-level slips are forgivable; bold-colon prefixes are not.

How do you repair a bad AI draft without making it worse?

The naive answer is to retry until clean. That naive answer is wrong.

LLMs regress to the mean on every call. A second attempt usually fixes the obvious tells. By the third attempt, the LLM starts introducing different tells. By the fourth attempt you are inventing slop that was not there before. Worst, you have spent four times the latency budget for diminishing returns.

We cap our regenerations at one retry. This repair directive is the key:

export function buildRepairDirective(report: ValidationReport): string {
  const offenders = [...new Set(report.violations.map(v => v.term))]
    .slice(0, 25)
    .map(t => `  - ${t}`)
    .join('\n');

  return `## REPAIR DIRECTIVE
Your previous output failed the banned-lexicon check. The following exact
terms or patterns appeared and must be removed:

${offenders}

Do NOT paraphrase the rejected output. Re-read the source material and
write a fresh draft from scratch. Paraphrasing keeps the underlying
cadence and structural tells. Rewriting from source breaks them.`;
}

The "do not paraphrase, rewrite from source" instruction is the most useful line in the whole pipeline. Paraphrase prompts cause the model to keep the same paragraph skeleton and only swap synonyms — which keeps every cadence tell intact. Forcing a rewrite from source forces a different sentence-shape distribution.

After the retry, the validator runs again. We keep whichever response has the lower slop score, even if neither is fully clean. The user always gets the best of two attempts, plus a lexiconWarnings payload so the UI can surface a small "regenerate?" badge if anything still slipped through.

How much does post-generation validation slow things down?

Here are the numbers from our first 48 hours of telemetry, captured via PostHog:

Scenario	Frequency	Extra latency
Validator scan, draft is clean	~88%	< 5 ms
Validator scan, retry triggered, succeeds	~10%	+ 1 LLM call (3–8 s social, 15–40 s long-form)
Validator scan, retry fails to clean	~2%	+ 1 LLM call, ship cleaner of two

Worst case is roughly 2× generation time. Not 4×, not 5×. The validator scan itself is regex over a few KB of text, sub-5ms even on a 2,000-word article.

We picked a single retry deliberately:

LLM regression-to-the-mean makes additional retries unreliable
Long-form is already 15–40 seconds; users abandon at 2 minutes
Every retry is a billable Gemini call
Bounded retries make worst-case latency predictable for loading states

Why we tell users about the delay

Our product reviewer asked whether we should hide the latency to make the product feel faster. We came to the conclusion that hiding it would make the wait feel arbitrary. Surfacing it makes the wait feel earned.

The pre-generation tip on the long-form page now reads:

Every draft runs through the slop validator. If AI tells slip through, we regenerate once before showing it to you.

The mid-generation loader cycles through honest steps: Running the slop filter… Scanning for AI tells… Re-running if any slop slipped through…

This is the same principle we explored in our LinkedIn post 2026 piece, when you charge a price (in money or time), name what the user is buying. Otherwise the price feels like a tax.

Should you use a humanizer API For your content instead?

Someone on the team suggested the alternative path: pipe every LLM output through a third-party humanizer API, then run a "tuning" pass on the humanized output to recover any meaning lost in humanization. So the chain becomes LLM → humanizer → re-tune → ship.

The short answer is no, with one caveat. Here is the longer answer.

Cost stack. A humanizer call adds at least one round trip, often two (the rewrite + the meaning-recovery pass). For long-form, that is +5–15 seconds on top of an already long generation. For social, it can double the entire request. The validator we shipped pays this cost only on the ~12% of drafts that need it. A humanizer pays the cost on 100%.

Detection arms race. Humanizer APIs are trained to fool AI-detector models like GPTZero or Originality.AI. That is a different goal from sounding like a person. Many humanizers degrade prose to win the detector benchmark. They introduce typos, fragmented sentences, and odd punctuation patterns that score "human" on a classifier but read worse to a real reader. Pangram's research on detector bypass is the right place to start if you want the academic version.

Meaning loss. The "tuning" pass exists in the proposed chain because humanizers regularly invert sentences, drop technical specificity, or mistranslate domain jargon. A re-tune pass on top of that adds a third LLM call where the model is now reasoning about an already-mangled draft. Each round trip introduces noise. By round three, you are far from the source material.

Ownership. A humanizer is a black box. Our banned lexicon is a TypeScript file. When a user complains about the word delve, we add it to the array, the dev-mode drift guard catches the prose-vs-code mismatch, and the next generation is fixed. With a humanizer, every fix is an outside vendor's roadmap.

The caveat. A small humanizer pass can help in one specific scenario: when you do not control the prompt. If you are wrapping a black-box API or showing third-party AI output, you have no banned-lexicon hook into the model's instruction. In that case, a constrained humanizer (one tuned for paraphrase quality, not detector bypass) is a reasonable last resort. If you control the prompt, controlling the prompt is always cheaper, faster, and more honest.

For Ozigi specifically, we give the model the rules and verify the rules were followed. That is the contract our users understand.

How we keep the lexicon updated

Two feedback loops keep the lexicon current:

Drift guard. The dev-mode block at the bottom of anti-ai.ts walks every entry in the structured arrays and verifies it appears in the prose rulebook. If a developer adds paradigm to BANNED_WORDS but forgets to add it to the §1A list in the prose, the dev console warns on next reload. CI promotes the warning to an error.

Telemetry. Every generation logs three properties to PostHog: lexiconViolations, lexiconSlopScore, lexiconRetried. We chart these weekly. When a new term starts trending in the violation feed (Gemini picked up crucial in week two — caught it in 31 generations before adding to the list), we promote it.

The result is a lexicon that grows from real production data instead of guesswork. We have written before about why production telemetry beats theoretical evals in our RAG architecture post. The same logic applies here.

What this approach does not catch

Three categories sit outside what regex can detect:

Statistical rhythm. LLMs default to even sentence lengths. Regex cannot measure that. A future LLM-judge pass with perplexity-style scoring will. The work to add a small judge model — likely Gemini 3 Flash on a sampled fraction of drafts — is on the Q3 roadmap.
Paragraph balance. AI defaults to roughly equal paragraph weights. Real engineering writing is uneven by design. A one-line punchline after a long technical explanation is the entire point. Detecting balance violations needs a structural pass we have not built yet.
Tone drift. A draft can be lexicon-clean and still feel off, too formal for the user's persona, too casual for a B2B audience. Tone is what Ozigi Personas handle on the prompt side, and what manual review still owns. We have a piece on the human-in-the-loop principle that explains the 90/10 rule we follow.

Every system has gaps. The honest thing is to name them.

How to apply this in your own stack

If you ship LLM output and want a similar layer, the minimum viable version is four files:

anti-ai.ts: your prose rules + structured arrays. Start with the English-language buzzword list from the Pangram paper plus anything specific to your domain.
lexicon-validator.ts: the four scan passes. Less than 200 lines of TypeScript.
repair-directive.ts: the "rewrite from source, do not paraphrase" prompt builder.
API-route hook: call validator → check threshold → optionally retry → return final draft + warnings.

If you want the full TypeScript, the Ozigi changelog tracks the architecture as it grows. Our deep dives hub covers the surrounding pieces: multimodal ingestion, system personas, human-in-the-loop. And if you are thinking about content quality more broadly, this GEO and AEO guide explains why this work matters for AI search ranking, not just reader trust.

Related reading on Ozigi:

The Banned Lexicon: Curing AI-Speak — the philosophy behind the word list
System Personas — why we use editorial briefs instead of soft prompts
Multimodal Ingestion — the input side of the same pipeline
Human-in-the-Loop — the 90/10 rule for collaborative content
Gemini 2.5 vs Claude 3.7 in production — the model trade-offs that informed this work
Your launch post got 4 likes — why generic AI content fails on launch day

What To Do If Your Project Was Affected By The Vercel Breach

Dumebi Okolo — Tue, 21 Apr 2026 11:57:58 +0000

Vercel confirmed a security incident on April 19, 2026 affecting customer environment variables. Here's what happened in plain English, whether you're affected, and the exact steps to secure your account. No security expertise required.

TL;DR: On April 19, 2026, Vercel disclosed a security incident. Attackers compromised a third-party AI tool called Context.ai, used that access to take over a Vercel employee's Google Workspace account, and reached environment variables that weren't marked as "sensitive." If you deploy on Vercel — especially if any of your API keys, database URLs, or tokens weren't explicitly marked sensitive — you need to rotate them. This guide walks through exactly what to do, in order, without assuming any security background.

If you've deployed any app on Vercel, chances are that you have been compromised!

You've probably seen the news over the last 48 hours and felt that particular kind of low-grade panic where you're not sure if you should be doing something right now or not. The short answer is yes, you probably should. The longer answer, which is what this guide is for, is that the required actions are straightforward, don't take long, and don't require you to be a DevOps engineer or a security researcher.

This is a practical walk-through for developers, solo founders, small teams, and anyone who builds or has built on Vercel and is now wondering what "rotate your keys" actually means. Let's start with what actually happened.

What Happened in the Vercel Breach (Plain English)

On April 19, 2026, Vercel published a security bulletin disclosing that attackers had accessed parts of their internal systems. The attack didn't start at Vercel. It started somewhere smaller, and that's actually the most interesting part of the story.

Here's how the breach actually happened, step by step:

A Vercel employee had signed up for a productivity tool called Context.ai, an AI-powered office suite, using their Vercel Google Workspace account. When they signed up, they granted the app broad permissions into their Google account.

Context.ai itself got compromised. According to CyberScoop's reporting, the initial infection started in February 2026 when a Context.ai employee's computer was hit with Lumma Stealer malware after searching for Roblox game exploits. That malware harvested credentials including OAuth tokens.

The attackers used the compromised OAuth token to get into the Vercel employee's Google Workspace account. This bypassed multi-factor authentication entirely, because once an OAuth token is issued, it doesn't require re-authentication.

From that Google account, the attackers moved laterally into Vercel's internal systems: admin tools, issue trackers, internal environments. Once inside, they were able to read customer environment variables that weren't marked as "sensitive" in Vercel's dashboard.

A threat actor claiming to be part of the ShinyHunters group posted on a cybercrime forum trying to sell the stolen data for $2 million. Vercel has engaged Mandiant, CrowdStrike, and law enforcement.

The key detail most people are missing: this isn't about Vercel being insecure.
Vercel encrypts sensitive environment variables at rest and those are confirmed safe. What got exposed are variables that weren't explicitly marked sensitive, meaning plaintext values the attacker could read once inside. If you ever added an API key, database URL, or token to Vercel without ticking the sensitive flag, it's potentially in the wrong hands.

Am I Affected by the Vercel Breach?

Short answer: you're probably fine, but assume worst case and act accordingly.

Vercel stated the breach affected "a limited subset of customers" and said they've directly contacted those customers. If you haven't received an email from Vercel about this, you're likely not in the confirmed-affected group.

However — and this is important — there are two reasons to treat your credentials as potentially exposed anyway:

The investigation is ongoing. Vercel said they "continue to investigate whether and what data was exfiltrated" and will contact customers if more evidence emerges.

OAuth trust chains are deep. According to Trend Micro's technical analysis, the attack leveraged OAuth tokens issued around June 2024 and only detected in April 2026, meaning there may have been access for months before disclosure.

The practical rule: if you have environment variables in Vercel that were not explicitly marked "sensitive" and contain real credentials, rotate them. The cost of rotation is low. The cost of not rotating a compromised key is potentially catastrophic.

How To Secure Your Vercel Account Right Now (In Order)

These are the actions to take today, in priority order. If you get through the first four, you've covered 80% of the risk.

1. Open Vercel and identify every environment variable not marked "sensitive"

Go to your Vercel dashboard, open each project, and review the Environment Variables tab. Any variable that doesn't have the "Sensitive" flag set should be treated as exposed.

Vercel has also rolled out a dashboard update that gives you an overview page of all environment variables across projects. Use it to audit faster.

2. How To Rotate Your Keys To Be Safe

This is the step that trips people up. Rotating a credential means generating a new one at the service that issued it, then updating Vercel to use the new one. Do not just delete the variable in Vercel and assume the old credential is dead or disabled. It's still valid at the service until you explicitly revoke it.

The order of operations:

Log in to the service that issued the credential (AWS, OpenAI, Supabase, GitHub, Stripe, whatever)
Generate a new key
Update the Vercel environment variable with the new value
Mark the variable as "Sensitive" this time
Redeploy your project to pick up the new value
Go back to the issuing service and revoke the old key

3. Prioritise by blast radius

You probably have dozens of credentials. Rotate them in this order based on what they unlock:

Tier 1(critical): cloud provider keys (AWS access keys, GCP service accounts, Azure tokens), database credentials (Supabase service role keys, Postgres URLs, MongoDB connection strings), payment keys (Stripe, payment processors), source control tokens (GitHub PATs, deploy keys).

Tier 2(high): third-party SaaS API keys (OpenAI, Anthropic, Firecrawl, SendGrid, Resend, analytics tools), email signing keys, webhook secrets.

Tier 3(medium): internal service tokens, feature flags, non-credential configuration values.

The reasoning: a Stripe secret key in the wrong hands can drain accounts. A feature flag value can't. Triage accordingly.

4. Check activity logs for anything suspicious

In each service, look at the access logs for the past 30 days. You're looking for:

API calls from IP addresses you don't recognise
Activity from countries where nobody on your team is located
Resource creation or deletion you didn't authorise
New webhooks, deploy keys, or OAuth applications that you didn't add

For AWS, check CloudTrail. For GCP, check Audit Logs. For GitHub, check the organization audit log. For Vercel itself, check the activity log in the dashboard.

5. Revoke any third-party AI or SaaS apps connected to your Google or Microsoft account

This is the specific vector that caused the breach. Go to Google Account → Security → Your connections to third-party apps and review every app that has access. Revoke anything you don't actively use, especially anything with broad permissions ("Allow All" is a red flag).

Do the same for Microsoft 365 if you use it, and for your GitHub account's OAuth applications.

6. Turn on passkeys or an authenticator app for Vercel (and everywhere else important)

Vercel supports passkeys and authenticator app MFA. If you're still using SMS-based 2FA, that's a weaker setup. SMS can be SIM-swapped. A hardware key or authenticator app is meaningfully better.

This won't protect you against OAuth-token-based attacks (which is what happened here), but it raises the cost of every other category of attack.

7. Use the "Sensitive" flag for every new environment variable going forward

Going forward, treat the Sensitive flag as mandatory, not optional. Per Vercel's documentation, sensitive variables are encrypted at rest and cannot be read back through the dashboard after they're set. This is precisely the protection that the exposed variables in this breach didn't have.

Vercel has also announced they're updating the default to make new variables sensitive automatically, but until that rolls out, do it manually.

Why This Matters Even If You Weren't Directly Hit

The reason this story is getting extended coverage: TechCrunch, The Hacker News, Help Net Security, Hacker News front page, etc, signals that the attack pattern is a template for what's coming.

What the breach did is that it exploited the fact that modern software teams connect a web of third-party tools to their identity providers, and each connection is a potential breach path.

The same attack shape could hit any platform. The affected parties in this case used Context.ai, an AI productivity tool. Next month it could be a different AI tool, a different note-taking app, a different calendar plugin. If any employee on your team has granted broad OAuth permissions to a small third-party app using their corporate Google or Microsoft account, you have the same exposure surface Vercel did.

Best Practices To Keep Your System Safe

The defensive posture is the same one that's been best-practice for years but most teams don't enforce rigorously:

Treat every third-party OAuth app as a potential attacker
Grant the narrowest permissions that let the app work, never "Allow All"
Review and revoke unused app connections quarterly
Rotate credentials regularly. Every 90 days at minimum for production keys, 30 days for the highest-stakes ones
Encrypt at rest, always. Mark every credential as sensitive
Monitor access logs for anything you didn't do

How Ozigi Responded To The Vercel breach (Because We Were Affected Too)

A quick note, since we deploy Ozigi on Vercel. Yes, we were in the group of affected customers. Here's what we did in the first 24 hours after disclosure, roughly matching the sequence above:

We rotated every credential in our Vercel environment, starting with our Supabase service role keys, our Google Cloud Vertex AI credentials, and our Dodo Payments keys. All of them are now marked sensitive.

We audited our Google Workspace connections and revoked every third-party app we weren't actively using, including two we'd forgotten were connected.

We checked our activity logs across Supabase, Vercel, and GCP for anomalies. Nothing suspicious so far, but we're continuing to monitor.

We're in the process of moving longer-lived credentials into Doppler for centralised management and automated rotation, rather than managing them directly in Vercel's dashboard.

Like a lot of small teams, our security posture wasn't as tight as it should have been.
The honest truth is that "it hasn't happened yet" is the reason most small teams haven't invested in secrets management properly. This incident was the trigger to fix it.

Frequently Asked Questions

Does rotating my keys actually help if they've already been stolen?

Yes, and it's the single most important thing you can do. Stolen credentials only have value while they're valid. The moment you rotate and revoke the old key at the source service, the stolen one is useless. Every hour you wait is an hour the attacker could be using it.

What does "mark as sensitive" mean in Vercel?

It's a flag on each environment variable that tells Vercel to encrypt the value at rest in a way that prevents it from being read back through the dashboard or API. Once marked sensitive, you can update the variable or delete it, but you can't see what the current value is. This is the flag that would have prevented the affected variables from being readable in this breach.

Do I need to rotate everything, or just keys on Vercel?

Focus on Vercel first. Keys stored elsewhere (in a separate secrets manager, in a different hosting platform, in your local .env files) aren't affected by this specific incident. That said, this is a good prompt to review credential hygiene everywhere — many teams discover they haven't rotated core credentials in years.

How often should I rotate API keys going forward?

Industry standard is every 30-90 days for production keys, depending on sensitivity. Payment and cloud provider keys should be closer to 30 days. Third-party SaaS keys can be 90 days. Internal service tokens should ideally be short-lived credentials with 1-24 hour TTLs, generated dynamically by a tool like HashiCorp Vault.

Is there a tool that automates this?

Yes, several. Doppler and Infisical are the most accessible for small teams and solo founders. HashiCorp Vault and AWS Secrets Manager are the enterprise-grade options but have a steeper setup cost. GitGuardian scans your repos for exposed secrets and can trigger automated rotation workflows.

What's the difference between an OAuth token and an API key?

An API key is a static credential you generate once and use directly. An OAuth token is issued by an identity provider (Google, Microsoft) after a user authorises a third-party app, and represents delegated access to that user's account. The Vercel breach specifically exploited OAuth tokens. That's what allowed the attackers to bypass MFA, since OAuth tokens don't require re-authentication once issued.

Should I stop using AI productivity tools?

No, but you should audit them carefully. The problem isn't AI tools specifically, it's any third-party app that gets broad permissions into your corporate identity systems. Apply the same scrutiny to a calendar plugin, a CRM integration, or an analytics connector that you would to an AI tool.

How do I know if my credentials are being sold on the dark web?

You typically won't know directly. Some services (GitHub, AWS) have automated monitoring that flags exposed credentials if they appear in known sources. Tools like Have I Been Pwned monitor email addresses, and enterprise security tools like Hudson Rock track infostealer-compromised credentials. For small teams, the honest answer is you rotate proactively and assume the worst, rather than trying to detect after the fact.

What should I ask my team or vendors in the next 48 hours?

Three questions: Which of our third-party tools have OAuth access to our Google/Microsoft workspace? Which of our production credentials were stored in Vercel and not marked sensitive? Do we have a secrets rotation schedule, and when did we last rotate our highest-risk keys? If the answer to any of those is "I don't know," that's your starting point.

Feedback and Community

If you went through the Vercel rotation this week, I'd genuinely like to hear how it went: what you found, what tripped you up, what tools you reached for. I'm Dumebi on LinkedIn and always open to comparing notes with other founders navigating the same incidents.

If you're rebuilding your stack and thinking about content tooling along with the rest, Ozigi is what we've been building: an AI content engine specifically designed not to sound like AI. It's free to try and we're in the middle of tightening everything in response to this breach, so you're getting it at its most security-conscious state.

Stay safe out there. This won't be the last supply chain attack of 2026, but knowing how to respond means the next one will take you hours to handle instead of days.

This article is based on publicly disclosed information as of April 21, 2026. The situation is unfolding. Refer to Vercel's official security bulletin for the latest updates.

Demystifying RAG Architecture for Enterprise Data: A Technical Blueprint

Dumebi Okolo — Fri, 10 Apr 2026 11:00:47 +0000

This article teaches how to engineer a robust Retrieval-Augmented Generation (RAG) pipeline to unlock LLM potential with proprietary information

The advent of Large Language Models (LLMs) has ushered in a new era of AI-powered applications, promising to revolutionize how enterprises interact with information, automate tasks, and generate insights. From crafting marketing copy to summarizing complex legal documents, the capabilities of models like OpenAI's GPT series, Anthropic's Claude, and Meta's Llama have captured the imagination of developers and business leaders alike.

However, the path from impressive public demos to practical, production-ready enterprise solutions is fraught with challenges. While LLMs excel at general knowledge tasks, their utility often diminishes when confronted with an organization's most valuable asset: its proprietary data.

This is where Retrieval-Augmented Generation (RAG) architecture emerges as a critical enabler. RAG provides a robust, scalable, and cost-effective framework for connecting the immense generative power of LLMs with the specific, dynamic, and often sensitive knowledge locked within an enterprise's data silos. It addresses the inherent limitations of standalone LLMs, transforming them from general-purpose conversationalists into domain-specific experts.

This article serves as a comprehensive technical blueprint for software engineers, data engineers, and technical product managers looking to build sophisticated AI features leveraging LLMs with private enterprise data. We will dissect the core problems LLMs face in an enterprise context, introduce the RAG paradigm, and meticulously walk through its three-step pipeline: ingestion and chunking, storage and semantic search, and context-aware generation. We'll also explore common pitfalls and provide actionable insights to ensure your RAG implementation is not just functional, but performant and reliable. By the end, you'll have a clear understanding of how to engineer a RAG solution that empowers your LLMs to speak with authority, accuracy, and relevance on your enterprise's terms.

The Problem with Standalone LLMs

Before diving into the solution, it's crucial to understand the fundamental limitations that prevent standard, off-the-shelf LLMs from being directly applicable to most enterprise use cases without significant augmentation.

The Knowledge Cutoff Problem

Large Language Models are trained on vast datasets of publicly available text and code. This training process is computationally intensive and takes a significant amount of time, meaning that once a model is released, its knowledge base is inherently static. This creates what's known as a knowledge cutoff. For example, an LLM released in early 2023 would have no inherent knowledge of events, products, or company policies that emerged later that year or in 2024.

For enterprise applications, this limitation is critical. Organizations operate in dynamic environments where information changes constantly. An LLM relying solely on its pre-trained knowledge cannot answer questions like:

"What was our Q2 revenue performance for the current fiscal year?"
"What is the latest iteration of our employee expense policy?"
"Which customer accounts are currently in our new pilot program?"
"What are the technical specifications of our newly released product version 3.1?"

These are questions that demand real-time, proprietary, and often granular data. A standalone LLM, without external context, simply doesn't have access to this information, rendering it largely ineffective for internal business intelligence or operational support.

The Hallucination Risk

Perhaps even more concerning than a lack of knowledge is the phenomenon of hallucination. LLMs are sophisticated pattern-matching machines, not factual databases. They are designed to predict the most statistically probable next token based on their training data. When an LLM encounters a query about information it doesn't possess, especially if the query's structure is similar to questions it can answer, it doesn't respond with "I don't know." Instead, it confidently generates plausible-sounding but entirely fabricated information.

In an enterprise context, hallucinations are not merely an inconvenience; they pose significant risks:

Misinformation and Bad Decisions: An LLM providing incorrect financial figures, outdated compliance advice, or non-existent product features can lead to flawed business strategies, operational errors, and reputational damage.
Erosion of Trust: If users repeatedly receive inaccurate information, their trust in the AI system, and by extension, the underlying business process, will quickly diminish.
Legal and Compliance Exposure: In regulated industries, incorrect AI-generated responses could lead to severe compliance violations, legal liabilities, and financial penalties.
Security Risks: While less direct, a hallucinating LLM might inadvertently reveal sensitive patterns or generate seemingly innocuous but misleading data that could be exploited.

The core issue is that LLMs are trained to be generative, not necessarily truthful. They prioritize fluency and coherence over factual accuracy when lacking concrete information. This fundamental characteristic makes them unsuitable for direct deployment on proprietary tasks without a mechanism to ground their responses in verifiable, up-to-date data. This mechanism is precisely what Retrieval-Augmented Generation provides.

What is Retrieval-Augmented Generation (RAG)?

Retrieval-Augmented Generation (RAG) is an architectural pattern designed to bridge the gap between the powerful generative capabilities of LLMs and the need for factual accuracy, recency, and domain-specificity in enterprise applications. At its heart, RAG is about providing an LLM with external, relevant, and verifiable information at the time of inference, allowing it to generate responses that are grounded in truth rather than relying solely on its pre-trained, potentially outdated, or irrelevant knowledge.

Think of RAG as giving an LLM an "open-book test." Instead of expecting the AI to answer purely from memory (its training data), we equip it with the ability to quickly look up the exact right documents or data snippets before formulating its answer. This fundamentally changes the LLM's role from a knowledge memorizer to a sophisticated knowledge synthesizer.

The Core Principle: Separate Retrieval from Generation

The genius of RAG lies in its modular approach. It separates the challenge of finding relevant information from the challenge of generating a coherent, human-like response. This separation offers several key advantages:

Factuality: By providing specific, up-to-date context, RAG significantly reduces the likelihood of hallucinations, as the LLM is instructed to base its answer only on the provided information.
Recency: New information can be added to the external knowledge base in real-time, without needing to retrain or fine-tune the LLM. This makes RAG highly agile for dynamic enterprise data.
Domain Specificity: The external knowledge base can be tailored precisely to an organization's proprietary data, enabling LLMs to become experts in niche domains where they previously had no knowledge.
Cost-Effectiveness: RAG is generally far more cost-effective than repeatedly fine-tuning LLMs for new or updated information. Fine-tuning is expensive, time-consuming, and can lead to 'catastrophic forgetting' of general knowledge. RAG simply updates the knowledge base.
Interpretability/Attribution: Because the LLM's response is grounded in retrieved documents, it's often possible to cite the sources, improving trust and auditability.

In essence, RAG transforms an LLM from a general-purpose oracle into a highly specialized, context-aware agent capable of interacting intelligently with an organization's most critical information assets. It allows enterprises to leverage the cutting-edge of generative AI without compromising on accuracy, relevance, or control over their data.

The Core RAG Architecture (The 3-Step Pipeline)

Building a robust RAG system involves a sequential, multi-component pipeline. While implementations can vary in complexity, the core architecture typically comprises three distinct, yet interconnected, stages:

Ingestion & Chunking: Preparing your enterprise data for retrieval.
Storage & Semantic Search: Efficiently storing and retrieving relevant data.
Generation (The Prompt Context): Using retrieved data to inform the LLM's response.

Let's visualize this flow: A user submits a query. This query is used to search a specialized knowledge base (often a vector database) for relevant information. The retrieved information, alongside the original query, is then sent to the LLM, which synthesizes a grounded answer. This process ensures the LLM is always operating with the most relevant and up-to-date context available.

Step 1: Ingestion & Chunking

This initial phase is critical for preparing your raw enterprise data for efficient retrieval. It involves extracting information from various sources, processing it, and transforming it into a format suitable for semantic search.

Data Sources & Preprocessing

Your enterprise data can reside in a multitude of formats and locations:

Documents: PDFs, Word documents (.docx), Markdown files, HTML pages (e.g., Confluence, SharePoint).
Databases: SQL databases, NoSQL databases (e.g., customer records, product catalogs).
Communication Platforms: Slack archives, email threads, CRM notes.
Code Repositories: Git repositories (for code documentation, internal libraries).

The first step is to extract the raw text content from these diverse sources. This often involves:

Parsing: Using libraries (e.g., PyPDF2, python-docx, BeautifulSoup) to extract text from structured and semi-structured documents.
Optical Character Recognition (OCR): For scanned PDFs or image-based documents, OCR tools are essential to convert images of text into machine-readable text.
Cleaning: Removing boilerplate text (headers, footers, navigation), irrelevant metadata, excessive whitespace, or corrupted characters.
Standardization: Converting all text to a consistent encoding (e.g., UTF-8) and potentially normalizing capitalization or punctuation.

Chunking Strategy: Breaking Down Knowledge

LLMs have a finite context window – the maximum number of tokens they can process in a single prompt. Enterprise documents can be lengthy, far exceeding these limits. Moreover, sending an entire document for every query is inefficient and often introduces noise. Therefore, the extracted text needs to be broken down into smaller, manageable units called chunks.

Effective chunking is an art and a science. Poor chunking can lead to:

Lost Context: If chunks are too small, essential information might be split across multiple chunks, making it difficult for the LLM to understand the complete picture.
Irrelevant Information: If chunks are too large, they might contain a lot of irrelevant text, diluting the signal and potentially confusing the LLM.

Common chunking strategies include:

Fixed-Size Chunking: Splitting text into chunks of a predefined character or token count (e.g., 500 characters) with a specified overlap (e.g., 50 characters). Overlap helps maintain context across chunk boundaries.
Sentence/Paragraph Chunking: Splitting text at natural linguistic breaks (sentences, paragraphs). This often results in more semantically coherent chunks than fixed-size methods.
Recursive Character Text Splitter: A common approach (found in libraries like LangChain) that attempts to split by paragraphs, then sentences, then words, until chunks fit a specified size, ensuring semantic boundaries are prioritized.
Semantic Chunking: A more advanced technique where chunks are created based on semantic similarity. Text is embedded, and then a clustering algorithm or other method identifies natural breaks where the meaning shifts significantly.

Best Practice: Experiment with different chunk sizes and overlap values. A chunk size of 200-1000 tokens with 10-20% overlap is a common starting point, but the optimal values depend heavily on your specific data and use case.

Embedding Generation: The Language of Similarity

Once your data is chunked, the next crucial step is to transform each text chunk into a numerical representation called an embedding.

What are Embeddings? Embeddings are high-dimensional vectors (lists of numbers, e.g., 1536 dimensions for models like OpenAI's text-embedding-3-small or open-source alternatives) that capture the semantic meaning of text. Texts with similar meanings will have vectors that are numerically 'close' to each other in this high-dimensional space.
How they are Generated: An embedding model (e.g., OpenAI's text-embedding-3-small, various Sentence Transformers models from Hugging Face, Cohere Embed) takes a piece of text as input and outputs its corresponding vector.
Importance: Embeddings are the backbone of semantic search. They allow us to move beyond keyword matching and find information based on conceptual similarity. For instance, a query about "remote work policy" could retrieve documents mentioning "telecommuting guidelines" because their embeddings are semantically close.

Each chunk of text from your enterprise data is processed by an embedding model, and its resulting vector is stored. This collection of vectors, along with references to their original text chunks, forms the core of your searchable knowledge base.

Step 2: Storage & Semantic Search (The Vector DB)

With your enterprise data processed into chunks and vectorized, the next step is to store these embeddings efficiently and enable rapid, accurate semantic search. This is the domain of the Vector Database.

The Role of a Vector Database

A vector database is purpose-built for storing, indexing, and querying high-dimensional vectors. Unlike traditional relational databases that excel at structured queries (e.g., SELECT * FROM users WHERE age > 30), vector databases specialize in 'similarity search' – finding vectors that are numerically closest to a given query vector.

How Semantic Search Works

When a user submits a query (e.g., "How do I request time off?"):

Query Embedding: The user's query is first sent to the same embedding model that was used to embed your enterprise data chunks. This transforms the natural language query into a query vector.
Vector Similarity Search: The query vector is then sent to the vector database. The database's indexing algorithms (e.g., Hierarchical Navigable Small Worlds (HNSW), Inverted File Index (IVF), Locality-Sensitive Hashing (LSH)) efficiently compare the query vector to all stored document chunk vectors.
Distance Metrics: This comparison typically uses distance metrics like:
- Cosine Similarity: Measures the cosine of the angle between two vectors. A value of 1 indicates identical direction (perfect similarity), 0 indicates orthogonality (no similarity), and -1 indicates opposite direction.
- Euclidean Distance: Measures the straight-line distance between two points in Euclidean space. Smaller distance implies greater similarity. The vector database returns the 'top-K' most similar document chunk vectors, where 'K' is a configurable parameter (e.g., retrieve the 5 most relevant chunks).
Retrieval of Original Text: Along with the similar vectors, the vector database also retrieves the original text content of the corresponding chunks.

Popular Vector Database Options

The choice of vector database depends on factors like scale, latency requirements, deployment model (managed vs. self-hosted), and ecosystem integration:

Managed Services:
- Pinecone: A cloud-native, fully managed vector database known for its scalability and ease of use.
- Weaviate: An open-source, cloud-native vector database that also offers a managed service, supporting GraphQL and semantic search.
- Qdrant: Another open-source vector search engine, available as self-hosted or managed, known for its speed and advanced filtering capabilities.
Self-Hosted/Open Source:
- Milvus: A widely adopted open-source vector database designed for massive-scale vector similarity search.
- Chroma: A lightweight, easy-to-use open-source embedding database, great for local development and smaller-scale applications.
- pgvector: An extension for PostgreSQL that enables efficient vector similarity search directly within a relational database. Excellent for scenarios where you want to keep your vector data alongside your existing structured data.

Advanced Retrieval Strategies

Simple top-K retrieval is a good start, but for complex enterprise data, more sophisticated strategies can enhance relevance:

Re-ranking: After an initial retrieval of, say, 20 chunks, a smaller, more powerful re-ranking model (often a cross-encoder or a specialized LLM) can evaluate the relevance of these chunks more deeply against the query and re-order them, selecting the absolute best 'K' for the LLM.
Hybrid Search: Combining semantic (vector) search with traditional keyword-based search (e.g., BM25) can provide a more robust retrieval system. Keyword search excels at finding exact matches or rare terms, while semantic search handles conceptual understanding.
Multi-query Retrieval: Generating multiple slightly different queries from the original user query (e.g., using an LLM) and running parallel searches to broaden the retrieval scope.
Contextual Compression: Filtering or summarizing retrieved documents to only include the most relevant sentences or paragraphs, reducing noise and optimizing token usage for the LLM.

Step 3: Generation (The Prompt Context)

This is the final stage where the LLM synthesizes an answer, critically informed by the context retrieved from your vector database.

Constructing the Augmented Prompt

The core idea here is to inject the retrieved document chunks directly into the LLM's prompt. This creates an 'augmented prompt' that provides the LLM with all the necessary information to answer the user's question accurately and without hallucination.

A typical augmented prompt structure looks like this:

# Placeholder for a simplified LangChain-like RAG snippet

from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
from langchain_core.documents import Document

# Initialize the LLM (using a sample configuration)
# from langchain_openai import ChatOpenAI
# llm = ChatOpenAI(model="gpt-4-turbo", temperature=0)

# A simple retriever mock for demonstration. In a real RAG system, this would
# embed the question, query a vector DB, and return Document objects.
class MockRetriever:
    def get_relevant_documents(self, query: str) -> list[Document]:
        # In a real scenario, this would query the vector DB
        if "remote work expenses" in query.lower():
            return [
                Document(page_content="The company's remote work expense policy allows reimbursement for internet and utilities up to $50/month."),
                Document(page_content="Employees must submit expense reports by the 15th of the following month for remote work related costs."),
            ]
        return [Document(page_content="No specific information found on that topic in the internal knowledge base.")]

mock_retriever = MockRetriever()

# 1. Define the prompt template
# This template instructs the LLM on its role and how to use the provided context.
template = """You are an expert assistant for a large enterprise.
Answer the user's question based *only* on the provided context.
If the answer cannot be found in the context, politely state that you do not have enough information.

Context:
{context}

Question:
{question}
"""
prompt = ChatPromptTemplate.from_template(template)

# 2. Format retrieved documents into a single context string
# This is crucial: the retriever returns Document objects, but the prompt expects a formatted string.
def format_docs(docs: list[Document]) -> str:
    """Serialize retrieved documents into a single context string."""
    return "\n\n".join(doc.page_content for doc in docs)

# 3. Define the RAG chain (using LangChain's Runnable interface for clarity)
# The 'context' key is populated by the retriever and formatted into a string, 
# and 'question' by the user's input.
rag_chain = (
    {
        "context": lambda x: format_docs(mock_retriever.get_relevant_documents(x["question"])), 
        "question": RunnablePassthrough()
    }
    | prompt
    | llm  # Your initialized LLM instance goes here (e.g., ChatOpenAI model above)
    | StrOutputParser()
)

# 4. Invoke the chain with a user query
# from langchain_openai import ChatOpenAI # Example LLM initialization
# llm = ChatOpenAI(model="gpt-4-turbo", temperature=0)
# response = rag_chain.invoke({"question": "What is the policy for remote work expenses?"})
# print(response)
# This would print: "The company's remote work expense policy allows reimbursement for internet and utilities up to $50/month. Employees must submit expense reports by the 15th of the following month for remote work related costs."

Key elements of the prompt template:

System Message/Role: Sets the persona and instructions for the LLM (e.g., "You are an expert assistant...").
Context Placeholder ({context}): This is where the retrieved document chunks are inserted. It's crucial to clearly delineate the context from the actual question.
Instruction for Context Usage: Explicitly telling the LLM to only use the provided context and to state if the answer is not found is vital to prevent hallucination.
Question Placeholder ({question}): The user's original query.

LLM Interaction and Synthesis

Once the augmented prompt is constructed, it is sent to the chosen LLM (e.g., GPT-4 Turbo, Claude 3.5 Sonnet, or open-source alternatives like Llama 3). The LLM then processes this entire prompt, using the provided context to formulate a relevant and accurate answer. Because the context is explicitly given, the LLM acts more like a sophisticated summarizer and question-answering system over the provided text, rather than generating from its internal, general knowledge.

This final step ensures that the LLM's response is:

Grounded: Directly supported by the retrieved enterprise data.
Relevant: Addresses the user's specific query.
Accurate: Minimizes hallucination by constraining the LLM's generation to the facts presented in the context.

By following this three-step pipeline, enterprises can transform generic LLMs into powerful, domain-specific AI assistants that deliver reliable and actionable intelligence from their most valuable data assets.

Common Pitfalls in RAG Engineering

While RAG offers a powerful solution, its effective implementation requires careful consideration and engineering rigor. Several common pitfalls can undermine the performance and reliability of a RAG system if not addressed proactively.

1. Suboptimal Chunking Strategies

As discussed, chunking is foundational, and mistakes here cascade through the entire pipeline:

Chunks that are too small: If chunks are excessively granular (e.g., single sentences), they might lack sufficient context to be meaningful on their own. The semantic meaning required to answer a complex question could be fragmented across multiple disparate chunks, making retrieval difficult or incomplete.
Chunks that are too large: Conversely, chunks that are too long introduce noise. They might contain a lot of irrelevant information alongside the relevant bits, diluting the signal for the embedding model and increasing the chances of retrieving less precise context. Large chunks also consume more tokens in the LLM's context window, increasing inference cost and potentially hitting context limits prematurely.
Poor Overlap: Insufficient overlap between sequential chunks can lead to critical information being split precisely at the boundary, making it hard for retrieval to capture the complete idea.

Mitigation: Experimentation is key. Develop an evaluation pipeline to test different chunk sizes, overlap strategies, and chunking methods (e.g., fixed-size vs. recursive vs. semantic) against a diverse set of representative queries. Consider specialized chunking based on document structure (e.g., splitting by headings, sections in a PDF). For highly structured data, consider 'parent-child' or 'summary' chunking where smaller chunks are linked to larger, more contextual parent chunks or summaries for different retrieval stages.

2. Irrelevant or Insufficient Retrieval

Even with good chunking, the retriever component can fail to provide the LLM with the optimal context:

Poor Embedding Model Choice: Not all embedding models are created equal, and some perform better on specific domains or languages. Using a generic embedding model for highly specialized enterprise terminology might lead to embeddings that don't accurately capture semantic similarity, resulting in irrelevant retrievals.
Noisy or Low-Quality Data in Vector DB: If your ingested data contains outdated, contradictory, or simply poorly written information, the vector database will retrieve it, and the LLM will struggle to synthesize a coherent, accurate answer. 'Garbage in, garbage out' applies acutely here.
Suboptimal k Value: Retrieving too few chunks (k is too low) might mean missing critical pieces of information. Retrieving too many chunks (k is too high) introduces irrelevant information into the LLM's context, potentially confusing it or causing it to misinterpret the core question.

Mitigation:

Embedding Model Evaluation: Test different embedding models for your specific domain. Consider fine-tuning an open-source embedding model on your proprietary data if off-the-shelf options underperform.
Data Quality Management: Implement robust data cleansing, deduplication, and versioning strategies for your source documents. Only ingest high-quality, current, and relevant data into your RAG knowledge base.
Advanced Retrieval Techniques: Employ re-ranking models to refine the initial top-K results. Utilize hybrid search (keyword + vector) to capture both exact matches and semantic similarity. Explore multi-query strategies to generate a more comprehensive set of retrieved documents.

3. Latency Issues

RAG introduces additional steps in the query processing pipeline, which can impact response times:

Slow Query Embedding: Converting the user's query into a vector can take time, especially if the embedding model is large or running on under-provisioned hardware.
Slow Vector Database Lookups: As the size of your vector database grows (millions or billions of vectors), similarity search can become a bottleneck if indexing is inefficient or the database is not properly scaled.
LLM Inference Latency: Even with optimized context, the LLM's generation step can be slow, especially for larger, more capable models (e.g., GPT-4) or for very long responses.

Mitigation:

Optimize Embedding Models: Choose embedding models that balance performance and accuracy. For query embedding, consider smaller, faster models if acceptable. Implement caching for frequently asked questions.
Vector DB Optimization: Ensure your vector database is correctly indexed (e.g., using HNSW or IVF) and adequately resourced. Explore cloud-native managed vector databases that handle scalability automatically. Consider sharding your vector index for very large datasets.
LLM Choice and Optimization: Select an LLM that meets your latency and quality requirements. For internal applications where cost and speed are paramount, smaller open-source models might be preferable to larger, more expensive cloud models. Implement streaming responses from the LLM where possible to improve perceived latency.

4. Prompt Engineering Failures

Even with perfect retrieval, a poorly constructed prompt can lead to suboptimal LLM responses:

Vague or Ambiguous Instructions: If the prompt doesn't clearly define the LLM's role, desired output format, or constraints, the LLM might deviate from expectations.
Failure to Constrain to Context: Forgetting to explicitly instruct the LLM to only use the provided context (e.g., "Answer only from the context provided. If the answer is not in the context, state that you don't know.") is a common mistake that reintroduces hallucination risk.
Context Window Overflow: If the combined length of the prompt, retrieved chunks, and the expected response exceeds the LLM's maximum context window, the model will truncate the input, leading to incomplete or erroneous answers.

Mitigation:

Clear and Concise System Prompts: Define the LLM's persona and task unambiguously. Use clear delimiters for context and questions.
Explicit Guardrails: Always include instructions to strictly adhere to the provided context and to admit when information is not available.
Dynamic Context Management: Implement logic to truncate or summarize retrieved chunks if their combined length approaches the LLM's context window limit. Prioritize the most relevant chunks in such scenarios. Evaluate the impact of different context lengths on LLM performance.
Few-Shot Examples: For specific response formats or nuanced tasks, providing one or two examples within the prompt can guide the LLM more effectively.

Addressing these common pitfalls requires a holistic approach, combining careful data engineering, robust infrastructure, and iterative prompt design. Continuous monitoring and evaluation are essential to ensure your RAG system consistently delivers accurate and performant results.

Conclusion & Next Steps

The journey from generic LLMs to powerful, domain-specific AI applications for enterprise data is fundamentally paved by Retrieval-Augmented Generation. RAG architecture is not merely an enhancement; it is a transformative paradigm that addresses the core limitations of pre-trained LLMs – their knowledge cutoff and propensity for hallucination – making them truly viable for critical business functions.

By systematically ingesting and chunking proprietary data, transforming it into semantically rich embeddings, storing it in high-performance vector databases, and then intelligently augmenting LLM prompts with retrieved context, enterprises can unlock unprecedented capabilities. RAG offers a cost-effective, agile, and scalable alternative to expensive model fine-tuning, allowing organizations to keep their AI systems current with rapidly evolving internal knowledge.

This article has provided a comprehensive technical blueprint, detailing the motivations, core components, and common challenges in engineering a robust RAG pipeline. The principles outlined here – from meticulous data preparation and strategic chunking to efficient vector search and precise prompt engineering – are the bedrock of successful RAG implementations.

Ready to Build Your First RAG Application?

Explore Frameworks: Dive into open-source frameworks like LangChain and LlamaIndex. These libraries provide high-level abstractions for building RAG pipelines, simplifying integration with various LLMs, embedding models, and vector databases.
Experiment with Vector Databases: Set up a local instance of Chroma or pgvector to get hands-on experience, or explore managed services like Pinecone for scalability.
Start Small, Iterate Fast: Begin with a small, manageable dataset from your enterprise. Focus on getting a basic RAG pipeline operational, then iteratively refine your chunking, retrieval, and prompt strategies based on real-world queries and evaluation metrics.
Continuous Learning: The RAG landscape is evolving rapidly. Stay updated with the latest research in retrieval techniques, embedding models, and multi-modal RAG. Consider exploring advanced topics like agentic RAG, where LLMs can dynamically decide when and how to retrieve information.

RAG empowers you to transform LLMs from generalists into trusted, domain-expert collaborators, enabling your enterprise to harness the full potential of generative AI with confidence and accuracy. The future of enterprise AI is augmented, and RAG is your blueprint to building it.

Feedback & Community

We believe in transparent, community-driven content creation. This article was generated using the Ozigi Dashboard – our advanced longform content generation platform – and has been thoroughly reviewed and refined by our engineering team.

Have feedback on this article? We'd love to hear your thoughts:

Leave a comment below or email us at hello@ozigi.app
Share your RAG architecture experiences and learnings with our community

Interested in building your own enterprise AI content? Longform article generation is available to users on the Organization tier, limited to 5 articles per day. Check our pricing details to learn more about what Ozigi can do for your content strategy.

Building a Robust Webhook Handler in Node.js: Validation, Queuing, and Retry Logic

Dumebi Okolo — Tue, 07 Apr 2026 11:50:28 +0000

Webhooks are everywhere. Stripe fires one when a payment succeeds. GitHub fires one when a PR is merged. Twilio fires one when an SMS lands. And when your handler is flaky — when it misses events, fails silently, or chokes under load — you lose data and trust.

Most tutorials show you how to receive a webhook. Few show you how to handle it properly. This article covers the full picture: signature validation, idempotency, async queuing, and retry logic with exponential backoff.

We'll use Node.js and Express throughout, with no external queue infrastructure required. One important caveat up front: the queuing approach in this article is designed for a single, long-lived Node.js process. If you're running on serverless functions (Lambda, Cloud Run) or horizontally scaled deployments with multiple instances, in-memory queues are not reliable — skip ahead to the When to Upgrade section for the right tool in those cases.

TL;DR Summary

Concern	Solution
Fake webhook senders	HMAC-SHA256 signature verification with `timingSafeEqual`
Slow handlers timing out	Acknowledge `200` immediately, process async
Cascading failures	In-process queue with concurrency limit
Transient errors	Exponential backoff with jitter
Duplicate events	Idempotency keys via Set or Redis

What We're Building

A webhook handler that:

Validates the request signature (so only legitimate senders get through)
Acknowledges fast (returns 200 immediately, does the work async)
Queues events in-process so the work doesn't block the HTTP layer
Retries failures with exponential backoff
Handles duplicates with idempotency keys

Step 1: Signature Validation

Never trust an incoming webhook without verifying it came from who you think it came from. Most webhook providers (Stripe, GitHub, Shopify) sign their payloads using HMAC-SHA256 with a shared secret.

const crypto = require('crypto');

function verifySignature(payload, signature, secret) {
  const expected = crypto
    .createHmac('sha256', secret)
    .update(payload, 'utf8')
    .digest('hex');

  // Use timingSafeEqual to prevent timing attacks
  const expectedBuffer = Buffer.from(`sha256=${expected}`, 'utf8');
  const signatureBuffer = Buffer.from(signature, 'utf8');

  if (expectedBuffer.length !== signatureBuffer.length) return false;

  return crypto.timingSafeEqual(expectedBuffer, signatureBuffer);
}

Why timingSafeEqual? A simple === check leaks timing information — an attacker can brute-force signatures by measuring how long the comparison takes. timingSafeEqual always takes the same amount of time regardless of where the strings differ.

Now wire it into Express. A critical detail: you need the raw body for HMAC validation, not the parsed JSON. Express's json() middleware strips the raw body by default — use express.raw() on the webhook route instead.

const express = require('express');
const app = express();

// Store raw body before parsing
app.use('/webhook', express.raw({ type: 'application/json' }));

app.post('/webhook', (req, res) => {
  const signature = req.headers['x-hub-signature-256']; // GitHub format
  const rawBody = req.body; // Buffer, because of express.raw()

  if (!verifySignature(rawBody, signature, process.env.WEBHOOK_SECRET)) {
    return res.status(401).json({ error: 'Invalid signature' });
  }

  const event = JSON.parse(rawBody);

  // Acknowledge immediately — do the work async
  res.status(200).send('OK');

  queue.enqueue(event);
});

The key discipline here: acknowledge before you process. If your business logic takes 2 seconds and the sender has a 1-second timeout, you'll get duplicate events.

Step 2: An In-Process Job Queue

You don't always need Redis or BullMQ for a job queue. For a single, persistent Node.js process, an in-process queue with controlled concurrency is enough — and it's simpler to reason about.

⚠️ Limitations to understand before using this pattern:

Jobs are lost on restart. If your process crashes or is redeployed while events are queued, those jobs disappear silently. There is no persistence.

Not shared across instances. If you run multiple server instances (behind a load balancer, in a cluster, or in any horizontally scaled setup), each instance has its own queue. Events are not distributed or deduplicated across them.

If either of those constraints is a problem for your use case, go straight to a real queue like BullMQ or AWS SQS.

class WebhookQueue {
  constructor({ concurrency = 3, maxRetries = 5 } = {}) {
    this.queue = [];
    this.running = 0;
    this.concurrency = concurrency;
    this.maxRetries = maxRetries;
  }

  enqueue(event) {
    this.queue.push({ event, attempts: 0 });
    this.drain();
  }

  drain() {
    while (this.running < this.concurrency && this.queue.length > 0) {
      const job = this.queue.shift();
      this.running++;
      this.process(job).finally(() => {
        this.running--;
        this.drain(); // pick up the next job
      });
    }
  }

  async process(job) {
    try {
      await handleEvent(job.event);
    } catch (err) {
      job.attempts++;
      if (job.attempts < this.maxRetries) {
        const delay = this.backoff(job.attempts);
        console.warn(`Retrying event ${job.event.id} in ${delay}ms (attempt ${job.attempts})`);
        setTimeout(() => {
          this.queue.push(job);
          this.drain();
        }, delay);
      } else {
        console.error(`Event ${job.event.id} failed after ${this.maxRetries} attempts`, err);
        // Send to dead-letter store, alert, etc.
      }
    }
  }

  backoff(attempt) {
    // Exponential backoff with jitter
    const base = Math.min(1000 * 2 ** attempt, 30000);
    const jitter = Math.random() * 1000;
    return base + jitter;
  }
}

const queue = new WebhookQueue({ concurrency: 3, maxRetries: 5 });

The backoff method uses exponential backoff with jitter. Without jitter, all retrying jobs fire at the same moment and create a thundering herd. Adding a random jitter spreads the load. See AWS's writeup on backoff and jitter for a deeper look at why this matters at scale.

Step 3: The Event Handler

This is where your actual business logic lives. Keep it focused — one function per event type.

async function handleEvent(event) {
  switch (event.type) {
    case 'payment.succeeded':
      await handlePaymentSucceeded(event.data);
      break;
    case 'user.created':
      await handleUserCreated(event.data);
      break;
    default:
      console.log(`Unhandled event type: ${event.type}`);
  }
}

async function handlePaymentSucceeded(data) {
  // e.g., upgrade account, send receipt, update DB
  await db.orders.update({ id: data.orderId, status: 'paid' });
  await emailService.sendReceipt(data.customerEmail, data.amount);
}

Step 4: Idempotency

Webhook senders will send duplicates. Network timeouts, retries on their end, and at-least-once delivery guarantees mean you'll see the same event ID more than once.

Your handler needs to be idempotent — processing the same event twice should have the same effect as processing it once.

const processedEvents = new Set(); // Use Redis in production

async function handleEvent(event) {
  if (processedEvents.has(event.id)) {
    console.log(`Skipping duplicate event: ${event.id}`);
    return;
  }

  processedEvents.add(event.id);

  switch (event.type) {
    // ... your handlers
  }
}

In production, replace the in-memory Set with a Redis SET NX EX call via ioredis so idempotency survives process restarts:

const redis = require('ioredis');
const client = new redis();

async function isAlreadyProcessed(eventId) {
  // SET key value NX EX seconds
  // NX = only set if not exists; EX = expire after 24h
  const result = await client.set(`event:${eventId}`, '1', 'NX', 'EX', 86400);
  return result === null; // null means the key already existed
}

async function handleEvent(event) {
  if (await isAlreadyProcessed(event.id)) {
    return;
  }
  // process...
}

Step 5: Putting It All Together

const express = require('express');
const crypto = require('crypto');

const app = express();
app.use('/webhook', express.raw({ type: 'application/json' }));

// --- Signature verification ---
function verifySignature(payload, signature, secret) {
  const expected = crypto
    .createHmac('sha256', secret)
    .update(payload)
    .digest('hex');
  const expectedBuffer = Buffer.from(`sha256=${expected}`);
  const sigBuffer = Buffer.from(signature);
  if (expectedBuffer.length !== sigBuffer.length) return false;
  return crypto.timingSafeEqual(expectedBuffer, sigBuffer);
}

// --- Queue ---
class WebhookQueue {
  constructor({ concurrency = 3, maxRetries = 5 } = {}) {
    this.queue = [];
    this.running = 0;
    this.concurrency = concurrency;
    this.maxRetries = maxRetries;
  }
  enqueue(event) { this.queue.push({ event, attempts: 0 }); this.drain(); }
  drain() {
    while (this.running < this.concurrency && this.queue.length > 0) {
      const job = this.queue.shift();
      this.running++;
      this.process(job).finally(() => { this.running--; this.drain(); });
    }
  }
  async process(job) {
    try {
      await handleEvent(job.event);
    } catch (err) {
      job.attempts++;
      if (job.attempts < this.maxRetries) {
        const delay = Math.min(1000 * 2 ** job.attempts, 30000) + Math.random() * 1000;
        setTimeout(() => { this.queue.push(job); this.drain(); }, delay);
      } else {
        console.error(`Dead letter: ${job.event.id}`, err);
      }
    }
  }
}

const queue = new WebhookQueue();

// --- Idempotency ---
const processed = new Set();

// --- Handler ---
async function handleEvent(event) {
  if (processed.has(event.id)) return;
  processed.add(event.id);
  console.log(`Processing event: ${event.type} (${event.id})`);
  // your business logic here
}

// --- Route ---
app.post('/webhook', (req, res) => {
  const sig = req.headers['x-hub-signature-256'];
  if (!verifySignature(req.body, sig, process.env.WEBHOOK_SECRET)) {
    return res.status(401).send('Unauthorized');
  }
  res.status(200).send('OK'); // acknowledge immediately
  queue.enqueue(JSON.parse(req.body));
});

app.listen(3000, () => console.log('Webhook server listening on :3000'));

When to Upgrade to a Real Queue

The in-process queue above is acceptable for a single persistent process with moderate throughput — think a low-traffic internal tool or a side project where restarts are rare and you run one instance. You'll want to graduate to BullMQ (Redis-backed) or AWS SQS when:

You're running multiple server instances (in-process state won't be shared)
You need event history and visibility into failed jobs
Your event volume exceeds a few hundred per minute consistently
You need scheduled retries that survive process restarts

The good news: the handler logic above (handleEvent, idempotency, backoff) carries over directly. You're just swapping the queue substrate.

Webhooks are one of those things that look simple until they aren't. Getting these five concerns right means you can receive events reliably at scale — without losing data, without duplicating side effects, and without taking down your server under a burst of retries.

If you're building something that relies on real-time event delivery, these patterns are worth getting right from the start.

What's your webhook setup look like? Drop a comment — especially if you've found a gotcha I haven't covered.

Your Social Media Content Marketing is Failing. Here's Why

Dumebi Okolo — Wed, 01 Apr 2026 11:30:00 +0000

I will intro this article with my experience, but retold.

You've spent six weeks building something real. You merged the final PR at 11pm on a Thursday. You pushed to production. You watched the deployment logs scroll clean. And then you did what every builder does: you opened Twitter, typed something like "Just shipped [thing]. Super excited to share this with everyone 🚀", hit post, and went to bed.

You woke up to four likes. Two of them were your teammates.

The product was solid. The problem it solved was real. But the post? The post was invisible.

Here's the thing nobody tells you when you're deep in the build:

shipping is only half the work.

The other half is making people care. And most technical founders, developers, and DevRel professionals are running that half on empty.

The Gap Between Building and Being Seen

There's a particular kind of frustration that lives in technical communities. This is the frustration of people who are genuinely doing interesting things and can't seem to get traction on any of it.

It's not imposter syndrome. It's just a distribution problem.

The builders who get seen aren't always the ones building better things, sadly. They're just the ones better at translating what they build into content that lands. Content that makes someone stop mid-scroll and think "wait, this is exactly my problem," or "this is a painpoint I have."

That translation layer is what most technical people skip, rush, or outsource badly.

A 2024 State of DevRel report found that content creation consistently ranks as one of the top three time drains for developer advocates. This is not because they don't know what to write, but because the gap between "having something worth saying" and "saying it in a way that resonates" is a lot wider than most people expect.

For founders, it's worse. You're building, selling, hiring, and doing customer calls, and somewhere in that schedule, you're supposed to be producing thought leadership content that grows your personal brand and drives top-of-funnel awareness. It rarely happens at the level it should.

Why Your Regular AI Doesn't Work

The obvious answer is AI. You paste your notes into ChatGPT, ask it to write a LinkedIn post, and get something back that technically covers the topic. You post it but nothing happens-- no traction.

It wasn't that the output was wrong. It was just generic. And generic content in technical communities doesn't just underperform, it actually actively damages credibility.

Developers, content folks and DevRel professionals are some of the most discerning readers on the internet. They can spot templated, buzzword-heavy content in seconds. The moment a post opens with "In today's fast-paced digital landscape" or promises to "delve into the nuances" of anything, it's already dead on arrival.

The problem isn't that AI tools can't write. It's just that most of them default to the statistical mean of their training data, which is saturated with corporate documentation, SEO copy, and marketing fluff. The output sounds like everybody. It sounds like nobody in particular.

What is needed isn't just generated content. You need generated content that sounds like you. That is, content written with your specific technical depth, your actual voice, your real opinion.

Tools like Ozigi approach this differently. Instead of asking the AI to "write professionally" (a soft suggestion it ignores), Ozigi enforces a hard blocklist of AI-default vocabulary at the API level (words like delve, robust, seamlessly, tapestry ) forcing the model to construct sentences from your actual content rather than padding with filler. The output reads less like a press release and more like a Slack message from someone who actually built the thing. You can read exactly how that system works in the Banned Lexicon deep dive.

But the tool is only part of the answer. The bigger problem is structural.

The Real Reason Your Content Isn't Working

Most builders (like me, before) treat content like a release: something that happens once, at the end, when the thing is done.

That mental model is the root cause of most distribution failure.

Content that builds an audience doesn't work like product launches. It works like compounding interest. A single post doesn't build a following. A consistent body of work does. A consistent posting habit that over time signals to your audience that you're a reliable source of something worth reading.

The builders who seem to "go viral" on X or LinkedIn aren't getting lucky. They've usually been shipping content consistently for long enough that when one post breaks through, there's a body of work behind it that converts interest into followers, followers into readers, and readers into users.

So the real question isn't "how do I write a better launch post?"

It's "how do I build a content system I can actually sustain?"

What a Sustainable Technical Content System Looks Like

Here's the framework. It's not complicated, but it requires treating content like an engineering problem — which, if you're reading this, is probably how you think best anyway.

1. Raw material is everywhere. Stop waiting for inspiration.

Every week you're producing more content-worthy material than you realize:

PRs you merged and the decisions behind them
A bug that took you three hours to track down
A meeting where a customer said something that reframed how you think about the product
A library you tried that didn't work the way the docs said it would
An architectural decision you almost made and didn't

None of this requires you to sit down and think of something to write about. It requires you to notice that what's already happening in your work is interesting to other people.

The shift is from treating content creation as a separate creative task to treating it as a documentation habit. You're already doing the work. You just need a system to capture it.

Ozigi is built around this principle. You drop in a URL, a block of raw notes, even a PDF, an audio, transcript, basically any piece of information you have at your disposal, and the engine extracts the narrative structure without you needing to summarize or clean it first. That's what the multimodal ingestion pipeline is built to do: collapse the friction between "I have something worth saying" and "I have a draft worth editing" down to seconds.

2. Platform matters more than most people think.

A LinkedIn post and an X thread about the same topic are not the same content. They're different formats, different reader expectations, different hooks, different lengths.

LinkedIn readers expect context and narrative. They'll read three paragraphs before deciding if they care. X readers decide in one sentence, often the first one. Discord announcements need to be skimmable. Newsletters can go long, but they need a reason to exist beyond "here's what I built."

Most people write one thing and paste it across platforms unchanged. The format stays the same but engagement falls because the content doesn't match where it's landing.

A proper content system produces platform-native output from the same source material. Your one insight: the rate-limiting decision, the architecture tradeoff, the customer discovery finding, etc, becomes a thread on X, a narrative on LinkedIn, a community update in Discord or Slack, and a newsletter deep-dive. Each piece formatted for the expectations of its audience, not copy-pasted from each other.

3. Your voice is the most important part of your content.

Anyone can write about Next.js caching. Anyone can explain what a webhook is. But only you can explain those things with your specific perspective, your specific context, the way you'd describe it to a colleague over lunch.

That voice — built over hundreds of posts — is what makes people follow you and not just the topic. It's what turns a reader into someone who shows up every time you post because they trust it'll be worth their time.

That voice is also what AI strips out by default. The generic output problem isn't just an aesthetics issue. Every time you publish something that sounds like it came from a template, you're forfeiting the one thing that can't be replicated: the specific way you think about something.

This is why Ozigi's System Personas go beyond setting a "tone." Instead of prompting "write professionally," you define a character: your technical depth, your sentence rhythm, the phrases you actually use, the things you'd never say. That brief gets applied to every generated content, which means every draft is already shaped like you before you touch the edit button.

4. The 10% rule: the tool gets you 90, you own the rest.

The honest truth about AI-assisted content is that any decent engine can get you to 90% you need to get started. The last 10% is yours, and it's the part that actually matters.

That 90% is structure, platform formatting, tone calibration, cutting the filler. Generative AI can handle that by default.

The 10% is "the specific number from your metrics dashboard" or that inside joke the AI doesn't know about, the anecdote from your last customer call, or the offhand observation that only makes sense if you know your history with this problem. The exact phrasing you'd use if you were explaining this to a friend at 11pm.

That 10% is what makes content trustworthy. It's what makes someone share it instead of just scrolling past it. And it's irreplaceable because it comes from actually having done the thing.

The mistake most people make with AI writing tools is expecting the full 100%. When the output is 90% of the way there, they feel cheated.

The better mental model to have is:

you're not outsourcing the writing. You're outsourcing the blank page.

Ozigi's editing layer is built around exactly this split. Every campaign lands in a staging area — nothing goes live until you've reviewed it. The human-in-the-loop architecture keeps generation and publishing strictly separate, so you're always the last step before your content reaches your audience.

The Compounding Effect Nobody Talks About

Here's what happens when you run a consistent content system for six months:

Your posts start referencing each other. Your audience starts anticipating what you'll say next. When you ship something new, you have enough readers that the launch post gets signal on day one, which means it gets distributed further, which means more people see it.

So, getting four likes on your launch post isn't a content quality problem. The problem is a lack of consistency. What it looks like is you posting into a vacuum because you hadn't been posting consistently enough to have an audience ready when it mattered.

The builders who seem to "have an audience already" when they ship something new didn't get lucky.
I know a founder on X who did a 100 day post on X challange before his product launch. He climbed to $500 in sales in the first week. He already had an audience.
He paid the consistency debt early. He posted about the messy in-progress version, the failed experiments, the decisions he made and unmade. By the time he shipped, the audience was already there.

Content marketing for technical audiences is a long game. The best time to start was six months ago. The second-best time is right now with a system that makes it sustainable enough to actually keep going.

Start Small. Ship Consistently.

You don't need to produce ten pieces of content a week. You don't need a content calendar with color-coded categories and quarterly themes.

You need one piece of content per week that comes from something you actually did, written in a voice that sounds like you, distributed to the platforms where your audience actually is.

That's the whole system.

The tools exist to make it easier. The only thing without a shortcut is starting.

If you're a technical founder, developer, or DevRel professional trying to build a consistent content presence without it eating your calendar — Ozigi is worth trying. The free tier gives you 5 campaigns a month. Drop in your raw notes from last week, see what comes out, and decide from there. Get one week of Pro free when you sign up today!

→ Try Ozigi free · Read the platform docs · See the architecture deep dives · Star on GitHub

Have a content system that's actually working for you? Or a launch post that flopped spectacularly and taught you something? Drop it in the comments — genuinely curious what patterns people are seeing.

Gemini 2.5 Flash vs Claude 3.7 Sonnet: 4 Production Constraints That Made the Decision for Me

Dumebi Okolo — Tue, 10 Mar 2026 13:00:34 +0000

An evaluation of the Gemini 2.5 flash and Claude 3.7 Sonnet model for an agentic engine.

I had a simple rule when choosing an LLM for Ozigi: don't pick based on benchmark leaderboards. After my v2 launch, in recieving feedback, a user suggested I use the Claude models as they were better for content generation than Gemini. While the suggestion sounded tempting, I had to pick a model based on the four constraints my production pipeline couldn't negotiate around.

Most "Gemini vs Claude" comparisons evaluate general-purpose capabilities like coding, reasoning, and creative writing. That's useful if you're building a general-purpose product.
I wasn't.
Ozigi is a content engine. You feed it a URL, a PDF, or raw notes. It returns a structured 3-day social media campaign as a JSON payload that the frontend maps directly into UI cards.

That specificity made the evaluation easier than I expected: Two models, Four constraints. One clear winner on three of the constraints.

This is the third post in the Ozigi Changelog Series. If you want the backstory on why Ozigi exists, start with how I vibe-coded the internal tool that became it, and the v2 changelog that introduced the modular architecture this decision was built on.

Here's the full Architecture Decision Record.

The Setup: What the Pipeline Actually Does

The core API route in Ozigi does this:

Accepts a multipart/form-data payload containing a URL, raw text, and/or a file (PDF or image)
Constructs a prompt with strict editorial constraints injected at the system level
Sends everything to the LLM via the Vertex AI Node.js SDK
Returns the raw text response directly to the client

The frontend then does this:

const parsed = JSON.parse(responseText);
setCampaign(parsed.campaign);

No middleware. No schema validation. No error recovery in the happy path. Raw parse, straight into React state.

That single line is why model selection mattered.

Constraint 1: Comparing Gemini vs Claude Models for JSON Output Stability

The requirement: The model must return a valid JSON object — every time, without wrapping it in markdown code fences, without adding a conversational preamble, and without hallucinating a trailing comma that breaks JSON.parse().

The target schema looks like this:

{
  "campaign": [
    { "day": 1, "x": "...", "linkedin": "...", "discord": "..." },
    { "day": 2, "x": "...", "linkedin": "...", "discord": "..." },
    { "day": 3, "x": "...", "linkedin": "...", "discord": "..." }
  ]
}

It renders nine posts across three platforms in a span of three days, with every field required.
The UI renders each field into a separate card with edit, copy, and publish actions. A missing key doesn't throw a visible error — it silently renders an empty card.
This comparison is specifically between Gemini with responseSchema enforcement and Claude with prompted JSON, not between each model's structural output ceiling. Claude's tool use with tool_choice: {type: "tool"} enforces schema at the decoding layer and can reach equivalent reliability. The relevant constraint here was which enforcement mechanism was available and practical within my existing stack. More on that below.
I ran 500 automated test generations against both models targeting this schema, measuring the percentage of responses that JSON.parse() accepted without exceptions.

Model	Format Adherence Rate
Gemini 2.5 Flash	99.9%
Claude 3.7 Sonnet (prompted)	~88.5%

The 11.5% gap maps directly to broken UI states for real users. That was not acceptable to me for a core feature.

Using Gemini's responseSchema closes this entirely. According to Google's controlled generation documentation, the feature physically prevents the model from returning output that doesn't conform to your schema. It's not prompt-level guidance, it's enforced at the decoding layer. Here's what the production implementation looks like for Ozigi: the schema is defined once at the top of the route and attached directly to the model config:

const distributionSchema = {
  type: "OBJECT" as const,
  properties: {
    campaign: {
      type: "ARRAY" as const,
      description: "A list of 3 daily social media posts.",
      items: {
        type: "OBJECT" as const,
        properties: {
          day:      { type: "INTEGER" as const, description: "Day number (1, 2, or 3)" },
          x:        { type: "STRING"  as const, description: "Content for X/Twitter." },
          linkedin: { type: "STRING"  as const, description: "Content for LinkedIn." },
          discord:  { type: "STRING"  as const, description: "Content for Discord." },
        },
        required: ["day", "x", "linkedin", "discord"],
      },
    },
  },
  required: ["campaign"],
};

const model = vertex_ai.getGenerativeModel({
  model: "gemini-2.5-flash",
  generationConfig: {
    responseMimeType: "application/json",
    responseSchema: distributionSchema,
  },
});

response.text() is now structurally guaranteed to be valid JSON. JSON.parse() cannot fail on a missing field, trailing comma, or conversational preamble — the model is physically prevented from producing them.
Claude's tool use and function calling can achieve similar guarantees, but it requires a meaningfully different integration architecture. With the Vertex SDK, this is one config block.

Winner: Gemini.

Constraint 2: Comparing Gemini vs Claude on Latency on a Live Public Sandbox

The requirement: Ozigi has a free, unauthenticated sandbox. Anyone can generate a full 3-day campaign without signing up.

That changes the economics of model selection completely. A paying user on a premium plan will tolerate a 20-second wait if the output quality justifies it. An anonymous user who found the product via my whacky marketing efforts will not. They'll close the tab at 10 seconds and probably not come back, sadly.

I benchmarked both models against a standard 10,000-token input payload via Vercel serverless functions (my production environment):

Model	Avg Response Time
Gemini 2.5 Flash	~6.2s
Claude 3.7 Sonnet	~21.5s

Methodology: N=100 requests per model, measured end-to-end from Vercel function invocation to full response. Results are environment-dependent and intended for directional comparison, not as absolute benchmarks.

The gap holds across payload sizes. Gemini Flash consistently comes in under 10-15 seconds. Claude 3.7 Sonnet consistently exceeds 20 seconds on the same inputs, in the same environment.

This gap would narrow significantly with streaming: getting first tokens in front of the user within 2-3 seconds. Streaming changes the perceived wait time for a user entirely. This is, however, a v4 architecture item that is being worked on. For a non-streaming pipeline with a public sandbox, the 3.5x latency difference is a product decision, not just an engineering one.

Winner: Gemini Flash — and it's not close for non-streaming public sandboxes.

Constraint 3: Comparing Gemini vs Claude on Native Multimodal Ingestion

The requirement: Users can upload PDFs and images directly as context. The pipeline needs to process them without an external preprocessing step.

With Gemini via the Vertex AI Node.js SDK, the entire PDF pipeline is:

// /app/api/generate/route.ts
if (file && file.size > 0) {
  const arrayBuffer = await file.arrayBuffer();
  const base64Data = Buffer.from(arrayBuffer).toString("base64");

  parts.push({
    inlineData: {
      data: base64Data,
      mimeType: file.type, // "application/pdf", "image/jpeg", etc.
    },
  });
}

const result = await model.generateContent({
  contents: [{ role: "user", parts: parts }],
});

You can see that the SDK handles the buffer natively. Gemini reads the PDF directly as part of the multipart request alongside the text prompt — no OCR step, no preprocessing, no separate service call. Google's multimodal documentation confirms that Gemini was designed from the ground up to handle PDF and image buffers natively via inlineData.

An earlier version of this article claimed that Claude required an external OCR step for PDF ingestion. That was wrong. Claude's Messages API does support native base64 PDF ingestion directly via a document content block — no OCR preprocessing, no external service. The pattern is structurally similar to Vertex AI's inlineData, just different field names.
The real constraint here was ecosystem, not capability. I evaluated Claude 3.7 Sonnet as available in the Google Model Garden within my existing Vertex AI setup. Switching to Claude's native PDF ingestion would have meant moving to the Anthropic Messages API entirely — a different provider, different SDK, different billing. The Vertex AI path was simpler for the stack I was already running.
Winner: Gemini — for this stack. Both models support native multimodal ingestion without external OCR. The advantage here was ecosystem fit, not a fundamental capability difference.

Constraint 4: Comparing Google Gemini vs Claude on Tone Engineering

The requirement: Generated social media posts must sound like a human wrote them. Specifically, they must pass AI content detection and avoid the predictable cadence patterns that make AI-generated copy immediately identifiable.

This is the constraint where Claude wins cleanly on base performance.
Our internal blind A/B evaluations of 50 technical posts (scored on pragmatic sentence structure and absence of AI terminology) gave Claude 3.7 Sonnet a "human cadence quality score" of 9.5/10. Gemini Flash's base score was 5.5/10.

That's a significant gap. And it's for the feature that is Ozigi's core value proposition.

Why use Gemini for Tone Engineering?

Because the gap is engineerable.

We built the Banned Lexicon — a programmatic constraint injected at the system prompt level that explicitly penalizes the vocabulary patterns that make AI copy detectable. You can read the full implementation in the Ozigi documentation:

THE BANNED LEXICON: You are strictly forbidden from using the 
following words or their variations: delve, testament, tapestry, 
crucial, vital, landscape, realm, unlock, supercharge, revolutionize, 
paradigm, seamlessly, navigate, robust, cutting-edge, game-changer.

Combined with explicit cadence engineering:

BURSTINESS (CADENCE): Write with high burstiness. Do not use 
perfectly balanced, medium-length sentences. Mix extremely short, 
punchy sentences (2-4 words) with longer, detailed explanations.

PERPLEXITY: Avoid predictable adjectives. Use strong, active verbs 
and concrete nouns. Talk like a pragmatic subject matter expert 
explaining a concept to people, not a marketer selling a product.

FORMATTING RESTRAINT: You are limited to a MAXIMUM of 1 emoji per 
post. Use a maximum of 2 highly relevant hashtags per post.

With these constraints active, Gemini's human cadence score jumps from 5.5 to 9.2 — within acceptable range of Claude's base 9.5.

The key insight: Claude's tone advantage is a default advantage, not an absolute one. Gemini's outputs are more malleable under prompt constraints. For a use case where tone control is the entire product, that malleability is worth more than a higher baseline.

Winner: Gemini + engineering constraints. The tone gap is closeable. The latency and JSON stability gaps on the other constraints are not.

Gemini vs Claude Models: The Cost Reality

At this stage where Ozigi is a public sandbox, every anonymous page load that can trigger a generation is a billable API call absorbed by the product. Ozigi is at its pre-revenue stage, so this matters a lot.

Model	Input Cost (per 1M tokens)	Output Cost (per 1M tokens)
Gemini 2.5 Flash	~$0.075	~$0.30
Claude 3.7 Sonnet	~$3.00	~$15.00

Pricing sourced from Google Cloud Vertex AI pricing and Anthropic API pricing.

Pro tip:Verify current rates before production decisions — both have changed multiple times in the past year.

The input cost difference is 40x. The output cost difference is 50x. For a free-tier product with no revenue, the ability to run a public sandbox sustainably is the difference between having a conversion funnel and not having one.

Where Ozigi is Going and How it'd Change My Choice of Model, Moving Foward

This is an honest ADR. Here's what would change my answer.

When Ozigi finally moves behind a paywall, latency and cost become secondary concerns. A signed-in user on a paid plan is more likely waiting 20 seconds for premium output is a different UX calculation than an anonymous user on a free demo. In that context, Claude's base tone quality becomes much more compelling. I'd be trading economics for output baseline, and the trade might be worth it.

When streaming gets implemented, the latency argument against Claude weakens significantly. Claude 3.7 Sonnet's time-to-first-token via streaming is competitive. A user seeing the first post appear in 2-3 seconds experiences the product very differently than a user staring at a progress bar for 21 seconds. Streaming is on the roadmap.

For an in-depth look at how we tested the pipeline that informs these decisions, see how we E2E test AI agents with Playwright in Next.js.

The Decision Matrix

Constraint	Gemini 2.5 Flash	Claude 3.7 Sonnet	Winner
JSON Stability (responseSchema)	99.9% → guaranteed	~88.5% (prompted)	Gemini
Latency (non-streaming)	~6.2s	~21.5s	Gemini
Native PDF/Image ingestion	Native via Vertex SDK	Native via Messages API	Gemini (Eco-system fit)
Base tone quality	5.5/10	9.5/10	Claude
Tone quality (+ constraints)	9.2/10	9.5/10	Near tie
Cost per 1M input tokens	$0.075	$3.00	Gemini

Gemini won on five of six dimensions. Claude won on one — base tone — and that gap was closeable through prompt engineering.

Four Questions To Ask Before Choosing An LLM Model For Your Agentic Project/APP

If you're building something similar to Ozigi, these are the constraints worth looking through before you pick an API and start building:

**1. Does your UI depend on structured output? If your frontend calls JSON.parse() on a raw model response, you need API-level schema enforcement — not prompt instructions asking nicely. responseSchema via Vertex AI, Claude's tool use with forced tool_choice, or structured outputs via OpenAI all enforce at the decoding layer. The question isn't which model supports it — most do — it's which enforcement path fits your existing stack.

2. Do you have a free tier or public sandbox? If yes, latency and cost are product decisions that affect conversion, not just infrastructure decisions that affect margins.

**3. Does your use case require multimodal inputs? Most major models now support native PDF and image ingestion without external preprocessing. Map out what the integration looks like within your existing API provider before assuming you need to switch or add infrastructure

4. Where is the base model weakest, and is that gap engineerable? Claude's tone advantage is real. It's also not the only path to human-sounding copy. Engineering constraints at the prompt level can close gaps that feel insurmountable when you're just looking at base benchmarks.

The best model for your product is rarely the one with the highest aggregate score. It's the one that fails least on the constraints you actually can't work around.

The full Ozigi architecture — including the generate API route, the Banned Lexicon implementation, and the Vertex AI configuration — is open source on GitHub.
The live context engine is at ozigi.app.
The interactive version of this ADR with Chart.js visualisations of each benchmark.
Ozigi is currently looking for User Experience Testers to give honest Feedback on their experience using the product, and areas for improvement.
We have some open issues on Github that is welcome to contribution from the community. ps, this app has been entirely vibe coded so far, therefore we welcome vibe coded contributions too!
Connect With Me On LinkedIn
Send me an email on okolodumebi@gmail.com.
Building osmething cool? Talk about it in the comments!

How to End-to-end (E2E) Test AI Agents: Mocking API Responses with Playwright in Next.js

Dumebi Okolo — Fri, 06 Mar 2026 12:50:33 +0000

Building an AI agent is fun. At least, I have had so much fun building out Ozigi, a social media content manager agent (ps, we are in need of user experience testers!).

But!
Testing it in a CI/CD pipeline is a nightmare.

If you are building an application that relies on an LLM (like OpenAI, Anthropic, or Google's Vertex AI), you quickly run into these three challanges when writing End-to-End (E2E) tests:

Cost: Every time your test suite runs, you are burning API credits.
Speed: LLMs are slow. Waiting 10-15 seconds per test will grind your deployment pipeline to a halt.
Non-Determinism: LLMs never return the exact same string twice. If your Playwright test relies on expect(page.getByText('exact phrase')).toBeVisible(), your tests will randomly fail.

While building Ozigi—an agentic content engine designed to turn raw technical research into structured social campaigns—I needed a way to test the complex UI state transitions (like custom loaders and dynamic grids) without actually hitting the Vertex AI API, especially seeing as I am managing very conservatively my $300 in credits!

Playwright Network Interception

Here is how to completely decouple your frontend E2E tests from your LLM backend using Next.js and Playwright.

In Ozigi, the user flow looks like this:

The user selects a custom persona and inputs raw context (a URL or text dump).

They click "Generate Campaign."
The UI swaps to a <DynamicLoader />.

The Next.js API route (/api/generate) sends the context to Gemini 2.5 Pro.
The LLM returns a strictly formatted JSON object.
The UI renders the multi-platform campaign grid.

If I test this live, it will introduce latency and flakiness.
Instead, I intercepted the API call and instantly return a fake JSON payload.

Network Mocking (Interception with `page.route`)

Playwright allows us to hijack outbound network requests directly from the browser. When the frontend tries to call our Next.js API route, Playwright intercepts the POST request, blocks it from ever hitting the server, and fulfills it with our own static data.

Here is the exact test script I use to validate the Ozigi content engine:

import { test, expect } from '@playwright/test';

test.describe('Ozigi Context Engine & AI Mocking', () => {

  test('should generate a campaign by intercepting the LLM response', async ({ page }) => {
    // 1. Navigate to the dashboard
    await page.goto('/dashboard');

    // 2. Fill out the Context fields
    await page.getByPlaceholder('Paste a URL or raw notes').fill('https://ozigi.app/docs');
    await page.getByPlaceholder('Additional directives...').fill('Keep it technical.');

    // 🚀 THE MAGIC: Intercept the AI generation API route
    await page.route('**/api/generate', async (route) => {

      // Define the exact JSON structure your frontend expects from the LLM
      const mockedAIResponse = {
        output: JSON.stringify({
          campaign: [
            {
              day: 1,
              x: "Day 1 Thread: Ozigi is tested and working! 1/2\n\n[The content engine is officially alive.]",
              linkedin: "LinkedIn Post: Ozigi testing complete.",
              discord: "Discord Update: Systems green."
            }
          ]
        })
      };

      // Fulfill the route instantly with the mocked data
      await route.fulfill({
        status: 200,
        contentType: 'application/json',
        body: JSON.stringify(mockedAIResponse),
      });
    });

    // 3. Trigger the generation
    await page.getByRole('button', { name: /Generate Campaign/i }).click();

    // 4. Assert the UI state transitions correctly
    // Verify the loader appears while the "network" request is happening
    const loaderContainer = page.locator('.animate-in.fade-in');
    await expect(loaderContainer).toBeVisible();

    // 5. Assert the final UI renders our mocked data perfectly
    await expect(page.getByText('Ozigi is tested and working!')).toBeVisible();
    await expect(page.getByText('[The content engine is officially alive.]')).toBeVisible();
  });
});

Why You Should Mock LLM/API Responses In Playwright

By using this testing pattern, I achieved three of my engineering goals:

Zero Cost: The test suite can run 1,000 times a day on GitHub Actions without costing a single cent in Vertex AI compute.
Lightning Fast: The entire E2E test finishes in seconds, as I bypass the LLM's generation latency entirely.
Absolute Determinism: Because I injected a static JSON payload, my text assertions (toBeVisible) will never fail due to an AI hallucination or a slightly altered adjective.

When building AI wrappers or agentic workflows, your testing strategy must isolate the LLM from the UI. Let the LLM be unpredictable in production, but demand strict predictability in your test suite.

I built this network mocking (interception0 pattern into Ozigi, an agentic content engine that helps pretty much anyone turn their raw notes/ideas into structured, multi-platform campaigns without dealing with cheesy AI buzzwords. You can check it out at ozigi.app.

Let's connect on LinkedIn!
You can find my spaghetti code here..

Consider this the unofficial v3 changelog of Ozigi. As always, we are welcome to your feedback and can't wait to hear from you!

Ozigi v2 Changelog: Building a Modular Agentic Content Engine with Next.js, Supabase, and Playwright

Dumebi Okolo — Mon, 02 Mar 2026 11:37:31 +0000

When I first built Ozigi (initially WriterHelper), the goal was simple: give content professionals in my team a way to break down their articles into high-signal social media campaigns.

OziGi has now evolved to an open source SaaS product, oepn to the public to use and imnprove.

Here is the complete technical changelog of how I completely turned Ozigi from a monolithic v1 MVP into a production-ready v2 SaaS.

1. Modular Refactoring of The App.tsx (Separation of Concerns)

In v1, my entire application: auth, API calls, and UI—lived inside a long app/page.tsx file. The more changes I made, the harder it became to manage.

Modular Component Library: I stripped down the monolith and broke the UI into pure, single-responsibility React components (Header, Hero, Distillery, etc.).

Centralized Type Safety: I created a global lib/types.ts file with a strict CampaignDay interface (complete with index signatures) to finally eliminate the TypeScript "shadow type" build errors I was fighting.
State Persistence: Implemented localStorage syncing so the app "remembers" if a user is in the dashboard or the landing page, preventing frustrating resets on browser refresh.

2. Using Supabase as the Database and Tightening the Backend

A major UX flaw in v1 was that refreshing the page wiped the user's progress.

Relational Database & OAuth: I replaced anonymous access with secure GitHub OAuth via Supabase.
Automated Context History: I engineered a system that auto-saves every generated campaign to a PostgreSQL database. Users can now restore past URLs, notes, and outputs with a single click.

Identity Storage: Built a settings flow to permanently save a user's custom "Persona Voice" and Discord Webhook URLs directly to their profile.

3. Core Feature Additions

Multi-Modal Ingestion: Upgraded the input engine to accept both a live URL and raw custom text simultaneously.

Native Discord Deployment: Built a dedicated API route and UI webhook integration to push generated content directly to Discord servers with one click.

4. Update UI/UX & Professional Branding

The Rebrand: Pivoted the app's messaging to focus entirely on content professionals, positioning it as an engine to generate social media content with ease and in your own voice.
Open-First Onboarding: Designed a "Try Before You Buy" workflow. Unauthenticated users can test the AI generation seamlessly, but are gated from premium features (History, Personas, Discord) via an Upgrade Banner.

Pixel-Perfect Layouts & SEO: Eliminated rogue whitespace and z-index issues using precise CSS Flexbox rules. Upgraded app/layout.tsx with professional OpenGraph and Twitter Card metadata.

5. Quality Assurance & DevOps (Automated Playwright E2E Tests)

Automated E2E Testing: Completely rewrote the Playwright test suite (engine.spec.ts) to verify the new landing page copy, test the navigation flow, and confirm security rules apply correctly.
Linux Dependency Fixes: Patched my CI/CD pipeline by ensuring underlying Linux browser dependencies (--with-deps) are installed so headless Chromium tests pass flawlessly.

What's Next? (v3 Roadmap)

With the Context Engine now stable, the foundation is set.
My plan for V3 is to fix the deployment pipeline:

integrating the native X (Twitter)
LinkedIn APIs so users can publish directly from the Ozigi dashboard.

What has been your biggest challenge scaling a Next.js MVP? Let me know in the comments!
Try out Ozigi
And let me know if you have any feature suggestions? Let me know!
Want to see my poorly written code? Find OziGi on Github.

Connect with me on LinkedIn!

What came next:
After shipping v2, the next hard question was model selection. A reader suggested switching to Claude for better content quality. I ran the benchmarks instead of just taking the advice. The results across JSON stability, latency, multimodal ingestion, and tone were clearer than I expected: Gemini 2.5 Flash vs Claude 3.7 Sonnet: 4 Production Constraints That Made the Decision for Me

I vibe-coded an internal tool that slashed my content workflow by 4 hours

Dumebi Okolo — Fri, 27 Feb 2026 14:52:17 +0000

One of the biggest challenges I face as a content expert is repurposing my written blogs for social media. Before now, I had to ask AI for summaries or try to get them myself. I became very busy recently, and I don't have time for that anymore.
The best solution for me was building a tool that helps me generate social media content from my blog and posts on my behalf.
I was in a meeting of content professionals recently. A key point that was hammered on regarding the use of AI in content creation is the need to maintain a strict Human-in-the-Loop (HITL) workflow.
This resonated well with me.
I had initially planned to build an agent to automate and schedule social media posts. This, however, leaves out the HITL factor, so I restrategized.

Here is the technical breakdown of how I built an Agentic Content Engine using Next.js 15, Gemini 3.1 Pro, and Discord Webhooks.

Agentic Human-in-the-Loop (HITL) architecture

The Problem: The "Context Gap"
Most AI social media tools are just wrappers for generic prompts. They don't know my research, they don't know my voice, and they definitely don't know the technical nuances of my articles.
So,
I needed a tool that:

Reads my actual dev.to articles.
Strategizes a 3-day multi-platform campaign.
Displays it in a way that I can audit, edit, and then—with one click—Deploy.

Even though this app was "vibe coded" (shoutout to the AI for keeping up with my pivots 😂😂), the architecture is solid.

The core philosophy of this build is Agency over Automation. The agent doesn't just act; it reasons, structures, and then waits for human approval before posting

The AI Stack

Reasoning Engine: Gemini 3.1 Pro (Tier 1 Billing). I opted for Pro over Flash to handle complex instruction following and strict JSON schema enforcement.
Frontend: Next.js 15 (App Router) for server-side rendering and SEO efficiency.
Styling: Tailwind CSS with @tailwindcss/typography for professional markdown rendering.
Deployment: Discord Webhooks for an immediate, zero-auth execution pipeline.

Handling AI Hallucinations in Next.Js

A common failure in vibe coding, I have found, is the LLM returning "chatty" text when the UI expects structured data.
To solve this, I implemented a Strict JSON Enforcement pattern in the API route.

Gemini often wraps its JSON output in markdown code blocks. If you pass this directly to JSON.parse(), the app crashes.

To solve this, I used Sanitization Middleware.
I built a regex-based sanitization layer to strip the noise and ensure the frontend receives a clean array.

// app/api/generate/route.ts
const rawOutput = data.output; // The raw string from Gemini

// Regex to extract only the JSON content
const cleanJson = rawOutput.replace(/```
{% endraw %}
json|
{% raw %}
```/g, "").trim();

try {
  const campaignData = JSON.parse(cleanJson);
  return NextResponse.json({ campaign: campaignData.campaign });
} catch (error) {
  console.error("JSON Parsing failed:", rawOutput);
  return NextResponse.json({ error: "Failed to parse Agent strategy" }, { status: 500 });
}

UI/UX Strategy: The Kanban "Board" Approach

The v1 of the UI was so messy. The tool worked but you'd have to dig through mountains of text to even understand what was going on.
I tried formatting it into a table for some structure. Somehow, that was worse!
Finally, to optimize for a "Human-in-the-Loop" workflow, I moved to a columnar dashboard.
Social posts, especially threads on X, can be long, and that would have made even the boards to be clumsy and unkempt.
To keep the UI clean, I built a PostCard component that caps content at 250 characters with a state-managed "Read More" toggle.

const [isExpanded, setIsExpanded] = useState(false);
const displayContent = isExpanded ? content : content.substring(0, 250) + "...";

This ensures the user can audit the text without scrolling for "miles."

Photo dump: Agentic Content Flow in Action

The Starting Point Here’s the clean, minimal dashboard before the magic happens. I wanted it to feel like a professional "Command Centre," not a messy chatbot window.

The 3-Day Campaign Map Once I paste my URL, the Agent goes to work. It returns a structured 3x3 grid. I added a 250-character truncation with a "Read More" toggle because, let's face it, nobody wants a wall of text when they're trying to strategise.

The Deployment Here is the best part. I hit "Post to Discord," and boom—success. No manual copy-pasting, no switching tabs. It’s live.

What's next

This is what I have built so far. I am calling it BloggerHelper v1
My next updates are:

Integrating the X and LinkedIn feature.
Putting more work into the context tank. So far, the agent's context has been obtained from the article and some instructions in the agents_instruction.md file. I will work more on this
Putting an edit feature, where I can edit a post before it goes out.
Making it take in more context than just my blog posts

Conclusion: The Engineering of Presence

Even though this tool was designed to help me cut down on work hours, it was also to take me from just a technical writer to a content engineer/architect, where my primary goal isn't to just create content but create solutions that make for easy content flow.
Also, as I position myself as an AI influencer, I want to show myself building more with AI and evangelising its adoption.

Let's connect on LinkedIn!

What’s your take on Agentic Workflows? Are you building for full automation, or are you keeping the human in the loop?

Let’s discuss below. 👇

UPDATE!!!!

I just used my tool to get my social media caption/content for this post. See below.

You can try it out here, but mercy on my API credit!!

UPDATE 2 — March 2026:
Several people in the comments asked about forcing structured JSON output without the regex sanitisation layer. I ended up going deep on this for Ozigi v3. The answer is responseSchema via the Vertex AI SDK — it enforces structure at the decoding layer, not the prompt level. I benchmarked it alongside Claude 3.7 Sonnet across four production constraints. The full write-up, with numbers, is here: Gemini 2.5 Flash vs Claude 3.7 Sonnet: 4 Production Constraints That Made the Decision for Me

Using Perplexity AI and Gemini 3 (Pro) for Academic Research and Writing

Dumebi Okolo — Thu, 19 Feb 2026 15:07:41 +0000

I’m currently in the trenches of my Master’s thesis, focusing on 5G Anomaly Detection using TensorFlow Lite at the Edge.

I wrote a paper on EDGE-DEPLOYABLE TENSORFLOW LITE AUTOENCODER FOR REAL-TIME 5G ANOMALY DETECTION AND COST-AWARE OPTIMIZATION that you can check out.

This blog post is part of my short-form content series. Where I write straight-to-the-point blog posts of less than 1000 words

Before building my AI research workflow, I used to spend hours just "pre-reading," trying to build the literature review section of my thesis.

Not anymore!
I built my own "Research Stack" with already existing AI tools that does all the heavy lifting for me in a matter of minutes.

I don’t use just one tool. I use an AI aggregator and a AI Native Pro Model together.

Perplexity is the AI Aggregator

Many people, like me before making this discovery, think of Perplexity as just a model; it’s actually more of a "librarian."
It doesn't just rely on its own model; it uses some of the best in the industry—Claude 4, GPT-5, and Gemini 3—to scour the web and find citations.

Sonar is Perplexity's own model.

I've come to learn that Perplexity is the "king" of finding where the information is.

However, when it comes to understanding/making sense of the 20 or so PDFs I just found? That’s where the "Aggregator" model hits a wall.

The Native Pro Advantage (Gemini Advanced)

Because I have a Gemini Pro subscription, I have access to something Perplexity’s implementation can’t match: Gemini's 2-Million Token Context Window.**

While Perplexity gives me snippets and links, I can feed those entire PDFs or papers it gives me into Gemini Pro.
This way, Gemini doesn't just look up the research papers; it "lives" in them.
That is, it remembers a conflict in data on page 4 and compares it to a conclusion on page 48.

My Research Workflow

Here is exactly how I use Perplexity AI and Google Gemini to speed up my thesis research:

Phase 1-- Using Perplexity to find research papers and material:

I ask Perplexity to find the most recent 2026 papers on Federated Learning in 5G. It gives me URLs and citations.
Here's an example of my prompt:

Find the top 5 most cited research papers from late 2025 and 2026 regarding 'Anomaly Detection in 5G Core Networks using Federated Learning.' Provide the direct URLs and a 2-sentence summary of their core methodology

Phase 2-- Using Gemini Pro to go through research materials:

I download those papers and upload them to Gemini and use it for things like comparing, reasoning, or critiquing.
Here's an example prompt I've used

I have these 5 research papers [Paste links/sources]. Using your 2M token context, analyze how these papers address the 'latency vs. accuracy' trade-off in Edge computing. Then, draft a 1,000-word skeleton for my literature review that explains why AI automation is the solution to 5G network failures.

Phase 3: Direct editing in the Google Docs Workspace

Since Gemini is integrated with my Google Workspace, I edit the literature review draft directly into a Google Doc.

📊 Comparison: Perplexity AI vs Google Gemini for Research

Feature	Perplexity (The Librarian)	Gemini Pro (The Architect)
Primary Strength	Real-time search & citations.	Massive context & reasoning.
Model Source	Aggregator (Claude, GPT, Gemini).	Native (Google's best).
Context Window	Small (Snippet-based).	2M+ Tokens (Entire libraries).
Best For...	Finding "The What" & URLs.	Analyzing "The How" & Drafting.
Integration	Web-only.	Google Workspace (Docs/Gmail).

What I have learned in my AI use is that looking for the one tool that does everything would lead to failure. or inaccuracies.
I prefer a "separation of concerns" type of workflow, leading to better accuracy.
This only works, though, when you know how to build the right stack for your workflow and how to get around the stack

Are you still using a single LLM for your research, or have you started "stacking" your tools? Let's discuss in the comments!

You can find me on LinkedIn!