Forem: Aakash Gour

Designing a Content Microservice: Architecture Decisions and Trade-offs

Aakash Gour — Fri, 08 May 2026 05:14:54 +0000

The first version of PostAll's content generation system wasn't a microservice. It was a Next.js API route that called OpenAI, wrote the result to a database, and returned a response. It took 8 seconds per article. Our first beta user queued 200 articles and the server timed out after the 11th.

That's the moment you stop thinking about "how do I generate content" and start thinking about "how do I build a system that generates content." They're completely different problems.

This is the architecture I landed on after three rewrites — what I chose, why I chose it, and where each approach broke before I got there.

Why content generation is a bad fit for synchronous APIs

Most API routes work like this: request comes in, thing happens, response goes out. The whole exchange is sub-second. Content generation breaks this model in three ways.

Latency. A GPT-4o call for a 1,000-word article takes 8–15 seconds. That's not a timeout edge case — that's the normal case. No HTTP client should sit open for 15 seconds waiting.

Failure modes. LLM APIs fail. Rate limits hit. Tokens run out mid-response. At low volume, you catch these with a try/catch and a retry. At scale, you need state — you need to know which of 500 jobs failed, why, and whether it's safe to retry.

Fan-out. A user submitting 100 articles isn't submitting 100 requests — they're submitting 1 request that should spawn 100 coordinated jobs. Synchronous APIs can't express that without blocking.

The solution to all three is the same: decouple submission from execution with a job queue.

The queue design that actually held up

I went through two queue implementations before landing on something stable.

First attempt: in-memory queue with setInterval.

// What I started with — don't do this
const jobQueue = [];

app.post('/generate', (req, res) => {
  const jobId = uuid();
  jobQueue.push({ id: jobId, ...req.body });
  res.json({ jobId });
});

setInterval(async () => {
  const job = jobQueue.shift();
  if (job) await processJob(job);
}, 100);

This lasted until the first server restart. Everything in the queue evaporated. Lesson learned: queue state needs to survive the process.

Second attempt: database as queue.

// Using Postgres as a job queue — better, but read the gotchas below
async function enqueueJob(payload) {
  const { rows } = await db.query(
    `INSERT INTO content_jobs (id, status, payload, priority, created_at)
     VALUES ($1, 'pending', $2, $3, NOW())
     RETURNING id`,
    [uuid(), JSON.stringify(payload), payload.priority ?? 0]
  );
  return rows[0].id;
}

async function claimNextJob(workerId) {
  // SELECT FOR UPDATE SKIP LOCKED is critical here —
  // without it, multiple workers race on the same row
  const { rows } = await db.query(`
    UPDATE content_jobs
    SET status = 'processing', worker_id = $1, claimed_at = NOW()
    WHERE id = (
      SELECT id FROM content_jobs
      WHERE status = 'pending'
      ORDER BY priority DESC, created_at ASC
      FOR UPDATE SKIP LOCKED
      LIMIT 1
    )
    RETURNING *
  `);
  return rows[0] ?? null;
}

The FOR UPDATE SKIP LOCKED clause is what makes this safe for concurrent workers. Without it, two workers claim the same job, you get duplicated output, and debugging it is a nightmare. I hit this at 3 workers — it got worse fast.

What I use now: dedicated queue (BullMQ + Redis).

For teams already running Redis, BullMQ solves priority queues, retries, rate limiting per worker, and dead-letter queues out of the box. The database-as-queue approach is legitimate for low volumes, but past ~50 jobs/minute the polling overhead shows up in your database metrics.

import { Queue, Worker } from 'bullmq';
import Redis from 'ioredis';

const connection = new Redis(process.env.REDIS_URL);

export const contentQueue = new Queue('content-generation', {
  connection,
  defaultJobOptions: {
    attempts: 3,
    backoff: { type: 'exponential', delay: 2000 },
    removeOnComplete: { count: 1000 }, // keep last 1000 for debugging
    removeOnFail: false,               // keep all failures — you'll want to inspect them
  },
});

// Priority lanes: 10 = high (single article), 1 = low (bulk batch)
export async function enqueue(payload, priority = 1) {
  return contentQueue.add('generate', payload, { priority });
}

The removeOnFail: false line is important. Failed jobs tell you things — which prompts consistently fail, which users hit edge cases, where rate limiting is biting you. Don't throw them away.

Worker architecture: three lanes, not one

The instinct is to have one worker that does everything: calls the LLM, formats the output, validates it, writes to the database. This works until you need to scale generation without scaling formatting, or you want to add a quality scoring step without touching the generation logic.

I split workers into three types:

Generation worker  → calls LLM, stores raw output
Formatting worker  → transforms raw → target format (HTML, Markdown, JSON)
Post-processing    → SEO checks, quality scoring, webhook dispatch

Each one pulls from its own queue. Generation output triggers a formatting job. Formatting completion triggers post-processing. This means:

You can scale generation horizontally when you're rate-limit-bound
You can swap the formatting logic without touching generation
A formatting bug doesn't take down generation

// Generation worker — one responsibility
const generationWorker = new Worker(
  'content-generation',
  async (job) => {
    const { keyword, tone, maxTokens, outputFormat } = job.data;

    const raw = await generateContent({ keyword, tone, maxTokens });

    // Store raw output immediately — if formatting fails, we have the source
    const articleId = await db.saveRawContent({
      jobId: job.id,
      raw,
      metadata: { keyword, tone, outputFormat },
    });

    // Trigger next stage
    await formattingQueue.add('format', { articleId, outputFormat });

    return { articleId };
  },
  {
    connection,
    concurrency: 5, // tune based on your API tier rate limits
    limiter: {
      max: 50,       // OpenAI tier 2: 5000 RPM → ~83 RPS; leave headroom
      duration: 60000,
    },
  }
);

The concurrency: 5 isn't arbitrary. On OpenAI's Tier 2 with GPT-4o, each call takes 8–12 seconds. At 5 concurrent workers, you're making 25–37 requests per minute — well within limits. Push to 15 and you'll start seeing 429s, which trigger retries, which increase your effective request count, which makes the 429s worse. Find your actual throughput ceiling before tuning this.

The caching decision I almost got wrong

My first instinct was to cache LLM responses by prompt hash. If the same exact prompt comes in twice, return the cached result. Simple, right?

The problem: content generation prompts are almost never identical. Even "write a blog post about React hooks" varies by tone parameter, target length, style instruction. My cache hit rate after a week was 2.3%. The Redis overhead wasn't worth it.

What is worth caching:

Prompt templates. If your prompts are built from templates + variables, cache the rendered template structure, not the final output. Template rendering is cheap but if you're doing it 500 times for a bulk job, it adds up.

External data fetches. If your prompts include fetched data (product descriptions, SEO keywords from an API), cache those aggressively. An article generation job that fetches 3 external APIs per article doesn't need to make 1,500 API calls for a batch of 500.

Quality scoring results. If you run a readability scorer or SEO checker on output, cache by content hash. The same output will score the same way every time.

async function getPromptTemplate(templateId) {
  const cacheKey = `template:${templateId}`;
  const cached = await redis.get(cacheKey);
  if (cached) return JSON.parse(cached);

  const template = await db.getTemplate(templateId);
  // Templates change infrequently — 1 hour TTL is safe
  await redis.setex(cacheKey, 3600, JSON.stringify(template));
  return template;
}

Monitoring: the metrics that actually matter

I spent two weeks looking at the wrong things. CPU and memory on the worker boxes stayed flat. What I should have been watching:

Queue depth over time. If your queue is growing faster than it's draining, you have a capacity problem. You want to alert when pending_jobs > (worker_count × average_job_duration × 2).

Job age at processing time. This is different from queue depth. A queue of 100 jobs with 5-second average age is healthy. A queue of 100 jobs with 45-minute average age means your throughput has been underwater for an hour.

Token cost per job. LLM costs are the primary cost driver for PostAll. Tracking average token usage per job type lets you catch prompt bloat before it shows up in your OpenAI invoice.

Dead-letter queue size. If your DLQ grows, something systematic is failing — not just occasional network hiccups. I check this daily; spikes always point to something actionable.

// Expose these as a /metrics endpoint — pipe to Grafana or Datadog
async function getQueueMetrics() {
  const [waiting, active, failed, delayed] = await Promise.all([
    contentQueue.getWaitingCount(),
    contentQueue.getActiveCount(),
    contentQueue.getFailedCount(),
    contentQueue.getDelayedCount(),
  ]);

  const oldestJob = await contentQueue.getJobs(['waiting'], 0, 0);
  const jobAgeSeconds = oldestJob[0]
    ? (Date.now() - oldestJob[0].timestamp) / 1000
    : 0;

  return { waiting, active, failed, delayed, jobAgeSeconds };
}

What I'd do differently

Start with the database queue, not BullMQ. BullMQ is great, but it's an additional dependency with its own failure modes. For the first 10,000 jobs, Postgres as a queue is completely fine and removes Redis as a point of failure. Migrate when you have a concrete reason.

Design the webhook contract on day one. PostAll's clients poll for job status. I built the webhook system as an afterthought and had to retrofit it. The async nature of the system means clients need a push notification — don't make them poll. Define the webhook payload schema before you write the first worker.

Add job metadata fields you don't need yet. organization_id, triggered_by, cost_usd, model_used. These fields are free to store and priceless when you're debugging a production issue at 11pm and need to know which tenant generated a specific piece of content and how much it cost.

What it handles now

PostAll's current architecture processes 500 articles per hour across 8 generation workers, 4 formatting workers, and 2 post-processing workers. P95 end-to-end latency — from API submission to webhook delivery — is 47 seconds for standard articles. That's not fast, but it's reliable, it's observable, and when something breaks, the failure is contained to one stage.

The clients get a job ID immediately. The rest is async. That's the only architecture that makes sense for workloads this slow.

What's your queue strategy for async workloads? I've seen teams reach for Kafka for this and pay a complexity tax they didn't need. Curious where that decision point actually is — drop it in the comments.

Build a Markdown-to-CMS Auto-Publisher With GitHub Actions

Aakash Gour — Mon, 04 May 2026 05:18:07 +0000

I got tired of the same three-step content publish loop: write draft → open CMS → paste, format, re-paste, fight the rich-text editor, click publish. Repeat for every environment — staging, then production. For one article, fine. For a team publishing 20+ pieces a month? That workflow is a quiet tax on everyone's time.

So I wired up a pipeline that cuts the loop entirely. You commit a .md file to a Git repo. A GitHub Actions workflow runs. The article is live on WordPress, Ghost, or Webflow — formatted, tagged, and published — within 90 seconds of the push.

Here's the complete workflow, every config file, and the two places where this will bite you if you're not careful.

What You'll Actually Build

By the end of this, you'll have:

A GitHub repo that acts as your content source of truth
A GitHub Actions workflow that triggers on push to main
A Node.js publish script that parses frontmatter, transforms Markdown, and hits your CMS API
Support for WordPress (REST API), Ghost (Admin API), and Webflow (CMS API) — pick one or all three
A staging branch that publishes drafts, a main branch that publishes live

Here's the directory structure we're building toward:

content-pipeline/
├── .github/
│   └── workflows/
│       └── publish.yml
├── posts/
│   ├── my-first-post.md
│   └── another-post.md
├── scripts/
│   ├── publish.js
│   ├── publishers/
│   │   ├── wordpress.js
│   │   ├── ghost.js
│   │   └── webflow.js
│   └── utils/
│       ├── parse-frontmatter.js
│       └── transform-markdown.js
├── package.json
└── .env.example

Why This Is Harder Than It Looks

The obvious approach — "just hit the API with the Markdown" — doesn't work in practice. Three reasons:

1. CMS APIs don't accept raw Markdown. WordPress wants HTML. Ghost accepts Markdown but has its own Lexical editor format for complex content. Webflow expects structured JSON with rich text field schemas.

2. Frontmatter is your metadata contract, and every CMS maps it differently. tags in Ghost is an array of tag objects. In WordPress, it's an array of tag IDs you have to look up first. In Webflow, tags don't exist — you have custom field slugs.

3. Idempotency. If the Action runs twice, you don't want two copies of the same article. You need a slug-based lookup before every publish to decide "create" vs "update."

These aren't edge cases — they'll hit you on the first real push. The script below handles all three.

Prerequisites

Node.js 18+
A GitHub repo (free tier works)
At least one of: a WordPress site with Application Passwords enabled, a Ghost instance with Admin API access, a Webflow CMS collection
Basic familiarity with GitHub Actions syntax

Step 1: The Frontmatter Contract

Every post needs a consistent frontmatter block. This is how the publish script knows what to do with each file.

---
title: "My Article Title"
slug: "my-article-title"
description: "One-sentence summary for meta descriptions and CMS excerpts."
tags: ["javascript", "devops", "tutorial"]
status: "published"        # or "draft"
targets: ["wordpress", "ghost"]   # which CMSes to publish to
date: "2025-01-15"
cover_image: "https://images.unsplash.com/photo-xyz"
---

Your article content starts here...

The targets field is the key one. It lets a single repo serve multiple CMSes without every post going everywhere. A marketing post might target WordPress only. A technical deep-dive might go to Ghost. The script reads this and routes accordingly.

Step 2: The Frontmatter Parser

// scripts/utils/parse-frontmatter.js
import fs from 'fs';
import matter from 'gray-matter';
import { marked } from 'marked';

export function parsePost(filePath) {
  const raw = fs.readFileSync(filePath, 'utf-8');
  const { data: frontmatter, content } = matter(raw);

  // Validate required fields before we touch any API
  const required = ['title', 'slug', 'targets'];
  const missing = required.filter(field => !frontmatter[field]);
  if (missing.length > 0) {
    throw new Error(`Missing required frontmatter fields: ${missing.join(', ')} in ${filePath}`);
  }

  return {
    frontmatter,
    markdown: content,
    html: marked(content),   // WordPress needs this
    filePath,
  };
}

Two things worth calling out here: gray-matter is the standard Markdown frontmatter parser — rock solid, widely used. And I'm converting to HTML at parse time so every publisher gets both formats and can choose what it needs. You'll need to install these:

npm install gray-matter marked

Step 3: The WordPress Publisher

WordPress's REST API is the most mature of the three. The catch is tag handling — you can't just pass tag names; you have to resolve them to IDs first, creating any that don't exist.

// scripts/publishers/wordpress.js
const WP_BASE = process.env.WP_BASE_URL;   // e.g. https://yourblog.com
const WP_USER = process.env.WP_USERNAME;
const WP_PASS = process.env.WP_APP_PASSWORD; // Application Password, not your login password

const authHeader = 'Basic ' + Buffer.from(`${WP_USER}:${WP_PASS}`).toString('base64');

async function resolveTagIds(tagNames) {
  const ids = [];

  for (const name of tagNames) {
    // Check if tag exists
    const searchRes = await fetch(
      `${WP_BASE}/wp-json/wp/v2/tags?search=${encodeURIComponent(name)}`,
      { headers: { Authorization: authHeader } }
    );
    const existing = await searchRes.json();

    if (existing.length > 0) {
      ids.push(existing[0].id);
    } else {
      // Create it
      const createRes = await fetch(`${WP_BASE}/wp-json/wp/v2/tags`, {
        method: 'POST',
        headers: {
          Authorization: authHeader,
          'Content-Type': 'application/json',
        },
        body: JSON.stringify({ name }),
      });
      const newTag = await createRes.json();
      ids.push(newTag.id);
    }
  }

  return ids;
}

export async function publishToWordPress({ frontmatter, html }) {
  const tagIds = await resolveTagIds(frontmatter.tags ?? []);

  // Check if post already exists (idempotency)
  const slugCheckRes = await fetch(
    `${WP_BASE}/wp-json/wp/v2/posts?slug=${frontmatter.slug}`,
    { headers: { Authorization: authHeader } }
  );
  const existing = await slugCheckRes.json();

  const payload = {
    title: frontmatter.title,
    slug: frontmatter.slug,
    content: html,
    excerpt: frontmatter.description ?? '',
    status: frontmatter.status ?? 'draft',
    tags: tagIds,
    date: frontmatter.date,
    featured_media: 0,  // Extend this if you want cover image upload
  };

  if (existing.length > 0) {
    // Update
    const res = await fetch(`${WP_BASE}/wp-json/wp/v2/posts/${existing[0].id}`, {
      method: 'PUT',
      headers: { Authorization: authHeader, 'Content-Type': 'application/json' },
      body: JSON.stringify(payload),
    });
    const updated = await res.json();
    console.log(`✅ WordPress: Updated "${updated.title.rendered}" → ${updated.link}`);
    return updated;
  } else {
    // Create
    const res = await fetch(`${WP_BASE}/wp-json/wp/v2/posts`, {
      method: 'POST',
      headers: { Authorization: authHeader, 'Content-Type': 'application/json' },
      body: JSON.stringify(payload),
    });
    const created = await res.json();
    console.log(`✅ WordPress: Created "${created.title.rendered}" → ${created.link}`);
    return created;
  }
}

Important: Use WordPress Application Passwords, not your account password. Go to Users → Profile → Application Passwords in your WP Admin. The format is username:xxxx xxxx xxxx xxxx xxxx xxxx (with spaces — that's normal).

Step 4: The Ghost Publisher

Ghost's Admin API uses JWT authentication, which is slightly more involved to set up but more elegant once it's running. Ghost also accepts Markdown natively via its mobiledoc format, but the cleanest approach for programmatic publishing is using the @tryghost/admin-api package.

npm install @tryghost/admin-api

// scripts/publishers/ghost.js
import GhostAdminAPI from '@tryghost/admin-api';

const ghost = new GhostAdminAPI({
  url: process.env.GHOST_URL,         // e.g. https://yoursite.ghost.io
  key: process.env.GHOST_ADMIN_KEY,   // Format: id:secret (from Ghost Admin → Integrations)
  version: 'v5.0',
});

export async function publishToGhost({ frontmatter, markdown }) {
  // Ghost tags are objects, not just strings
  const tags = (frontmatter.tags ?? []).map(name => ({ name }));

  const postData = {
    title: frontmatter.title,
    slug: frontmatter.slug,
    mobiledoc: buildMobiledoc(markdown),
    custom_excerpt: frontmatter.description ?? '',
    status: frontmatter.status ?? 'draft',
    tags,
    published_at: frontmatter.date ? new Date(frontmatter.date).toISOString() : undefined,
    feature_image: frontmatter.cover_image ?? null,
  };

  try {
    // Try to find existing post by slug
    const existing = await ghost.posts.browse({ filter: `slug:${frontmatter.slug}` });

    if (existing.length > 0) {
      const updated = await ghost.posts.edit({
        id: existing[0].id,
        updated_at: existing[0].updated_at,  // Required for conflict detection
        ...postData,
      });
      console.log(`✅ Ghost: Updated "${updated.title}" → ${updated.url}`);
      return updated;
    } else {
      const created = await ghost.posts.add(postData);
      console.log(`✅ Ghost: Created "${created.title}" → ${created.url}`);
      return created;
    }
  } catch (err) {
    throw new Error(`Ghost publish failed: ${err.message}`);
  }
}

// Ghost uses Mobiledoc internally. This wraps raw Markdown in a Markdown card.
function buildMobiledoc(markdown) {
  return JSON.stringify({
    version: '0.3.1',
    markups: [],
    atoms: [],
    cards: [['markdown', { markdown }]],
    sections: [[10, 0]],
  });
}

Step 5: The Webflow Publisher

Webflow is the most opinionated of the three. Your CMS collection fields need to match your frontmatter keys, and you'll map them explicitly. This also means the publisher is the most customizable.

// scripts/publishers/webflow.js
const WF_TOKEN = process.env.WEBFLOW_API_TOKEN;
const WF_COLLECTION_ID = process.env.WEBFLOW_COLLECTION_ID;

const headers = {
  Authorization: `Bearer ${WF_TOKEN}`,
  'Content-Type': 'application/json',
  'accept-version': '1.0.0',
};

export async function publishToWebflow({ frontmatter, html }) {
  // Check for existing item by slug
  const listRes = await fetch(
    `https://api.webflow.com/collections/${WF_COLLECTION_ID}/items`,
    { headers }
  );
  const { items } = await listRes.json();
  const existing = items?.find(item => item['slug'] === frontmatter.slug);

  // Map your frontmatter to Webflow field slugs
  // These must match the field slugs in your Webflow CMS collection
  const fields = {
    name: frontmatter.title,
    slug: frontmatter.slug,
    'post-body': html,         // Your rich-text field slug in Webflow
    'meta-description': frontmatter.description ?? '',
    'tags-string': (frontmatter.tags ?? []).join(', '),  // Webflow doesn't have native tags
    _archived: false,
    _draft: frontmatter.status !== 'published',
  };

  if (existing) {
    const updateRes = await fetch(
      `https://api.webflow.com/collections/${WF_COLLECTION_ID}/items/${existing._id}`,
      { method: 'PUT', headers, body: JSON.stringify({ fields }) }
    );
    const updated = await updateRes.json();
    console.log(`✅ Webflow: Updated "${updated.fields.name}"`);

    // Publish immediately if status is "published"
    if (frontmatter.status === 'published') {
      await publishWebflowItem(updated._id);
    }
    return updated;
  } else {
    const createRes = await fetch(
      `https://api.webflow.com/collections/${WF_COLLECTION_ID}/items`,
      { method: 'POST', headers, body: JSON.stringify({ fields }) }
    );
    const created = await createRes.json();
    console.log(`✅ Webflow: Created "${created.fields.name}"`);

    if (frontmatter.status === 'published') {
      await publishWebflowItem(created._id);
    }
    return created;
  }
}

// Webflow requires a separate "publish" API call after creating/updating
async function publishWebflowItem(itemId) {
  await fetch(
    `https://api.webflow.com/collections/${WF_COLLECTION_ID}/items/publish`,
    {
      method: 'PUT',
      headers,
      body: JSON.stringify({ itemIds: [itemId] }),
    }
  );
}

Webflow gotcha: The "publish" step is separate from creating/updating. If you skip it, your item will exist in the CMS but won't be live. I missed this for about an hour wondering why my posts weren't showing up.

Step 6: The Main Publish Script

This ties everything together. It finds changed .md files, parses them, and routes to the right publishers.

// scripts/publish.js
import { parsePost } from './utils/parse-frontmatter.js';
import { publishToWordPress } from './publishers/wordpress.js';
import { publishToGhost } from './publishers/ghost.js';
import { publishToWebflow } from './publishers/webflow.js';
import { execSync } from 'child_process';
import path from 'path';

const PUBLISHERS = {
  wordpress: publishToWordPress,
  ghost: publishToGhost,
  webflow: publishToWebflow,
};

async function main() {
  // Get list of changed .md files from git
  // In GitHub Actions, GITHUB_SHA gives us the current commit
  const changedFiles = execSync(
    'git diff --name-only HEAD~1 HEAD -- "posts/*.md"'
  )
    .toString()
    .trim()
    .split('\n')
    .filter(Boolean);

  if (changedFiles.length === 0) {
    console.log('No markdown files changed. Nothing to publish.');
    return;
  }

  console.log(`Found ${changedFiles.length} changed post(s): ${changedFiles.join(', ')}`);

  const results = { success: [], failed: [] };

  for (const filePath of changedFiles) {
    const fullPath = path.resolve(filePath);

    try {
      const post = parsePost(fullPath);
      const { targets = [] } = post.frontmatter;

      if (targets.length === 0) {
        console.log(`⚠️  Skipping ${filePath}: no targets specified in frontmatter`);
        continue;
      }

      for (const target of targets) {
        const publisher = PUBLISHERS[target];
        if (!publisher) {
          console.warn(`⚠️  Unknown target "${target}" in ${filePath}`);
          continue;
        }
        await publisher(post);
      }

      results.success.push(filePath);
    } catch (err) {
      console.error(`❌ Failed to publish ${filePath}: ${err.message}`);
      results.failed.push({ filePath, error: err.message });
    }
  }

  console.log(`\nDone. ${results.success.length} succeeded, ${results.failed.length} failed.`);

  // Exit with error if any publish failed — this fails the GitHub Action
  if (results.failed.length > 0) {
    process.exit(1);
  }
}

main();

Step 7: The GitHub Actions Workflow

# .github/workflows/publish.yml
name: Publish Content

on:
  push:
    branches:
      - main        # Publishes live
      - staging     # Publishes as drafts (see note below)
    paths:
      - 'posts/**'  # Only triggers when posts change — not on script edits

jobs:
  publish:
    runs-on: ubuntu-latest

    steps:
      - name: Checkout
        uses: actions/checkout@v4
        with:
          fetch-depth: 2   # Need at least 2 commits for git diff to work

      - name: Setup Node
        uses: actions/setup-node@v4
        with:
          node-version: '18'
          cache: 'npm'

      - name: Install dependencies
        run: npm ci

      - name: Run publish script
        env:
          # WordPress
          WP_BASE_URL: ${{ secrets.WP_BASE_URL }}
          WP_USERNAME: ${{ secrets.WP_USERNAME }}
          WP_APP_PASSWORD: ${{ secrets.WP_APP_PASSWORD }}
          # Ghost
          GHOST_URL: ${{ secrets.GHOST_URL }}
          GHOST_ADMIN_KEY: ${{ secrets.GHOST_ADMIN_KEY }}
          # Webflow
          WEBFLOW_API_TOKEN: ${{ secrets.WEBFLOW_API_TOKEN }}
          WEBFLOW_COLLECTION_ID: ${{ secrets.WEBFLOW_COLLECTION_ID }}
        run: node scripts/publish.js

Set all those secrets.* values in your repo under Settings → Secrets and variables → Actions. None of these should ever touch your codebase directly.

The staging branch pattern: Push to staging when the post is a draft you want to preview in the real CMS. The frontmatter's status field controls whether it's a draft or published — so pushing to staging doesn't auto-publish; it just means "run the pipeline." You still control publish state via frontmatter.

What Can Go Wrong

Git diff returns empty on the first commit. The HEAD~1 diff requires at least two commits. If your repo is brand new and this is the first push, there's no HEAD~1. Fix: use git diff --name-only 4b825dc..HEAD -- "posts/*.md" (the magic empty tree SHA) for the initial commit, or add a guard:

const isFirstCommit = execSync('git rev-list --count HEAD').toString().trim() === '1';
const diffCommand = isFirstCommit
  ? 'git diff --name-only 4b825dc HEAD -- "posts/*.md"'
  : 'git diff --name-only HEAD~1 HEAD -- "posts/*.md"';

WordPress Application Passwords with special characters. WP generates passwords with spaces. When you put username:abcd efgh ijkl in an env variable, the spaces can cause header parsing issues. URL-encode the password or store it already base64-encoded.

Ghost updated_at mismatch. Ghost's edit endpoint requires you pass back the updated_at timestamp from the fetched post, or it rejects the update with a 409 conflict. The code above handles this, but if you strip that field for any reason, you'll get confusing errors.

Webflow's API rate limit is 60 requests/minute. If you're bulk-publishing 30+ posts in one commit (like migrating a blog), you'll hit it. Add a delay between requests:

// Add after each publish call inside the loop
await new Promise(resolve => setTimeout(resolve, 1100)); // ~55 req/min

The workflow doesn't trigger when you expect. Double-check your paths filter — posts/** only fires when files inside posts/ change. If you accidentally commit to posts/drafts/ and your pattern doesn't include subdirectories, nothing runs. posts/** handles subdirectories; posts/* does not.

The `.env.example` File

Commit this to the repo so anyone cloning it knows what to configure:

# WordPress
WP_BASE_URL=https://yourblog.com
WP_USERNAME=your_wp_username
WP_APP_PASSWORD=xxxx xxxx xxxx xxxx xxxx xxxx

# Ghost
GHOST_URL=https://yoursite.ghost.io
GHOST_ADMIN_KEY=id:secret

# Webflow
WEBFLOW_API_TOKEN=your_token
WEBFLOW_COLLECTION_ID=your_collection_id

Never commit a .env with real values. The .gitignore should include .env from the start.

Where to Take This Next

The core pipeline is solid as-is. Three extensions worth building if you use this heavily:

Image optimization on push. Add a step before publish.js that pulls image URLs from Markdown, downloads them, runs them through sharp, and uploads to your CDN. Then rewrites the src attributes before publishing. The article goes live already using your own CDN, not third-party hotlinks.

Slack notification on success/failure. Add a final workflow step that posts to a #content-deploys channel. The message should include the article title, which CMSes it published to, and a link. Tiny addition, huge QoL for a content team.

Scheduled publishing. Frontmatter already has a date field. You can add a GitHub Actions schedule trigger (on: schedule) that runs the publisher daily and checks whether any post's date has passed and its status is scheduled. Publish those automatically. True scheduled publishing without any CMS-specific plan tier.

The full working repo is on GitHub — link in my profile. Drop a star if this saved you the setup time. And if you've wired this up to a different CMS — Contentful, Sanity, Strapi — I'd genuinely like to see how you handled the schema mapping. Drop it in the comments.

Why Every Developer Should Understand Content Operations (Even If You Never Write a Blog Post)

Aakash Gour — Thu, 30 Apr 2026 05:19:21 +0000

Content ops is an engineering problem. Most developers just haven't been told that yet.

I realized this about eight months into building PostAll, a content automation SaaS. We were generating articles, product descriptions, and social copy at scale — and the engineers on our beta teams weren't asking "can we make better content?" They were asking the same questions they ask about any distributed system: Why is the throughput inconsistent? How do we validate output quality? Where's the retry logic when the generation step fails?

That's when it clicked. Content operations — the pipeline that takes a brief and turns it into published, formatted, on-brand output — is infrastructure. It has the same failure modes, the same scaling problems, and the same need for observability as anything else you'd ship to production.

Here's the argument I'd make to any developer who's never thought about this: if you build software that produces, transforms, or publishes content at any scale, content ops is your problem. And treating it like a marketing concern instead of an engineering one is costing your team real time and real money.

Content Operations, Defined Without the Marketing Jargon

Content ops is the pipeline between "someone needs content" and "that content is live." In practice, it covers:

Input — briefs, keywords, brand guidelines, audience data
Generation — writing, whether human or AI-assisted
Transformation — formatting, tone adjustment, SEO structuring
Validation — quality gates, brand-compliance checks, factual review
Publishing — CMS ingestion, scheduling, multi-channel distribution

Most teams treat each of these as a separate human workflow. Someone fills a spreadsheet with topics. Someone else picks them up, writes something in Google Docs, drops it into Notion for review, then pastes it into WordPress.

That's a manual ETL pipeline. Except nobody's called it that, so nobody's thought about optimizing it like one.

The Engineering Problems Hiding in Every Content Workflow

When I started mapping PostAll's content pipeline in code rather than in a workflow diagram, three familiar problems showed up immediately.

1. Unbounded async work with no retry semantics

The generation step calls an LLM API. That API has rate limits, response time variance, and occasional failures. In most content workflows, the "retry logic" is a human refreshing the page or pinging someone in Slack. There's no queue, no backoff strategy, no dead letter bucket for failed jobs.

Here's a simplified version of what the naive approach looks like — and why it breaks:

// ❌ What most teams are doing (implicitly)
async function processContentQueue(jobs) {
  for (const job of jobs) {
    const content = await generateContent(job); // No timeout. No retry. No rate limit handling.
    await publishContent(content);              // Fails silently if generation returned garbage.
  }
}

And here's what the actual production version needs to look like:

// ✅ The version that survives at scale
async function processContentJob(job, attempt = 1) {
  const MAX_ATTEMPTS = 3;
  const RATE_LIMIT_DELAY = 1100; // OpenAI gpt-4o: ~3 req/sec on standard tier — leave buffer

  try {
    await sleep(RATE_LIMIT_DELAY);
    const raw = await generateContent(job);

    if (!meetsQualityThreshold(raw)) {
      throw new Error(`Quality check failed: score ${raw.qualityScore} below threshold 0.72`);
    }

    return await publishContent(raw);
  } catch (err) {
    if (attempt >= MAX_ATTEMPTS) {
      await deadLetterQueue.push({ job, err: err.message, attempts: attempt });
      return;
    }

    const backoff = Math.min(1000 * 2 ** attempt, 30000);
    await sleep(backoff);
    return processContentJob(job, attempt + 1);
  }
}

This isn't exotic engineering. This is how you'd handle any async pipeline. Content just hasn't been treated like one.

2. No output schema — so validation is impossible

When developers think about data pipelines, they think schemas: what is the shape of the output, and does it match what downstream systems expect?

Content pipelines almost never define this. The "schema" is implicit — someone knows that a blog post needs a title, an intro, three sections, and a conclusion — but it's enforced by human review, not by code.

The result: downstream systems (your CMS, your SEO tool, your analytics platform) get inconsistently shaped content, and errors surface only after publishing.

Here's what a content schema looks like when you actually write it out:

interface BlogPostOutput {
  title: string;               // Max 70 chars for SEO
  metaDescription: string;     // 150-160 chars
  sections: {
    heading: string;           // H2, max 60 chars
    body: string;              // Min 150 words per section
    hasCodeExample: boolean;
  }[];                         // Exactly 3 sections required
  readingTimeMinutes: number;  // Computed, not generated
  tags: string[];              // Must match predefined taxonomy
}

function validateOutput(raw: unknown): raw is BlogPostOutput {
  // Zod or similar — validate shape AND business constraints
  return BlogPostOutputSchema.safeParse(raw).success;
}

When I added explicit schema validation to PostAll's generation step, our CMS ingestion errors dropped to near-zero. Not because the LLM got better — because we stopped pretending unstructured text was structured data.

3. Zero observability on the thing that matters most: quality

Developers are used to tracking error rates and latencies. Content teams track word count and whether the deadline was hit. Neither of these tells you what actually matters: is the output good?

"Good" is partially subjective — but it's not unmeasurable. For PostAll, we defined quality as a composite score across:

Structural completeness — does it have all required sections?
Keyword presence — is the target term used at the expected frequency?
Tone match — does embedding similarity against brand guidelines exceed a threshold?
Factual plausibility — does a second LLM call flag any obvious hallucinations?

Each of these is a measurable signal. You can log them, set alerts on degradation, and build dashboards — the same way you'd observe latency percentiles or error rates.

Without this, you're shipping a system where the primary output (the content) has no production monitoring. That's a gap you'd never tolerate in application code.

Why This Matters More Now That AI Is Involved

Pre-AI content workflows were slow but predictable. A human writer produces variable-quality work, but the variance has recognizable patterns — bad days, unclear briefs, unclear feedback cycles.

AI-generated content introduces a different failure mode: high throughput, inconsistent quality, at scale, with no human in the loop to catch it.

If you're using any LLM API in your product — even as a one-off feature — you're operating a small content pipeline. The question is whether you've built it like one.

The teams that treat AI content generation as "call the API and pipe the response somewhere" are the ones who end up with a 1-in-20 hallucination rate, no way to detect it, and a support queue full of user complaints about things they can't reproduce.

The teams that treat it as a pipeline — with schemas, validation, retry logic, dead letter queues, and quality metrics — ship something that actually scales.

The Part I Got Wrong First

My first version of PostAll's content pipeline had none of this. It was a serverless function that called the OpenAI API and wrote the response to a database. It worked great for 10 requests. It fell apart at 200 — not from a server failure, but because we had no way to detect which outputs were good, which had failed silently, and which had hit a rate limit and returned a partial response stored as if it were complete.

I spent two days backfilling data and building the validation layer I should have built first. Every hour of that was time I'd have saved by thinking about content as a pipeline from the start.

The specific thing I wish I'd done differently: define the output schema before building the generation step, not after. Schema-first forces you to think about what "done" looks like. Generation-first produces output that feels done but isn't validated.

What "Understanding Content Operations" Actually Looks Like for a Developer

You don't need to become a content strategist. You need to ask a few questions your marketing counterpart probably hasn't thought to ask:

What's the schema for this content? Where is it validated?
What happens when generation fails or returns low-quality output?
How do we know the content pipeline is healthy? What do we alert on?
Where do rejected or failed jobs go? Can we inspect them?
When throughput increases 10x, what breaks first?

These are boring engineering questions. That's exactly why they're valuable — because nobody else at the table is asking them.

The Uncomfortable Truth About "AI Content Tools"

Most of the AI content tools on the market today are demos with deployment infrastructure. They've solved the generation problem — the API call that produces plausible-looking text — and treated everything around it as someone else's concern.

The real engineering problem in content operations is everything except the generation step: the input normalization, the output validation, the quality feedback loops, the CMS integration contracts, the failure recovery, the observability layer.

That's where the actual complexity lives. And it's squarely in your domain.

What I'd Tell a Developer Starting Fresh

If you're building anything that touches content at scale — product descriptions, documentation, marketing copy, localized variants, AI-generated responses — start by drawing the pipeline. Not the prompt. Not the UI. The pipeline.

Where does input come from? What schema does output need to match? What are the quality gates? What happens on failure?

Answer those questions before you write your first API call, and you'll build something that actually holds up. Skip them, and you'll rebuild it after your first production incident.

Have you hit any of these problems in a content pipeline you've shipped? I'd genuinely like to know — especially if you've built quality validation that isn't just "have a human read it."

Auto-Format Blog Posts for SEO Using Python + BeautifulSoup

Aakash Gour — Mon, 27 Apr 2026 06:31:25 +0000

I had 300 old blog posts that were technically fine — good content, decent keywords — but structurally a mess. H1s used as subheadings. No meta descriptions. Schema markup that hadn't been touched since 2019. Images with alt text that just said "image."

I wasn't going to fix them by hand. So I built a Python pipeline to do it.

Here's the exact system I ended up with: it scrapes raw HTML, corrects the heading hierarchy, generates meta descriptions programmatically, injects schema markup, and outputs clean, SEO-ready files. It handles the boring 80% automatically so you only have to think about the 20% that actually requires judgment.

Why This Is Harder Than It Looks

The naive approach — "just parse the HTML and fix the tags" — hits three real problems fast:

Heading hierarchy is contextual. A post that jumps from H1 to H4 can't be mechanically fixed without understanding the content structure. You need heuristics to infer what was meant to be a section header versus a subpoint.

Meta descriptions can't just be the first 160 characters. That usually catches navigation menus, author bios, or breadcrumbs. You need to identify the actual article body before you extract anything.

Schema markup that's wrong is worse than no schema markup. Google's rich result tests actively penalize malformed JSON-LD. If you're going to inject it programmatically, it needs to be valid every time.

This pipeline handles all three. Let me walk through each part.

Prerequisites

pip install beautifulsoup4 requests lxml

You'll also want Python 3.9+ for the cleaner type hints. I'm testing against lxml as the parser — it's stricter than html.parser and catches malformed HTML that would silently break things downstream.

Step 1: Scrape and Parse the Raw Content

Start by pulling the HTML and isolating the article body. The key here is being explicit about what counts as "the article" — you don't want to scrape navigation, sidebars, or footers into your content analysis.

import requests
from bs4 import BeautifulSoup
from typing import Optional

def fetch_article(url: str) -> Optional[BeautifulSoup]:
    headers = {
        "User-Agent": "Mozilla/5.0 (compatible; SEO-formatter/1.0)"
    }

    try:
        response = requests.get(url, headers=headers, timeout=10)
        response.raise_for_status()
    except requests.RequestException as e:
        print(f"Fetch failed for {url}: {e}")
        return None

    soup = BeautifulSoup(response.text, "lxml")
    return soup

def extract_article_body(soup: BeautifulSoup) -> Optional[BeautifulSoup]:
    # Try semantic selectors first — most modern CMS templates use these
    # Fall back to <main> if no <article> exists
    candidates = [
        soup.find("article"),
        soup.find("main"),
        soup.find(class_=lambda c: c and any(
            word in c for word in ["post-content", "article-body", "entry-content", "content"]
        )),
    ]

    for candidate in candidates:
        if candidate:
            return candidate

    return None  # Caller handles missing body

The selector chain matters: <article> > <main> > class heuristics. If none of those match, you want to know — not silently get the whole page body.

Step 2: Fix the Heading Hierarchy

This is the part that took me longest to get right. The naive fix ("if you see H3, change it to H2") creates new problems. You need to understand the relative structure, not just the absolute levels.

My approach: collect all headings in document order, detect the structural intent from their nesting pattern, then remap to a clean H1 → H2 → H3 hierarchy.

from bs4 import Tag

HEADING_TAGS = ["h1", "h2", "h3", "h4", "h5", "h6"]

def get_heading_level(tag: Tag) -> int:
    return int(tag.name[1])

def fix_heading_hierarchy(article: BeautifulSoup, base_title: str = "") -> BeautifulSoup:
    headings = article.find_all(HEADING_TAGS)

    if not headings:
        return article

    # Detect the minimum heading level actually used
    # If the post uses H2, H3, H4 — the "root" is H2, not H1
    levels_used = sorted(set(get_heading_level(h) for h in headings))
    min_level = levels_used[0]

    # If H1 is used inside the article body, it's almost certainly wrong
    # (the page H1 should be the <title>, not inside <article>)
    # Remap everything so the topmost level becomes H2
    target_root = 2
    offset = target_root - min_level

    for heading in headings:
        current_level = get_heading_level(heading)
        new_level = min(current_level + offset, 6)  # Cap at H6
        heading.name = f"h{new_level}"

    return article

The offset calculation is the key insight: rather than hardcoding rules, you find the gap between what exists and what should exist, then shift everything uniformly. A post using H1/H2/H3 becomes H2/H3/H4. A post using H3/H4 (skipping H1 and H2 entirely, which happens more than you'd think) becomes H2/H3.

What this misses: Genuinely broken hierarchies where someone used headings for visual styling rather than structure — like an H4 in the middle of body text because they liked the font size. This handles structural drift, not semantic abuse. For that case, you'd need to check whether the heading has substantial text content around it, which gets complicated fast.

Step 3: Generate Meta Descriptions

The goal: a 150–160 character description that represents the article, not the page chrome. Two approaches I tested:

Approach 1 (what I tried first): Extract the first <p> from the article body and truncate to 155 characters. Clean, simple.

Why it broke: First paragraphs are often intro fluff. "Welcome to our blog. Today we're going to talk about..." gets truncated to a sentence that tells readers nothing.

Approach 2 (what I use now): Find the first <p> with more than 80 characters that doesn't contain any navigation-adjacent terms, then trim to 155 characters at a word boundary.

import re

def generate_meta_description(article: BeautifulSoup, max_length: int = 155) -> str:
    # Skip paragraphs that look like navigation or metadata
    skip_patterns = re.compile(
        r"(published|updated|author|tags:|categories:|share this|follow us|subscribe)",
        re.IGNORECASE
    )

    paragraphs = article.find_all("p")

    candidate = ""
    for p in paragraphs:
        text = p.get_text(strip=True)

        # Too short to be body content
        if len(text) < 80:
            continue

        # Looks like metadata
        if skip_patterns.search(text):
            continue

        candidate = text
        break

    if not candidate:
        # Fallback: concatenate all heading text
        headings = article.find_all(["h2", "h3"])
        candidate = " — ".join(h.get_text(strip=True) for h in headings[:3])

    # Trim to max_length at a word boundary
    if len(candidate) <= max_length:
        return candidate

    trimmed = candidate[:max_length].rsplit(" ", 1)[0]
    return trimmed.rstrip(".,;:") + "..."

The rsplit(" ", 1)[0] is the detail that matters: it finds the last space before your character limit, so you never cut mid-word. Then strip trailing punctuation before the ellipsis — "and the results were impressive,..." looks wrong.

Step 4: Inject Schema Markup

JSON-LD Article schema is what gives you the rich results in Google Search. The markup needs to go in the <head>, and it needs to be valid every time.

import json
from datetime import datetime

def generate_article_schema(
    title: str,
    description: str,
    url: str,
    author_name: str,
    date_published: str,  # ISO 8601 format: "2024-03-15"
    date_modified: str | None = None,
    image_url: str | None = None,
) -> str:
    schema = {
        "@context": "https://schema.org",
        "@type": "Article",
        "headline": title[:110],  # Schema spec: max 110 chars for headline
        "description": description,
        "url": url,
        "author": {
            "@type": "Person",
            "name": author_name,
        },
        "publisher": {
            "@type": "Organization",
            "name": author_name,  # Adjust to org name if applicable
        },
        "datePublished": date_published,
        "dateModified": date_modified or date_published,
    }

    if image_url:
        schema["image"] = {
            "@type": "ImageObject",
            "url": image_url,
        }

    # json.dumps with indent=2 for readability in source, 
    # but ensure_ascii=False to handle non-ASCII author names
    return json.dumps(schema, indent=2, ensure_ascii=False)

def inject_schema(soup: BeautifulSoup, schema_json: str) -> BeautifulSoup:
    script_tag = soup.new_tag("script", type="application/ld+json")
    script_tag.string = schema_json

    head = soup.find("head")
    if head:
        # Remove any existing Article schema to avoid duplicates
        for existing in head.find_all("script", type="application/ld+json"):
            try:
                existing_data = json.loads(existing.string or "")
                if existing_data.get("@type") == "Article":
                    existing.decompose()
            except json.JSONDecodeError:
                pass  # Leave malformed script tags alone

        head.append(script_tag)

    return soup

The existing schema removal is something I wish I'd included from the start. Running this pipeline twice on the same file without deduplication creates duplicate JSON-LD blocks — which Google's structured data validator flags as an error.

Step 5: Fix Image Alt Text

While you're in there: images with missing or meaningless alt text are an easy SEO win that's usually completely ignored.

def fix_image_alt_text(article: BeautifulSoup) -> dict:
    images = article.find_all("img")
    fixed = 0
    needs_manual_review = []

    for img in images:
        alt = img.get("alt", "").strip()
        src = img.get("src", "")

        # No alt attribute at all — add empty string for decorative images
        # (empty alt is correct for decorative; missing alt is wrong)
        if "alt" not in img.attrs:
            img["alt"] = ""
            fixed += 1
            continue

        # Meaningless alt text — flag for manual review
        meaningless = ["image", "photo", "picture", "img", "screenshot", ""]
        if alt.lower() in meaningless:
            needs_manual_review.append(src)

    return {
        "auto_fixed": fixed,
        "needs_review": needs_manual_review
    }

I made an intentional choice here: don't auto-generate alt text from filenames. screenshot-2024-03-14-at-9.32am.png → "Screenshot 2024 03 14 at 9 32am" is worse than a blank alt. Flag it for human review instead.

Step 6: Wire It All Together

def process_article(
    url: str,
    author_name: str,
    date_published: str,
    output_path: str,
) -> dict:
    print(f"Processing: {url}")

    soup = fetch_article(url)
    if not soup:
        return {"status": "error", "reason": "fetch_failed", "url": url}

    article = extract_article_body(soup)
    if not article:
        return {"status": "error", "reason": "no_article_body", "url": url}

    # Fix structure
    article = fix_heading_hierarchy(article)
    image_report = fix_image_alt_text(article)

    # Generate SEO fields
    title = soup.find("title")
    title_text = title.get_text(strip=True) if title else "Untitled"
    description = generate_meta_description(article)

    # Inject or update meta description
    existing_meta = soup.find("meta", attrs={"name": "description"})
    if existing_meta:
        existing_meta["content"] = description
    else:
        meta_tag = soup.new_tag("meta", attrs={"name": "description", "content": description})
        soup.head.append(meta_tag)

    # Find OG image for schema
    og_image = soup.find("meta", property="og:image")
    image_url = og_image["content"] if og_image else None

    # Inject schema
    schema_json = generate_article_schema(
        title=title_text,
        description=description,
        url=url,
        author_name=author_name,
        date_published=date_published,
        image_url=image_url,
    )
    soup = inject_schema(soup, schema_json)

    # Write output
    with open(output_path, "w", encoding="utf-8") as f:
        f.write(str(soup))

    return {
        "status": "success",
        "url": url,
        "meta_description": description,
        "images_fixed": image_report["auto_fixed"],
        "images_needing_review": image_report["needs_review"],
    }

Running this against a directory of URLs:

import json

articles = [
    {
        "url": "https://yourblog.com/post-1",
        "date_published": "2023-06-15",
        "output": "./formatted/post-1.html"
    },
    # ... more articles
]

results = []
for article in articles:
    result = process_article(
        url=article["url"],
        author_name="Your Name",
        date_published=article["date_published"],
        output_path=article["output"],
    )
    results.append(result)

# Summary report
success = [r for r in results if r["status"] == "success"]
errors = [r for r in results if r["status"] == "error"]

print(f"\nProcessed: {len(success)} success, {len(errors)} errors")
for e in errors:
    print(f"  FAILED: {e['url']} — {e['reason']}")

with open("seo_report.json", "w") as f:
    json.dump(results, f, indent=2)

What Can Go Wrong

JavaScript-rendered content. If the blog runs on a JS framework and doesn't server-render, requests gets the skeleton, not the content. You'll need Playwright or Selenium for those. I handle this by checking if the extracted article body has fewer than 200 characters — that's a strong signal the content didn't load.

Rate limiting. If you're processing your own posts by scraping your live site, you're going to hit yourself with a self-DDoS. Add a time.sleep(1) between requests, or — better — process from local HTML files if you have access to them.

The heading offset goes negative. If a post uses only H5 and H6 (it happens, usually from copy-pasted content), the offset calculation maps them to H2/H3 correctly. But if someone used H1 for a subheading and the article body legitimately has H2/H3 sections, you end up with H3/H4/H5 output. There's no perfect mechanical fix for this — it's a sign the original content needs human attention.

Schema headline length. Google's spec says Article schema headlines should be 110 characters maximum. Slice at 110 before adding to the schema object — I got a structured data validation error on three articles before I caught this.

Duplicate meta descriptions. Some templates already set meta descriptions via <meta property="og:description"> without a <meta name="description">. Search engines read both, but they're separate tags. This pipeline only handles name="description". Add a second pass for og:description if your templates use that instead.

Results

After running this against my 300 posts: 284 processed successfully, 16 errors (mostly JavaScript-rendered content). Of those 284:

100% now have valid Article schema (verified via Google Rich Results Test)
73 posts had heading hierarchy corrected — mostly old posts where H1 was used inside the body
41 images were auto-fixed (missing alt attribute added as empty string)
89 images flagged for manual alt text review

Google Search Console showed an increase in structured data coverage within the next crawl cycle. Traffic takes longer to move — that's a different story.

Extending This

A few directions worth exploring from here:

Add robots.txt checking before scraping external URLs — you don't want to process content you shouldn't be touching
Keyword density analysis — collections.Counter on the article body text against a target keyword list
Internal link detection — flag posts with zero internal links, which are common in older content and hurt crawlability
Bulk processing with asyncio + aiohttp — the synchronous requests version is fine at 300 posts, but at 3,000 you want async fetching

The full pipeline is around 200 lines. If there's interest, I'll put it on GitHub — drop a comment.

What's the messiest SEO problem you've had to automate your way out of? I'm curious what other structural issues show up in older content — especially across different CMS platforms.

Tags: #python #webdev #seo #tutorial

The Hidden Complexity of "Simple" Text Generation at Scale

Aakash Gour — Sat, 25 Apr 2026 05:20:57 +0000

What developers don't realize until their queue is on fire?

I thought I understood the problem.
You've got a list of inputs. You send each one to an LLM API. You get text back. You store it. Done — right?

That's what I believed until I tried to run it at scale. Not "scale" in the Silicon Valley sense. Just: 400 product descriptions for a client's e-commerce migration, needed by Thursday, with a two-person team and a $30 API budget.

The naive version worked fine for the first 40. Then it quietly started failing in ways that took me four hours to even name.
This is what bulk text generation actually involves — the parts that don't show up in any "getting started with the OpenAI API" tutorial.

The four problems nobody warns you about

Rate limits are not what you think they are The OpenAI rate limit docs say: "You can make X requests per minute." That sounds like a traffic light — green until you hit the limit, then red. It's not a traffic light. It's more like a leaky bucket with multiple dimensions.

You're rate-limited by requests per minute, tokens per minute, and (depending on your tier) tokens per day. These limits apply independently. You can be under the RPM limit and still get a 429 if your average request is 3,000 tokens and you're sending 20 of them in 60 seconds.

Here's what this looks like in practice. You fire off 50 requests concurrently. The first 30 succeed. Requests 31-50 hit the token-per-minute ceiling and fail with a 429. Your retry logic catches them and queues them for the next minute.

But by then, the first 30 have already written to your database — so when the retries run, you've got logic decisions to make: do you re-check for existing records? Do you deduplicate by input hash? Do you trust that your storage layer handled concurrent writes correctly?

The rate limit isn't just a throttle. It's a consistency problem wearing a throttle costume.
Here's the retry wrapper I ended up with after two failed simpler versions:

async function generateWithRetry(prompt, options = {}) {
  const {
    maxRetries = 5,
    baseDelay = 1000,
    maxDelay = 32000,
  } = options;

  for (let attempt = 0; attempt < maxRetries; attempt++) {
    try {
      const response = await openai.chat.completions.create({
        model: "gpt-4o",
        messages: [{ role: "user", content: prompt }],
        max_tokens: 1000,
      });

      return response.choices[0].message.content;

    } catch (error) {
      if (error.status === 429) {
        // Exponential backoff with jitter — without jitter, all retrying
        // requests hit the API at the same time and cause a retry thunderstorm
        const jitter = Math.random() * 500;
        const delay = Math.min(baseDelay * Math.pow(2, attempt) + jitter, maxDelay);

        console.warn(`Rate limited. Retry ${attempt + 1}/${maxRetries} in ${Math.round(delay)}ms`);
        await sleep(delay);
        continue;
      }

      // Don't retry on non-rate-limit errors
      throw error;
    }
  }

  throw new Error(`Failed after ${maxRetries} retries`);
}

function sleep(ms) {
  return new Promise(resolve => setTimeout(resolve, ms));
}

The jitter is the part I got wrong first. Without it, if 20 requests all fail at the same time and all wait exactly 2 seconds before retrying, you've created a synchronized retry storm that hits the API at the exact moment your rate limit window resets — and you'll 429 again immediately. Add randomness. It's not elegant, but it works.

2. Context windows are a constraint you plan around, not react to
The GPT-4o context window is 128,000 tokens. That sounds like infinity. It's not.
Here's a real scenario: You're generating long-form blog posts from an outline + brand guidelines + source material + few-shot examples. You want to maintain consistent tone across 50 posts in the same batch, so you include the style guide in every prompt.

The style guide is 2,000 tokens. Your source material per article is 1,500 tokens. Your few-shot examples are 3,000 tokens. Your actual prompt instructions are 500 tokens.

That's 7,000 tokens of input before you've written a single word of output. Multiply by 50 requests, and you're spending $X just on the input tokens that never appear in the final output.
This is the moment you realize you need to think about token budgets the way you think about memory budgets.

What I ended up doing: Compressing the style guide into a structured format, caching it as a system message, and treating it as a constant cost rather than a per-request variable.

Then auditing every few-shot example to ask: "Is this example earning its tokens?" Some examples that looked useful turned out to add almost nothing to output quality — just token spend.
The practical tool here is a token counting function that runs before every API call:

import { encoding_for_model } from "tiktoken";

function countTokens(text, model = "gpt-4o") {
  const enc = encoding_for_model(model);
  const tokens = enc.encode(text);
  enc.free(); // Important: free the encoder or you'll leak memory in long-running processes
  return tokens.length;
}

function buildPrompt(input, styleGuide, examples) {
  const components = {
    instructions: `Generate a product description for: "${input.name}"\n\n`,
    styleGuide: `Style guidelines:\n${styleGuide}\n\n`,
    examples: `Examples:\n${examples.join("\n\n")}\n\n`,
    inputData: `Product data: ${JSON.stringify(input)}\n\nDescription:`,
  };

  const total = Object.values(components).reduce(
    (sum, text) => sum + countTokens(text), 0
  );

  // Reserve 1000 tokens for output
  if (total > 127000) {
    // Trim examples first — they're the most compressible
    // This is where you make product decisions about what to sacrifice
    throw new Error(`Prompt too large: ${total} tokens. Trim examples.`);
  }

  return Object.values(components).join("");
}

The enc.free() call is the thing nobody mentions. Run this without freeing the encoder in a batch job and you'll have a memory leak that's genuinely confusing to debug.

3. Chunking is harder than splitting by character count
The obvious approach to generating long content is: split the input, generate in chunks, concatenate.
The problem is that LLMs don't produce coherent output when stitched mechanically. Section 3 doesn't know what Section 1 decided. The conclusion doesn't know what premises got established in the introduction. You end up with text that passes a grammar check and fails a coherence check.

There are a few approaches here. None of them are perfect.
Option A: Sequential generation with context carry-forward. Generate section 1. Pass section 1's output as context when generating section 2. Continue. This maintains coherence but is slow and your total token spend gets expensive fast.

Option B: Outline-first, then expand. Generate a detailed outline first. Then generate each section using the outline as a reference. This is faster and keeps sections more consistent but requires a good outline generation step that's its own challenge.

Option C: One-shot long generation. If your content fits in a single API call, just... do it that way. Resist the urge to chunk prematurely. I chunked things that didn't need chunking for weeks before realizing I was adding complexity for no reason.

async function generateLongContent(topic, targetLength = 1000) {
  // Option B: outline-first approach
  const outline = await generateWithRetry(
    `Create a detailed outline for a ${targetLength}-word article about "${topic}".
     Return as JSON: { sections: [{ title: string, keyPoints: string[] }] }`
  );

  let parsedOutline;
  try {
    // Strip markdown code fences if the model returns them — it often does
    const cleaned = outline.replace(/```
{% endraw %}
json|
{% raw %}
```/g, "").trim();
    parsedOutline = JSON.parse(cleaned);
  } catch {
    throw new Error("Outline generation returned invalid JSON. Retry or adjust prompt.");
  }

  const sections = [];

  for (const section of parsedOutline.sections) {
    const content = await generateWithRetry(`
      You're writing a section of an article about "${topic}".

      Full outline for context:
      ${JSON.stringify(parsedOutline.sections)}

      Write the "${section.title}" section now.
      Cover these points: ${section.keyPoints.join(", ")}.
      Length: approximately ${Math.round(targetLength / parsedOutline.sections.length)} words.
      Do not include a header — just the prose.
    `);

    sections.push({ title: section.title, content });

    // Brief delay between sections to stay under RPM limits
    await sleep(500);
  }

  return sections;
}

The thing that surprised me: passing the full outline on every section request felt wasteful (more tokens, more cost), but removing it produced noticeably inconsistent section lengths and tone. The outline acts as an implicit contract that keeps the whole piece coherent. Those extra tokens are earning their keep.

4. Deduplication is a business logic problem, not a technical one
When you're running bulk generation with retries, you will generate the same content more than once. A request that appeared to fail (network timeout, for instance) may have succeeded on the API side — you just never got the response. If you retry it, you've now queued a second generation for the same input.

The naive fix is: check if this input already exists in the database before generating. But this doesn't work reliably when you're running concurrent workers, because two workers can both check for the same record at the same time, both get "not found," and both generate.

The fix is idempotency at the database layer, not the application layer.

// Using PostgreSQL's INSERT ... ON CONFLICT DO NOTHING
// This makes the write atomic — if two workers race, one wins and one no-ops

async function saveGeneratedContent(inputHash, content, metadata) {
  const result = await db.query(
    `INSERT INTO generated_content (input_hash, content, metadata, created_at)
     VALUES ($1, $2, $3, NOW())
     ON CONFLICT (input_hash) DO NOTHING
     RETURNING id`,
    [inputHash, content, metadata]
  );

  // result.rows.length === 0 means a duplicate — the other worker won
  // This is fine. Don't retry. Don't throw. Just move on.
  return result.rows[0] ?? null;
}

// Generate a deterministic hash from the input
import { createHash } from "crypto";

function hashInput(input) {
  return createHash("sha256")
    .update(JSON.stringify(input))
    .digest("hex");
}

Two workers racing to write the same content: one inserts successfully, one gets a no-op, both continue processing their queues. No duplicates. No errors. No extra logic needed at the application level.

The thing I wish I'd known earlier: the deduplication question is really a question about what "the same input" means for your use case. Is {"name": "Blue Sweater", "color": "blue"} the same as {"color": "blue", "name": "Blue Sweater"}? For JSON objects, key ordering isn't guaranteed to be consistent across your application. JSON.stringify may produce different strings from the same logical object depending on how the object was constructed. Normalize before you hash — sort keys, lowercase strings, whatever's appropriate for your domain.

What I wish I'd known before writing the first version
The failure modes are quiet. A 429 that gets swallowed by a try/catch without logging looks identical to a successful request that returned nothing. Add observability first. Not after things break — before.

Token budgets compound. A style guide that's 500 tokens "too long" costs you $0.003 extra per request. Across 10,000 requests, that's $30. Optimize token costs the same way you'd optimize database queries — not obsessively, but intentionally.

Output length is not output quality. Every team I've talked to has had a phase where they equate longer generation with better generation. It's not true. A 300-word product description can be excellent. A 1,200-word product description can be redundant padding. Set max_tokens deliberately, not generously.

Batching and streaming solve different problems. If your use case is offline batch generation (like my e-commerce project), optimize for throughput and cost. If your use case is interactive generation where a user is waiting, optimize for time-to-first-token with streaming. These are different architectures and it's worth deciding which one you're building before you start.

Why I built something around this
After running into these same four problems on multiple projects — for myself and for clients — I got tired of rebuilding the same infrastructure. That led me to build CAT (Content Automation Tool), which handles rate limiting, chunking, context management, and deduplication as core features rather than afterthoughts.

But honestly? You don't need a tool to solve these problems. You need to understand them first. The code above is the foundation — handle your rate limits properly, count tokens before you send them, think about chunking strategy upfront, and make your writes idempotent. That's 90% of the complexity.

The last 10% is where things get interesting. But that's a different post.

What's your scale?
I've described this from the perspective of hundreds-to-low-thousands of requests per batch. The architecture changes meaningfully above 10,000 requests/hour — you start needing proper job queues, worker pools, and different rate limit strategies.

Where are you hitting complexity in your generation pipelines? Specifically: rate limits, output quality, or something else entirely?

Building a Content Queue System with Redis + Node.js: What They Don't Tell You

Aakash Gour — Mon, 20 Apr 2026 09:01:24 +0000

Our content pipeline broke at exactly 47 concurrent jobs.
Not 50. Not 100. Forty-seven — because that's when Redis started dropping connections faster than our retry logic could catch them, and the jobs weren't failing loudly. They were failing silently. Articles were getting swallowed with no trace in the logs, no error in the queue, no entry in the database.

That's the part nobody writes about in queue tutorials. I'm writing about it here.

This is the architecture we landed on after three failed versions — the one that now powers CAT's bulk content generation at 500+ jobs/hour without data loss. I'll walk you through the full implementation, including the retry logic and the dead letter queue that saved us.

Why a Queue in the First Place
If you're building anything that calls an external API — especially OpenAI — at scale, you already know the naive approach breaks fast:

// The naive approach. Works at 5 req/min. Collapses at 50.
for (const article of articles) {
  const content = await generateContent(article.prompt);
  await db.save({ id: article.id, content });
}

Three things kill you here:

Rate limits — OpenAI caps requests per minute and tokens per minute independently. A for-loop doesn't know or care.
Partial failures — If job #23 throws, what happens to jobs #24-100? In sequential code, they don't run. In a queue, they can.
Observability — When something goes wrong, you want a record of what was in-flight. A for-loop leaves nothing behind.

A queue gives you concurrency control, retry logic, and a paper trail. Here's how to build one that actually holds up.

The Stack

Node.js 20+ (native --watch, stable fetch)
Redis 7 (for the queue backend)
BullMQ (not Bull — BullMQ is the maintained fork with TypeScript support and better flow control)
ioredis (BullMQ's required Redis client)

If you're wondering why BullMQ over plain Redis lists: BullMQ gives you job states (waiting, active, completed, failed), automatic retries with backoff, and a web UI (Bull Board) without building any of that yourself. The tradeoff is a slightly more complex API. Worth it at scale.

npm install bullmq ioredis

Step 1: The Redis Connection
This is where most tutorials gloss over the part that matters. Your Redis connection configuration will determine how your queue behaves under load.

// lib/redis.js
import { Redis } from "ioredis";

// BullMQ requires TWO separate Redis connections:
// one for the queue (subscriber) and one for the worker (publisher).
// Reusing a single connection will cause silent failures under load.
export function createRedisConnection() {
  return new Redis({
    host: process.env.REDIS_HOST || "localhost",
    port: parseInt(process.env.REDIS_PORT || "6379"),
    maxRetriesPerRequest: null, // Required by BullMQ — do not omit
    enableReadyCheck: false,    // Prevents startup race conditions
    lazyConnect: true,          // Don't connect until first command
  });
}

The maxRetriesPerRequest: null line is the one that gets people. BullMQ manages its own retry loop. If you let ioredis also retry, you get double-retry behavior that makes jobs appear to hang. Set it to null and let BullMQ own the retry logic.

Step 2: Defining the Queue

// lib/contentQueue.js
import { Queue } from "bullmq";
import { createRedisConnection } from "./redis.js";

export const contentQueue = new Queue("content-generation", {
  connection: createRedisConnection(),
  defaultJobOptions: {
    attempts: 3,           // Retry up to 3 times before marking failed
    backoff: {
      type: "exponential",
      delay: 2000,         // First retry: 2s. Second: 4s. Third: 8s.
    },
    removeOnComplete: {
      age: 86400,          // Keep completed jobs for 24h for debugging
      count: 1000,         // But cap at 1000 to prevent memory bloat
    },
    removeOnFail: false,   // Never auto-remove failed jobs — you want them
  },
});

removeOnFail: false is the line I wish I'd set from day one. Failed jobs are your audit trail. When a client says "where's my article?" and the database has no record, you need to be able to check the dead letter queue. If you're auto-removing failed jobs, that investigation starts at "I don't know."

Step 3: Adding Jobs to the Queue

// api/generateArticles.js  
import { contentQueue } from "../lib/contentQueue.js";

export async function queueArticleBatch(articles) {
  const jobs = articles.map((article) => ({
    name: "generate-article",
    data: {
      articleId: article.id,
      keyword: article.keyword,
      wordCount: article.wordCount || 1000,
      userId: article.userId,
    },
    opts: {
      // Job ID deduplication: same article won't queue twice
      // even if the endpoint is called twice (e.g., double-submit)
      jobId: `article-${article.id}`,
      priority: article.isPriority ? 1 : 10, // Lower number = higher priority
    },
  }));

  // addBulk is significantly faster than calling add() in a loop.
  // Internally it uses a single Redis pipeline instead of N round trips.
  const result = await contentQueue.addBulk(jobs);

  return result.map((job) => ({ jobId: job.id, articleId: job.data.articleId }));
}

The jobId deduplication is something I added after we had a bug where a UI double-submit resulted in the same article being generated twice, billed twice, and returned twice. Setting a deterministic _jobId _makes BullMQ idempotent — adding a job that already exists returns the existing job instead of creating a duplicate.

Step 4: The Worker
This is where the actual work happens. The worker pulls jobs from the queue and executes them.

// workers/contentWorker.js
import { Worker } from "bullmq";
import { createRedisConnection } from "../lib/redis.js";
import { generateWithRetry } from "../lib/openai.js";
import { db } from "../lib/db.js";

const worker = new Worker(
  "content-generation",
  async (job) => {
    const { articleId, keyword, wordCount, userId } = job.data;

    // Update progress so the UI can show a spinner instead of nothing
    await job.updateProgress(10);

    const content = await generateWithRetry({ keyword, wordCount });

    await job.updateProgress(80);

    await db.articles.update({
      where: { id: articleId },
      data: {
        content,
        status: "completed",
        completedAt: new Date(),
      },
    });

    await job.updateProgress(100);

    return { articleId, wordCount: content.split(" ").length };
  },
  {
    connection: createRedisConnection(), // Separate connection from the queue
    concurrency: 5, // Process 5 jobs simultaneously
    limiter: {
      max: 10,         // Max 10 jobs processed
      duration: 1000,  // per 1000ms (per second)
    },
  }
);

// Worker event handlers — non-negotiable for production
worker.on("completed", (job, result) => {
  console.log(`[${job.id}] Completed: ${result.wordCount} words for article ${result.articleId}`);
});

worker.on("failed", (job, error) => {
  console.error(`[${job?.id}] Failed after ${job?.attemptsMade} attempts:`, error.message);
  // Add your alerting here — Slack, PagerDuty, whatever you use
});

worker.on("error", (error) => {
  // This catches connection errors, not job errors
  console.error("Worker connection error:", error);
});

The limiter config is how you respect OpenAI's rate limits without sleeping between requests. The worker will automatically throttle itself to 10 jobs/second. Adjust this based on your API tier.

Step 5: Retry Logic for the OpenAI Calls
BullMQ retries at the job level. But you also want request-level retry for transient 429s and 500s from OpenAI — errors that should resolve in seconds, not after a full job retry cycle.

// lib/openai.js
import OpenAI from "openai";

const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

export async function generateWithRetry({ keyword, wordCount }, retries = 3) {
  for (let attempt = 1; attempt <= retries; attempt++) {
    try {
      const response = await openai.chat.completions.create({
        model: "gpt-4o",
        messages: [
          {
            role: "system",
            content: "You are a professional content writer. Respond with article content only, no preamble.",
          },
          {
            role: "user",
            content: `Write a ${wordCount}-word article about "${keyword}".
Include: engaging introduction, 3 H2 sections, practical examples, conclusion.
Tone: professional but approachable.`,
          },
        ],
        max_tokens: Math.ceil(wordCount * 1.5), // ~1.5 tokens per word average
        temperature: 0.7,
      });

      return response.choices[0].message.content;

    } catch (error) {
      const isRateLimit = error.status === 429;
      const isServerError = error.status >= 500;
      const isRetryable = isRateLimit || isServerError;

      if (!isRetryable || attempt === retries) {
        throw error; // Surface to BullMQ for job-level retry
      }

      // Exponential backoff: 1s, 2s, 4s
      const delay = Math.pow(2, attempt - 1) * 1000;
      console.warn(`OpenAI ${error.status} on attempt ${attempt}. Retrying in ${delay}ms...`);
      await new Promise((resolve) => setTimeout(resolve, delay));
    }
  }
}

Two levels of retry — request-level for transient errors, job-level for failures that need a fresh start — gives you resilience without hammering the API when it's actually down.

Step 6: The Dead Letter Queue
Failed jobs that exhaust all retries live in BullMQ's failed state. You need a way to inspect and requeue them. This is the admin endpoint we use:

// api/admin/jobs.js
import { contentQueue } from "../../lib/contentQueue.js";

// GET /admin/jobs/failed
export async function getFailedJobs(req, res) {
  const failedJobs = await contentQueue.getFailed(0, 50); // First 50 failed jobs

  res.json({
    count: failedJobs.length,
    jobs: failedJobs.map((job) => ({
      id: job.id,
      articleId: job.data.articleId,
      failedReason: job.failedReason,
      attemptsMade: job.attemptsMade,
      timestamp: new Date(job.timestamp).toISOString(),
    })),
  });
}

// POST /admin/jobs/:jobId/retry
export async function retryFailedJob(req, res) {
  const { jobId } = req.params;
  const job = await contentQueue.getJob(jobId);

  if (!job) return res.status(404).json({ error: "Job not found" });

  await job.retry();
  res.json({ message: `Job ${jobId} requeued`, status: "waiting" });
}

// POST /admin/jobs/retry-all-failed
export async function retryAllFailed(req, res) {
  const failedJobs = await contentQueue.getFailed(0, 100);
  await Promise.all(failedJobs.map((job) => job.retry()));

  res.json({ message: `${failedJobs.length} jobs requeued` });
}

This saved us three times in the first month after launch. A bad deploy caused all jobs to fail with a misconfiguration error. We fixed the config, hit /admin/jobs/retry-all-failed, and 340 queued articles processed without any customer data loss.

What Can Go Wrong
Silent job loss at high concurrency. This is what broke us at 47 concurrent jobs. The cause: using a single Redis connection for both the Queue and the Worker. BullMQ's internal pub/sub conflicts with command execution when they share a connection. The fix is in Step 1 — always createRedisConnection() separately for each BullMQ instance.

Workers getting stuck on long-running jobs. BullMQ has a lockDuration (default: 30 seconds). If a job takes longer than that, BullMQ assumes the worker crashed and moves the job back to waiting — where another worker picks it up. You'll get duplicate processing. Fix: set lockDuration and a corresponding lockRenewTime:

const worker = new Worker("content-generation", processor, {
  connection: createRedisConnection(),
  lockDuration: 60000,    // 60 seconds
  lockRenewTime: 20000,   // Renew every 20s (should be lockDuration / 3)
});

Memory bloat from completed jobs. If you set removeOnComplete: false and never clean up, Redis memory will grow until your instance dies. The removeOnComplete: { age: 86400, count: 1000 } config in Step 2 handles this — but double-check it's actually set. I've seen it get accidentally overridden in per-job options.

Rate limit miscalculation. The limiter config controls how many jobs BullMQ starts, not how many OpenAI requests you make. If your worker calls OpenAI multiple times per job (e.g., outline generation + full article), your actual request rate is concurrency * openai_calls_per_job. Plan accordingly.

Monitoring Without a Dashboard
Bull Board is the standard UI for BullMQ, and it's worth adding. But if you want visibility immediately with zero setup, add this health endpoint:

// api/health/queue.js
export async function getQueueHealth(req, res) {
  const [waiting, active, completed, failed, delayed] = await Promise.all([
    contentQueue.getWaitingCount(),
    contentQueue.getActiveCount(),
    contentQueue.getCompletedCount(),
    contentQueue.getFailedCount(),
    contentQueue.getDelayedCount(),
  ]);

  const isHealthy = failed === 0 || (failed / (completed + failed)) < 0.05; // < 5% failure rate

  res.status(isHealthy ? 200 : 503).json({
    status: isHealthy ? "healthy" : "degraded",
    queue: { waiting, active, completed, failed, delayed },
    failureRate: completed + failed > 0
      ? `${((failed / (completed + failed)) * 100).toFixed(1)}%`
      : "0%",
  });
}

Hook this up to whatever uptime monitoring you use. A spike in failed count with a corresponding drop in active is the pattern that precedes queue collapse — you want an alert before users start noticing.

What I'd Do Differently
I'd add the dead letter queue admin endpoints on day one. I built them after our first production incident. That was backwards. The moment you have production jobs, you need a recovery mechanism.

I'd separate "generation" jobs from "post-processing" jobs. Our first version had one big job: generate content → validate it → format it → save it. When formatting failed, the generation work was lost and had to be retried from scratch. Now we use BullMQ Flows — parent/child job dependencies — so each stage can fail and retry independently. More complex setup, dramatically better resilience.

I'd instrument token usage from the start. Every OpenAI response includes usage.total_tokens. I started logging it six weeks in, and immediately discovered that 12% of our jobs were consuming 3x more tokens than expected due to prompt edge cases. That's billing information I was flying blind on for a month and a half.

The full working implementation — with Bull Board, the admin endpoints, and the health check — is on GitHub: [link]. Drop a star if this helped; it keeps me writing more of these.

What's your retry strategy for OpenAI rate limits at scale? I've seen people use token bucket algorithms instead of BullMQ's built-in limiter — curious if anyone's found that worth the complexity.

I Tested OpenAI, Anthropic, and Cohere for Bulk Content Generation. Here's What the Data Actually Shows.

Aakash Gour — Thu, 16 Apr 2026 06:30:05 +0000

My content pipeline needed to process 10,000 articles a month.
I had three serious API options: OpenAI, Anthropic, and Cohere.

Every comparison article I found online was either two years old, benchmarked on toy examples, or written by someone with a vendor relationship. So I ran my own.

Three weeks, 4,200 test requests, one specific use case: bulk content generation at production scale. Here's what happened.

My Evaluation Criteria (And Why These Specific Metrics)
Before I get to the numbers, let me be clear about what I was optimizing for. "Best LLM API" is meaningless. Best for my use case is what I cared about:

Output quality on structured content — I needed articles with consistent heading structure, tone, and word count. Not just fluent text.

Cost per 1,000 words — At 10K articles/month, a $0.002 difference per article is $20/month. A $0.02 difference is $200/month.
Latency (p50 and p95) — The p95 matters more than p50 for bulk work. One slow request holds up a queue.

Instruction adherence — If I say "use h2 headers, not h3," does it actually do that across 1,000 requests? Or does it drift?
Error rate over volume — Rate limit errors, context errors, malformed responses. What breaks at scale?

I didn't test: coding tasks, reasoning, math, or anything multimodal. Those benchmarks exist everywhere. This one doesn't.

The Test Setup
Same prompt, same word count target, same structural requirements, across all three providers. I wrote a simple Node.js harness to run the tests and log results to a SQLite database.

import Anthropic from "@anthropic-ai/sdk";
import OpenAI from "openai";
import { CohereClient } from "cohere-ai";
import Database from "better-sqlite3";

const db = new Database("benchmark.db");

db.exec(`
  CREATE TABLE IF NOT EXISTS results (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    provider TEXT,
    model TEXT,
    prompt_tokens INTEGER,
    completion_tokens INTEGER,
    latency_ms INTEGER,
    cost_usd REAL,
    heading_count INTEGER,
    word_count INTEGER,
    error TEXT,
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP
  )
`);

async function runBenchmark(provider, generateFn, prompt) {
  const start = Date.now();

  try {
    const result = await generateFn(prompt);
    const latency = Date.now() - start;

    // Count h2 headings to measure instruction adherence
    const headings = (result.text.match(/^## /gm) || []).length;
    const words = result.text.split(/\s+/).length;

    db.prepare(`
      INSERT INTO results (provider, model, prompt_tokens, completion_tokens, 
                           latency_ms, cost_usd, heading_count, word_count)
      VALUES (?, ?, ?, ?, ?, ?, ?, ?)
    `).run(
      provider,
      result.model,
      result.promptTokens,
      result.completionTokens,
      latency,
      result.cost,
      headings,
      words
    );

    return { success: true, latency };
  } catch (err) {
    db.prepare(`
      INSERT INTO results (provider, model, error) VALUES (?, ?, ?)
    `).run(provider, "unknown", err.message);

    return { success: false, error: err.message };
  }
}

I ran 1,400 requests against each provider — enough to get stable percentiles and surface intermittent errors. The prompt asked for a 600-word article with exactly 3 ## sections and a specific tone. Straightforward structural requirements. The kind of thing you'd run 10,000 times.

The Numbers

At 10,000 articles a month (averaging 600 words each), the cost difference between gpt-4o and gpt-4o-mini is roughly $167/month. That's not nothing.

The p95 numbers are where things get interesting. Cohere's command-r-plus had the highest p95 latency by a significant margin — nearly 2x OpenAI's gpt-4o and almost 4x Anthropic's claude-haiku-4-5. For synchronous use cases this would be painful. For queued bulk generation it's manageable, but you need to account for it in your timeout settings.

Claude Haiku had the best p95 of any capable model. If latency matters more than cost in your use case, that's worth noting.

Instruction Adherence
This is the metric nobody else was measuring. I asked for exactly 3 ## sections in every request.

Claude Sonnet followed structural instructions the most consistently. This surprised me — I expected the output quality gap between Sonnet and Haiku to be larger than the instruction adherence gap. It wasn't. Haiku drifted noticeably more on structure.

Cohere's models had the most drift. command-r would frequently add extra sections or collapse two sections into one. For casual content this is fine. For template-driven content pipelines where downstream parsing depends on consistent structure, it's a problem.

At 10,000 requests/month, a 2.8% error rate means 280 failed generations that need retries. That's not catastrophic, but it's a cost: retry logic, queue overhead, and the occasional job that fails three times and needs manual intervention.

OpenAI and Anthropic both had error rates under 1% in my test. Cohere's error rate was high enough that I'd budget for retry infrastructure before relying on it at scale.

Output Quality: The Part That's Hard to Put in a Table
I spot-checked 150 outputs across providers — 50 per provider, sampled across model tiers. I evaluated them on:

Tone consistency with the prompt
Logical flow between sections
Avoidance of filler phrases ("In conclusion...", "It's important to note...")
Whether the content was actually useful or just plausible-sounding

The honest assessment:
Claude Sonnet produced the most editable drafts. The structure was clean, the tone held throughout, and it was less prone to the kind of filler-heavy conclusions that make AI content feel generic. If I was generating content that humans would lightly edit before publishing, Sonnet gave editors the least work.

GPT-4o was close behind. Slightly more verbose, occasionally padded, but strong structural instincts and good default tone. If you're already in the OpenAI ecosystem and using the Assistants API, there's no compelling reason to switch just for content generation.

Claude Haiku surprised me on quality given its cost. The outputs weren't Sonnet-level, but they were significantly better than I expected from a model at that price point. For high-volume, lower-stakes content (product tags, meta descriptions, brief blurbs), Haiku is underrated.

Cohere command-r-plus had the most inconsistent quality. Some outputs were excellent. Others had structural problems or tonal drift mid-article. For human-reviewed content pipelines this is manageable. For automated pipelines where content goes straight to a CMS, the variance is a real issue.

GPT-4o-mini was fine. Not inspiring. Solid enough for use cases where you're generating high volumes of content that gets human reviewed anyway. At its price point, the quality-per-dollar ratio is hard to beat.

The Retry Handler I Ended Up Writing
Every provider needs retry logic. Here's the one I landed on after testing various approaches:

async function generateWithRetry(generateFn, prompt, options = {}) {
  const {
    maxRetries = 3,
    baseDelay = 1000,
    maxDelay = 30000,
  } = options;

  for (let attempt = 0; attempt <= maxRetries; attempt++) {
    try {
      return await generateFn(prompt);
    } catch (err) {
      const isRateLimit =
        err.status === 429 ||
        err.message?.includes("rate limit") ||
        err.message?.includes("too many requests");

      const isRetryable = isRateLimit || err.status >= 500;

      if (!isRetryable || attempt === maxRetries) {
        throw err;
      }

      // Exponential backoff with jitter
      // Without jitter, retrying clients hit the API in waves and cause more rate limits
      const exponential = baseDelay * Math.pow(2, attempt);
      const jitter = Math.random() * 0.3 * exponential;
      const delay = Math.min(exponential + jitter, maxDelay);

      console.log(
        `Attempt ${attempt + 1} failed (${err.status}). Retrying in ${Math.round(delay)}ms...`
      );
      await new Promise((resolve) => setTimeout(resolve, delay));
    }
  }
}

The jitter is important. I learned this the hard way: without it, rate-limited requests all retry at the same interval, which creates another burst that triggers another rate limit. Jitter spreads the retry load.

What Can Go Wrong (And Did)
Rate limits hit differently at scale than in testing. My test harness ran requests at a controlled rate. In production, queue drains aren't that clean — you get bursts when a lot of jobs land at once. I hit OpenAI rate limits in production that I never hit in testing because of this. Solution: implement a token bucket limiter, not just a fixed delay between requests.

Instruction adherence degrades with longer prompts. The 3-section test used a clean, short prompt. When I added more context (brand guidelines, examples, negative constraints), adherence dropped across all models. Claude Sonnet held up best under prompt complexity. GPT-4o-mini degraded the most.

Cohere's context window handling is different. A few of my requests hit a content filtering response that wasn't a standard API error — it returned a 200 with a specific response body structure. My generic error handler missed it and logged a successful request with garbled output. Read Cohere's error documentation more carefully than I did.

Cold starts are real with Anthropic's API. A small percentage of Haiku requests (roughly 2-3% in my data) had latencies 3-5x higher than normal. Not errors — just slow. I don't know if this is model loading, infrastructure, or something else, but it showed up consistently enough to affect the p95.

My Recommendation (Conditional, As It Should Be)
For high-volume, cost-sensitive generation where content gets human review: gpt-4o-mini or command-r. The cost savings are significant. The quality gap is real but acceptable if humans are in the loop.

For high-volume, automated pipelines where structure consistency matters: claude-haiku-4-5. Best p95 latency, solid instruction adherence, reasonable cost. The quality is better than the price suggests.

For lower-volume, higher-quality generation that feeds into editorial workflows: claude-sonnet-4-5. The instruction adherence and output editability are worth the cost premium when you're generating content that humans will touch.

For Cohere: If you have a specific reason to use it (enterprise contract, data residency requirements, a use case where Command-R performs unusually well for your specific domain), fine. For general content generation benchmarked against my criteria, it didn't compete with the OpenAI and Anthropic options.

One thing I'd do differently: I didn't test Anthropic's prompt caching for repeated system prompts. For bulk content generation where you're sending the same lengthy system prompt with each request, caching can significantly reduce input token costs. That's the next benchmark I'm running.

What's your use case? I'm curious whether the instruction adherence numbers match what others are seeing — especially if you're doing high-volume structured generation. Different domains might surface different failure modes than content generation did.

How I Built a Keyword-to-Blog-Post Pipeline in Python (Under 50 Lines)

Aakash Gour — Mon, 13 Apr 2026 11:24:34 +0000

I had a list of 40 keywords and needed a blog post for each one.
Writing them manually would take two weeks.
Writing a script to generate them took one afternoon — and 47 lines of Python.
Here's exactly how I built it.

This isn't a tutorial about AI being magic. It's a tutorial about the specific, unsexy plumbing you need to turn a keyword into a structured, usable blog post — with retry logic, output formatting, and a folder of .md files you can actually work with.

By the end of this, you'll have a working script that:

Takes a list of keywords from a .txt file
Sends a structured prompt to the OpenAI API
Parses and saves each response as a Markdown file
Handles rate limit errors without crashing

Why this is harder than it looks
The naive version is 5 lines:

That works exactly once, in a Jupyter notebook, for a demo.
In practice, you hit three problems immediately:

Rate limits. OpenAI's default tier for gpt-4o is 3 requests per minute. Try to fire 40 at once and you'll get a RateLimitError on request 4.
Unstructured output. "Write a blog post" gets you anything from a 200-word paragraph to a 3,000-word essay with inconsistent headers. If you're using this content anywhere, you need predictable structure.
No persistence. If the script crashes on keyword 22, you've lost the first 21. You need to write each output to disk as it completes.

The 47-line version handles all three.

Prerequisites

Python 3.9+
An OpenAI API key
openai library: pip install openai

That's it. No frameworks, no databases, no Docker.

The setup
Create this file structure:

Your keywords.txt should look like this — one keyword per line:

The script
Here's generate.py in full:
import os
import time
from pathlib import Path
from openai import OpenAI, RateLimitError

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

OUTPUT_DIR = Path("output")
OUTPUT_DIR.mkdir(exist_ok=True)

PROMPT_TEMPLATE = """Write a blog post about: "{keyword}"

Use this exact structure:

[Title]

Introduction

[2-3 sentences introducing the topic]

[Section 1 heading]

[3-4 sentences]

[Section 2 heading]

[3-4 sentences]

[Section 3 heading]

[3-4 sentences]

Conclusion

[2-3 sentences wrapping up with a practical takeaway]

Tone: conversational and practical. Avoid fluff. Total length: ~400 words."""

def slugify(keyword: str) -> str:
return keyword.lower().replace(" ", "-").replace("/", "-")

def generate_post(keyword: str, retries: int = 3) -> str:
for attempt in range(retries):
try:
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": PROMPT_TEMPLATE.format(keyword=keyword)}],
max_tokens=700,
temperature=0.7,
)
return response.choices[0].message.content
except RateLimitError:
wait = 20 * (attempt + 1) # back off: 20s, 40s, 60s
print(f" Rate limit hit. Waiting {wait}s before retry {attempt + 1}/{retries}...")
time.sleep(wait)
raise RuntimeError(f"Failed to generate post for '{keyword}' after {retries} retries.")

def run(keywords_file: str = "keywords.txt"):
keywords = Path(keywords_file).read_text().strip().splitlines()
print(f"Processing {len(keywords)} keywords...\n")

for i, keyword in enumerate(keywords, 1):
    output_path = OUTPUT_DIR / f"{slugify(keyword)}.md"

    if output_path.exists():
        print(f"[{i}/{len(keywords)}] Skipping '{keyword}' — already generated.")
        continue

    print(f"[{i}/{len(keywords)}] Generating: '{keyword}'")
    content = generate_post(keyword)
    output_path.write_text(content, encoding="utf-8")

    # OpenAI free tier: ~3 requests/minute. This keeps us just under.
    time.sleep(22)

print("\nDone. Check the output/ folder.")

if name == "main":
run()

What each part is actually doing
The prompt template is doing the real work here. "Write a blog post" is too open-ended — the model will vary wildly in length and structure. The template locks in a specific H2 structure, a word count target, and a tone instruction. Your output becomes predictable enough to actually use.

The retries loop in _generate_post _uses exponential-ish backoff — 20s, 40s, 60s — because OpenAI's rate limit errors are temporary. Most of the time, waiting 20 seconds is enough. The retry loop means the script keeps running instead of crashing and forcing you to restart.

The if output_path.exists(): continue check is the most important line for long runs. If you're processing 100 keywords and the script dies at #73, you don't want to regenerate the first 72. This check skips already-completed files and resumes from where you left off.

The time.sleep(22) at the bottom of the loop is tuned for the free tier rate limit. If you're on a paid OpenAI tier with higher limits, you can drop this to time.sleep(2) or remove it entirely and let the retry logic handle any occasional errors.

Running it

Output:

Your output/ folder now has three .md files, each with consistent structure:

What can go wrong

RateLimitError even with the sleep. This happens if you've been running the script multiple times in the same minute. The rate limit is per-minute across all your requests, not just this script. Fix: increase time.sleep(22) to time.sleep(30) if you're hitting it consistently.

The model ignores your structure prompt. This happens more with gpt-3.5-turbo than gpt-4o. If you switch models to reduce cost, test with 5 keywords first and inspect the output structure before running it on your full list.

You get back an empty string. Rare, but it happens. The generate_post function returns the raw content string — add a check after the call: if not content.strip(): raise ValueError(...) to catch and flag empty responses before they get written to disk as empty files.

Special characters in keywords break the filename. The slugify function handles spaces and forward slashes, but if your keywords have apostrophes, colons, or question marks, you'll get OS-level errors. Add .replace("'", "").replace(":", "").replace("?", "") to the slugify function if your keyword list is user-generated.

What I'd add next

This script is intentionally minimal — 47 lines, no external dependencies beyond openai. But if you're running this in production for more than a few hundred keywords, here's what breaks next:

Cost tracking. Add a token counter so you know what each run costs before you've spent $40 without noticing.
Quality validation. A second API call that checks whether the output meets a minimum quality bar (does it have all the required sections? Is it close to the target word count?). This sounds expensive but catching bad outputs early is cheaper than rewriting them manually.

Concurrent requests. The serial approach (one keyword at a time, sleep 22 seconds) is slow. With asyncio and a proper rate limiter, you can process 3 keywords simultaneously and cut wall-clock time by ~60%.

I'll cover the async version in the next post.

The full script on GitHub

The complete source, with a requirements.txt and a sample keywords.txt to test with, is at:

Drop a star if it saved you time.

Have you hit rate limits running keyword pipelines at scale? What's your retry strategy — and are you running requests serially or concurrently? Curious what others have landed on.

Forem: Aakash Gour

Designing a Content Microservice: Architecture Decisions and Trade-offs

Why content generation is a bad fit for synchronous APIs

The queue design that actually held up

Worker architecture: three lanes, not one

The caching decision I almost got wrong

Monitoring: the metrics that actually matter

What I'd do differently

What it handles now

Build a Markdown-to-CMS Auto-Publisher With GitHub Actions

What You'll Actually Build

Why This Is Harder Than It Looks

Prerequisites

Step 1: The Frontmatter Contract

Step 2: The Frontmatter Parser

Step 3: The WordPress Publisher

Step 4: The Ghost Publisher

Step 5: The Webflow Publisher

Step 6: The Main Publish Script

Step 7: The GitHub Actions Workflow

What Can Go Wrong

The .env.example File

Where to Take This Next

Why Every Developer Should Understand Content Operations (Even If You Never Write a Blog Post)

Content Operations, Defined Without the Marketing Jargon

The Engineering Problems Hiding in Every Content Workflow

Why This Matters More Now That AI Is Involved

The Part I Got Wrong First

What "Understanding Content Operations" Actually Looks Like for a Developer

The Uncomfortable Truth About "AI Content Tools"

What I'd Tell a Developer Starting Fresh

Auto-Format Blog Posts for SEO Using Python + BeautifulSoup

Why This Is Harder Than It Looks

Prerequisites

Step 1: Scrape and Parse the Raw Content

Step 2: Fix the Heading Hierarchy

Step 3: Generate Meta Descriptions

Step 4: Inject Schema Markup

Step 5: Fix Image Alt Text

Step 6: Wire It All Together

What Can Go Wrong

Results

Extending This

The Hidden Complexity of "Simple" Text Generation at Scale

Building a Content Queue System with Redis + Node.js: What They Don't Tell You

I Tested OpenAI, Anthropic, and Cohere for Bulk Content Generation. Here's What the Data Actually Shows.

How I Built a Keyword-to-Blog-Post Pipeline in Python (Under 50 Lines)

[Title]

Introduction

[Section 1 heading]

[Section 2 heading]

[Section 3 heading]

Conclusion

The `.env.example` File