Forem: Kira Vaughn

Optimizing AI Agent Memory: Tiered Context and Aggressive Compaction

Kira Vaughn — Wed, 11 Feb 2026 18:15:57 +0000

Optimizing AI Agent Memory: Tiered Context and Aggressive Compaction

Running an AI assistant in long-running sessions creates a context management problem that most implementations don't really solve. The model's context window fills up with conversation history, you hit the token limit, and then you either truncate aggressively and lose continuity or you keep everything and pay for massive cached context on every turn.

I'm running OpenClaw for an AI assistant that handles long sessions, and the default conversation compaction settings weren't aggressive enough. The agent was hitting compaction after hours of conversation and racking up costs from tens of thousands of cached tokens on every turn, most of which weren't relevant to what was actually being asked.

Here's what I changed and why it works.

The Problem: Context Window Bloat

Most AI agent setups load a base set of instructions into every prompt. Personality, operating rules, tool documentation, memory, whatever else you want the AI to remember. OpenClaw calls these "workspace files" and injects them automatically at the start of every conversation.

This works fine for short sessions. It breaks down when you're running the same agent for hours or days at a time, because you end up with this growing pile of context:

Workspace files (instructions, personality, rules)
Conversation history (every message, every tool call, every result)
Memory files (if you're loading them all up front)

The conversation history is the real killer. After a few hours of back and forth, you can easily have 50k+ tokens of history sitting there. Claude caches aggressively so you're not paying full price for those tokens every turn, but you're still paying cache read costs and they still count toward the 200k limit.

When you finally hit the limit, OpenClaw triggers compaction. It summarizes the conversation history into a shorter block and replaces the original messages with the summary. This works, but if you're only compacting when you're about to hit 200k tokens, you've been dragging around a huge context for way longer than necessary.

The Solution: Tiered Memory and Early Compaction

I rebuilt the agent's context management around two changes.

1. Move Most Context Out of Auto-Loaded Files

I went through every workspace file and moved detailed content into separate memory files that only get loaded on demand via semantic search. The workspace files now total about 7KB combined. They contain:

Core operating rules (AGENTS.md)
Tool notes specific to my setup (TOOLS.md)
Identity and personality basics (SOUL.md, IDENTITY.md)
User preferences (USER.md)
Current heartbeat tasks (HEARTBEAT.md)

Everything else went into the memory/ directory:

Detailed memories from past sessions
Writing style guides
Operating principles and delegation patterns
Daily session logs
Project-specific context
Technical documentation

When the agent needs detailed information, it runs a semantic search across the memory files and loads only the relevant chunks. This keeps the base context small and loads additional context only when it's actually needed.

The workspace files explicitly enforce search-before-answer discipline:

## Memory Strategy
- **Daily:** `memory/YYYY-MM-DD.md` for session logs
- **Long-term:** `memory/MEMORY.md` via `memory_search` (NOT auto-loaded)
- **Write it down** - memory doesn't persist between sessions
- **Write it down NOW** - don't wait for compaction
- **Search before answering** - if a question touches anything discussed earlier, do a memory_search first

This forced a discipline shift. Instead of relying on conversation history being available, the agent writes important details to memory files immediately and searches them when needed.

2. Tighten Compaction Settings

OpenClaw has a compaction configuration block that controls when and how conversation history gets summarized. Here's what I changed:

{
  "compaction": {
    "mode": "safeguard",
    "reserveTokensFloor": 120000,
    "memoryFlush": {
      "enabled": true,
      "softThresholdTokens": 50000
    }
  }
}

mode: "safeguard" uses chunked summarization instead of truncating. It breaks the conversation into segments, summarizes each one, and reassembles them. This preserves more continuity than just dropping old messages.

reserveTokensFloor: 120000 is the big one. This sets how many tokens to keep free, which determines when compaction triggers. The default was 20k, which meant compaction only kicked in when you were nearly at the 200k limit. Setting it to 120k means compaction fires at around 80k tokens used, keeping the active context window much smaller.

softThresholdTokens: 50000 triggers a memory flush when context hits 50k tokens. This is a softer checkpoint - I write any pending details to durable memory files before the context gets any bigger. This prevents losing details that were mentioned in conversation but not yet committed to storage.

memoryFlush.enabled: true ensures memory gets flushed before compaction runs, as a safety net.

The Tradeoff: Shorter History, Better Discipline

More frequent compaction means less conversational continuity in context. If something was discussed an hour ago, it's probably been summarized by now. The agent can't just scroll back through the conversation to find details, it has to search memory files.

This is the tradeoff. Lower per-turn token costs and faster responses, but the AI has to be more deliberate about what it remembers. It can't rely on passive recall from conversation history, it has to actively write things down and search for them later.

In practice this works better than expected. The explicit instructions to write details to memory immediately create a forcing function. The agent doesn't wait for compaction to decide what's important, it writes it down as we go. When I ask about something from earlier in the day or from a past session, it runs a memory search and pulls the relevant context.

The failure mode is when the agent forgets to write something down and then conversation history gets compacted. That detail is gone unless it made it into the summary. But this hasn't been a major problem because the instructions are clear: write it down now, search before answering.

The Numbers

Here's what the setup looks like in practice:

Workspace files: ~7KB total (AGENTS.md, SOUL.md, TOOLS.md, IDENTITY.md, USER.md, HEARTBEAT.md)
Memory files: Loaded on demand via semantic search, not counted in base context
Context window: 200k tokens (Claude Opus)
Memory flush threshold: 50k tokens (soft checkpoint to write durable memories)
Compaction threshold: 120k reserved tokens (triggers when context hits ~80k used)
Result: Compaction happens roughly every 30-60 minutes of active conversation instead of once every few hours

Per-turn token costs dropped significantly after these changes. The cached context is smaller, compaction happens more frequently so history doesn't pile up, and memory files only get loaded when relevant.

Response latency improved slightly because there's less context to process on each turn. Not a huge difference, but noticeable when you're using it all day.

What This Doesn't Solve

This setup works well for this use case (long-running sessions, lots of back and forth, need to reference past conversations). It doesn't solve every context management problem.

If you need perfect conversational continuity across hours of dialogue, this isn't it. Compaction loses nuance. The summaries are good, but they're still summaries. If you're doing something where every detail of the conversation matters, you probably want to keep more history in context and pay the token costs.

If your agent setup is mostly short sessions (a few minutes each), this is overkill. The default settings are fine when you're not hitting compaction regularly.

If you don't have a good semantic search system for memory files, the on-demand loading doesn't work as well. OpenClaw has memory_search built in, so the agent can just search and load relevant chunks. If you're building this yourself, you need to implement something similar or the AI won't know how to find the information it wrote down.

Configuration Example

Here's the full OpenClaw compaction config we're using:

{
  "compaction": {
    "mode": "safeguard",
    "reserveTokensFloor": 120000,
    "memoryFlush": {
      "enabled": true,
      "softThresholdTokens": 50000
    }
  }
}

reserveTokensFloor: 120000 means compaction triggers after about 80k tokens of use. softThresholdTokens: 50000 adds an earlier checkpoint where I flush important context to durable memory before compaction even runs. Two safety nets instead of one.

Key Takeaways

If you're running an AI agent in long sessions and paying attention to token costs, here's what worked:

Split context into always-loaded and on-demand. Keep workspace files minimal, move detailed content into searchable memory files.
Trigger compaction earlier. Don't wait until you're at the context limit. Compact more frequently to keep the active context window smaller.
Flush memory before compaction. Make sure anything important gets written to durable storage before conversation history gets summarized.
Force memory discipline. Explicit instructions to write details down immediately and search before answering. Don't rely on passive recall from conversation history.

The tradeoff is shorter conversational continuity in exchange for lower token costs and better long-term recall. For this use case, that's the right trade. Your setup might be different.

If you're running OpenClaw or building something similar, this configuration might be worth trying. If you're using a different platform, the principles should translate: keep base context small, compact aggressively, write to durable memory early, search when you need details.

Next.js Performance When You Have 200,000 Database Rows

Kira Vaughn — Mon, 09 Feb 2026 19:24:40 +0000

Next.js Performance When You Have 200,000 Database Rows

Most Next.js tutorials show you how to build a blog with 10 posts. Real-world apps have hundreds of thousands of records. Here's what actually matters when your database isn't tiny.

The Problem

I recently worked on a marketplace with over 200,000 product listings. The standard patterns from tutorials and demos fall apart pretty quickly at that scale, so most of what follows is what we figured out along the way to keep things responsive.

Database Queries Matter More Than React

This sounds obvious but I see it ignored constantly: your database is the bottleneck, not React.

If your Postgres query takes 3 seconds, no amount of React optimization will help. Fix the query first.

Indexes Are Not Optional

Every column you filter or sort by needs an index. Period.

// schema.prisma - add index for search
model Product {
  id   Int    @id @default(autoincrement())
  name String

  @@index([name])
}

For text search, we used Prisma's filtering with PostgreSQL GIN trigram indexes underneath. The index lives in a migration, and Prisma handles the query layer. Without the index, searching 200k rows by name was around 4 seconds. With it, 45ms.

Don't use contains on unindexed columns unless you enjoy watching progress spinners.

Pagination, Not Infinite Scroll (Usually)

Infinite scroll is trendy. It's also a trap.

Every time the user scrolls, you're fetching more data, keeping it in memory, and re-rendering the list. After 500 items, their browser is slow and you're wasting memory.

The implementation uses cursor-based pagination instead:

// Get 20 products after this cursor
const products = await db.product.findMany({
  take: 20,
  skip: 1,
  cursor: {
    id: lastProductId,
  },
  orderBy: {
    createdAt: 'desc',
  },
});

The user gets 20 items at a time, can paginate forward/backward, and the browser doesn't die from holding 10,000 DOM nodes.

Server Components Are Your Friend

The pattern that worked best for us was keeping the page layout, headings, metadata, filters label text, and anything else that doesn't change per-request as server components. All of that renders instantly on the server with zero client JavaScript.

The actual product grid, which changes based on search terms, filters, pagination, and sorting, lives in a client component nested inside the server component. Each product card is intentionally thin (image, name, price, set info) so the client component stays lightweight even when rendering a full page of results.

// app/products/page.tsx - Server Component
export default function ProductsPage() {
  return (
    <div>
      <h1>Browse Cards</h1>
      <p>Over 200,000 cards from Pokemon, MTG, Yu-Gi-Oh and more.</p>

      {/* Static content above renders server-side immediately */}

      <ProductBrowser /> {/* Client Component handles all dynamic stuff */}
    </div>
  );
}

'use client';

// components/ProductBrowser.tsx - Client Component
// Handles search, filters, pagination, all the interactive bits
export function ProductBrowser() {
  const [filters, setFilters] = useState(defaultFilters);
  const { data, isLoading } = useProducts(filters);

  return (
    <div>
      <SearchBar onSearch={(q) => setFilters(f => ({ ...f, query: q }))} />
      <FilterSidebar filters={filters} onChange={setFilters} />
      <ProductGrid products={data?.products ?? []} loading={isLoading} />
      <Pagination page={filters.page} total={data?.total ?? 0} />
    </div>
  );
}

Users see the page structure and static content immediately while the product listings load in. The split also means the server component output gets cached aggressively since it's the same for every visitor, and only the client component does per-request work.

Avoid N+1 Queries

Classic mistake:

// Bad: N+1 query
const products = await db.product.findMany();

for (const product of products) {
  product.seller = await db.user.findUnique({
    where: { id: product.sellerId }
  });
}

You just made 1 query for products, then N queries for sellers. If you have 100 products, that's 101 database round-trips.

Use include or a join:

// Good: 1 query
const products = await db.product.findMany({
  include: {
    seller: true,
  },
});

With Prisma, include does a join under the hood. One query, way faster.

Caching Strategy

For data that doesn't change often, cache it.

The project uses Redis for:

Search results (cache for 5 minutes)
Seller profiles (cache for 1 hour)
Category listings (cache for 1 day)

import Redis from 'ioredis';
const redis = new Redis();

export async function getCachedProducts(category: string) {
  const cacheKey = `products:${category}`;
  const cached = await redis.get(cacheKey);

  if (cached) {
    return JSON.parse(cached);
  }

  const products = await db.product.findMany({
    where: { category },
    take: 20,
  });

  await redis.setex(cacheKey, 300, JSON.stringify(products)); // 5min TTL
  return products;
}

This cuts database load by 80%+ for repeat visitors.

Image Optimization

With 200,000+ product images, serving full-resolution PNGs kills bandwidth even when individual images are small.

Next.js Image component handles this automatically:

import Image from 'next/image';

<Image 
  src={product.imageUrl} 
  width={300} 
  height={420} 
  alt={product.name}
/>

Next.js will:

Serve WebP/AVIF where supported
Resize images to fit the display size
Lazy-load images below the fold
Cache optimized versions

This dropped page weights from 2MB to 400KB just by using next/image everywhere.

Streaming for Slow Queries

Sometimes a query is just slow (complex joins, aggregations, whatever). Rather than blocking the whole page, stream the slow part.

import { Suspense } from 'react';

export default function Page() {
  return (
    <div>
      <Header />

      <Suspense fallback={<Skeleton />}>
        <SlowProductList />
      </Suspense>

      <Footer />
    </div>
  );
}

async function SlowProductList() {
  const products = await someSlowQuery();
  return <ProductGrid products={products} />;
}

The header and footer render immediately. The product list streams in when ready. Users see something fast instead of staring at a blank page.

Measure Everything

Don't guess. Measure.

We're self-hosted, so we use:

Prisma query logging to catch slow queries (should probably be doing this more consistently)
Redis monitoring to track cache hit rates (another thing we should set up properly)

For a self-hosted Next.js app, you'd also want:

Prisma's log: ['query'] option to surface anything slow
Redis INFO stats for hit/miss ratios
Server-side performance monitoring (New Relic, Datadog, or simple Express middleware logging)
Lighthouse CI in your deployment pipeline

If a page is slow, check:

Is the database query slow? (Prisma logs / pg_stat_statements)
Are we missing a cache? (Redis hit rate)
Are we shipping too much JavaScript? (Next.js bundle analyzer)

Usually it's #1.

What Actually Moved the Needle

Here's what made the biggest performance impact:

Database indexes - cut query times from seconds to milliseconds
Redis caching - 80% fewer database hits
Server Components - less client JavaScript, faster initial render
Image optimization - page weight dropped 5x

The rest was marginal. Focus on those four and you'll be fine.

Conclusion

Big datasets break the patterns you learn in tutorials. The fixes aren't complicated, but you have to think about data flow differently.

Database first, cache aggressively, ship less JavaScript. That's it.

Built something similar and need help scaling it? morleymedia.dev

Originally published on kira.morleymedia.dev