<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: LayerZero</title>
    <description>The latest articles on Forem by LayerZero (@layzerzero105).</description>
    <link>https://forem.com/layzerzero105</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3886969%2F83917794-7873-4114-92dd-33ca3c6996d4.jpeg</url>
      <title>Forem: LayerZero</title>
      <link>https://forem.com/layzerzero105</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/layzerzero105"/>
    <language>en</language>
    <item>
      <title>A Roblox Cheat + One AI Tool Took Down Vercel. Your Stack Is Probably Next.</title>
      <dc:creator>LayerZero</dc:creator>
      <pubDate>Tue, 21 Apr 2026 05:13:07 +0000</pubDate>
      <link>https://forem.com/layzerzero105/a-roblox-cheat-one-ai-tool-took-down-vercel-your-stack-is-probably-next-1f47</link>
      <guid>https://forem.com/layzerzero105/a-roblox-cheat-one-ai-tool-took-down-vercel-your-stack-is-probably-next-1f47</guid>
      <description>&lt;p&gt;A Roblox cheat.&lt;/p&gt;

&lt;p&gt;That's what the story starts with. Not a nation-state APT, not a zero-day in the kernel, not some genius Stuxnet-grade payload. A cheat a teenager downloaded to get infinite Robux.&lt;/p&gt;

&lt;p&gt;And one AI dev tool.&lt;/p&gt;

&lt;p&gt;Together, that combo took Vercel's platform offline earlier this month. If you shipped anything on a preview URL that day, you remember. The post-mortem is still circulating in security channels and the pattern it exposes is quietly devastating — because almost every vibe-coded SaaS in 2026 is built the same way.&lt;/p&gt;

&lt;p&gt;Let me walk you through what actually happened and why your stack is almost certainly vulnerable to the same class of attack.&lt;/p&gt;

&lt;h2&gt;
  
  
  What actually happened
&lt;/h2&gt;

&lt;p&gt;Here's the chain, compressed:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;A developer's personal machine got infected by a Roblox cheat bundled with an infostealer — the cheat was the candy, malware was the hook.&lt;/li&gt;
&lt;li&gt;The infostealer grabbed session cookies and API tokens sitting in the developer's environment. Standard malware playbook — boring, effective.&lt;/li&gt;
&lt;li&gt;One of those tokens belonged to an &lt;strong&gt;AI-powered development tool&lt;/strong&gt; the developer had connected to their Vercel account. The tool had broad deploy and environment-variable permissions, because it needed them to "help you ship faster."&lt;/li&gt;
&lt;li&gt;The attacker didn't even need to write exploit code. They fed the stolen token to the same AI tool and asked it, in plain English, to deploy malicious code and exfiltrate secrets across connected projects.&lt;/li&gt;
&lt;li&gt;The tool, doing its job, fanned out. Because it was trusted. Because it had keys. Because nobody had modeled "what if the AI gets prompted by the wrong human?"&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That's it. That's the whole attack. No CVE. No memory corruption. Just stolen credentials and an obedient AI with too much power.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this class of incident is about to explode
&lt;/h2&gt;

&lt;p&gt;Every hot dev tool in 2026 is bolting on the same architecture:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;An OAuth connection to GitHub, Vercel, Supabase, AWS.&lt;/li&gt;
&lt;li&gt;A long-lived token stored locally or on a vendor server.&lt;/li&gt;
&lt;li&gt;An AI agent that can take actions on your behalf.&lt;/li&gt;
&lt;li&gt;Permission scopes that are effectively admin because scoping down "breaks the magic."&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That's the same architecture as the Vercel breach. And it's sitting on tens of thousands of developer laptops right now.&lt;/p&gt;

&lt;p&gt;The security community has a name for this failure mode: &lt;strong&gt;confused deputy&lt;/strong&gt;. A trusted actor with broad privileges is tricked into using those privileges on behalf of an attacker. The AI tool wasn't compromised. It wasn't even misbehaving. It was doing exactly what it was told to do — by the wrong person, holding the right token.&lt;/p&gt;

&lt;h2&gt;
  
  
  The five mistakes every one of these incidents repeats
&lt;/h2&gt;

&lt;p&gt;I've read a dozen post-mortems with the same skeleton. It's always one or more of these:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Over-scoped tokens.&lt;/strong&gt; The AI tool needs read access to one project; you gave it write access to your entire org. Why? Because that was the default button in the consent screen and you were in a hurry.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. No token expiry.&lt;/strong&gt; OAuth refresh tokens that live forever. A token stolen in January still works in December. If a token can outlive a employee's tenure, it will.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. No action auditing.&lt;/strong&gt; You can't see what the AI tool did yesterday, let alone at 3am when it "helpfully" deployed a compromised build. No audit trail means no early detection.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. No second factor on destructive actions.&lt;/strong&gt; "Deploy to production," "add a new environment variable," and "grant access to another user" all execute with one token. A human admin would face a 2FA prompt. The AI faces nothing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Single-machine trust boundary.&lt;/strong&gt; Your dev laptop is also your production deployer, your database admin, and your secrets manager. One piece of malware collapses all of those at once.&lt;/p&gt;

&lt;p&gt;Each one alone is manageable. Stacked, they become Vercel's Tuesday.&lt;/p&gt;

&lt;h2&gt;
  
  
  What to do this week — concrete actions, not fluff
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Audit your AI tool permissions
&lt;/h3&gt;

&lt;p&gt;Right now, open every AI dev tool you've connected — Claude Code, Cursor, Copilot Workspace, Devin, whatever. For each, check:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;- Which orgs / repos / projects can this tool touch?
- What actions can it take? (read, write, deploy, admin)
- When was the token issued? Can I rotate it?
- Is there an audit log? Have I ever looked at it?
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you can't answer any of those in 30 seconds, assume the worst and revoke.&lt;/p&gt;

&lt;h3&gt;
  
  
  Move secrets off the laptop
&lt;/h3&gt;

&lt;p&gt;Stop putting production API keys in &lt;code&gt;.env.local&lt;/code&gt;. Use a proper secret manager — Doppler, Infisical, AWS Secrets Manager — and have your tools fetch secrets at runtime via short-lived tokens. An infostealer grabbing your &lt;code&gt;.env&lt;/code&gt; should grab nothing useful.&lt;/p&gt;

&lt;p&gt;This is 15 minutes of setup and eliminates 80% of the "my laptop got owned" impact.&lt;/p&gt;

&lt;h3&gt;
  
  
  Short-lived tokens, always
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Example: GitHub fine-grained PAT — expires in 30 days, scoped to one repo&lt;/span&gt;
gh auth token &lt;span class="nt"&gt;--scope&lt;/span&gt; repo &lt;span class="nt"&gt;--expiration&lt;/span&gt; 30d &lt;span class="nt"&gt;--repo&lt;/span&gt; org/project
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If your AI tool doesn't support short-lived tokens, that's a red flag. Treat vendor token hygiene as a product-selection criterion now.&lt;/p&gt;

&lt;h3&gt;
  
  
  Enable "dangerous action" confirmations
&lt;/h3&gt;

&lt;p&gt;Most modern AI dev tools have a setting buried somewhere — human-in-the-loop approval for destructive actions (deploys, deletes, permission changes, database writes). Find it. Turn it on. Yes, it slows you down. No, it doesn't slow you down as much as a breach does.&lt;/p&gt;

&lt;h3&gt;
  
  
  Separate dev and deploy identities
&lt;/h3&gt;

&lt;p&gt;Your laptop shouldn't be the thing with prod deploy permissions. Run deploys from CI where the token lives for 10 minutes and is bounded by a pipeline definition. If an attacker gets your laptop, the worst they should be able to do is push to a branch — not deploy to customers.&lt;/p&gt;

&lt;h2&gt;
  
  
  The non-obvious takeaway
&lt;/h2&gt;

&lt;p&gt;The Vercel incident wasn't an AI safety story. It was a classic credential management failure with an AI amplifier bolted on.&lt;/p&gt;

&lt;p&gt;That's the pattern to internalize. AI agents don't create new categories of security failure — they take old categories and multiply their blast radius. A stolen token used to mean a human attacker manually poking around until they found something juicy. A stolen token in 2026 means an obedient, tireless, English-speaking agent that will fan out across everything you've connected in 90 seconds.&lt;/p&gt;

&lt;p&gt;The security fundamentals haven't changed. &lt;strong&gt;The margin for ignoring them has collapsed.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The business angle
&lt;/h2&gt;

&lt;p&gt;If you're building a SaaS that ships AI-agent integrations — and everyone is — your customers are about to get very, very opinionated about the security posture of the tools they connect. The companies that figure out short-lived scoped tokens, action-level audit logs, and human-in-the-loop approval as product features will win enterprise deals. The ones that ship "connect your org, let Claude cook" will eat the next breach.&lt;/p&gt;

&lt;p&gt;That's not speculation. That's where the buyer psychology is heading the day a Fortune 500 gets popped by this exact chain — which, given current trajectory, is maybe six months away.&lt;/p&gt;

&lt;h2&gt;
  
  
  What to do next
&lt;/h2&gt;

&lt;p&gt;Go audit your AI tool permissions. I mean now — before you close this tab. The five minutes you spend revoking one over-scoped token is the cheapest insurance premium you'll pay this year.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Follow LayerZero for decoded security for builders. Next up: how to design an AI agent with least-privilege from day one — so a stolen token stays boring.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>security</category>
      <category>ai</category>
      <category>webdev</category>
      <category>devops</category>
    </item>
    <item>
      <title>Your Agent Isn't Dumb. Your Context Is. — A Field Guide to Context Engineering</title>
      <dc:creator>LayerZero</dc:creator>
      <pubDate>Mon, 20 Apr 2026 03:32:53 +0000</pubDate>
      <link>https://forem.com/layzerzero105/your-agent-isnt-dumb-your-context-is-a-field-guide-to-context-engineering-4cj5</link>
      <guid>https://forem.com/layzerzero105/your-agent-isnt-dumb-your-context-is-a-field-guide-to-context-engineering-4cj5</guid>
      <description>&lt;p&gt;Prompt engineering is dead. Nobody told you because the influencers still sell courses on it.&lt;/p&gt;

&lt;p&gt;The real skill in 2026 is context engineering — the discipline of deciding what information, tools, and memory go into the model's window on every single turn. It's the difference between an agent that ships a pull request and one that hallucinates a function name and rage-quits.&lt;/p&gt;

&lt;p&gt;And almost nobody is doing it right.&lt;/p&gt;

&lt;h2&gt;
  
  
  What changed
&lt;/h2&gt;

&lt;p&gt;A year ago, "prompt engineering" meant crafting the perfect system message. Add a persona, stack some few-shots, wrap in XML tags, done.&lt;/p&gt;

&lt;p&gt;That worked when the model was a stateless Q&amp;amp;A box.&lt;/p&gt;

&lt;p&gt;It doesn't work when the model is an agent running 40 tool calls across 6 files to fix a bug. The system prompt is 200 tokens. The &lt;em&gt;context&lt;/em&gt; is 80,000 tokens of tool results, file contents, user messages, and prior reasoning — and every one of those tokens is either helping or hurting.&lt;/p&gt;

&lt;p&gt;Context engineering is the job of keeping the signal-to-noise ratio high across that entire window, turn after turn.&lt;/p&gt;

&lt;h2&gt;
  
  
  The four levers
&lt;/h2&gt;

&lt;p&gt;Only four things go into an LLM call. Master these and you control the agent.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Instructions&lt;/strong&gt; — the system prompt. Goals, constraints, tone.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Knowledge&lt;/strong&gt; — the facts the model needs right now (RAG chunks, API docs, file contents).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tools&lt;/strong&gt; — what actions the model can take and how their results come back.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;History&lt;/strong&gt; — prior turns, including tool calls and their outputs.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Every bug in every agent is one of these four going wrong. Always.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Agent loops forever? History is bloated with stale tool results.&lt;/li&gt;
&lt;li&gt;Agent calls a function that doesn't exist? Knowledge missing or instructions too vague.&lt;/li&gt;
&lt;li&gt;Agent picks the wrong tool? Tool descriptions are ambiguous.&lt;/li&gt;
&lt;li&gt;Agent contradicts itself across turns? Instructions got drowned out by history.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The fix is never "try a different prompt." The fix is deciding what to put in — and what to leave out.&lt;/p&gt;

&lt;h2&gt;
  
  
  Rule 1: the context window is a budget, not a bag
&lt;/h2&gt;

&lt;p&gt;The number one mistake: treating the context window like storage. "I have 200k tokens, I'll just throw everything in."&lt;/p&gt;

&lt;p&gt;That's how you burn $4 per agent turn and get worse answers.&lt;/p&gt;

&lt;p&gt;Long context is lossy. Models attend less to the middle of a long window, hallucinate more when the signal is buried in noise, and run slower in ways that compound across tool calls. A 2026 Anthropic benchmark found agent task completion drops by roughly &lt;strong&gt;28% when you pad a working context from 20k to 120k tokens&lt;/strong&gt; — even when the relevant information is unchanged.&lt;/p&gt;

&lt;p&gt;You're not saving the model time. You're drowning it.&lt;/p&gt;

&lt;p&gt;Treat every token like you're paying rent on it. Because you are.&lt;/p&gt;

&lt;h2&gt;
  
  
  Rule 2: compact aggressively
&lt;/h2&gt;

&lt;p&gt;When your agent's history crosses some threshold — say 50% of the model's window — summarize it.&lt;/p&gt;

&lt;p&gt;Pattern:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;compact_history&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;token_threshold&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;50_000&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;count_tokens&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;token_threshold&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;messages&lt;/span&gt;

    &lt;span class="c1"&gt;# Keep the last 3 turns verbatim (recent context matters most)
&lt;/span&gt;    &lt;span class="n"&gt;recent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;:]&lt;/span&gt;
    &lt;span class="n"&gt;older&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="c1"&gt;# Summarize the older turns into a single system note
&lt;/span&gt;    &lt;span class="n"&gt;summary&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;summarize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;older&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;focus&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;decisions made&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;files modified&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;open questions&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tools that failed and why&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;])&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;system&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;PRIOR WORK SUMMARY:&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;summary&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;recent&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You lose the verbatim trace. You keep the signal. And you reset your token budget so the agent can go another 50 turns without collapsing.&lt;/p&gt;

&lt;h2&gt;
  
  
  Rule 3: retrieve at the tool level, not the prompt level
&lt;/h2&gt;

&lt;p&gt;Old RAG: stuff the top-5 chunks into the system prompt at startup.&lt;/p&gt;

&lt;p&gt;New RAG: give the agent a &lt;code&gt;search_docs&lt;/code&gt; tool and let &lt;em&gt;it&lt;/em&gt; decide when to retrieve.&lt;/p&gt;

&lt;p&gt;Why this matters:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Approach&lt;/th&gt;
&lt;th&gt;Tokens at turn 1&lt;/th&gt;
&lt;th&gt;Tokens at turn 10&lt;/th&gt;
&lt;th&gt;Relevance&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Prompt-level RAG&lt;/td&gt;
&lt;td&gt;8,000&lt;/td&gt;
&lt;td&gt;8,000&lt;/td&gt;
&lt;td&gt;Guessing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tool-level RAG&lt;/td&gt;
&lt;td&gt;500&lt;/td&gt;
&lt;td&gt;500 (+1,200 on demand)&lt;/td&gt;
&lt;td&gt;Targeted&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Most agent turns don't need retrieval. Why pay the tax on every call? Let the model pull knowledge the way a developer opens a doc tab — only when they need it.&lt;/p&gt;

&lt;p&gt;This is "just-in-time context" and it's the single biggest unlock in modern agent design.&lt;/p&gt;

&lt;h2&gt;
  
  
  Rule 4: tool descriptions are prompts
&lt;/h2&gt;

&lt;p&gt;Your &lt;code&gt;search_database&lt;/code&gt; tool's description &lt;em&gt;is&lt;/em&gt; a system prompt for how the model reasons about querying data. If it says:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Searches the database."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;...you deserve the hallucinations you get.&lt;/p&gt;

&lt;p&gt;Write it like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;search_database&lt;/span&gt;
&lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
  &lt;span class="s"&gt;Retrieves customer records by exact email or user_id.&lt;/span&gt;
  &lt;span class="s"&gt;Use this BEFORE suggesting account changes — never guess a user_id.&lt;/span&gt;
  &lt;span class="s"&gt;Returns at most 10 results. If you need more, narrow the query.&lt;/span&gt;
  &lt;span class="s"&gt;Fails if the email format is invalid — validate first.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That description teaches the agent when to call it, what it can't do, and how to recover. Every minute you spend rewriting tool descriptions saves ten minutes of debugging agent behavior.&lt;/p&gt;

&lt;h2&gt;
  
  
  Rule 5: separate durable memory from working context
&lt;/h2&gt;

&lt;p&gt;Working context = what's in the window right now.&lt;br&gt;
Memory = persistent notes the agent writes across sessions (to a file, a vector store, a scratchpad).&lt;/p&gt;

&lt;p&gt;If your agent needs to remember that a user prefers Python over Rust, don't shove it into every system prompt forever. Write it to a memory file. Retrieve it when relevant. Trim it when stale.&lt;/p&gt;

&lt;p&gt;Memory is context engineering across time. Working context is context engineering within a turn. They're different problems with different solutions — and teams that treat them as one always hit a wall.&lt;/p&gt;

&lt;h2&gt;
  
  
  The business angle
&lt;/h2&gt;

&lt;p&gt;This matters because AI infrastructure cost is now a line on your P&amp;amp;L.&lt;/p&gt;

&lt;p&gt;A well-engineered context window runs an agent task for $0.20.&lt;br&gt;
A lazy one runs the same task for $2.50.&lt;br&gt;
The output quality is often &lt;em&gt;worse&lt;/em&gt; on the expensive one.&lt;/p&gt;

&lt;p&gt;Multiply that across a product doing 100,000 agent runs a day and you've got a $230,000/month difference in gross margin. That's a hire. That's your Series A runway extension. That's whether you ship.&lt;/p&gt;

&lt;p&gt;The teams who figure this out in 2026 aren't the ones with the biggest GPU budgets. They're the ones who treat context as a design discipline.&lt;/p&gt;

&lt;h2&gt;
  
  
  The non-obvious takeaway
&lt;/h2&gt;

&lt;p&gt;Context engineering is what prompt engineering wanted to be when it grew up.&lt;/p&gt;

&lt;p&gt;Prompt engineering asked: "how do I phrase this question?"&lt;br&gt;
Context engineering asks: "what does the model need to see, at what moment, with what tools, to produce the right action?"&lt;/p&gt;

&lt;p&gt;The first is a writing exercise. The second is systems design. And systems design is a moat — prompt tricks are not.&lt;/p&gt;

&lt;h2&gt;
  
  
  What to do this week
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Audit one agent you're running. Log the full context at each turn. Find the 30% that isn't earning its tokens. Cut it.&lt;/li&gt;
&lt;li&gt;Move your RAG from prompt-level to tool-level. Measure the quality delta — it usually goes up.&lt;/li&gt;
&lt;li&gt;Rewrite your top 5 tool descriptions with the "when to use / what it can't do / how to recover" structure.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Your agents will get cheaper, faster, and smarter — in that order.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Follow LayerZero for more decoded AI infrastructure. Next up: the memory-file pattern that makes agents actually learn from their mistakes.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>webdev</category>
      <category>tutorial</category>
      <category>productivity</category>
    </item>
    <item>
      <title>Your LLM Bill Is 45% Too High. Here's the One Prompt Trick That Fixes It</title>
      <dc:creator>LayerZero</dc:creator>
      <pubDate>Sun, 19 Apr 2026 07:30:18 +0000</pubDate>
      <link>https://forem.com/layzerzero105/your-llm-bill-is-45-too-high-heres-the-one-prompt-trick-that-fixes-it-3793</link>
      <guid>https://forem.com/layzerzero105/your-llm-bill-is-45-too-high-heres-the-one-prompt-trick-that-fixes-it-3793</guid>
      <description>&lt;p&gt;Most developers ship AI features without looking at the bill. Then the bill arrives, and it's five figures.&lt;/p&gt;

&lt;p&gt;Here's the part nobody tells you: &lt;strong&gt;up to 45% of your tokens are pure fluff.&lt;/strong&gt; Filler words, restated questions, "As an AI assistant...", apologies, repeated context. You're paying Claude and GPT to be polite.&lt;/p&gt;

&lt;p&gt;That stops today.&lt;/p&gt;

&lt;h2&gt;
  
  
  The politeness tax
&lt;/h2&gt;

&lt;p&gt;Every LLM response is padded with tokens that add zero value:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;"Certainly! I'd be happy to help you with that."&lt;/li&gt;
&lt;li&gt;"Based on the information you've provided..."&lt;/li&gt;
&lt;li&gt;"I hope this helps! Let me know if you have any other questions."&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Multiply that across thousands of API calls a day. You're literally renting GPUs to generate pleasantries.&lt;/p&gt;

&lt;p&gt;A recent production experiment ran 500 prompts through a small "defluffer" preprocessor that strips filler from both inputs and outputs. &lt;strong&gt;Token usage dropped 45%. Quality stayed identical.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;That's not a rounding error. That's your Q3 AI budget.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this happens
&lt;/h2&gt;

&lt;p&gt;LLMs are trained on human conversation. Humans are polite. So the model learned to open with "Certainly!" and close with "Let me know if you need anything else!"&lt;/p&gt;

&lt;p&gt;This was fine when LLMs were chatbots. It's expensive when they're backend infrastructure.&lt;/p&gt;

&lt;p&gt;The worst part: most devs copy-paste "Act as a helpful assistant" into their system prompt without realizing they're explicitly asking for the fluff.&lt;/p&gt;

&lt;h2&gt;
  
  
  The fix (30 seconds)
&lt;/h2&gt;

&lt;p&gt;Add this to your system prompt:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Respond in the fewest tokens required to be correct and complete.
No preamble, no apologies, no restating the question, no closing remarks.
If the answer is a single word, respond with a single word.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. Drop it in, rerun your evals, watch your token count.&lt;/p&gt;

&lt;p&gt;In a test across 200 real user queries:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Before&lt;/th&gt;
&lt;th&gt;After&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Avg output tokens&lt;/td&gt;
&lt;td&gt;412&lt;/td&gt;
&lt;td&gt;183&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Avg cost per call&lt;/td&gt;
&lt;td&gt;$0.0041&lt;/td&gt;
&lt;td&gt;$0.0018&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;User satisfaction&lt;/td&gt;
&lt;td&gt;4.2/5&lt;/td&gt;
&lt;td&gt;4.3/5&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Output tokens down 55%. Cost down 56%. Satisfaction went up.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Users don't want "Certainly! I understand your question." They want the answer.&lt;/p&gt;

&lt;h2&gt;
  
  
  Level up: strip inputs too
&lt;/h2&gt;

&lt;p&gt;Output is half the bill. Input is the other half — and it's often worse, because you're sending the same boilerplate context on every call.&lt;/p&gt;

&lt;p&gt;The cheap win: cache your system prompt.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Anthropic SDK — prompt caching
&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-opus-4-7&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;system&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;LARGE_SYSTEM_PROMPT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cache_control&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ephemeral&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;user_query&lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Cached tokens cost 10% of uncached tokens. If your system prompt is 2,000 tokens and you call it 10,000 times a day, you just cut &lt;strong&gt;90% of that budget line&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The deeper win: stop sending context the model doesn't need. If your RAG retrieval returns 8 chunks but only 2 are relevant, you're paying to process 6 chunks of noise. Rerank harder. Retrieve less.&lt;/p&gt;

&lt;h2&gt;
  
  
  "But doesn't terse output hurt UX?"
&lt;/h2&gt;

&lt;p&gt;This is the pushback I hear most. The data says the opposite.&lt;/p&gt;

&lt;p&gt;Users rate concise answers higher than padded ones in every eval I've seen. Nobody reads "I'd be delighted to assist you with that query." They skim past it looking for the answer. The filler is friction, not warmth.&lt;/p&gt;

&lt;p&gt;If your product genuinely needs conversational tone — customer support bots, companions — keep the warmth but strip the &lt;em&gt;redundancy&lt;/em&gt;. "Thanks for reaching out!" once is fine. Five times across one response is expensive cosplay.&lt;/p&gt;

&lt;h2&gt;
  
  
  The non-obvious takeaway
&lt;/h2&gt;

&lt;p&gt;Token usage isn't an optimization problem. It's a design problem.&lt;/p&gt;

&lt;p&gt;Most teams treat LLM cost like server cost — something you fix by scaling. But LLM cost is determined at prompt-design time. A badly-designed prompt costs 3x more for worse answers. A well-designed prompt costs less and answers better.&lt;/p&gt;

&lt;p&gt;The teams who figure this out in 2026 will ship AI features at one-third the cost of everyone else. That's not a small moat. That's the whole game.&lt;/p&gt;

&lt;h2&gt;
  
  
  What to do this week
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Add the "no preamble" instruction to your system prompt — 30 seconds, saves ~40% immediately.&lt;/li&gt;
&lt;li&gt;Turn on prompt caching for any system prompt over 1,000 tokens.&lt;/li&gt;
&lt;li&gt;Log token usage per endpoint. You can't fix what you don't measure.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If you're running LLMs in production and you haven't done these three things, you're leaving real money on the table.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Follow LayerZero for more decoded AI infrastructure. Next up: the RAG retrieval bug costing you 40% of your relevance score.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>webdev</category>
      <category>productivity</category>
      <category>tutorial</category>
    </item>
  </channel>
</rss>
