Forem: Fabricio Quagliariello

The OptiPFair Series #1: Forging the Future with Small Models — An Architectural Analysis with Pere Martra

Fabricio Quagliariello — Tue, 16 Dec 2025 10:49:16 +0000

Originally published on Principia Agentica

This is the first episode of The OptiPFair Series, a deep-dive exploration of Small Language Models optimization.

The AI race has prioritized parameter count, but for real-world systems, the equation has changed.
We've entered the efficiency era. In this first OptiPFair Series episode, I speak with Pere Martra—engineer, educator, and OptiPFair creator—to dissect his tool and its philosophy.
From depth vs. width pruning to surgical bias removal, this architect-to-architect conversation explores building the next generation of Small Language Models. The future belongs to specialists: small, fast, and fair.

Introduction: When "Bigger" Stopped Being "Better"

We live in the age of giants—and perhaps we're witnessing their fall?

Over the past few years, the AI race has been defined by a brutal metric: the number of parameters. Bigger seemed, invariably, better. But for those of us building systems in the real world—those who have to deal with cloud budgets, real-time latency, and edge devices—the equation has changed.

We've entered the age of efficiency. The rise of Small Language Models (SLMs) isn't a passing fad; it's a necessary market correction. But how do we take these models and make them even faster, lighter, and fairer without destroying their intelligence in the process?

This is where Pere Martra and his new creation come in: OptiPFair.

Pere isn't an ivory tower academic. He's a seasoned engineer, a prolific educator (his LLM course repository is a must-read reference that I highly recommend), and above all, a pragmatic builder. I had the privilege of sitting down with him to dissect not just his tool, but the philosophy behind it.

What follows isn't a simple interview; it's a deep dive into the mind of an architect who is defining how we'll build the next generation of efficient AI.

Act I: The Pragmatic Spark and the Secret of Productivity

The first thing I wanted to know was the origin. We often imagine that open-source libraries are born from grand theoretical epiphanies. Pere's story, however, is refreshingly human and pragmatic.

Fabricio Q: Pere, OptiPFair is a sophisticated tool. What was the specific pain point or "spark" that led you to say "I need to build this"?

Pere Martra: Well, it came from a technical test. They asked me to create an optimized version of a model and I decided to do pruning. From that test, I started researching, and over these months, SLMs have been gaining more importance and different papers have been appearing that I've based my work on. The most important one was from Nvidia that explained how they created their model families using structured pruning plus knowledge distillation.

The Architect's Analysis:

This answer reveals two fundamental truths about good engineering:

Innovation is born from necessity: OptiPFair wasn't born looking for a problem; it was born solving one. That's the best guarantee of usefulness.
Curiosity as a driver: Pere didn't just pass the technical test. He used that challenge as a springboard to investigate the state of the art (Nvidia papers) and democratize that complex technology into an accessible tool.

But there's something deeper in Pere's way of working. When I asked him how he manages to maintain such high output—books, courses, libraries, private work—he revealed his personal "algorithm" for productivity.

Pere Martra: "I try to leverage everything I do; everything I do has at least two uses. OptiPFair came from a commission... from that problem came a notebook for my course, and from that notebook came the library.When I do development, depending on how rushed I am: I can start with a notebook that goes to the course and from the notebook it moves to the library, or I go straight to the library to solve what needs to be known in the project and then, when I have time, that moves toward educational notebooks."

The Takeaway: For Pere, code is never an end in itself. It's a vehicle. OptiPFair is the crystallization of his knowledge, packaged so others can use it (the library) and understand it (the book and the course). It's the perfect cycle of learning and teaching.

Act II: The Architectural "Sweet Spot" and the Ethics of Code

Once the origin was understood, it was time to talk architecture. The optimization ecosystem is full of noise. There are a thousand ways to make a model smaller (quantization, distillation, unstructured pruning). I asked Pere where exactly OptiPFair fits. His answer was a lesson in knowing your terrain.

Pere Martra: "OptiPFair doesn't compete in the 70B parameter range. Its 'sweet spot' is sub-13B models, and specifically, deployment efficiency through Depth Pruning.Many width pruning methods theoretically reduce parameters, but often fail to improve actual inference speed in small batch scenarios (like local devices), because they break the memory alignment that GPUs love. By removing complete transformer blocks (depth pruning), we achieve hardware-agnostic acceleration."

From the Principia Agentica Laboratory: The Acid Test

Inspired by this distinction, I decided not to stay in theory. I took OptiPFair to my own lab to test this premise with a 90-minute "Hello, Speedup" recipe.

Using a Llama-3.2-1B model as baseline, I ran two strategies:

Width Pruning (MLP_GLU): Reducing fine neurons.
Depth Pruning: Eliminating the last 3 layers of the model.

The Laboratory Verdict: The results validated Pere's thesis. While width pruning maintained the global structure more faithfully, depth pruning delivered a significantly larger performance gain: a 15.6% improvement in Tokens Per Second (TPS) compared to width pruning's 4.3%, with controllable quality degradation.

Reproduce these results experimentally: All benchmarks are documented in an interactive Jupyter notebook.

Visualizing the Invisible: Bias

But speed isn't everything. And this is where OptiPFair plays its hidden card. Pere showed me a demo that left me frozen. It wasn't about TPS, it was about ethics.

Pere Martra: "It's not enough to make the model fast. We need to know if pruning it amplifies biases. OptiPFair includes a bias visualization module that analyzes how layers activate in response to protected attributes."

He shared an example with a recent Llama-3.2 model. Given a prompt about a Black man in an ambiguous situation, the original model hallucinated a violent response (a shooting). After surgical intervention using OptiPFair's analysis tools—removing just 0.1% of specific neurons—the model changed its response: the police officer no longer shot, but called for help.

The Architect's Analysis: This is a game-changer. Normally, we treat "ethics" and "optimization" as separate silos. Pere has integrated them into the same toolbox. He reminds us that an "efficient" model that amplifies prejudices isn't production-ready; it's a liability risk.

Act III: "We're Going to Run Out of Planet" and the Master's Advice

Toward the end of our conversation, the discussion turned to the future. I asked Pere where he thinks all this is going. His answer was a sobering reminder of why efficiency isn't just a cost issue, but a sustainability one.

Pere Martra: "If for every specific need we use a 700 billion parameter model... we're going to run out of planet in five years. We need generalist models, yes, but the future belongs to specialists: small models, fast and consuming less."

This vision drives OptiPFair's roadmap. It doesn't stop here. Pere is already working on Knowledge Distillation and attention layer pruning, seeking that holy grail where a small model doesn't just mimic a large one, but competes with it in its niche.

Deep Dive: Notes for the Advanced Architect

Before closing, I took the opportunity to ask Pere some "architect to architect" questions about the technical limits of these techniques. Here are the key insights for those who want to take this to production:

Is there a "safe" pruning range? It depends drastically on the family. "Llama handles MLP layer pruning very well (up to 400% of original expansion), while families like Gemma are more fragile. The safe limit usually hovers around 140% remaining expansion, but it will almost always require a recovery process (retraining or distillation)".
The "Last" layers heuristic: Although depth pruning often targets the last layers, Pere clarified that this is an oversimplification. The recommended practice is to protect the first 4 blocks (fundamental for input processing) and the last 2 (essential for output consolidation). The "fat" is usually in the middle.

The Final Advice: Top to Bottom

To finish, I asked for advice for engineers who are starting out in this dizzying field. His answer validates the path many of us are taking.

Pere Martra: "Don't get bored. Study from top to bottom. Start using an API, doing something easy that you like. Once you have it, go down. Go to the foundations. Understand how a Transformer works, what a GLU structure is. Those 'aha!' moments when you connect practice with theory are what make you an expert."

Conclusion: The Lighthouse Verdict

OptiPFair isn't just another library in the Python ocean. It's a statement of principles.

For the modern AI architect, it represents the perfect tool for the "Edge AI" and efficiency era. If your goal is to deploy language models in constrained environments, controlling both latency and ethical bias, this is an essential piece in your toolbelt.

What I take away from Pere: The most sophisticated technology is born from the simplest pragmatism. You don't need to start with a grand theory; you need to start solving a real problem. And if in the process you can teach others and build tools that make work fairer and more efficient, then you're building a legacy.

The principia-agentica laboratory approves and recommends OptiPFair.

Resources and Next Steps

I want to use OptiPFair. Where do I start?

Official OptiPFair repository: github.com/peremartra/optipfair
Pere's complete LLM course (free): An educational treasure that covers from fundamentals to advanced techniques. Highly recommended. https://github.com/peremartra/Large-Language-Model-Notebooks-Course
"Large Language Models Projects" book (Apress, 2024): Pere's definitive guide on LLMs, now available. https://link.springer.com/book/10.1007/979-8-8688-0515-8
Upcoming book with Manning: Pere is working on a book about model architecture and optimization that will delve deeper into OptiPFair and related techniques. Stay tuned.

Connect with Pere Martra:

LinkedIn: Follow his updates on OptiPFair, SLMs, and the future of efficient AI
- https://www.linkedin.com/in/pere-martra/
Hugging Face: Explore his optimized models and experiments with SLMs
- https://huggingface.co/oopere
Medium: Read his articles on model optimization and advanced ML techniques
- https://medium.com/@peremartra
Community: Pere is an active mentor at DeepLearning.AI and regularly contributes to TowardsAI

If you found this article useful:

Try OptiPFair in your next optimization project
Share this analysis with your ML team
Consider supporting Pere's open source work by giving it a star on GitHub
Follow Principia Agentica's work for more in-depth architectural analyses

Efficiency isn't just a technical metric. It's a commitment to a sustainable future for AI. Pere Martra is leading that path, one line of code at a time.

Editor's Note (December 2025): While this article was being prepared for publication, Pere released significant improvements to OptiPFair that address precisely the memory alignment limitation mentioned.

Now width pruning supports the expansion_divisor parameter (32, 64, 128, 256) to respect tensor core size, and accepts a dataloader for data-driven neuron selection. This demonstrates the speed of OptiPFair's evolution.
A complete update will come in the OptiPFair Series from Principia Agentica.

More from Principia Agentica:

Follow the series and explore hands-on labs, architectural analyses, and AI agent deep-dives at principia-agentica.io

Beyond the Notebook: 4 Architectural Patterns for Production-Ready AI Agents

Fabricio Quagliariello — Wed, 10 Dec 2025 21:57:39 +0000

This is a submission for the Google AI Agents Writing Challenge: Learning Reflections

Introduction

The gap between a "Hello World" agent running in a Jupyter Notebook and a reliable, production-grade system is not a step—it's a chasm (and it is not an easy one to cross).

I recently had the privilege to participate in the 5-Day AI Agents Intensive Course with Google and Kaggle. After completing the coursework and finalizing the capstone project, I realized that beyond the many valuable things we enjoyed in this course (very valuable white papers, carefully designed notebooks, and exceptional expert panels in the live sessions), the real treasure wasn't just learning the ADK syntax—it was the architectural patterns subtly embedded within the lessons.

As an architect building production systems for over 20 years, including multi-agent workflows and enterprise integrations, I've seen firsthand where theoretical agents break under real-world constraints.

We are moving from an era of "prompt engineering" to "agent architecture" where "context engineering" is key. As with any other emerging architectural paradigm, this shift demands blueprints that ensure reliability, efficiency, and ethical safety. Without them, we risk agents that silently degrade, violate user privacy, or execute irreversible actions without oversight.

Drawing from the course and my own experience as an AI Architect, I have distilled the curriculum into four essential patterns that transform fragile prototypes into robust production systems:

The 4 Core Patterns

Outside-In Evaluation Hierarchy: Shifting focus from the final answer to the decision-making trajectory.
Dual-Layer Memory Architecture: Balancing ephemeral session context with persistent, consolidated knowledge.
Protocol-First Interoperability: Decoupling agents from tools using standardized protocols like MCP and A2A.
Long-Running Operations & Resumability: Managing state for asynchronous tasks and human-in-the-loop workflows.

Throughout this analysis, I'll apply a 6-point framework grounded in the principles of Principia Agentica—ensuring these patterns respect human sovereignty, fiduciary responsibility, and meaningful human control.

The Analysis Framework

The Production Problem: Why naive approaches fail at scale.
The Architectural Solution: The specific design pattern taught in the course.
Key Implementation Details: Concrete code-level insights from the ADK notebooks.
Production Considerations: Real-world deployment implications (latency, cost, scale).
Connection to Ethical Design: How the pattern supports human sovereignty, fiduciary responsibility, or ethical agent architecture. I will include a "failure scenario" where I'll try to illustrate what could happen without the ethical safeguard.
Key Takeaways: A distilled summary of each pattern's production principle, implementation guidance, and ethical anchor—designed to serve as a quick reference for architects moving from prototype to production.

Let's do this!

Pattern 1: Outside-In Evaluation Hierarchy (Trajectory as Truth)

In traditional software, if the output is correct, the test passes. In agentic AI, a correct answer derived from a hallucination or a dangerous logic path is a ticking time bomb.

1. The Production Problem

Naive evaluation strategies often fail in production due to the non-deterministic nature of LLMs. We face two specific traps:

The "Lucky Guess" Trap: Imagine an agent asked to "Get the weather in Tokyo." A bad agent might hallucinate "It is sunny in Tokyo" without calling the weather tool. If it happens to be sunny, a traditional assert result == expected test passes. This hides a critical failure in logic that will break as soon as the weather changes.
The "Silent Failure" of Efficiency: An agent might solve a user request but take 25 steps to do what should have taken 3. This bloats token costs and latency—a failure mode that boolean output checks completely miss.

2. The Architectural Solution

Day 4 of the course introduced the concept of Glass Box Evaluation. We move away from simple output verification to a hierarchical approach:

Level 1: Black Box (End-to-End): Did the user get the right result?
Level 2: Glass Box (Trajectory): Did the agent use the correct tools in the correct order?
Level 3: Component (Unit): Did the individual tools perform as expected?

This shift treats the trajectory (Thought → Action → Observation) as the unit of truth. By evaluating the trajectory, we ensure the agent isn't just "getting lucky," but is actually reasoning correctly.

3. Implementation Details: Field Notes from the ADK

The ADK provides specific primitives to capture and score these trajectories without writing custom parsers for every test.

From adk web to evalset.json
Instead of manually writing test cases, the ADK encourages a "Capture and Replay" workflow. During development (using adk web), when you spot a successful interaction, you can persist that session state. This generates an evalset.json that captures not just the input/output, but the expected tool calls.

// Conceptual structure of an ADK evalset entry
// Traditional test: just input/output
// ADK evalset contains evalcases with invocations: input (queries) + expected_tool_use + reference (output)
{
  "name": "ask_GOOGLE_price", // a given name of the evaluation set
  "data": [ // evaluation cases are included here
      "query": "What is the stock price of GOOG?", // user input
      "reference": "The price is $175...", // expected semantic output
      "expected_tool_use": [ // expected trajectory
        { 
            "tool_name": "get_stock_price", 
            "tool_input": { // arguments passed to the tool
                "symbol": "GOOG" 
            } 
        }
      ],
      // other evaluation cases ...
  ],
  "initial_session": {
    "state": {},
    "app_name": "hello_world",
    "user_id": "user_..." // the specific id of the user
  }
}

This JSON represents an EvalSet containing one EvalCase. Each EvalCase has a name, data (which is a list of invocations), and an optional initial_session. Each invocation within the data list includes a query, expected_tool_use, expected_intermediate_agent_responses, and a reference response.

The EvalSet object itself also includes eval_set_id, name, description, eval_cases, and creation_timestamp fields.

Configuring the Judge

In the test_config.json, we can move beyond simple string matching. The course demonstrated configuring LLM-as-a-Judge evaluators.

Naive Approach: Uses an exact match evaluator (brittle, fails on phrasing differences).
Architectural Approach: Uses TrajectoryEvaluator alongside SemanticSimilarity. The ADK allows us to define "Golden Sets" where the reasoning path is the standard, allowing the LLM judge to penalize agents that skip steps or hallucinate data, even if the final text looks plausible.

Core Configuration Components

To configure an LLM-as-a-Judge effectively, you must construct a specific input payload with four components:

The Agent's Output: The actual response generated by the agent you are testing.
The Original Prompt: The specific instruction or query the user provided.
The "Golden" Answer: A reference answer or ground truth to serve as a benchmark.
A Detailed Evaluation Rubric: Specific criteria (e.g., "Rate helpfulness on a scale of 1-5") and requirements for the judge to explain its reasoning.

ADK Default Evaluators

The ADK Evaluation Framework includes several default evaluators, accessible via the MetricEvaluatorRegistry:

RougeEvaluator: Uses the ROUGE-1 metric to score similarity between an agent's final response and a golden response.
FinalResponseMatchV2Evaluator: Uses an LLM-as-a-judge approach to determine if an agent's response is valid.
TrajectoryEvaluator: Assesses the accuracy of an agent's tool use trajectories by comparing the sequence of tool calls against expected calls. It supports various match types (EXACT, IN_ORDER, ANY_ORDER).
SafetyEvaluatorV1: Assesses the safety (harmlessness) of an agent's response, delegating to Vertex Gen AI Eval SDK.
HallucinationsV1Evaluator: Checks if a model response contains any false, contradictory, or unsupported claims by segmenting the response into sentences and validating each against the provided context.
RubricBasedFinalResponseQualityV1Evaluator: Assesses the quality of an agent's final response against user-defined rubrics, using an LLM as a judge.
RubricBasedToolUseV1Evaluator: Assesses the quality of an agent's tool usage against user-defined rubrics, employing an LLM as a judge.

These evaluators can be configured using EvalConfig objects, which specify the criteria and thresholds for assessment.

Bias Mitigation Strategies

A major challenge is handling bias, such as the tendency for models to give average scores or prefer the first option presented:

Pairwise Comparison (A/B Testing): Instead of asking for an absolute score, configure the judge to compare two different responses (Answer A vs. Answer B) and force a choice. This yields a "win rate," which is often a more reliable signal.
Swapping Operation: To counter position bias, invoke the judge twice, swapping the order of the candidates. If the results are inconsistent, the result can be labeled as a "tie".
Rule Augmentation: Embed specific evaluation principles, references, and rubrics directly into the judge's system prompt.

Advanced Configuration: Agent-as-a-Judge

There's a distinction between standard LLM-as-a-Judge (which evaluates final text outputs) and Agent-as-a-Judge:

Standard LLM-as-a-Judge: Best for evaluating the final response (e.g., "Is this summary accurate?").
Agent-as-a-Judge: Necessary when you need to evaluate the process, not just the result. You configure the judge to ingest the agent's full execution trace (including internal thoughts, tool calls, and tool arguments). This allows the judge to assess intermediate steps, such as whether the correct tool was chosen or if the plan was logically structured.

Evaluation Architectures

You can use several architectural approaches when configuring your judge:

Point-wise: The judge evaluates a single candidate in isolation.
Pair-wise / List-wise: The judge compares two or more candidates simultaneously to produce a ranking.
Multi-Agent Collaboration: For high-stakes evaluation, you can configure multiple LLM judges to debate or vote (e.g., "Peer Rank" algorithms) to produce a final consensus, rather than relying on a single model.

Example Configuration

For a pairwise comparison judge, structure the prompt to output in a structured JSON format:

{
  "winner": "A", // or "B" or "tie"
  "rationale": "Answer A provided more specific delivery details..."
}

This structured output allows you to programmatically parse the judge's decision and calculate metrics like win/loss rates at scale.

Analogy

You can think of configuring an LLM-as-a-Judge like setting up a blind taste test. If you just hand a judge a cake and ask "Is this good?", they might be polite and say "Yes." But if you provide them with a Golden Answer (a cake baked by a master chef) and use Pairwise Comparison (ask "Which of these two is better?"), you force them to make a critical distinction, resulting in far more accurate and actionable feedback.

4. Production Considerations

Moving this pattern from a notebook to a live system requires handling scale and cost.

Dynamic Sampling: You cannot trace and judge every single production interaction with an LLM—it’s too expensive. A robust pattern is 100/10 sampling: capture 100% of traces that result in user errors or negative feedback, but only sample 10% of successful sessions to monitor for latency drift (P99) and token bloat.
The Evaluation Flywheel: Evaluation isn't a one-time gate before launch. Production traces (captured via OpenTelemetry) must be fed back into the development cycle. Every time an agent fails in production, that specific trajectory should be anonymized and added to the evalset.json as a regression test.
Latency Impact: Trajectory logging must be asynchronous. The user should receive their response immediately, while the trace data is pushed to the observability store (like LangSmith or a custom SQL db) in a background thread to avoid degrading the user experience.

5. Ethical Connection

"The Trajectory is the Truth" is the technical implementation of Fiduciary Responsibility. We cannot claim an agent is acting in the user's best interest if we only validate the result (the "what") while ignoring the process (the "how"). We must ensure the agent isn't achieving the right ends through manipulative, inefficient, or unethical means.

Concrete Failure Scenario:

Consider a hiring agent that filters job candidates. Without trajectory validation, it could discriminate based on protected characteristics (age, gender, ethnicity) during the filtering process, yet pass all output tests by producing a "diverse" final shortlist through cherry-picking. The bias hides in the how—which resumes were read, which criteria were weighted, which candidates were never considered. Output validation alone cannot detect this algorithmic discrimination. Only trajectory evaluation exposes the unethical reasoning path.

Key Takeaways

Production Principle: Trust the reasoning process, not just the output. Trajectory validation is the difference between lucky guesses and reliable intelligence.
Implementation: Use ADK's TrajectoryEvaluator with EvalSet objects to capture expected tool calls alongside expected outputs. Configure LLM-as-a-Judge with Golden Sets and pairwise comparison to avoid evaluation bias.
Ethical Anchor: This pattern operationalizes Fiduciary Responsibility—we validate that the agent serves the user's interests through sound reasoning, not through shortcuts, hallucinations, or hidden bias.

Validating the how is critical, but what happens when the reasoning path spans not just one conversation turn, but weeks or months? An agent that reasons correctly in the moment can still fail catastrophically if it forgets what it learned yesterday. This brings us to our second pattern: managing the agent's memory architecture.

Pattern 2: Dual-Layer Memory Architecture (Session vs. Memory)

1. The Production Problem

Although models like Gemini 1.5 have introduced massive context windows, treating context as infinite is an architectural anti-pattern.

Consider a Travel Agent Bot: In Session 1, the user mentions a "shellfish allergy." By Session 10, months later, that critical fact is buried under thousands of tokens of hotel searches and flight comparisons

This might lead to two very concrete failures:

Context Rot: As the context window fills with noise, the model's ability to attend to specific, older instructions (like the allergy) degrades.
Cost Spiral: Re-sending the entire history of every past interaction for every new query creates a linear cost increase that makes the system economically unviable at scale.

2. The Architectural Solution

We must distinguish between the Workbench and the Filing Cabinet.

The Session (Workbench): An ephemeral, mutable space for the current task. It holds the immediate "Hot Path" context. To keep it performant, we apply Context Compaction—automatically summarizing or truncating older turns while keeping the most recent ones raw.
The Memory (Filing Cabinet): A persistent layer for consolidated facts. This requires an ETL (Extract, Transform, Load) pipeline where the agent Extracts facts from the session, Consolidates them (deduplicating against existing knowledge), and Stores them for semantic retrieval later.

3. Implementation Details: Code Insights

The ADK moves memory management from manual implementation to configuration.

Session Hygiene via Compaction
In the ADK, we don't manually trim strings. We configure the agent to handle its own hygiene using EventsCompactionConfig.

from google.adk.agents.base_agent import BaseAgent
from google.adk.apps.app import App, EventsCompactionConfig
from google.adk.apps.llm_event_summarizer import LlmEventSummarizer # Assuming this is your summarizer

# Define a simple BaseAgent for the example
class MyAgent(BaseAgent):
    name: str = "my_agent"
    description: str = "A simple agent."

    def call(self, context, content):
        pass

# Create an instance of LlmEventSummarizer or your custom summarizer
my_summarizer = LlmEventSummarizer()

# Create an EventsCompactionConfig
events_compaction_config_instance = EventsCompactionConfig(
    summarizer=my_summarizer,
    compaction_interval=5,
    overlap_size=2
)

# Create an App instance with the EventsCompactionConfig
my_app = App(
    name="my_application",
    root_agent=MyAgent(),
    events_compaction_config=events_compaction_config_instance
)

print(my_app.model_dump_json(indent=2))

Persistence: From RAM to DB
In notebooks, we often use InMemorySessionService. This is dangerous for production because a container restart wipes the conversation. The architectural shift is moving to DatabaseSessionService (backed by SQL or Firestore) which persists the Session object state, allowing users to resume conversations across devices.

The Memory Consolidation Pipeline
Day 3b introduced the framework for moving from raw storage to intelligent consolidation. This is where the "Filing Cabinet" becomes smart. The workflow is an LLM-driven ETL pipeline with four stages:

Ingestion: The system receives raw session history.

Extraction & Filtering: An LLM analyzes the conversation to extract meaningful facts, guided by developer-defined Memory Topics:
The LLM extracts only facts matching these topics.

# Conceptual configuration (Vertex AI Memory Bank, Day 5)
memory_topics = [
  "user_preferences",      # "Prefers window seats"
  "dietary_restrictions",  # "Allergic to shellfish"
  "project_context"        # "Leading Q4 marketing campaign"
]

Consolidation (The "Transform" Phase): The LLM retrieves existing memories and decides:
- CREATE: Novel information → new memory entry.
- UPDATE: New info refines existing memory → merge (e.g., "Likes marketing" becomes "Leading Q4 marketing project").
- DELETE: New info contradicts old → invalidate (e.g., Dietary restrictions change).
Storage: Consolidated memories persist to a vector database for semantic retrieval.

Note: While Day 3b uses InMemoryMemoryService to teach the API, it stores raw events without consolidation. For production-grade consolidation, we look to the Vertex AI Memory Bank integration introduced in Day 5.

Retrieval Strategies: Proactive vs. Reactive
The course highlighted two distinct patterns for getting data out of the Filing Cabinet:

Proactive (preload_memory): Injects relevant user facts into the system prompt before the model generates a response. Best for high-frequency preferences (e.g., "User always prefers aisle seats").
Reactive (load_memory): Gives the agent a tool to search the database. The agent decides if it needs to look something up. Best for obscure facts to save tokens.

4. Production Considerations

Asynchronous Consolidation: Moving data from the Workbench to the Filing Cabinet is expensive. In production, this ETL process should happen asynchronously. Do not make the user wait for the agent to "file its paperwork." Trigger the memory extraction logic in a background job after the session concludes.
Semantic Search: Keyword search is insufficient for the Filing Cabinet. Production memory requires vector embeddings. If a user asks for "romantic dining," the system must be able to retrieve a past note about "candlelight dinners," even if the word "romantic" wasn't used.
The "Context Stuffing" Trade-off: While preload_memory reduces latency (no extra tool roundtrip), it increases input token costs on every turn. load_memory is cheaper on average but adds latency when retrieval is needed.

5. Ethical Design Note

This architecture embodies Privacy by Design. By distinguishing between the transient session and persistent memory, we can implement rigorous "forgetting" protocols.

We scrub Personally Identifiable Information (PII) from the session log before it undergoes consolidation into long-term memory, ensuring we act as fiduciaries of user data rather than creating an unmanageable surveillance log.

Concrete Failure Scenario:

Imagine a healthcare agent that remembers a patient mentioned their HIV status in Session 1. Without a dual-layer architecture, this fact sits in plain text in the session log forever, accessible to any system with database read permissions. If the system is breached, or if a support engineer needs to debug a session, the patient's private health information is exposed. Worse, without consolidation logic, the system doesn't know to delete this information if the patient later says "I was misdiagnosed—I don't have HIV." The agent treats every utterance as equally permanent, creating a privacy nightmare where sensitive data proliferates uncontrollably across logs and backups.

Key Takeaways

Production Principle: Context is expensive, but privacy is priceless. Design memory systems that distinguish between what an agent needs now (hot session) and what it needs forever (consolidated memory).
Implementation: Use EventsCompactionConfig for session hygiene and implement a PII scrubber in your ETL pipeline before consolidation. Leverage Vertex AI Memory Bank for production-grade semantic memory with built-in privacy controls.
Ethical Anchor: This pattern operationalizes Privacy by Design—we build forgetfulness and data minimization into the architecture, treating user data as a liability to protect, not an asset to hoard.

With robust evaluation validating our agent's reasoning and a dual-layer memory preserving context over time, we might assume our system is production-ready. But there's a hidden fragility: these capabilities are only as good as the tools and data sources the agent can access. When every integration is a bespoke API wrapper, scaling becomes a maintenance nightmare. This brings us to the third pattern: decoupling agents from their dependencies through standardized protocols.

Pattern 3: Protocol-First Interoperability (MCP & A2A)

1. The Production Problem

We are facing an "N×M Integration Trap."

Imagine building a Customer Support Agent. It needs to check GitHub for bugs, message Slack for alerts, and update Jira tickets. Without a standard protocol, you write three custom API wrappers. When GitHub changes an endpoint, your agent breaks.

Now, multiply this across an enterprise. You have 10 different agents needing access to 20 different data sources. You are suddenly maintaining 200 brittle integration points. Furthermore, these agents become isolated silos—the Sales Agent has no way to dynamically discover or ask the Engineering Agent for help because they speak different "languages."

2. The Architectural Solution

The solution is to invert the dependency. Instead of the agent knowing about the specific tool implementation, we adopt a Protocol-First Architecture.

Model Context Protocol (MCP): For Tools and Data. It decouples the agent (client) from the tool (server). The agent doesn't need to know how to query a Postgres DB; it just needs to know the MCP interface to ask for data.
Agent2Agent (A2A): For Peers and Delegation. It allows for high-level goal delegation. An agent doesn't execute a task; it hands off a goal to another agent via a standardized handshake.
Runtime Discovery: Instead of hardcoding tools, agents query an MCP Server or an Agent Card at runtime to discover capabilities dynamically.

3. Implementation Details: Code Examples from the ADK

The ADK abstracts the heavy lifting of these protocols.

Connecting Data via MCP
We don't write API wrappers. We instantiate an McpToolset. The ADK handles the handshake, lists the available tools, and injects their schemas into the context window automatically.

The Model Context Protocol (MCP) is used to connect an agent to external tools and data sources without writing custom API clients. In ADK, we use McpToolset to wrap an MCP server configuration.

Example: Connecting an agent to the "Everything" MCP server:

from google.adk.agents import LlmAgent
from google.adk.tools import McpToolset
from google.adk.tools.mcp_tool.mcp_session_manager import StdioConnectionParams
from google.adk.tools.mcp_tool.mcp_session_manager import StdioServerParameters
from google.adk.runners import Runner # Assuming Runner is defined elsewhere

# 1. Define the connection to the MCP Server
# Here we use 'npx' to run a Node-based MCP server directly
mcp_toolset = McpToolset(
    connection_params=StdioConnectionParams(
        server_params=StdioServerParameters(
            command="npx",
            args=["-y", "@modelcontextprotocol/server-everything"]
        ),
        timeout=10.0 # Optional: specify a timeout for connection establishment
    ),
    # Optionally filter to specific tools provided by the server
    tool_filter=["getTinyImage"]
)

# 2. Add the MCP tools to your Agent
agent = LlmAgent(
    name="image_agent",
    model="gemini-2.0-flash",
    instruction="You can generate tiny images using the tools provided.",
    # The toolset exposes the MCP capabilities as standard ADK tools
    tools=[mcp_toolset] # tools expects a list of ToolUnion
)

# 3. Run the agent
# The agent can now call 'getTinyImage' as if it were a local Python function
runner = Runner(agent=agent, ...) # Fill in Runner details to run

Delegating via A2A (Agent-to-Agent)

The Agent2Agent (A2A) protocol is used to enable collaboration between different autonomous agents, potentially running on different servers or frameworks.

A. Exposing an Agent (to_a2a)
This converts a local ADK agent into an A2A-compliant server that publishes an Agent Card.

To make an agent discoverable, we wrap it using the to_a2a() utility. This generates an Agent Card—a standardized manifest hosted at .well-known/agent-card.json.

from google.adk.agents import LlmAgent
from google.adk.a2a.utils.agent_to_a2a import to_a2a
from google.adk.tools.tool_context import ToolContext
from google.genai import types
import random

# Define the tools
def roll_die(sides: int, tool_context: ToolContext) -> int:
  """Roll a die and return the rolled result.

  Args:
    sides: The integer number of sides the die has.
    tool_context: the tool context
  Returns:
    An integer of the result of rolling the die.
  """
  result = random.randint(1, sides)
  if not 'rolls' in tool_context.state:
    tool_context.state['rolls'] = []

  tool_context.state['rolls'] = tool_context.state['rolls'] + [result]
  return result

async def check_prime(nums: list[int]) -> str:
  """Check if a given list of numbers are prime.

  Args:
    nums: The list of numbers to check.

  Returns:
    A str indicating which number is prime.
  """
  primes = set()
  for number in nums:
    number = int(number)
    if number <= 1:
      continue
    is_prime = True
    for i in range(2, int(number**0.5) + 1):
      if number % i == 0:
        is_prime = False
        break
    if is_prime:
      primes.add(number)
  return (
      'No prime numbers found.'
      if not primes
      else f"{', '.join(str(num) for num in primes)} are prime numbers."
  )

# 1. Define your local agent with relevant tools and instructions
# This example uses the 'hello_world' agent's logic for rolling dice and checking primes.
root_agent = LlmAgent(
    model='gemini-2.0-flash',
    name='hello_world_agent',
    description=(
        'hello world agent that can roll a die of 8 sides and check prime'
        ' numbers.'
    ),
    instruction="""
      You roll dice and answer questions about the outcome of the dice rolls.
      When you are asked to roll a die, you must call the roll_die tool with the number of sides.
      When checking prime numbers, call the check_prime tool with a list of integers.
    """,
    tools=[
        roll_die,
        check_prime,
    ],
    generate_content_config=types.GenerateContentConfig(
        safety_settings=[
            types.SafetySetting(
                category=types.HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT,
                threshold=types.HarmBlockThreshold.OFF,
            ),
        ]
    ),
)

# 2. Convert to A2A application
# This automatically generates the Agent Card and sets up the HTTP endpoints
a2a_app = to_a2a(root_agent, host="localhost", port=8001)

# To run this application, save it as a Python file (e.g., `my_a2a_agent.py`)
# and execute it using uvicorn:
# uvicorn my_a2a_agent:a2a_app --host localhost --port 8001

The Agent Card (Discovery):

The Agent Card is a standardized JSON file that acts as a "business card" for an agent, allowing other agents to discover its capabilities, security requirements, and endpoints.

{
  "name": "hello_world_agent",
  "description": "hello world agent that can roll a die of 8 sides and check prime numbers. You roll dice and answer questions about the outcome of the dice rolls. When you are asked to roll a die, you must call the roll_die tool with the number of sides. When checking prime numbers, call the check_prime tool with a list of integers.",
  "doc_url": null,
  "url": "http://localhost:8001/",
  "version": "0.0.1",
  "capabilities": {},
  "skills": [
    {
      "id": "hello_world_agent",
      "name": "model",
      "description": "hello world agent that can roll a die of 8 sides and check prime numbers. I roll dice and answer questions about the outcome of the dice rolls. When I am asked to roll a die, I must call the roll_die tool with the number of sides. When checking prime numbers, call the check_prime tool with a list of integers.",
      "examples": null,
      "input_modes": null,
      "output_modes": null,
      "tags": [
        "llm"
      ]
    },
    {
      "id": "hello_world_agent-roll_die",
      "name": "roll_die",
      "description": "Roll a die and return the rolled result.",
      "examples": null,
      "input_modes": null,
      "output_modes": null,
      "tags": [
        "llm",
        "tools"
      ]
    },
    {
      "id": "hello_world_agent-check_prime",
      "name": "check_prime",
      "description": "Check if a given list of numbers are prime.",
      "examples": null,
      "input_modes": null,
      "output_modes": null,
      "tags": [
        "llm",
        "tools"
      ]
    }
  ],
  "default_input_modes": [
    "text/plain"
  ],
  "default_output_modes": [
    "text/plain"
  ],
  "supports_authenticated_extended_card": false,
  "provider": null,
  "security_schemes": null
}

B. Consuming a Remote Agent (RemoteA2aAgent)

To consume this, the parent agent simply points to the URL. The ADK treats the remote agent exactly like a local tool.

This allows a local agent to delegate tasks to a remote agent by reading its Agent Card.

from google.adk.agents import LlmAgent
from google.adk.agents.remote_a2a_agent import AGENT_CARD_WELL_KNOWN_PATH
from google.adk.agents.remote_a2a_agent import RemoteA2aAgent

# 1. Define the remote agent interface
# Points to the .well-known/agent.json of the running A2A server
prime_agent = RemoteA2aAgent(
    name="remote_prime_agent",
    description="Agent that handles checking if numbers are prime.",
    agent_card=f"http://localhost:8001/{AGENT_CARD_WELL_KNOWN_PATH}"
)

# 2. Use the remote agent as a sub-agent
root_agent = LlmAgent(
    name="coordinator",
    model="gemini-2.0-flash", # Explicitly define the model
    instruction="""
      You are a coordinator agent.
      Your primary task is to delegate any requests related to prime number checking to the 'remote_prime_agent'.
      Do not attempt to check prime numbers yourself.
      Ensure to pass the numbers to be checked to the 'remote_prime_agent' correctly.
      Clarify the results from the 'remote_prime_agent' to the user.
      """,
    sub_agents=[prime_agent]
)

# You can then use this root_agent with a Runner, for example:
# from google.adk.runners import Runner
# runner = Runner(agent=root_agent)
# async for event in runner.run_async(user_id="test_user", session_id="test_session", new_message="Is 13 a prime number?"):
#     print(event)

While both protocols connect AI systems, they operate at different levels of abstraction.

When to use which?

Use MCP when you need deterministic execution of specific functions (stateless).
Use A2A when you need to offload a fuzzy goal that requires reasoning and state management (stateful).

Feature	MCP (Model Context Protocol)	A2A (Agent2Agent Protocol)
Primary Domain	Tools & Resources.	Autonomous Agents.
Interaction	"Do this specific thing". Stateless execution of functions (e.g., "query database," "fetch file").	"Achieve this complex goal". Stateful, multi-turn collaboration where the remote agent plans and reasons.
Abstraction	Low-level plumbing. Connects LLMs to data sources and APIs (like a USB-C port for AI).	High-level collaboration. Connects intelligent agents to other intelligent agents to delegate responsibility.
Standard	Standardizes tool definitions, prompts, and resource reading.	Standardizes agent discovery (Agent Card), task lifecycles, and asynchronous communication.
Analogy	Using a specific wrench or diagnostic scanner.	Asking a specialized mechanic to fix a car engine.

How they work together:
An application might use A2A to orchestrate high-level collaboration between a "Manager Agent" and a "Coder Agent."

The "Coder Agent," in turn, uses MCP internally to connect to GitHub tools and a local file system to execute the work.

4. Production Considerations

Moving protocols from stdio (local process) to HTTP (production network) introduces critical security challenges.

The "Confused Deputy" Problem: Protocols decouple execution, but they also expose risks. A malicious user might trick a privileged agent (the deputy) into using an MCP file-system tool to read sensitive configs. Production architectures must enforce Least Privilege by placing MCP servers behind API Gateways that enforce policy checks before the tool is executed.
Discovery vs. Latency: Dynamic discovery adds a round-trip latency cost at startup (handshaking). In production, we often cache tool definitions (static binding) for performance, while keeping the execution dynamic.
Governance: To prevent "Tool Sprawl" where agents connect to unverified servers, enterprises need a Centralized Registry—an allowlist of approved MCP servers and Agent Cards that act as the single source of truth for capabilities.

5. Ethical Design Note

Protocol-first architectures are the technical foundation for Human Sovereignty and Data Portability.

Standardizing the interface (MCP) helps us prevent vendor lock-in, among many other advantages. A user can swap out a "Google Drive" data source for a "Local Hard Drive" source without breaking the agent, ensuring the user—not the platform—controls where the data lives and how it is accessed.

This abstraction acts as a bulwark against algorithmic lock-in, ensuring that an agent's reasoning capabilities are decoupled from proprietary tool implementations, preserving the user's freedom to migrate their digital ecosystem without losing their intelligent assistants.

Concrete Failure Scenario:

Imagine a small business builds a customer service agent tightly coupled to Salesforce's proprietary API. Over three years, the agent accumulates thousands of lines of custom integration code. When Salesforce raises prices 300%, the business wants to migrate to HubSpot—but their agent is fundamentally Salesforce-shaped. Every tool, every data query, every workflow assumption is hardcoded. Migration means rebuilding the agent from scratch, which the business can't afford. They're trapped. This is algorithmic lock-in—not just vendor lock-in of data, but vendor lock-in of intelligence. Without protocol-first design, the agent becomes a hostage to the platform, and the user loses sovereignty over their own automation.

Key Takeaways

Production Principle: Agents should depend on interfaces, not implementations. Protocol-first design (MCP for tools, A2A for peers) inverts the dependency and prevents the N×M integration trap.
Implementation: Use McpToolset to connect agents to data sources via the Model Context Protocol. Use RemoteA2aAgent and to_a2a() for agent-to-agent delegation. Cache tool definitions at startup for performance, but keep execution dynamic.
Ethical Anchor: This pattern operationalizes Human Sovereignty and Data Portability—users control where their data lives and which tools their agents use, free from vendor lock-in or algorithmic hostage-taking.

We now have agents that reason correctly, remember what matters, and connect to any tool or peer through standard protocols. But there's one final constraint that threatens to unravel everything: the assumption that every interaction completes in a single request-response cycle. Real business workflows don't work that way. Approvals take hours. External APIs time out. Humans need time to think. This is where our fourth pattern becomes essential: teaching agents to pause, persist, and resume across the boundaries of time itself.

Pattern 4: Long-Running Operations & Resumability

This is perhaps the most critical pattern for integrating agents into real-world business logic where human approval is non-negotiable.

1. The Production Problem

Naive agents fall into the "Stateless Trap."

Imagine a Procurement Agent tasked with ordering 1,000 servers.

The workflow is:

Analyze quotes
Propose the best option
Wait for CFO approval
Place the order

Here's a mermaid sequence diagram illustrating the procurement workflow:

This diagram shows the sequential flow from analyzing quotes through to placing the order, with the critical approval step from the CFO in the middle.

If the CFO takes 2 hours to review the proposal, a standard HTTP request will time out in seconds. When the CFO finally clicks "Approve," the agent has lost its memory. It doesn't know which vendor it selected, the quote ID, or why it made that recommendation. It essentially has to start over.

2. The Architectural Solution

The solution is a Pause, Persist, Resume architecture.

Event-Driven Interruption: The agent doesn't just "wait." It emits a specific system event (adk_request_confirmation) and halts execution immediately, releasing compute resources.
State Persistence: The agent's full state (conversation history, tool parameters, reasoning scratchpad) is serialized and stored in a database, keyed by an invocation_id.
The Anchor (invocation_id): This ID becomes the critical "bookmark." When the human acts, the system rehydrates the agent using this ID, allowing it to resume exactly where it left off—inside the tool call—rather than restarting the conversation.

3. Implementation Details: Code Insights

The ADK provides the ToolContext and App primitives to handle this complexity without writing custom state machines.

The Three-State Tool Pattern
Inside your tool definition, you must handle three scenarios:

Automatic approval (low stakes)
Initial request (pause)
Resumption (action)

def place_order(num_units: int, tool_context: ToolContext) -> dict:
    # Scenario 1: Small orders auto-approve
    if num_units <= 5:
        return {"status": "approved", "order_id": f"ORD-{num_units}"}

    # Scenario 2: First call - request approval (PAUSE)
    # The tool checks if confirmation exists. If not, it requests it and halts.
    if not tool_context.tool_confirmation:
        tool_context.request_confirmation(
            hint=f"Large order: {num_units} units. Approve?",
            payload={"num_units": num_units}
        )
        return {"status": "pending"}

    # Scenario 3: Resume - check decision (ACTION)
    # The tool runs again, but this time confirmation exists.
    if tool_context.tool_confirmation.confirmed:
        return {"status": "approved", "order_id": f"ORD-{num_units}"}
    else:
        return {"status": "rejected"}

Automatic Approval (Scenario 1): The initial if num_units <= 5: block handles immediate, non-long-running scenarios, which is a common pattern for tools that can quickly resolve simple requests.
Initial Request (Pause - Scenario 2): The if not tool_context.tool_confirmation: block leverages tool_context.request_confirmation() to signal that the tool requires external input to proceed. The return of {"status": "pending"} indicates that the operation is not yet complete.
Resumption (Action - Scenario 3): The final if tool_context.tool_confirmation.confirmed: block demonstrates how the tool re-executes, this time finding tool_context.tool_confirmation present, indicating that the external input has been provided. The tool then acts based on the confirmed status. The Human-in-the-Loop Workflow Samples also highlights how the application constructs a types.FunctionResponse with the updated status and sends it back to the agent to resume its task.

The Application Wrapper
To enable persistence, we wrap the agent in an App with ResumabilityConfig. This tells the ADK to automatically handle state serialization.

from google.adk.apps import App, ResumabilityConfig

app = App(
    root_agent=procurement_agent,
    resumability_config=ResumabilityConfig(is_resumable=True)
)

The Workflow Loop
The runner loop must detect the pause and, crucially, use the same invocation_id to resume.

# 1. Initial Execution
async for event in runner.run_async(...):
    events.append(event)

# 2. Detect Pause & Get ID
approval_info = check_for_approval(events)

if approval_info:
    # ... Wait for user input (hours/days) ...
    user_decision = get_user_decision() # True/False

    # 3. Resume with INTENT
    # We pass the original invocation_id to rehydrate state
    async for event in runner.run_async(
        invocation_id=approval_info["invocation_id"],
        new_message=create_approval_response(user_decision)
    ):
        # Agent continues execution from inside place_order()
        pass

This workflow shows the mechanism for resuming an agent's execution:

Initial Execution: The first runner.run_async() call initiates the agent's interaction, which eventually leads to the place_order tool returning a "pending" status.
Detecting Pause & Getting ID: Detect the "pending" state and extract the invocation_id. Check the Invocation Context and State Management code wiki section to check how InvocationContext tracks an agent's state and supports resumable operations.
Resuming with Intent: The crucial part is calling runner.run_async() again with the same invocation_id. This tells the ADK to rehydrate the session state and resume the execution from where it left off, providing the new message (the approval decision) as input. This behavior is used in the Human-in-the-Loop Workflow Samples, where the runner orchestrates agent execution and handles multi-agent coordination.

4. Production Considerations

Persistence Strategy: InMemorySessionService is insufficient for production resumability because a server restart kills pending approvals. You must use a persistent store like Redis or PostgreSQL to save the serialized agent state.
UI Signaling: The adk_request_confirmation event should trigger a real-time notification (via WebSockets) to the user's frontend, rendering an "Approve/Reject" card.
Time-To-Live (TTL): Pending approvals shouldn't live forever. Implement a TTL policy (e.g., 24 hours) after which the state is garbage collected and the order is auto-rejected to prevent stale context rehydration.

5. Ethical Design Note

This pattern is the technical implementation of Meaningful Human Control.

It ensures high-stakes actions (Agency) remain subservient to human authorization (Sovereignty), preventing "rogue actions" where an agent executes irreversible decisions (like spending budget) without explicit oversight.

Concrete Failure Scenario:

Imagine a financial trading agent receives a signal to liquidate a portfolio position. Without resumability, the agent operates in a stateless, atomic transaction: detect signal → execute trade. There's no pause for human review. If the signal is based on a data glitch (a "flash crash"), or if market conditions have changed in the seconds between signal and execution, the agent completes an irreversible $10M trade that wipes out a quarter's earnings. The human operator sees the confirmation after the damage is done. Worse, if the system crashes mid-execution, the agent loses context and might try to execute the same trade twice, compounding the disaster. Without Meaningful Human Control embedded in the architecture, the agent becomes a runaway train.

Key Takeaways

Production Principle: High-stakes actions require human-in-the-loop workflows. Design agents that can pause, wait for approval, and resume execution without losing context—spanning hours or days, not just seconds.
Implementation: Use ToolContext.request_confirmation() for tools that need approval. Configure ResumabilityConfig in your App to enable state persistence. Use the invocation_id to resume execution from the exact point of interruption. Store state in Redis or PostgreSQL, never in-memory.
Ethical Anchor: This pattern operationalizes Meaningful Human Control—we architecturally prevent agents from executing irreversible, high-stakes actions without explicit human authorization, preserving human sovereignty over consequential decisions.

Conclusion

The Google & Kaggle Intensive was a masterclass not just in coding, but in thinking.

Building agents is not just about chaining prompts; it is about designing resilient systems that can handle the messiness of the real world.

Evaluation ensures we trust the process, not just the result.
Dual-Layer Memory solves the economic and context limits of LLMs.
Protocol-First (MCP) prevents integration spaghetti and silos.
Resumability allows agents to participate in human-speed workflows safely.

Where to Start: A Prioritization Guide

If you're moving your first agent from prototype to production, consider implementing these patterns in order:

Start with Pattern 1 (Evaluation). Without trajectory validation, you're flying blind. Capture a handful of golden trajectories from your adk web sessions, configure a TrajectoryEvaluator, and establish your evaluation baseline before writing another line of agent code.
Add Pattern 4 (Resumability) early if your agent performs any action that requires human approval or waits on external systems (payment processing, legal review, third-party APIs). The cost of refactoring a stateless agent into a resumable one later is enormous. Build with invocation_id and ToolContext.request_confirmation() from day one.
Implement Pattern 2 (Dual-Layer Memory) when your agent starts handling multi-turn conversations or personalization. If you see users repeating themselves across sessions ("I'm allergic to shellfish" → 3 months later → "I'm allergic to shellfish"), or if your context costs are climbing, it's time for the Workbench/Filing Cabinet split.
Adopt Pattern 3 (Protocol-First Interoperability) when you need to integrate your second data source or agent. The first integration is always bespoke; the second is where you refactor to MCP/A2A or accept technical debt forever. Don't wait until you have ten brittle integrations to wish you'd used protocols.

The Architect's Responsibility

As we move forward, our job as architects is to ensure these systems are not just smart, but reliable, efficient, and ethical.

We are not just building tools—we are defining the interface between human intention and machine action. Every architectural decision we make either preserves or erodes human sovereignty, privacy, and meaningful control.

When you choose to validate trajectories, you're not just improving test coverage—you're building fiduciary responsibility into the system.

When you separate session from memory, you're not just optimizing token costs—you're designing for privacy by default.

When you adopt MCP and A2A, you're not just reducing integration complexity—you're preserving user freedom from algorithmic lock-in.

When you implement resumability, you're not just handling timeouts—you're enforcing meaningful human control over consequential actions.

These patterns are not neutral technical choices. They are ethical choices encoded in architecture.

Let's build responsibly.