Forem: Printo Tom

Launching Claude for Legal: A Toolkit for Modern Legal Workflows

Printo Tom — Thu, 14 May 2026 13:24:36 +0000

Intro:

Legal teams today juggle everything from vendor agreements and privacy impact assessments to litigation prep and law school training. I wanted to create a repo that brings all of these workflows together in one place — practical, extensible, and open for the community. That’s how Claude for Legal was born.

🔑 What’s Inside

This repo is a comprehensive collection of agents, skills, and connectors designed for legal professionals, students, and researchers.

⚖️ Practice‑area plugins: In‑house commercial, corporate, employment, privacy, product, regulatory, AI governance, IP, litigation, clinics, and law school.
🤖 Named agents: Vendor Agreement Reviewer, DSAR Responder, Claim Chart Builder, Termination Reviewer, NDA Triager, and many more.
🔌 MCP connectors: Integrations with Slack, Google Drive, DocuSign, iManage, Everlaw, CourtListener, and other legal‑specific systems.
📚 Managed agent cookbooks: Renewal watcher, docket watcher, regulatory feed monitor, diligence grid, launch radar — ready for scheduled deployment.

⚡ Benefits

Accelerates legal analysis while keeping attorney review at the center.
Structured workflows with guardrails for compliance, privilege, and risk management.
Learning tools for students and clinics — IRAC graders, case briefers, bar prep coaches, and Socratic drills.
Verified citations through research connectors like CourtListener and Trellis.

🚀 Getting Started

Install as a Claude Cowork or Claude Code plugin.
Run the cold‑start interview to tailor each plugin to your practice.
Connect a research tool for authoritative citations.
Explore the scheduled agents for automated monitoring and reporting.

🌟 Why It Matters

The law is evolving fast — privacy, AI governance, regulatory feeds, and litigation workflows all demand agility. This repo helps legal teams and students move faster without cutting corners, combining automation with professional responsibility.

👉 Explore the repo here: https://github.com/printotomp/claude-legal-assistant-.git

I’d love for you to go through it, try the plugins, and share feedback. Contributions are welcome — let’s build the future of legal AI together!

The AI system that worked in staging destroyed us in production. Here's what we missed.

Printo Tom — Thu, 14 May 2026 05:30:00 +0000

I've been a software and enterprise architect for over twelve years. I've shipped pricing platforms, fraud detection systems, and order management infrastructure at scale — most recently at one of the UK's largest retailers. I say that not to flex, but to explain why I'm writing this post with a specific kind of frustration.

Because almost every article I read about AI in enterprise sounds like it was written by someone who has never been paged at 2am because an LLM-backed pricing rule marked 40,000 product lines as zero.

So here's what actually happens when you put AI into systems where the decisions have consequences.

The staging trap

Staging environments lie. They lie about load, they lie about data shape, and — critically for AI systems — they lie about context drift.

Context drift is when the world changes between the moment you assembled the input to your model and the moment the model's output takes effect. In a pricing engine, that gap can be milliseconds. In those milliseconds: a competitor might have repriced, a promotional rule might have fired, a stock threshold might have been crossed.

What this looks like in practice: your orchestrator assembles context — product cost, margin floor, competitor price, stock level — and sends it to the model. The model reasons and returns a recommended price. Validation passes. But by the time you write to the pricing store, the stock level has changed and the margin floor has been updated by a concurrent batch job. The model's recommendation was correct for a world that no longer exists.

The fix isn't faster models. It's a snapshot contract: a bounded, versioned, immutable view of state captured at orchestration time and passed all the way through to the action layer. Every downstream system confirms against the snapshot version before committing. If the snapshot is stale, you abort and re-orchestrate.

This pattern is borrowed directly from event sourcing. Most AI architects I've met have never heard of it.

Fraud signals don't behave like pricing signals — and that matters architecturally

One of the most useful things I've done is build both a fraud detection system and a pricing platform, because the contrast forces architectural clarity.

Fraud signals are high-frequency, low-latency, and the cost of a false negative is asymmetric — you can recover from a false positive (apologise to a good customer) but you can't unwind a fraudulent transaction. This pushes the architecture toward fail-closed defaults: when confidence is low, decline and escalate.

Pricing signals are lower frequency, higher context, and the cost structure is different — a bad price for 10 minutes on a low-velocity SKU costs less than a declined checkout. This pushes toward fail-open defaults with aggressive post-hoc monitoring.

The point is that "AI system" is not a single architecture. The trust posture of your validation layer, your fallback strategy, your human-in-the-loop gates — all of these should be derived from the asymmetry of your failure modes, not from a generic best-practice blog post (including this one).

Before you design the system, map your failure modes. A false positive in fraud is not the same as a false positive in pricing. Your architecture should know the difference.

The prompt is a contract. Treat it like one.

Your codebase versions your APIs. It versions your database schemas. It does not version your prompts — and that is a production incident waiting to happen.

We learned this the hard way. A well-intentioned tweak to the system prompt of a fraud classification model changed the output structure enough to break the downstream parser. Silently. For six hours. Because the validation layer was checking for the presence of a field, not its semantic content.

Prompt versioning isn't complicated. It's a git-tracked file, a version identifier injected into every API call, and a log entry that records which version produced which output.

{
  "prompt_version": "fraud-classifier-v2.4.1",
  "model": "claude-sonnet-4-20250514",
  "input_snapshot_id": "snap_01JV...",
  "output": { ... },
  "validation_result": "pass",
  "action_taken": "flag_for_review"
}

Every LLM-influenced decision that touches production state should produce a record like this. Not for debugging — for auditability. In retail, in finance, in any regulated domain, the question "why did the system do that?" will be asked by someone whose salary is higher than yours. You want a clean answer.

The layer nobody builds until they need it

Teams build the orchestration layer. They build the reasoning layer (the model call). They often skip the trust and validation layer, tell themselves they'll add it later, and then spend six months retrofitting it after their first production incident.

The trust layer is not a safety net. It's load-bearing infrastructure. It includes:

Schema enforcement — structured output validation before anything downstream sees the result. Not "does the JSON parse" but "does this output satisfy the business constraints it was supposed to satisfy."
Confidence routing — when the model signals uncertainty, the output should not go to production. Route to a fallback rule, a human queue, or a conservative default.
Semantic drift detection — over time, the distribution of what your model produces drifts. Not because the model changed, but because the world feeding it changed. Monitor output distributions the same way you'd monitor latency percentiles.

What I'd tell myself three years ago

The model is not the system. The model is one component inside a system that has to earn the right to touch production state. It earns that right through versioned contracts, explicit validation, bounded context, and audit trails.

Every shortcut you take on those four things will come back as a production incident. I know because I've taken most of them.

Build boring AI systems. Your on-call rotation will thank you.

If you've been through something similar — or disagree with any of this — I'd genuinely like to hear it in the comments.

Choosing the Right Gemma 4 Model: A Practical Guide

Printo Tom — Mon, 11 May 2026 06:00:00 +0000

Gemma 4 isn’t just one model — it’s three distinct flavors. Picking the right one can make or break your project.** With Google’s latest open model family, developers now have access to native multimodal capabilities, advanced reasoning, and a massive 128K context window. But the real power lies in choosing the right variant for your use case.

🧩 The Three Flavors of Gemma 4

Small (2B / 4B):
Built for ultra‑mobile, edge, and browser deployment. Perfect for IoT projects, mobile apps, or even running on a Raspberry Pi. If you want AI that lives close to the user, this is your pick.
Dense (31B):
A powerhouse that bridges server‑grade performance with local execution. Ideal for enterprise prototypes, chatbots, or applications that need strong reasoning without relying on cloud‑only solutions.
Mixture‑of‑Experts (26B MoE):
Highly efficient and designed for advanced reasoning at scale. Best suited for research, high‑throughput tasks, or scenarios where efficiency matters as much as raw capability.

⚙️ Practical Scenarios

Smart Home IoT Assistant → Small Model
Runs locally, respects privacy, and handles multimodal inputs like voice + sensor data.
Enterprise Knowledge Bot → Dense Model
Balances performance with practicality, enabling long‑context reasoning for business workflows.
Research Reasoning Engine → MoE Model
Efficiently processes complex queries, making it ideal for labs or academic projects.

💡 Key Insight

Choosing a model isn’t about “bigger is better.” It’s about fit for purpose. A Raspberry Pi project thrives on the Small model, while a multimodal research tool demands the MoE. Intentional selection shows you understand both the technology and the problem you’re solving.

📣 Final Thoughts

Gemma 4 opens the door to local AI that’s powerful, flexible, and accessible. The real challenge — and opportunity — is matching the right model to the right context. Experiment, build, and share your journey with the community. That’s how we’ll unlock the full potential of open AI.

The Rise of AI-Native Cloud Architectures: Beyond “Lift and Shift”

Printo Tom — Fri, 08 May 2026 13:29:43 +0000

Cloud isn’t just infrastructure anymore — it’s becoming intelligent.
We’ve moved past the era of simply migrating workloads (“lift and shift”) into the cloud. The new frontier is AI-native cloud architectures — systems designed from the ground up to leverage artificial intelligence as a core capability, not an add-on.

🌐 What’s Changing?

• AI at the foundation: Instead of bolting on ML models, platforms like Azure OpenAI and AWS Bedrock are embedding intelligence into orchestration, monitoring, and scaling.
• Self-optimizing workloads: Imagine microservices that auto-tune themselves based on traffic patterns, cost constraints, and user behavior.
• Developer experience redefined: Prompt engineering, AI-assisted coding, and intelligent CI/CD pipelines are becoming standard practice.

🔑 Why It Matters

Traditional cloud patterns (containers, serverless, event-driven) solved scalability.
AI-native patterns solve adaptability — systems that learn, predict, and evolve without constant human intervention. This means:

• Lower operational overhead
• Smarter resource allocation
• Faster innovation cycles

🛠 Example in Action

A retail platform running on Azure can now:

• Predict demand spikes using AI models trained on historical sales + external signals (weather, holidays).
• Auto-scale microservices before the spike hits.
• Adjust pricing dynamically, feeding back into the system for continuous learning.

This isn’t theory — it’s happening in production today.

💡 Takeaway

The future of cloud isn’t just about where you run workloads.
It’s about how intelligently those workloads run themselves.
If you’re designing systems in 2026, ask yourself: Am I building for scale, or am I building for intelligence?

👉 What do you think — are AI-native architectures the next Kubernetes moment?

How We Caught Fraud Before the Payment Cleared

Printo Tom — Thu, 30 Apr 2026 06:39:24 +0000

I want to tell you about the day our fraud system started losing.
Not to fraudsters. To the clock.
We had built what looked, on paper, like a solid fraud detection pipeline. An in-house ML model trained on months of transaction data, a feature store we were proud of, a scoring service sitting neatly between checkout and payment authorisation. Fraud catch rates were good. The data science team was happy.
Then we scaled past a million transactions a day, and everything started to crack.
Not in an obvious, alarms-blaring kind of way. In the quiet, insidious way where your p99 latency creeps up, your SLA breaches start appearing in dashboards nobody checks, and one morning you realise your fraud score is arriving after the payment gateway has already made a decision without it.
This is the story of how we fixed that. And more importantly, what we learned about building real-time fraud systems that actually work at scale — not in a notebook, not in a staging environment, but in production, on millions of real transactions, with real money on the line.

• Fraud scoring must complete inside the payment authorisation window (~80–120ms available in practice)
• Sequential feature store lookups were our biggest latency killer — parallelising them cut retrieval time by ~60%
• Velocity counters must be pre-aggregated via a stream processor, not computed at request time
• Model inference should run in an isolated sidecar to avoid resource contention
• Always design your fallback scoring path before you need it in an incident

The constraint nobody puts in the architecture diagram

Here is something that doesn’t get written about enough: fraud detection at checkout isn’t a background job. It isn’t something you can do asynchronously and reconcile later. The fraud decision has to fit inside the payment authorisation window — typically somewhere between 150ms and 300ms end-to-end, depending on your payment provider and your checkout UX tolerance.
That window includes network round-trips, payment gateway processing, and your own service calls. By the time your fraud scoring service gets invoked, you might have 80 to 120ms left. Sometimes less.
Miss that window and you have two bad options: block the transaction because you couldn’t score it in time (false positive, frustrated customer, lost revenue), or let it through unscored (potential fraud, chargeback, loss anyway). Neither is acceptable at scale.

This latency constraint is the architectural forcing function that shapes everything else. It’s not a performance nice-to-have. It’s a hard boundary the entire system has to be designed around from day one.

We didn’t fully internalise this early enough. That was our first mistake.

What the pipeline actually looked like

Before I get into what went wrong, let me walk through how the system was structured — because the architecture itself was sound, and understanding it matters.
When a customer hits checkout, a request comes into our fraud scoring service. That service has one job: take a transaction, produce a risk score between 0 and 1, and do it fast enough to matter.
To produce that score, it needs features — signals about the transaction that the model has learned to interpret. These fall into three categories:
Transaction signals — the basics. Order value, payment method, billing and delivery address match, whether it’s a new card, whether the delivery address has been seen before.
Velocity signals — behavioural patterns over time. How many orders has this account placed in the last hour? Last 24 hours? How many declined payments? Has this device been seen across multiple accounts?
Session and device signals — fingerprinting, typing cadence, how long the customer spent on each page, whether the checkout flow looked human.
Some of these features could be computed in real time. Most couldn’t — not within our latency budget. Velocity counts, account history, device reputation — these needed to be pre-computed and cached in a feature store, ready to be looked up in single-digit milliseconds.
The model itself was a gradient boosted tree — not the most glamorous choice, but interpretable, fast at inference, and well understood by the team. We’d evaluated neural approaches and they weren’t worth the inference overhead for our use case.
The output fed into a decision layer:
Checkout Request
│
▼
Fraud Scoring Service
│
├──► Feature Store ──► account features
│ │ ──► device features
│ │ ──► velocity counters
│ ▼
├──► ML Model (gradient boosted tree)
│ │
│ ▼ score: 0.0 ──────────── 1.0
└──► Decision Layer
│
┌─────────┼──────────┐
▼ ▼ ▼
BLOCK CHALLENGE ALLOW
(>0.85) (0.4–0.85) (<0.40)
Clean. Logical. And for a while, fine.

Where it started breaking

Growth is a pressure test. When we were processing hundreds of thousands of transactions a day, our p95 latency was comfortable. The feature store lookups were fast, the model inference was fast, the whole thing felt snappy.
Then volumes grew. And the cracks appeared in a specific order that, in retrospect, told a clear story.
First, our feature store reads started slowing down. Not catastrophically — a few milliseconds here and there. But we were doing multiple lookups per transaction (account-level features, device-level features, velocity counters), and those milliseconds added up. What had been 15ms of feature retrieval became 30ms, then 40ms under peak load.
Second, we discovered that our velocity counters — the “how many orders in the last hour” signals — were being computed differently depending on which service instance handled the request. We had a consistency problem. Different replicas were seeing slightly different views of recent transaction history. The model was being fed subtly wrong features, and we had no idea.
Third, and this was the one that hurt, our model inference time started varying in ways we hadn’t anticipated. A gradient boosted tree should be deterministic and fast — and it was, on average. But at p99, something was happening. JVM garbage collection pauses, resource contention under load, the occasional cold path through the tree for unusual feature combinations.
Our average inference time was 12ms. Our p99 was 47ms.
Add those three problems together and you have a system that works most of the time but fails exactly when you need it most — under peak load, during flash sales, at Christmas.

The latency audit

The first thing we did was stop guessing and start measuring. Properly measuring.
We instrumented every stage of the pipeline with percentile-level metrics — not just averages, because averages lie in systems like this. We wanted to know what was happening at p95, p99, and p99.9. We added trace IDs to every request so we could follow a single transaction through every hop and see exactly where time was being spent.
What we found surprised us.
The model inference itself was not the primary problem. The real culprit was the feature store lookups — specifically the fact that we were doing them sequentially.

❌ BEFORE — Sequential lookups
─────────────────────────────────────────────────────
[account lookup 15ms] → [device lookup 15ms] → [velocity lookup 15ms]
= 45ms minimum, 90–120ms under load
─────────────────────────────────────────────────────

✅ AFTER — Parallel lookups
─────────────────────────────────────────────────────
[account lookup] ─┐
[device lookup] ─┼─► all fire simultaneously ─► await all → ~15ms
[velocity lookup] ─┘
─────────────────────────────────────────────────────
With async feature fetching, our retrieval time dropped from the sum of three lookups to the slowest of three lookups — roughly a 60% reduction in that part of the pipeline.
That single change brought us back under SLA. But we didn’t stop there, because we knew it was fragile.

What we actually rebuilt

Once we understood the problem clearly, we made three structural changes that transformed the system’s behaviour under load.

Pre-aggregated velocity counters Rather than computing velocity signals at request time — even with a cache — we moved to a streaming aggregation model. A separate process consumed our transaction event stream and maintained rolling counters in Redis with atomic increments. By the time a fraud scoring request arrived, the velocity features were already there, updated in real time by the stream processor. The lookup became a simple Redis GET — sub-millisecond, every time. This also solved the consistency problem. All scoring service instances were now reading from the same source of truth. The subtle feature drift disappeared.
Model serving as a dedicated sidecar We moved model inference out of the main scoring service and into a dedicated sidecar process — a lightweight model server running alongside each instance. This gave us: • Isolated resource allocation for inference • Ability to pre-warm the model at startup (no cold path variance) • Clean separation of concerns — model updates no longer required a full service redeploy The p99 inference variance dropped significantly. Not because the model changed, but because it was no longer competing for resources with feature retrieval, logging, and everything else the scoring service was doing.
A hard timeout with a safe fallback This was the most important thing we did, and it’s the one I see teams most often skip.

We set a hard internal timeout on the fraud scoring pipeline — tighter than the external SLA. If the pipeline didn’t complete within that budget, we applied a pre-computed risk tier based on the customer’s account history instead.

This meant that even in the worst case — a Redis hiccup, a model serving blip, a network spike — the system degraded gracefully. Transactions still got a risk decision. It just came from a different place.
The fallback was never as accurate as the full model. But it was infinitely better than no decision at all.

Five things I’d tell any team building this

Start with the latency budget, not the model. Before you write a line of feature engineering code, map out every hop between checkout click and payment authorisation. Know exactly how much time your fraud system has. Build backwards from that constraint.
Parallelise your feature fetching from day one. Sequential lookups are a trap. They feel fine at low volume and will absolutely bite you at scale. Async fan-out for feature retrieval is not premature optimisation — it’s table stakes.
Measure percentiles, not averages. Your p99 latency is what determines whether you breach SLA under peak load. An average that looks healthy can hide a p99 that’s a disaster. Instrument accordingly.
Build your fallback before you need it. Fallback logic written under pressure during an incident is fallback logic that will have bugs. Design your degraded-mode behaviour deliberately, test it, and know exactly what triggers it.
Feature consistency is a correctness problem, not just a performance problem. If different instances of your scoring service are seeing different views of the same data, your model is being fed lies. This is hard to detect because everything looks like it’s working — the scores just aren’t quite right. Treat feature consistency as a first-class concern.

What I’d do differently

If I were starting this from scratch, I’d separate the feature store from the scoring service earlier and more aggressively. We treated the feature store as an internal implementation detail for longer than we should have. Making it a proper platform — with clear ownership, explicit SLAs, and a defined contract for the scoring service to consume — would have saved us several painful debugging sessions.
I’d also invest earlier in shadow scoring: running a new model version in parallel against live traffic without acting on its decisions. We did eventually build this, and it transformed our ability to validate model changes safely. But we built it reactively, after a bad model deployment caused a spike in false positives. It should have been part of the platform from the start.

Closing thought

Fraud detection at scale is one of those problems that looks straightforward until you’re in it. The ML part — the model, the features, the training pipeline — that’s what gets written about. The infrastructure problem, the latency problem, the question of how you make a confident decision about a transaction in less time than it takes to blink — that’s the part that actually determines whether your system works in production.
Get the architecture right and the model has a chance to do its job. Get it wrong and it doesn’t matter how good your model is.
The clock doesn’t care about your accuracy metrics.

If you’re building something similar or have dealt with these problems differently, I’d genuinely love to hear about it in the comments. There’s no single right answer here — just tradeoffs, and the specific constraints of your system.

How We Built a Real-Time Credit Card Fraud Detection System: An Architect's Perspective

Printo Tom — Mon, 27 Apr 2026 07:17:45 +0000

Every millisecond counts when it comes to fraud. A fraudulent transaction approved in 200ms costs real money. A legitimate transaction declined in 200ms costs a customer. Getting this balance right — at scale — is one of the hardest engineering problems in financial services.

This is a deep dive into the architectural decisions, trade-offs, and hard lessons from building a production-grade credit card fraud detection system. No toy datasets. No Jupyter notebooks. Real architecture, real constraints.

The Problem Is Not What You Think

Most tutorials frame fraud detection as a machine learning problem. Pick the right model, tune your F1 score, ship it.

In production, it's an engineering and systems problem with ML embedded inside it.

The real challenges are:

Latency: you have ~150–300ms to make a decision before the payment network times out
Scale: millions of transactions per day, with spikes you cannot always predict
Imbalance: fraudulent transactions can be as low as 0.1% of total volume — your system must be hypersensitive to a signal buried in 99.9% noise
Drift: fraud patterns change constantly; yesterday's model is today's liability
Explainability: regulators and customers will ask why a transaction was declined

None of these are model problems. All of them are architecture problems.

System Architecture: The Big Picture

Here's the high-level design we settled on after several iterations:

Transaction Request
       │
       ▼
┌─────────────────────┐
│   API Gateway        │  ← Rate limiting, auth, routing
└────────┬────────────┘
         │
         ▼
┌─────────────────────┐
│  Feature Service     │  ← Real-time feature assembly (<20ms)
│  (Redis + Flink)     │
└────────┬────────────┘
         │
         ▼
┌─────────────────────┐
│  Scoring Engine      │  ← ML model inference (<50ms)
│  (Rule layer +       │
│   ML model layer)    │
└────────┬────────────┘
         │
         ▼
┌─────────────────────┐
│  Decision Engine     │  ← Threshold logic, risk bands, actions
└────────┬────────────┘
         │
    ┌────┴─────┐
    ▼          ▼
 Approve    Decline / Step-up (3DS)

The entire synchronous path — from transaction in to decision out — must complete in under 300ms. Everything else (model retraining, alerting, feedback loops) is asynchronous.

Layer 1: Feature Engineering is Everything

The biggest performance gains we saw did not come from switching models — they came from better features.

Real-Time Features (assembled per transaction)

These are computed on-the-fly and pulled from Redis:

Velocity features: number of transactions in last 1m / 5m / 1h / 24h per card
Amount deviation: how far this transaction deviates from the cardholder's average spend
Merchant risk score: pre-computed score for the merchant category / MCC code
Time-of-day signal: is this transaction outside the cardholder's normal hours?
Geographic anomaly: is the country/city inconsistent with recent usage?

Behavioural Baseline Features (batch, updated hourly)

Pulled from a feature store (we used Feast, but Redis + Spark works too):

30-day average transaction amount
Typical merchant categories
Device fingerprint history
Known trusted locations

The Hard Part: Consistency

The training data must use exactly the same feature definitions as inference. Feature drift between training and serving is one of the most common sources of degraded model performance in production — and it's invisible until something breaks.

We solved this by owning feature computation in a shared library used by both the batch training pipeline and the real-time feature service. Same code. No exceptions.

Layer 2: The Scoring Engine — Rules + ML Together

We ran two scoring layers in parallel, not in sequence.

Rules Engine (first line of defence)

Simple, fast, interpretable. Handles obvious cases:

IF transaction_country NOT IN cardholder_known_countries
  AND amount > 500
  AND hour_of_day BETWEEN 1 AND 5
THEN score += 80 (high risk)

Rules are cheap to update, easy to explain to regulators, and handle known fraud patterns with high precision. They block roughly 30–40% of fraud before the ML model is even invoked.

ML Model Layer

We landed on a gradient boosted tree (XGBoost) as the primary model, for three reasons:

Speed: inference is sub-millisecond even with 200+ features
Performance: consistently outperforms neural networks on tabular transaction data
Explainability: SHAP values give you per-prediction feature importance, which is gold for compliance teams

We also ran a secondary autoencoder-based anomaly detector in parallel for catching novel fraud patterns the supervised model had never seen. Its score was blended in with a lower weight.

Class Imbalance: What Actually Works

The standard advice is to use SMOTE or random oversampling. In practice, we found:

SMOTE helped during initial model development but added noise at high volumes
Cost-sensitive learning (setting scale_pos_weight in XGBoost) was more robust in production
Threshold tuning post-training gave us far more control over the precision-recall trade-off than any resampling technique

Do not optimise for accuracy. Optimise for precision-recall AUC and then tune your operating threshold based on business risk appetite.

Layer 3: The Decision Engine

The model outputs a score between 0 and 1. The decision engine translates that into an action:

Score Band	Action
0.0 – 0.3	Approve
0.3 – 0.6	Approve with soft alert
0.6 – 0.8	Step-up authentication (3DS)
0.8 – 1.0	Decline

These thresholds are not fixed. They shift based on:

Merchant risk profile: a high-risk merchant (e.g., crypto exchange, gift cards) shifts the threshold down
Time of day / fraud rate spikes: during known high-fraud periods, thresholds tighten
Card holder profile: a known high-value customer might get a softer threshold to protect against false positives

This dynamic thresholding layer is where a lot of the business logic lives, and it's deliberately kept separate from the ML model so it can be tuned without retraining.

Model Retraining: The Feedback Loop Nobody Talks About

A fraud model trained on 6-month-old data is already becoming stale. Fraud rings adapt quickly.

Our retraining pipeline:

Daily batch: new confirmed fraud labels (from manual review + chargeback signals) are fed into the training dataset
Weekly retrain: model is retrained on a rolling 90-day window
Shadow scoring: new model runs in shadow mode alongside production for 48 hours before promotion
Canary release: new model serves 5% of traffic before full rollout, with automatic rollback if key metrics degrade

The key metric we watched during shadow mode was rank order — does the new model score known fraudulent transactions higher than known legitimate ones? A lift chart drift of more than 5% triggered a human review.

Infrastructure Choices and Why

Component	Choice	Why
Feature cache	Redis Cluster	Sub-millisecond read latency
Stream processing	Apache Flink	Stateful windowed aggregations at scale
Model serving	Custom FastAPI service	Full control over batching and concurrency
Feature store	Feast	Consistency between training and serving
Monitoring	Prometheus + Grafana	Real-time score distribution and drift alerts
Experiment tracking	MLflow	Model versioning and reproducibility

We deliberately avoided "full-stack" ML platforms that bundle everything together. The lock-in risk and latency overhead were not worth the developer experience gains at our scale.

What We Got Wrong (and Fixed)

1. We underestimated feature latency

Early on, our feature service was querying a relational database. At peak load, this added 80–120ms to the decision path — completely unacceptable. Migrating feature reads to Redis brought this down to 3–5ms.

2. We over-rotated to ML and abandoned rules

A period where the ML model was doing all the heavy lifting meant that when it degraded (due to data drift), there was no backstop. Bringing the rules layer back as a first gate significantly improved resilience.

3. We optimised the wrong metric

In early iterations, we chased high recall (catch as much fraud as possible). This led to a false positive rate that was frustrating legitimate cardholders. The right trade-off is business-specific — understand your chargeback cost vs. your customer attrition cost before setting thresholds.

4. We didn't monitor score distributions

A model can look healthy on offline metrics while its live score distribution is silently shifting. We now track percentile distribution of scores daily. Any unexpected shift triggers an investigation.

Key Takeaways

Latency is a hard constraint, not a nice-to-have. Design your feature assembly and inference path first, then fit your model choices around it.
Rules + ML is better than ML alone — rules handle known patterns, ML catches the novel ones.
Feature engineering and consistency between training and serving will give you more performance gains than model selection.
Monitor score distributions in production, not just model metrics on a test set.
Separate your decision logic from your model — thresholds and business rules change faster than models, and that's okay.

Fraud detection is a live arms race. The system that catches fraud today will need to evolve to catch fraud tomorrow. Build for adaptability, not just accuracy.

Have you built fraud detection systems? I'd love to hear what architectural decisions you made differently — drop it in the comments.

Google Cloud NEXT '26 Just Changed How I Think About Real-Time Pricing APIs

Printo Tom — Sat, 25 Apr 2026 06:44:26 +0000

I've been building Pricing APIs for years.

And if you've ever built one — a real one, one that has to serve millions of customers, react to stock levels, competitor moves, time of day, demand surges, and promotional calendars — you know the dirty secret at the heart of most of them.

They're not really real-time.

They're very fast batch jobs wearing a real-time costume.

You precompute prices on a schedule. You cache aggressively. You serve stale data with enough confidence that most customers never notice. And you spend half your architectural energy managing the gap between "what the price should be right now" and "what the price actually says right now."

I've been sitting with that problem for a long time. Then Google Cloud NEXT '26 happened. And something clicked.

What Google Just Announced (And Why It's Not What You Think)

The big headline from NEXT '26 is the Gemini Enterprise Agent Platform — Google's full rebranding of Vertex AI into an end-to-end system for building, scaling, governing, and optimising autonomous AI agents.

Most people read that as an enterprise productivity story. Automate your emails. Summarise your meetings. Build a chatbot for HR.

I read it as an infrastructure story. Specifically, as the missing architectural layer that makes a genuinely real-time Pricing API actually buildable — without a team of 20 people and a multi-year runway.

Let me walk you through what I mean.

The Real Problem With Pricing APIs

Picture Black Friday. 9:00 AM.

Your traffic spikes 40x in 90 seconds. Your inventory for a flagship product drops from 10,000 units to 200 in the time it takes to read this sentence. A competitor just dropped their price by 15%. Three promotional campaigns are live simultaneously. Your demand forecasting model is screaming that stock will hit zero in 11 minutes.

What should the price be right now?

Not in 5 minutes when your next batch job runs. Not based on the snapshot from 3 minutes ago. Right now.

In most systems I've worked with, nobody actually knows. The system serves whatever it last computed and hopes for the best.

The problem isn't compute. It's orchestration. Getting five different signals — inventory, competitor pricing, demand velocity, promotional state, customer segment — to converge on a single decision, in milliseconds, at scale, without hard-coding every possible combination of conditions.

That's not a caching problem. That's a reasoning problem.

Where Gemini Enterprise Agent Platform Changes the Game

Here's what caught my attention at NEXT '26.

Google announced the Agentic Data Cloud — a new way of organising enterprise data so that AI agents can take action on it in real time. At the centre of it is something called the Knowledge Catalog: a live, AI-maintained map of your entire data estate, automatically tagged and connected so agents understand the context and relationships between your systems.

For a Pricing API, think about what that means.

Today, your pricing logic has to explicitly know about the relationship between your inventory system, your competitor feed, your promotions database, and your demand model. Someone had to write that understanding into code. Someone has to maintain it when things change. Every new data source is a sprint.

With a Knowledge Catalog backed by Gemini, your pricing agent doesn't need to be told "when inventory drops below 5%, check the demand velocity signal before adjusting price." It can reason over your connected data estate and arrive at that logic itself — and update its reasoning as your business changes, without a redeploy.

That's a fundamentally different model.

What a Gemini-Powered Pricing API Looks Like

Let me make this concrete. Here's how I'd architect a real-time pricing system on the Gemini Enterprise Agent Platform, using what was announced this week.

The Pricing Agent

Built on the Agent Development Kit (ADK), the Pricing Agent is a persistent, long-running agent that owns pricing decisions for a product catalogue. It's not a model you call once per request. It's a stateful entity with:

Memory Bank (now GA): persistent context about each product's pricing history, demand patterns, and the reasoning behind previous decisions
Agent Identity: a verified, auditable identity that allows the pricing agent to call authorised tools — pulling from your inventory system, your competitor feed, your ERP
Multi-agent coordination: sub-agents for inventory monitoring, competitor scraping, and promotional eligibility can push updates to the pricing agent in real time, rather than waiting to be polled

The Real-Time Signal Layer

This is where the Agentic Data Cloud and the Cross-Cloud Lakehouse (built on Apache Iceberg) come in. Your data doesn't have to move. Your inventory lives in your warehouse system. Your competitor feed lives wherever it lives. The Knowledge Catalog maps the relationships, and the pricing agent queries across them with zero data migration.

On Black Friday at 9:00 AM, the inventory sub-agent fires an event: "Product X — 200 units remaining, sell-through rate 18 units/minute." That event hits the pricing agent's inbox. The agent has context (from Memory Bank) that this product has a 94% attachment rate with a premium customer segment. It checks the promotional state. It queries the competitor feed sub-agent. And it issues a price update within the decision window — not because a rule said so, but because it reasoned to that outcome.

The Governance Layer

This is the part that most AI architecture conversations skip. You cannot have an autonomous agent making live pricing decisions without an audit trail, human oversight triggers, and rollback capability.

The Gemini Enterprise Agent Platform has this built in. Agent Engine provides observability into every decision the agent makes — what signals it considered, what it decided, why. For regulated environments (and retail pricing absolutely counts), this auditability is non-negotiable.

You can also define confidence thresholds: if the pricing agent's recommendation falls outside a predefined band, it doesn't auto-apply — it routes to a human for approval. That's not a limitation. That's the right architecture.

The Part That Actually Surprised Me

I expected the agent platform announcement. I did not expect the managed MCP servers.

Google announced managed Model Context Protocol (MCP) servers across Google Cloud services. For those unfamiliar, MCP is the protocol that lets AI agents call external tools in a standardised way — think of it as USB-C for AI integrations.

What this means for a Pricing API is that your agent can call your authorised pricing tools — apply discount, update price, trigger campaign — through a governed, observable, rate-limited interface. No custom integration work. No bespoke API wrappers. Just a standard protocol your agent speaks natively.

Combined with the A2A (Agent-to-Agent) protocol also announced this week, your pricing agent can directly coordinate with a fulfilment agent, a customer service agent, and a demand forecasting agent — across platforms, not just within Google Cloud. That's the multi-agent mesh that enterprise systems have needed for a decade.

What I'd Build First

If I were starting a new Pricing API project tomorrow, here's where I'd begin on this stack:

Week 1–2: Knowledge Catalog setup
Map your pricing-relevant data — inventory, promotions, competitor feeds, customer segments — into the Knowledge Catalog. Let Gemini tag the relationships. Don't write the logic yourself. Observe what it surfaces.

Week 3–4: Single-product pilot agent
Build a pricing agent for one product category using ADK. Give it read access to your inventory and competitor signals via MCP tools. Run it in shadow mode — it makes recommendations, humans apply them. Compare to your current pricing outcomes.

Week 5–6: Governance baseline
Define your guardrails before you go live. Price floor/ceiling bounds. Confidence thresholds. Escalation paths. Audit log review cadence. This is the work that makes the business trust the system.

Month 2 onward: Expand and measure
Roll out to more categories. Track margin impact, stock turn rate, competitive win rate. The agent gets better as Memory Bank accumulates context. That's the flywheel.

The Honest Take

I'm genuinely excited about what Google announced this week. The Gemini Enterprise Agent Platform, the Agentic Data Cloud, managed MCP servers, A2A — these aren't incremental updates. They're the connective tissue that enterprise AI has been missing.

But I want to be honest about what this isn't, yet.

Building a production-grade pricing agent still requires you to own the domain logic. You have to define what "correct pricing" means for your business. You have to set the guardrails. You have to decide when the agent decides autonomously and when a human is in the loop. The platform gives you the infrastructure. The thinking is still yours.

And that's actually how it should be.

Pricing is a competitive moat. The businesses that will win with this technology are not the ones who hand the decisions entirely to an agent. They're the ones who use agents to act faster on better information — while keeping the strategic reasoning close to home.

That's always been the job of architecture. It's just got a lot more interesting.

Have you worked on real-time pricing systems? I'd love to hear how you're thinking about agents in this space — drop it in the comments.

Building a Pricing Engine That Actually Works at Scale

Printo Tom — Thu, 23 Apr 2026 12:45:34 +0000

Building a Pricing Engine That Actually Works at Scale

Most people think pricing is simple.

Set a price → show it → done.

But the moment you deal with real traffic, real users, and real business pressure… things get messy very quickly.

I’ve worked on systems where pricing wasn’t just a number — it was something that had to react in real time to inventory, demand, and campaigns. And honestly, that’s where things get interesting.

The problem nobody talks about

Traditional pricing systems are slow.

Prices updated every few hours
Hardcoded rules
No flexibility

That works… until:

Traffic spikes
Inventory changes every minute
Business wants instant campaign updates

At that point, the system starts breaking.

What actually needs to happen

A modern pricing system needs to:

Respond in milliseconds
Handle huge traffic (millions of requests)
Stay available even under load
Adapt quickly to business changes

Which means — you can’t treat it like a simple backend service.

How I think about the architecture

Instead of one big system, you break it down:

A service that handles pricing requests
A rules layer that controls logic (discounts, campaigns, etc.)
A data layer for inventory + product info
A messaging system (Kafka/RabbitMQ) to push updates in real time

The key idea:

👉 Don’t pull data. React to events.

When inventory changes → pricing updates

When campaigns change → pricing updates

That shift alone changes everything.

The real challenges (not theory)

1. Latency kills everything

If your pricing call takes too long, the whole user experience suffers.

So you end up using:

Caching (a lot of it)
Precomputed values
Smart fallbacks

2. Nothing is ever “clean”

In real systems:

Data is inconsistent
Dependencies break
Edge cases show up daily

You don’t design for perfection — you design for failure.

3. Business moves faster than code

This is the hardest part.

The business wants:

New campaigns tomorrow
Pricing changes instantly
Experiments running all the time

So your system has to be flexible, not just scalable.

Where AI starts coming in

Now things are evolving.

Pricing is no longer just rules-based.

You start seeing:

Demand prediction
Competitor-based adjustments
Dynamic pricing models

But even then — the architecture still matters more than the model.

Final thought

The biggest lesson for me:

👉 Pricing systems are not about pricing.

They’re about building a system that can handle constant change without breaking.

And that’s where most designs fail.

If you’ve worked on similar systems, I’m curious — what was the hardest part for you?

Forem: Printo Tom

Launching *Claude for Legal*: A Toolkit for Modern Legal Workflows

🔑 What’s Inside

⚡ Benefits

🚀 Getting Started

🌟 Why It Matters

The AI system that worked in staging destroyed us in production. Here's what we missed.

The staging trap

Fraud signals don't behave like pricing signals — and that matters architecturally

The prompt is a contract. Treat it like one.

The layer nobody builds until they need it

What I'd tell myself three years ago

Choosing the Right Gemma 4 Model: A Practical Guide

The Rise of AI-Native Cloud Architectures: Beyond “Lift and Shift”

How We Caught Fraud Before the Payment Cleared

How We Built a Real-Time Credit Card Fraud Detection System: An Architect's Perspective

The Problem Is Not What You Think

System Architecture: The Big Picture

Layer 1: Feature Engineering is Everything

Real-Time Features (assembled per transaction)

Behavioural Baseline Features (batch, updated hourly)

The Hard Part: Consistency

Layer 2: The Scoring Engine — Rules + ML Together

Rules Engine (first line of defence)

ML Model Layer

Class Imbalance: What Actually Works

Layer 3: The Decision Engine

Model Retraining: The Feedback Loop Nobody Talks About

Infrastructure Choices and Why

What We Got Wrong (and Fixed)

1. We underestimated feature latency

2. We over-rotated to ML and abandoned rules

3. We optimised the wrong metric

4. We didn't monitor score distributions

Key Takeaways

Google Cloud NEXT '26 Just Changed How I Think About Real-Time Pricing APIs

What Google Just Announced (And Why It's Not What You Think)

The Real Problem With Pricing APIs

Where Gemini Enterprise Agent Platform Changes the Game

What a Gemini-Powered Pricing API Looks Like

The Pricing Agent

The Real-Time Signal Layer

The Governance Layer

The Part That Actually Surprised Me

What I'd Build First

The Honest Take

Building a Pricing Engine That Actually Works at Scale

Building a Pricing Engine That Actually Works at Scale

The problem nobody talks about

What actually needs to happen

How I think about the architecture

The real challenges (not theory)

1. Latency kills everything

2. Nothing is ever “clean”

3. Business moves faster than code

Where AI starts coming in

Final thought

Launching Claude for Legal: A Toolkit for Modern Legal Workflows