Forem: Alex Retana

Cracking the Medical Coding Challenge: Fine-Tuning BioBERT for ICD-10 Classification (Part 1)

Alex Retana — Tue, 18 Nov 2025 16:20:41 +0000

The Problem That Keeps Medical Coders Up at Night

Imagine you're processing disability claims for veterans. Each claim contains dense medical documentation—thousands of characters describing symptoms, diagnoses, and treatment history. Your job? Extract the correct ICD-10 diagnostic codes from this narrative. Miss a code, and a veteran might not receive the benefits they've earned. Add an incorrect code, and you've created compliance issues.

Now imagine doing this hundreds of times per day, under pressure, with 158+ possible diagnosis codes to remember.

This is exactly the type of problem that makes medical coding both critically important and incredibly challenging. And it's the perfect use case for Natural Language Processing (NLP). But here's the catch: training an AI to do this isn't straightforward, especially when you're dealing with limited training data and severe class imbalance.

In this two-part series, I'll walk you through building an automated medical coding system. Part 1 (this article) focuses on fine-tuning BioBERT with advanced techniques to handle real-world constraints. Part 2 will explore AWS Comprehend Medical as an alternative approach and compare the two solutions.

🔗 GitHub Repository

Why This Project Matters: Real-World Use Cases

Before diving into code, let's talk about why automated medical coding matters:

1. Disability Claims Processing

Veterans Affairs (VA) processes millions of disability claims. Each claim requires accurate ICD-10 coding to determine eligibility and compensation levels. Manual coding creates bottlenecks and inconsistencies.

2. Healthcare Revenue Cycle Management

Hospitals lose billions annually due to coding errors. Automated coding assistance can flag potential issues before claims are submitted to insurance companies.

3. Clinical Research

Large-scale medical studies require consistent coding of patient records. Automated extraction enables researchers to identify patient cohorts more efficiently.

4. Compliance and Auditing

Healthcare organizations must ensure coding accuracy for regulatory compliance. AI systems can audit existing codes and identify discrepancies.

The Dataset: MedCodER and Its Challenges

For this project, we're using the MedCodER (Medical Coding with Explanations and Retrievals) dataset, which contains:

500+ clinical documents with full SOAP notes (Subjective, Objective, Assessment, Plan)
158 unique ICD-10-CM codes
Supporting evidence annotations showing which text spans support each diagnosis
Severe class imbalance: Most codes appear fewer than 10 times

Here's what makes this dataset challenging (and realistic):

# Class distribution snapshot
Total unique codes: 158
Codes with ≥80 samples: 18  # Only 11% have sufficient training data!
Codes with ≥50 samples: 25
Codes with <10 samples: 98  # 62% are extremely rare

This mirrors real-world medical data perfectly—common conditions like diabetes and hypertension appear frequently, while rare diseases have minimal examples.

The Naive Approach (And Why It Fails Spectacularly)

Let's talk about what doesn't work. Your first instinct might be:

Take full 2000+ character clinical documents
Feed them to BioBERT
Train on all 158 classes
Hope for the best

Result: Macro F1 score of 0.023 (2.3%). Essentially random guessing.

Why does this fail?

Problem 1: Signal Dilution
A 2000-character document might contain only 50-100 characters actually describing a specific diagnosis. The rest is noise—patient demographics, vital signs, medication lists, etc.

Problem 2: Insufficient Training Data
With only 500 documents and 158 classes, you have an average of ~3 examples per class. Deep learning models need orders of magnitude more data.

Problem 3: Catastrophic Overfitting
BioBERT has 110 million parameters. Training all of them on tiny datasets causes the model to memorize training examples rather than learn generalizable patterns.

The Solution: A Five-Pronged Strategy

To achieve a 94.4% Macro F1 score (a 4,000% improvement!), we implement five key techniques:

1. Evidence-Focused Training

2. Label Space Optimization

3. Back-Translation Data Augmentation

4. LoRA Parameter-Efficient Fine-Tuning

5. Class-Weighted Loss Function

Let's dive into each one.

Technique 1: Evidence-Focused Training

The Problem: Training on 2000-character documents dilutes the diagnostic signal.

The Solution: Use the supporting evidence annotations to extract focused diagnostic spans (~150-200 characters) with context.

def extract_evidence_text(row):
    """Extract evidence span from full document text"""
    start = int(row['Start'])
    end = int(row['End'])

    # Extract with ±50 character context window
    context_start = max(0, start - 50)
    context_end = min(len(row['medical_record_text']), end + 50)

    return row['medical_record_text'][context_start:context_end]

Why this works: We're giving the model concentrated diagnostic information. Instead of finding a needle in a haystack, we're handing it the needle.

Example transformation:

❌ Full Document (2,347 chars):

[Long patient history, demographics, vitals, multiple conditions mixed together...]

✅ Evidence Span (189 chars):

"...blood pressure remains elevated at 156/94 despite medication compliance. 
Diagnosis: Essential (primary) hypertension. Will increase lisinopril dose..."

Consequence of skipping this step:

Without evidence extraction, the model struggles to differentiate signal from noise. You'd see F1 scores plateau around 20-30% even with other optimizations.

Technique 2: Label Space Optimization

The Problem: 62% of codes have fewer than 10 training examples—impossible to learn from.

The Solution: Filter to codes with ≥80 examples, reducing from 158 codes to 18 viable classes.

MIN_SAMPLES = 80
code_freq = evidence_focused['ICD10'].value_counts()
frequent_codes = code_freq[code_freq >= MIN_SAMPLES].index.tolist()

evidence_filtered = evidence_focused[
    evidence_focused['ICD10'].isin(frequent_codes)
].reset_index(drop=True)

print(f"Reduced to {len(frequent_codes)} codes")  # 18 codes
print(f"Retained {len(evidence_filtered)} examples")  # ~1,200 examples

Why this works: Machine learning requires sufficient examples to learn patterns. By focusing on codes with adequate representation, we ensure the model can actually learn meaningful relationships.

The trade-off: We sacrifice coverage (18 codes vs. 158) for accuracy. This is acceptable in a hybrid system where:

Custom model handles frequent codes (high accuracy)
Commercial API handles rare codes (broader coverage, lower accuracy)

Consequence of skipping this step:

Including rare codes creates extreme class imbalance. The model would:

Ignore rare classes entirely (predicting only common ones)
Waste capacity trying to memorize insufficient examples
Achieve poor performance across all classes

Technique 3: Back-Translation Data Augmentation

The Problem: Even after filtering, we only have ~1,200 training examples for 18 classes (~67 examples per class). Still limited.

The Solution: Use back-translation to generate synthetic training data.

def back_translate(text, pivot_lang='de'):
    """Translate EN→DE→EN to create paraphrased version"""

    # EN → German
    fwd_model = MarianMTModel.from_pretrained(f'Helsinki-NLP/opus-mt-en-{pivot_lang}')
    fwd_tokenizer = MarianTokenizer.from_pretrained(f'Helsinki-NLP/opus-mt-en-{pivot_lang}')

    fwd_inputs = fwd_tokenizer(text, return_tensors='pt', truncation=True)
    fwd_outputs = fwd_model.generate(**fwd_inputs)
    german_text = fwd_tokenizer.decode(fwd_outputs[0], skip_special_tokens=True)

    # German → EN
    bwd_model = MarianMTModel.from_pretrained(f'Helsinki-NLP/opus-mt-{pivot_lang}-en')
    bwd_tokenizer = MarianTokenizer.from_pretrained(f'Helsinki-NLP/opus-mt-{pivot_lang}-en')

    bwd_inputs = bwd_tokenizer(german_text, return_tensors='pt', truncation=True)
    bwd_outputs = bwd_model.generate(**bwd_inputs)
    back_translated = bwd_tokenizer.decode(bwd_outputs[0], skip_special_tokens=True)

    return back_translated

Example transformation:

Original:

"Patient reports persistent chest pain radiating to left arm with 
shortness of breath during physical exertion."

After EN→DE→EN:

"Patient experiences continuous chest pain extending to the left arm 
with breathing difficulty during physical activity."

Why this works: The semantic meaning remains identical, but the phrasing varies. This teaches the model to recognize diagnoses regardless of how they're worded—critical for handling real-world clinical variation.

Best practice: Use multiple pivot languages (German, French, Spanish) for 4x data expansion. In our demo, we use German for 1.2x expansion to save time.

Critical requirement: Keep 100% original data in validation set

# Split BEFORE augmentation
train_orig, val_orig = train_test_split(original_df, test_size=0.2)

# Augment ONLY training data
train_augmented = augment_with_back_translation(train_orig)
train_final = pd.concat([train_orig, train_augmented])

# Validation stays 100% original
val_final = val_orig

Why this matters: If augmented data leaks into validation, you'll get overly optimistic metrics. The model might learn artifacts of the translation process rather than true diagnostic patterns.

Consequence of skipping this step:

Without augmentation, the model has limited exposure to linguistic variation. It might learn to recognize specific phrasings but fail on synonyms or alternative formulations—reducing real-world robustness by 10-15%.

Technique 4: LoRA (Low-Rank Adaptation) Fine-Tuning

The Problem: BioBERT has 110 million parameters. Training all of them on 1,200 examples causes severe overfitting.

The Solution: Use LoRA to train only 0.1% of parameters while keeping the rest frozen.

How LoRA Works

Instead of updating all weights in the attention layers, LoRA injects trainable low-rank matrices:

Traditional: W_new = W_old + ΔW  (update all 768×768 = 589,824 params)
LoRA: W_new = W_old + A×B  (update 768×8 + 8×768 = 12,288 params)

Where:

A is a 768×8 matrix
B is an 8×768 matrix
r=8 is the rank (a hyperparameter)

from peft import LoraConfig, get_peft_model, TaskType

# Load base BioBERT model
base_model = AutoModelForSequenceClassification.from_pretrained(
    'dmis-lab/biobert-v1.1',
    num_labels=18,
    problem_type='single_label_classification'
)

# Configure LoRA
lora_config = LoraConfig(
    task_type=TaskType.SEQ_CLS,
    r=8,  # Rank: controls capacity vs. overfitting trade-off
    lora_alpha=16,  # Scaling factor (typically 2×r)
    lora_dropout=0.1,
    target_modules=["query", "value"],  # Apply to Q/V attention projections
    inference_mode=False
)

# Apply LoRA adapter
model = get_peft_model(base_model, lora_config)

print(f"Trainable params: {sum(p.numel() for p in model.parameters() if p.requires_grad):,}")
print(f"Total params: {sum(p.numel() for p in model.parameters()):,}")

Output:

Trainable params: 148,488 (0.13%)
Total params: 109,629,456 (100%)

Why this works:

Pre-trained knowledge is preserved: BioBERT's medical understanding stays intact
Task-specific adaptation: The small LoRA adapters learn to map BioBERT's features to ICD-10 codes
Regularization effect: Limited capacity prevents memorization

Choosing the rank (r)

r=4: Very lightweight, may underfit complex tasks
r=8: Sweet spot for most tasks (used here)
r=16: More capacity, risk of overfitting on small datasets
r=32+: Approaching full fine-tuning behavior

Image above is from hugging face: https://huggingface.co/docs/peft/main/en/conceptual_guides/lora

Consequence of skipping this step:

Full fine-tuning on this dataset produces F1 scores around 20-30%. The model memorizes training examples and fails to generalize. LoRA's regularization is the difference between failure and success.

Technique 5: Class-Weighted Loss Function

The Problem: Even after filtering, we have imbalance (some codes have 200 examples, others have 80).

The Solution: Use weighted cross-entropy loss that penalizes errors on rare classes more heavily.

from sklearn.utils.class_weight import compute_class_weight

# Compute balanced class weights
class_weights = compute_class_weight(
    class_weight='balanced',
    classes=np.arange(num_labels),
    y=train_df['label_id']
)

class_weights_tensor = torch.tensor(class_weights, dtype=torch.float32)

# Custom Trainer with weighted loss
class WeightedTrainer(Trainer):
    def __init__(self, class_weights=None, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.class_weights = class_weights.to(self.args.device)

    def compute_loss(self, model, inputs, return_outputs=False, num_items_in_batch=None):
        labels = inputs.pop("labels")
        outputs = model(**inputs)
        logits = outputs.logits

        # Weighted cross-entropy loss
        loss_fct = nn.CrossEntropyLoss(weight=self.class_weights)
        loss = loss_fct(logits, labels)

        return (loss, outputs) if return_outputs else loss

How balanced weights work:

weight[c] = n_samples / (n_classes × n_samples_in_class[c])

Example:

Class A: 200 examples → weight = 1,200/(18×200) = 0.33
Class B: 80 examples → weight = 1,200/(18×80) = 0.83

During training, misclassifying Class B incurs 2.5× the penalty of Class A.

Consequence of skipping this step:

Without weighting, the model optimizes for overall accuracy by focusing on frequent classes. Rare classes get ignored, reducing macro F1 by 5-10%.

Putting It All Together: Training Configuration

training_args = TrainingArguments(
    output_dir='./models/biobert-lora-improved',
    eval_strategy='epoch',
    learning_rate=2e-4,  # Higher LR for LoRA (10× standard fine-tuning)
    per_device_train_batch_size=16,
    num_train_epochs=15,
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model='macro_f1',
    fp16=True,  # Mixed precision for faster training
    warmup_ratio=0.1,
)

trainer = WeightedTrainer(
    class_weights=class_weights_tensor,
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    compute_metrics=compute_metrics,
)

trainer.train()

Key hyperparameters explained:

Learning rate (2e-4): Higher than typical fine-tuning (2e-5) because LoRA adapters can handle larger updates
Batch size (16): Balanced between GPU memory and gradient quality
Epochs (15): Sufficient for convergence without overfitting
FP16: Reduces memory usage and speeds up training by ~2×

Results: From Failure to Success

Performance Metrics

Metric	Score
Accuracy	94.4%
Macro F1	0.944
Weighted F1	0.945
Macro Precision	0.944
Macro Recall	0.950

Comparison to naive approach:

Approach	Macro F1	Improvement
Naive (full docs, all classes, full fine-tuning)	0.023	Baseline
Improved (evidence + LoRA + augmentation)	0.944	+4,000%

Per-Class Performance

The model achieves balanced performance across all 18 classes:

                    precision    recall  f1-score   support

         E11.9          0.95      0.95      0.95        20
         I10            0.93      0.97      0.95        15
         E78.5          0.94      0.94      0.94        18
         ...

    macro avg          0.94      0.95      0.94       240
 weighted avg          0.95      0.94      0.95       240

No class falls below 90% F1—demonstrating that our techniques successfully handle the remaining imbalance.

What We've Learned: Key Takeaways

✅ Do This

Extract focused context: Don't train on full documents when evidence spans are available
Filter aggressively: Better to excel at 18 codes than fail at 158
Augment intelligently: Back-translation preserves semantics while adding variation
Use parameter-efficient methods: LoRA prevents overfitting on small datasets
Weight your loss: Account for remaining class imbalance

❌ Avoid This

Training on full documents: Dilutes diagnostic signals
Including rare classes: <10 examples per class is unlearnable
Mixing augmented data into validation: Creates overly optimistic metrics
Full fine-tuning: Causes catastrophic overfitting on small datasets
Ignoring class imbalance: Model will focus only on frequent classes

Limitations and Future Work

Current Limitations

1. Limited Code Coverage
We only handle 18 out of 158 codes. For production use, you'd need:

More training data for rare codes
Hierarchical classification (predict ICD chapter first, then specific code)
Hybrid approach with commercial APIs

2. Evidence Dependency
Our approach requires supporting evidence annotations. For new data without annotations:

Use attention weights to identify key spans
Employ named entity recognition (NER) to extract diagnoses
Apply the trained model to full documents (with performance degradation)

3. Multi-Label Simplification
We converted multi-label to single-label (one example per code). True multi-label classification would:

Predict all relevant codes simultaneously
Model code co-occurrence patterns
Better reflect real clinical scenarios

Next Steps

Hierarchical Classification: Leverage ICD-10's tree structure (Chapter → Category → Code)
Full Augmentation: Implement FR and ES translations for 4× data expansion
Ensemble Methods: Combine multiple augmented models with different random seeds
Multi-Label Extension: Train on documents with all codes simultaneously
Transfer Learning: Pre-train on medical entity recognition before ICD-10 classification

Coming Up in Part 2: AWS Comprehend Medical

In the next article, we'll explore a completely different approach:

Zero-shot inference using AWS's pre-trained medical NLP service
Entity trait filtering to handle negations, hypotheticals, and family history
Multi-label evaluation at the document level
Head-to-head comparison with our BioBERT model
Hybrid strategy combining both approaches for optimal results

We'll discover that AWS Comprehend Medical achieves 27% macro F1 on all 158 codes (vs. our 94% on 18 codes)—a fascinating trade-off between coverage and accuracy.

Try It Yourself

All code is available in the GitHub repository:

🔗 clinical-nlp-claims-processing

To run this notebook:

# Clone the repository
git clone https://github.com/alexretana/clinical-nlp-claims-processing.git
cd clinical-nlp-claims-processing

# Install dependencies (using uv)
curl -LsSf https://astral.sh/uv/install.sh | sh
uv sync

# Launch Jupyter
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
jupyter lab

# Open notebooks/01_BioBERT_Fine-Tuning_NLP.ipynb

Hardware requirements:

GPU with 8GB+ VRAM (RTX 3060, V100, A100) for reasonable training times
16GB+ system RAM
Training takes ~2-4 hours on GPU, much longer on CPU

Conclusion

Building production-quality medical NLP systems requires more than throwing data at a pre-trained model. By combining:

Evidence-focused training
Strategic label filtering
Back-translation augmentation
LoRA parameter-efficient fine-tuning
Class-weighted loss

We transformed a failing system (2.3% F1) into one that performs at 94.4% F1—good enough for real-world deployment with human oversight.

The techniques we've covered apply far beyond medical coding:

Legal document analysis (case law classification)
Scientific literature mining (research topic categorization)
Customer support (ticket routing and classification)
Content moderation (policy violation detection)

Anywhere you face limited training data and class imbalance, this toolkit will serve you well.

Next time, we'll see how AWS Comprehend Medical tackles the same problem without any training data at all—and explore when each approach makes sense.

What challenges have you faced when training NLP models on limited data? Share your experiences in the comments! And if you found this helpful, follow me for Part 2 where we dive into AWS Comprehend Medical.

📚 Further Reading:

Tags: #machinelearning #nlp #healthcare #python #biobert #transformers #medicalcoding #datascience

Building Reproducible n8n Environments with CLI-Based Configuration Management

Alex Retana — Wed, 22 Oct 2025 21:01:03 +0000

Building Reproducible n8n Environments with CLI-Based Configuration Management

When you're building applications with n8n as a core component—not just using it as a standalone automation tool—you need a way to provision n8n instances with pre-configured credentials, workflows, and integrated services. This article shows you a pattern for creating fully reproducible n8n environments using the n8n CLI and environment variable substitution.

The Real Problem: n8n as an Application Component

Most n8n tutorials focus on getting started quickly. But what if you're building an application where n8n is one piece of a larger system? You need:

Reproducible environments - Same setup across dev, staging, production
Pre-configured credentials - Database connections ready to use
Integrated services - PostgreSQL for data storage, Redis for agent memory
Zero manual setup - No clicking through UIs to configure connections
Version controlled configuration - Infrastructure as code

This isn't about convenience—it's about treating n8n as a first-class component in your application stack.

The Solution: n8n CLI + Environment Variables

n8n includes powerful CLI commands that most people don't know about:

# Export all credentials to JSON
n8n export:credentials --all --output=creds.json

# Import credentials from JSON
n8n import:credentials --input=creds.json

# Export workflows
n8n export:workflow --all --output=workflows.json

# Import workflows
n8n import:workflow --input=workflows.json

These commands are the foundation of reproducible n8n deployments. But there's a problem: exported credentials contain hardcoded values. If you export a PostgreSQL credential, it has a specific password baked in.

The envsubst Trick

Here's the key insight: we can use envsubst to transform n8n credential exports into templates with environment variable placeholders.

Step 1: Export a credential manually (one time)

n8n export:credentials --all --output=creds.json

Step 2: Replace hardcoded values with environment variables

Transform this:

{
  "id": "postgres_local",
  "name": "Local PostgreSQL Database",
  "data": {
    "host": "postgres",
    "password": "some_hardcoded_password",
    "user": "n8n_user"
  },
  "type": "postgres"
}

Into this template:

{
  "id": "${N8N_POSTGRES_CREDENTIAL_ID}",
  "name": "Local PostgreSQL Database",
  "data": {
    "host": "${POSTGRES_HOST}",
    "password": "${POSTGRES_PASSWORD}",
    "user": "${POSTGRES_USER}",
    "database": "${POSTGRES_DB}",
    "port": ${POSTGRES_PORT}
  },
  "type": "postgres"
}

Step 3: Use envsubst to substitute at runtime

envsubst < creds.json.template > creds.json
n8n import:credentials --input=creds.json

Now your credentials are environment-driven. Same template works in dev, staging, production—just different environment variables.

Why This Matters: Building Applications with n8n

This pattern unlocks powerful use cases:

1. AI Agents with Redis Memory

n8n's AI Agent node has a "Simple Memory" option that stores conversation history in n8n's database. For production AI applications, you want Redis instead:

Faster access times
Better scalability
Shared memory across n8n instances
TTL-based conversation expiry

With our pattern, you can provision n8n with Redis credentials already configured. Your workflows can immediately use Redis memory nodes without manual setup.

2. Multi-Tenant SaaS Applications

Spin up isolated n8n instances per customer, each with credentials for their dedicated database:

export POSTGRES_PASSWORD="${CUSTOMER_ID}_db_password"
export REDIS_PASSWORD="${CUSTOMER_ID}_redis_password"
./provision-n8n.sh

3. Development Environment Parity

Every developer gets the same n8n setup:

git clone your-app
./init-credentials.sh  # Generate dev credentials
./start.sh             # Everything works

The Demo: Dockerized n8n with PostgreSQL and Redis

Our example repository demonstrates this pattern with a complete Docker Compose setup.

The Architecture

┌─────────────────────────────────────────┐
│  Docker Compose Environment             │
│                                         │
│  ┌──────────┐  ┌──────────┐  ┌──────┐ │
│  │ n8n      │→ │ Postgres │  │Redis │ │
│  │          │  │          │  │      │ │
│  │ - Creds  │  │ (auto    │  │(auto │ │
│  │   auto-  │  │  config) │  │ cfg) │ │
│  │   imported│  │          │  │      │ │
│  └──────────┘  └──────────┘  └──────┘ │
│       ↑                                 │
│       │ envsubst at startup             │
│       │                                 │
│  ┌─────────────────────────────┐       │
│  │ .env (generated secrets)    │       │
│  │ POSTGRES_PASSWORD=<random>  │       │
│  │ REDIS_PASSWORD=<random>     │       │
│  └─────────────────────────────┘       │
└─────────────────────────────────────────┘

How It Works

1. Generate Environment Variables

./init-credentials.sh

This script creates a .env file with generated secrets:

POSTGRES_PASSWORD=$(openssl rand -base64 48 | tr -d "=+/" | cut -c1-32)
REDIS_PASSWORD=$(openssl rand -base64 48 | tr -d "=+/" | cut -c1-32)
N8N_ENCRYPTION_KEY=$(openssl rand -hex 32)

2. Start with Exported Variables

Here's the critical part. Docker Compose can use ${VARIABLE} syntax, but only if those variables are exported to the environment:

#!/bin/bash
# start.sh

# This is the key - export all variables
set -a
source .env
set +a

docker-compose up -d --build

The Gotcha: If you just run docker-compose up without exporting variables, your docker-compose.yml file won't resolve ${POSTGRES_PASSWORD} and services will fail to authenticate.

3. Docker Compose Provisions Services

services:
  postgres:
    image: postgres:15-alpine
    environment:
      POSTGRES_PASSWORD: ${POSTGRES_PASSWORD}  # Resolved from exported env
      POSTGRES_USER: ${POSTGRES_USER}
      POSTGRES_DB: ${POSTGRES_DB}

  redis:
    image: redis:7-alpine
    command: redis-server --requirepass ${REDIS_PASSWORD}  # Also resolved

  n8n:
    build: ./n8n
    environment:
      DB_POSTGRESDB_PASSWORD: ${POSTGRES_PASSWORD}  # Same password
      # ... other config
    depends_on:
      postgres:
        condition: service_healthy
      redis:
        condition: service_healthy

4. n8n Container Auto-Imports Credentials

On first startup, our custom entrypoint:

# n8n/entrypoint.sh
envsubst < /data/workflow_creds/decrypt_creds.json.template > /data/workflow_creds/decrypt_creds.json
n8n import:credentials --input=/data/workflow_creds/decrypt_creds.json

Now n8n has credentials for PostgreSQL and Redis, using the same passwords Docker Compose used to provision those services.

The Credential Template

Here's what the template looks like:

[
  {
    "id": "${N8N_POSTGRES_CREDENTIAL_ID}",
    "name": "Local PostgreSQL Database",
    "data": {
      "host": "${POSTGRES_HOST}",
      "database": "${POSTGRES_DB}",
      "user": "${POSTGRES_USER}",
      "password": "${POSTGRES_PASSWORD}",
      "port": ${POSTGRES_PORT},
      "ssl": "disable"
    },
    "type": "postgres"
  },
  {
    "id": "${N8N_REDIS_CREDENTIAL_ID}",
    "name": "Local Redis Cache",
    "data": {
      "password": "${REDIS_PASSWORD}",
      "host": "${REDIS_HOST}",
      "port": ${REDIS_PORT},
      "database": 0
    },
    "type": "redis"
  }
]

This template was originally created by:

Manually creating credentials in n8n
Exporting them with n8n export:credentials
Replacing values with ${VARIABLE} placeholders

Now it's reusable across all environments.

Practical Use Case: Redis for AI Agent Memory

This setup shines when building AI agents with n8n. Instead of this:

AI Agent Node → Simple Memory (in n8n database)

You can do this:

AI Agent Node → Redis Chat Memory Node → Redis (provisioned)

Benefits:

Persistence across restarts - Conversation history survives n8n restarts
Shared state - Multiple n8n instances share conversation history
Performance - Redis is optimized for this access pattern
Scalability - Redis can handle millions of conversation threads

And the Redis credential is already configured—no manual setup required.

Extending the Pattern

Add MongoDB

1. Add to .env.example:

MONGO_PASSWORD=__GENERATED__

2. Update credential template:

{
  "id": "mongo_local",
  "name": "Local MongoDB",
  "data": {
    "host": "${MONGO_HOST}",
    "password": "${MONGO_PASSWORD}",
    "user": "${MONGO_USER}",
    "database": "${MONGO_DB}"
  },
  "type": "mongoDb"
}

3. Add to docker-compose.yml:

mongo:
  image: mongo:7
  environment:
    MONGO_INITDB_ROOT_PASSWORD: ${MONGO_PASSWORD}

Import Workflows Too

# In entrypoint.sh, after importing credentials:
n8n import:workflow --input=/data/workflows/default-workflows.json

This lets you ship n8n with pre-built workflows for your application.

The Repository

The complete implementation is available at:
github.com/alexretana/n8n-docker-demo

Use it as:

A starting point for your own n8n-based applications
A reference for the envsubst + CLI pattern
A foundation to build on

Quick Start

git clone https://github.com/alexretana/n8n-docker-demo
cd n8n-docker-demo

./init-credentials.sh  # Generate secrets
./start.sh            # Start everything

# Access n8n at http://localhost:5678
# PostgreSQL and Redis credentials already configured

Building Applications Around n8n

This pattern enables you to build applications where n8n is a component, not the entire application. Examples:

SaaS platforms - n8n handles workflow orchestration, your app handles user management
AI applications - n8n orchestrates AI agents, Redis stores conversation history
Data pipelines - n8n coordinates ETL, PostgreSQL stores results
Internal tools - n8n automates business processes, your app provides the UI

The key is treating n8n configuration as code. With the CLI + envsubst pattern, your n8n setup becomes reproducible, version-controlled, and automatable.

Invitation

The repository is open source and free to use. But more than that—I'd love to see what you build.

If you're creating an application around n8n:

Fork the repo and adapt it to your needs
Share what you're building - open an issue or discussion
Contribute improvements - PRs welcome
Ask questions - I'm happy to help

The pattern shown here is just a starting point. The real value comes from what you build on top of it.

Key Takeaways

n8n CLI is powerful - Use export and import commands for configuration management
envsubst bridges the gap - Transform exported configs into reusable templates
Export variables in start scripts - Docker Compose needs them exported, not just in .env
Treat n8n as a component - Build applications around it, not just use it standalone
Configuration as code - Make your n8n setup reproducible and version-controlled

Happy building!

Further Reading:

Tags: #n8n #automation #docker #devops #postgresql #redis #ai #reproducibility

Build an AI Research Archivist with n8n: Stop Researching the Same Topics Twice

Alex Retana — Fri, 10 Oct 2025 19:58:02 +0000

Build an AI Research Archivist with n8n: Stop Researching the Same Topics Twice

The $15K Problem You Didn't Know You Had

Picture this: It's Tuesday morning, and you're diving into researching authentication patterns for your new microservices architecture. You spend two hours reading articles, comparing approaches, and documenting your findings in a scattered collection of browser tabs and sticky notes.

Fast forward three months. A colleague asks about authentication strategies. You vaguely remember researching this, but where did you save those findings? What were the key takeaways? You end up starting from scratch.

Studies show that knowledge workers waste nearly 6 hours per week duplicating research efforts. For a developer making $80K annually, that's roughly $15,000 in wasted productivity every year. Multiply that across a team, and the numbers become staggering.

The solution isn't another note-taking app—it's an intelligent system that actively prevents duplicate research by checking what you've already investigated before conducting new searches.

What We're Building

In this tutorial, you'll build a Research Archivist Agent using n8n that:

Checks your existing research archive before conducting new searches
Uses Perplexity AI for high-quality research synthesis
Automatically stores findings in Google Sheets with proper citations
Maintains searchable keywords for easy retrieval
Guides users through a structured research workflow

Tech Stack:

n8n (workflow automation)
Anthropic Claude Sonnet 4.5 (agent orchestration)
Perplexity AI (research tool)
Google Sheets (knowledge archive)

Prerequisites

You'll need:

n8n instance (cloud or self-hosted)
Anthropic API key
Perplexity API key
Google account

Cost estimate: ~$5-10/month for API usage with moderate research volume.

Step 1: Set Up Your Knowledge Archive

Create a new Google Sheet with these columns:

Document Name | Document Content | Reference Link | Research Date | Keywords

Why this structure?

Document Name: Human-readable identifier for quick scanning
Document Content: Summary of findings (not full articles)
Reference Link: Source URL for verification
Research Date: Helps identify outdated research
Keywords: Enables semantic search across topics

Save the Sheet URL—you'll need it for the n8n workflow.

Step 2: Import the n8n Template

Download the template from the GitHub repository
In n8n, go to Workflows → Import from File
Select Archivist Agent Template.json

You'll see seven nodes connected:

Chat Trigger → Archivist Agent → Claude Model
                        ↓
              [Simple Memory]
                        ↓
        ┌───────────────┴───────────────┐
        ↓                               ↓
  Perplexity Tool            Google Sheets Tools (x2)

Step 3: Configure Credentials

Anthropic API

Click Anthropic Chat Model node
Create credential → Enter your API key
Ensure model is claude-sonnet-4-5-20250929

Perplexity API

Click Message a model in Perplexity node
Create credential → Enter your API key
Keep model as sonar-pro for best research quality

Google Sheets

Click either Google Sheets node
Create credential → Select OAuth2
Follow Google's authorization flow
Paste your Sheet URL in both nodes:
- Get row(s) in sheet
- Append or update row

Step 4: Understanding the Agent System Prompt

The core intelligence comes from the system prompt in the Archivist Agent node. Here's what makes it work:

## Workflow Process

### Phase 1: Initial Check
When a user requests research:
1. Search existing archive using "Get row(s) in sheet"
2. If found, present existing research
3. Confirm if user wants updated information

### Phase 2: New Research
If no existing research found:
1. Conduct research using Perplexity AI
2. Summarize findings
3. Store in archive
4. Provide summary to user

### Phase 3: Archive Management
- Search and retrieve specific topics
- Update entries when needed
- Organize content
- Remove duplicates

This three-phase approach ensures you never research the same topic twice unless you explicitly need updated information.

Step 5: Test Your Agent

Click Save and Activate the workflow
Click the Chat button (webhook icon on the trigger node)
Try these test queries:

First research request:

Research the benefits of edge computing for web applications

The agent will:

Check the archive (empty for first run)
Conduct Perplexity research
Store findings in your Sheet
Return a summary

Duplicate check:

What do we have on edge computing?

The agent will:

Find your previous research
Present existing findings
Ask if you want updated research

Step 6: Advanced Configuration

Adjust Memory Window

The Simple Memory node stores conversation context. Default is 15 messages. Increase for longer research sessions:

contextWindowLength: 30  // stores last 30 messages

Customize Research Depth

In the Perplexity node, adjust for different research needs:

// Quick facts
model: "sonar"

// Deep research (recommended)
model: "sonar-pro"

Add Search Filters

Modify the Google Sheets search node to filter by date:

// Only search research from last 6 months
filter: "Research Date >= DATE(2024, 4, 1)"

Real-World Usage Patterns

Daily Standup Research

"What research do we have on our current sprint topics?"

Technical Decision Making

"Compare our previous research on GraphQL vs REST APIs"

Onboarding New Developers

"Find all research related to our authentication architecture"

Knowledge Transfer

"What did we learn about database sharding last quarter?"

Troubleshooting Common Issues

Problem: Agent researches instead of checking archive first

Solution: Verify Google Sheets credentials and that the Sheet URL includes the sheet tab name

Problem: Perplexity returns generic results

Solution: Craft more specific queries. Bad: "web security" Good: "OWASP top 10 mitigation strategies for Node.js REST APIs"

Problem: Duplicate entries appearing

Solution: Use consistent naming conventions. Create a naming guide:

✅ "JWT Authentication Best Practices"
❌ "jwt auth", "JWT stuff", "authentication research"

Scaling Your Archive

As your knowledge base grows, consider these enhancements:

1. Add Tagging System
Add a "Tags" column with comma-separated values:

Tags: authentication, security, nodejs, jwt

2. Create Research Templates
Define standard research formats for common topics:

Technical Comparisons: Pros, Cons, Performance, Cost
Tool Evaluations: Features, Integration, Community, Pricing
Best Practices: Pattern, When to Use, Common Pitfalls

3. Implement Version Control
Track research updates by adding columns:

Version | Last Updated By | Change Summary

Extension Challenge: Build a Weekly Digest

Ready to level up? Here's your challenge: Create an automated weekly research digest that emails you a summary of all research conducted in the past week.

Hints:

Add a Schedule Trigger node that runs weekly
Query Google Sheets for entries from the last 7 days
Use Claude to generate a formatted summary
Send via Gmail or SendGrid node

Bonus points:

Include most-searched keywords
Highlight research gaps (topics with old data)
Add "Related research suggestions" using Claude

Share your solution! Post your workflow to the n8n community or tweet it with #n8n and tag me—I'd love to see what you build.

Why This Matters

Personal Knowledge Management isn't just productivity theater—it's a competitive advantage. When you can instantly recall research insights from six months ago, you make faster decisions. When your team shares a searchable knowledge archive, you eliminate duplicate work and accelerate onboarding.

The Research Archivist Agent isn't just a tool—it's a mindset shift from "search and forget" to "research once, reference forever."

Next steps:

Clone the repository
Set up your workflow today
Research your first topic
Watch your knowledge compound

Three months from now, you'll have a valuable archive of research that would have otherwise been lost to browser history and forgotten bookmarks.

What will you research first?

Found this helpful? Drop a ❤️ and share it with your team. Have questions or improvements? Drop them in the comments below—I read and respond to every one.

Optimizing Multi-Agent Workflows in n8n: A Context-Aware Approach to Agent Handoffs

Alex Retana — Wed, 01 Oct 2025 19:22:53 +0000

Optimizing Multi-Agent Workflows in n8n: A Context-Aware Approach to Agent Handoffs

When working with multi-agent systems like the BMAD (Big Model, Agent Design) pattern, context window management becomes critical for model performance and cost efficiency. While you could dump an entire agent bundle into a Claude Project and let it figure things out, you'll quickly burn through tokens on instruction sets that may never be relevant to the current task.

This tutorial demonstrates how to build intelligent agent routing in n8n—the popular node-based automation tool—that maintains tight control over context and enables direct user-to-subagent communication without wasteful token processing.

Why This Matters

Traditional approaches to multi-agent orchestration often suffer from two key problems:

Context bloat: Loading all agent instructions upfront wastes tokens on irrelevant context
Indirect communication: Routing everything through a master agent doubles processing costs and adds latency

While Claude Projects offers solutions like separating master instructions from agent definitions and using RAG for knowledge retrieval, building a custom workflow in n8n gives you explicit control over data flow and context management. This pattern extends beyond chatbots—use it anywhere you need task-specific agents with optimized context windows.

The Build: Two Demonstration Workflows

I've created two n8n workflows that progressively demonstrate agent handoff patterns. Both use intentionally simple agent instructions to focus on the routing mechanics, but these patterns scale to complex production systems.

You can copy the templates to import into your own n8n instance at my github repo: N8n Multi Agent Handoff Templates

Demo 1: Sequential Agent Pass-Through

This workflow demonstrates the fundamental pattern: how to pass control from one agent to another.

Flow breakdown:

Chat Trigger receives the user message
AI Agent 1 processes the input with access to:
- OpenAI GPT-4.1-mini (shared language model)
- Simple Memory (conversation history)
Agent 1 outputs to two destinations simultaneously:
- "Respond to Chat" node (user feedback)
- AI Agent 2 (next agent in chain)

AI Agent 2 receives Agent 1's output via the prompt template:

   {{ $json.output || $json.chatInput }}

This expression handles both the initial user input and subsequent agent outputs

Agent 2 responds back to the user through "Respond to Chat1"

The loop continues—Agent 2's response feeds back into itself. As you can see below, when I ask what agent it is, it says agent 2, and with out routing through agent 1 (we don't see agent 1 messaged again).

Key architectural decisions:

Shared memory: Both agents use the same Simple Memory node to maintain conversation continuity
Shared model: Single OpenAI connection reduces configuration overhead
Branching output: Agent 1 uses n8n's multiple output connections to respond AND handoff simultaneously

Code reference (from Demonstrate Agent Pass Off.json):

{
  "parameters": {
    "promptType": "define",
    "text": "={{ $json.output ||  $json.chatInput}}",
    "options": {
      "systemMessage": "You are Agent 2. If you're asked to respond to the chat with what agent you are, just say \"Yes, I'm Agent 2\""
    }
  },
  "type": "@n8n/n8n-nodes-langchain.agent",
  "name": "AI Agent 2"
}

Demo 2: Dynamic Routing with Master Agent

This workflow adds intelligent routing: a master agent decides which specialized agent should handle each request.

Flow breakdown:

Chat Trigger → AI Master Agent
Master Agent analyzes the request and outputs structured JSON:

   {
     "direct_response_to_user": "I'm routing you to Agent 2",
     "agent_to_route_to": "Agent 2",
     "forwarded_message": "User asked about X. Routing because Y."
   }

Map Master Agent's Response extracts these fields using n8n expressions:

   $json.output.parseJson().agent_to_route_to
   $json.output.parseJson().forwarded_message

Data splits into two paths:
- Master Agent Responds To Chat: Sends routing explanation to user (no wait)
- Switch Node: Routes to Agent 1, 2, or 3 based on agent_to_route_to value

Selected agent receives a contextualized prompt:

   User's Original Message:
   ${$('When chat message received').item.json.chatInput}

   Master Agent's message to you:
   ${$json.forwarded_message}

Agent responds through its dedicated chat node

Critical differences from Demo 1:

Isolated memory: Each agent (including Master) has separate memory nodes (Simple Memory1/2/3)
Context preservation: The forwarded message includes both the original user input AND the master's routing rationale
Parallel execution: User gets immediate feedback while the selected agent processes in parallel

Master Agent system prompt (edited for clarity):

You are the Master Agent. You route user requests to the correct agent.

IMPORTANT: Output only valid JSON in this format:

{
  "direct_response_to_user": "I'm routing you to Agent 1",
  "agent_to_route_to": "Agent 1",
  "forwarded_message": "**Summary of user request and routing rationale**"
}

Why structured output matters: The JSON format enables programmatic routing via the Switch node. In production, you'd add validation to handle malformed responses.

Implementation Details You Need to Know

Context Window Optimization

Each agent only loads:

Its own system prompt (~100-500 tokens)
Relevant conversation history (window-buffered)
The forwarded context from the master agent

Compare this to loading all 3 agent instruction sets upfront—you'd waste thousands of tokens per request.

The Switch Node Configuration

The Switch node uses n8n's rule-based routing:

{
  "conditions": {
    "conditions": [
      {
        "leftValue": "={{ $json.agent_to_route_to }}",
        "rightValue": "Agent 2",
        "operator": {
          "type": "string",
          "operation": "equals"
        }
      }
    ]
  }
}

Three rules match "Agent 1", "Agent 2", or "Agent 3" exactly. Unmatched requests fall through (you'd want error handling in production).

Memory Architecture Trade-offs

Demo 1: Shared memory allows agents to reference each other's outputs naturally, but blurs agent boundaries.

Demo 2: Isolated memory per agent creates cleaner separation but requires explicit context passing via forwarded_message. This scales better for specialized agents with distinct conversation contexts.

Running These Workflows

Import the JSON files into n8n (both included with this post)
Configure your OpenAI API credentials in the "OpenAI: gpt-4.1-mini" node
Activate the workflow
Open the chat interface via the webhook URL

Test prompts:

"Who are you?" (tests agent self-identification)
"Pass me to Agent 2" (tests routing logic)
"What did Agent 1 say?" (tests memory persistence)

What's Next?

The natural evolution is bidirectional routing: subagents should be able to return control to the master when they complete their task. This creates a true orchestration layer where:

Master Agent delegates to specialists
Specialists execute and report back
Master Agent synthesizes results or delegates further

Challenge for you: Can you modify Demo 2 to:

Let each subagent indicate completion in its output (maybe via JSON like the master)?
Route completed tasks back to the Master Agent?
Have the Master Agent decide whether to route again or provide a final response?

This pattern mirrors how tools like LangGraph handle cyclic agent flows, but with explicit control over every transition.

Conclusion

TL;DR: Multi-agent systems in n8n benefit from explicit routing and context management. Use sequential pass-through (Demo 1) for simple pipelines; use master-agent routing with structured output (Demo 2) for dynamic task distribution. Both patterns dramatically reduce token waste compared to loading all agent instructions upfront. Next step: implement agent-to-master return logic for full orchestration loops.

The workflows demonstrated here show that intelligent agent handoffs aren't magic—they're just careful data flow management. n8n's visual interface makes the logic transparent, which is invaluable when debugging complex agent interactions or optimizing for cost.

Try implementing the return-to-master pattern yourself, and share your solution in the comments. What other agent routing patterns would be useful for your projects?

Streamlining MCP Management: Bundle Multiple Servers with FastMCP Proxies

Alex Retana — Tue, 23 Sep 2025 18:14:36 +0000

Introduction

Model Context Protocol (MCP) servers have revolutionized how AI applications access external tools and data sources. From web browsing with Playwright to documentation search with Context7, MCPs provide a standardized way to extend AI capabilities beyond their training data.

However, as the MCP ecosystem grows, managing multiple servers becomes increasingly complex. Each MCP server typically requires separate installation, configuration, and maintenance across different clients like Claude Desktop, Cursor, or Claude Code. This fragmentation creates several pain points:

Configuration sprawl: Each client needs individual server configurations
Dependency conflicts: Different servers may require conflicting Python versions or packages
Resource overhead: Multiple server processes consume unnecessary system resources
Maintenance burden: Updates and troubleshooting multiply across installations

FastMCP's proxy capabilities solve these challenges by allowing you to bundle multiple MCP servers behind a single endpoint. Combined with FastMCP's CLI tools, you can easily deploy this unified proxy to any MCP client with a single command.

I created a small github repo with example code if you'd like to follow along with it. alexretana/FastMCP-Simple-Proxy-Bundling

Important Warning: While bundling MCPs is convenient, be mindful of tool overload. Providing too many tools to an MCP client can overwhelm the AI model and degrade performance. Start with essential tools and add more selectively based on your specific workflow needs.

Installation

Install FastMCP

They recommend using uv to install and manage FastMCP. You can install it directly with uv pip or pip:

# Using uv (recommended)
uv pip install fastmcp

# Or using pip
pip install fastmcp

# Add as a tool through uv (My preference)
uv tool install fastmcp

Install uv (Required for MCP Client Integration)

FastMCP's CLI tools require uv for dependency management when installing to MCP clients. Install uv for your platform:

Windows:

# Using PowerShell
powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"

# Or using pip
pip install uv

macOS:

# Using Homebrew (recommended)
brew install uv

# Or using curl
curl -LsSf https://astral.sh/uv/install.sh | sh

Linux:

# Using curl
curl -LsSf https://astral.sh/uv/install.sh | sh

# Or using pip
pip install uv

Verify Installation

To verify that FastMCP is installed correctly:

fastmcp version

You should see output like:

$ fastmcp version
FastMCP version:                           2.11.3
MCP version:                               1.12.4
Python version:                            3.12.2
Platform:            macOS-15.3.1-arm64-arm-64bit
FastMCP root path:            ~/Developer/fastmcp

Running with JSON Configuration

FastMCP can run servers directly from JSON configuration files, making it easy to define and deploy multi-server setups. Let's create a configuration that bundles Context7 (documentation search) and Playwright (web automation) into a single endpoint.

Create a file named fastmcp.json:

{
  "$schema": "https://gofastmcp.com/public/schemas/fastmcp.json/v1.json",
  "environment": {
    "type": "uv",
    "python": ">=3.10", 
    "dependencies": [
      "fastmcp"
    ]
  },
  "deployment": {
    "transport": "sse",
    "log_level": "DEBUG"
  },
  "mcpServers": {
    "context7": {
      "command": "npx",
      "args": ["-y", "@upstash/context7-mcp", "--api-key", "YOUR_API_KEY"]
    },
    "playwright": {
      "command": "npx", 
      "args": ["@playwright/mcp@latest"]
    }
  }
}

Configuration breakdown:

environment: Specifies Python version and FastMCP dependency
deployment: Sets transport method and logging level
mcpServers: Defines the backend servers to proxy

Replace YOUR_API_KEY with your actual Context7 API key from Upstash.

Now run the server using FastMCP CLI:

fastmcp run fastmcp.json --transport http --host localhost --port 53456

Note: the values I use in the cli call will overwrite options I picked in fastmcp.json file. I intentionally made them conflicting to point this out.

The server will start on http://localhost:53456 and automatically proxy requests to both Context7 and Playwright servers. You can test it by accessing the server endpoint directly or integrating it with MCP clients.

Although I didn't demonstrate it here, the github repo also includes a dockerfile example if you need help getting started with dockerizing FastMCP.

Define a Python File and CLI Install Feature

While JSON configuration works well for direct server execution, you might prefer a Python-based approach for more complex scenarios or better IDE support. Let's create a simple proxy server file.

Create mcp-proxy.py:

from fastmcp import FastMCP

# Your MCP servers configuration
config = {
    "mcpServers": {
        "context7": {
            "command": "npx",
            "args": ["-y", "@upstash/context7-mcp", "--api-key", "YOUR_API_KEY"]
        },
        "playwright": {
            "command": "npx",
            "args": ["@playwright/mcp@latest"]
        }
    }
}

# Create the proxy
mcp = FastMCP.as_proxy(config, name="Multi-MCP-Proxy")

This Python file defines the same proxy configuration as our JSON, but in a more programmatic format that allows for easier customization and extension.

Installing in Claude Code

FastMCP's CLI makes installation trivial. For Claude Code:

fastmcp install claude-code mcp-proxy.py

This command automatically configures Claude Code to run your proxy server with all necessary dependencies managed by uv.

Installing in Claude Desktop

For Claude Desktop installation:

fastmcp install claude-desktop mcp-proxy.py

I got an error on my freshly installed claude desktop. That is because it couldn't find the claude_desktop_config.json. You can go to Settings > Developer tab, and click 'Edit Config', and it'll automatically make one. Then run the fastmcp install command again, and that should resolve.

The install command automatically updates Claude Desktop's configuration file with the proper server entry, including dependency management through uv.

Testing the Installation

Open Claude Desktop (Note: you will need to restart Claude Desktop after installing the FastMCP server) and verify the installation worked by asking it to search for FastMCP documentation. I actually had many issues trying to do this. It ran in Claude Code just fine, but for some reason I couldn't get context7 to work in Claude Desktop like this, only playwright. Honestly, Claude Code is better at using tools anyways, so you shouldn't lean as much on Claude Desktop using tools. Granted that the context7 tool is pretty good for the 'planning period' of agentic development, so you'll probably want to install it. Also, as of writing of this article, the MCP tool in Claude Desktop is a beta feature (so it can get better and more bug free in the future), and I believe they are trying to focus more on extension instead since Claude Desktop has a more general audience (not just programmers) compared to Claude Code. That's just my speculations.

Just to finish this demonstration, I removed context7 temporarily from mcp-proxy.py, to show how a working mcp tool looks like in Claude Desktop.

From a new chat, you can click the plus, and see what mcp servers are available and what tools are exposed. You can even enable/disable why server or tool. You should definitely leverage this feature.

Now, to try out the tool, make a simple request like:
"using the playwright mcp tool, can go to gofastmcp.com ?"

The response should show that Claude successfully used the Playwright tool to go to the FastMCP homepage.

TL;DR: Create JSON or Python proxy configs, run with fastmcp run, install to clients with fastmcp install. Automatic dependency management via uv handles the complexity.

Conclusion

FastMCP's proxy capabilities transform MCP server management from a fragmented, per-client configuration nightmare into a streamlined, centralized approach. By bundling multiple servers behind a single endpoint, you gain:

Simplified deployment: One proxy serves all your MCP tools
Consistent configuration: Single source of truth across all clients
Resource efficiency: Fewer running processes and managed dependencies
Easy maintenance: Update proxy configuration once, benefit everywhere

The CLI tools make integration seamless—whether you prefer JSON configurations for simplicity or Python files for programmability, FastMCP handles the complexity of dependency management and client integration automatically.

As the MCP ecosystem continues growing, this proxy pattern will become increasingly valuable for developers who want to harness multiple specialized tools without the operational overhead. Start with your most essential MCPs, test the performance impact, and gradually expand your toolkit as needed.

Remember: the goal isn't to bundle every available MCP, but to create a curated, efficient collection that enhances your AI workflows without overwhelming the underlying models.

The Job Pilot Chronicles: 94 Commits, 27 Days, and the Brutal Reality of AI-Assisted Development

Alex Retana — Tue, 16 Sep 2025 22:54:11 +0000

A brutally honest story of building a full-stack app in the AI age - where every "firts commit" typo and late-night debugging session reveals what we're really signing up for

The Hook That Started Everything

It was August 20th, 2025, 11:47 PM. I typed git commit -m "firts commit" and hit enter.

Yes, "firts." With a typo. Because apparently, even in the age of AI coding assistants that can write entire applications, I still can't spell "first" correctly when I'm excited about a new project.

That typo-laden commit would become the first of 94 commits across 27 days - a journey that perfectly captures the paradox every developer faces in 2025: AI tools promise to make us faster and smarter, but somehow we're still debugging our own mistakes at 2 AM, wondering if we're more productive or just more confused.

The Numbers Don't Lie (But They Don't Tell the Whole Story Either)

Let me hit you with the raw data from my git archaeology:

94 commits in 27 days (that's 3.5 commits per day, for those keeping score)
100% solo development (just me, my coffee, and an army of AI assistants)
Peak intensity: 78 commits in August alone
Top commit message keywords: "api" (23 times), "implement" (21 times), "tests" (13 times)

But here's what those numbers don't show:

Hours spent arguing with AI about why its "perfect" code wouldn't compile
The number of times I copy-pasted AI suggestions that looked brilliant but were subtly, devastatingly wrong
How many "implement frontend" commits were actually "please God just make this work" in disguise

Sound familiar? You're not alone.

The Great AI Coding Reality Check

While I was grinding through Job Pilot, researchers were documenting what every developer using AI tools secretly knows but rarely admits: we're not actually as productive as we think we are.

Recent studies show that 84% of developers now use AI coding assistants, but here's the kicker - experienced developers actually take 19% longer to complete tasks when using AI tools. We expected a 24% speedup. Instead, we got a productivity drag.

Why? Because AI solutions are "almost right, but not quite" 45% of the time. And debugging almost-right code is somehow more painful than writing it from scratch.

My commit history tells this exact story.

Act I: The Foundation Fantasy (August 20-25)

After that infamous "firts commit," I did what any developer does when starting fresh - I immediately restructured everything.

August 23rd: "Complete project restructure: app → backend"

This wasn't procrastination. This was wisdom. I was setting up proper separation of concerns before they became technical debt. The AI tools were great at generating boilerplate, but they had zero opinions about project architecture.

August 25th: The API explosion began.

In a single day, I built:

Authentication & Authorization endpoints
Job listings CRUD
User profiles API
Companies API
Job Applications API
Resumes API

My commit messages from that day read like a developer's fever dream:

"feat: Implement FastAPI backend structure with TDD approach"
"Add comprehensive job listings endpoints with validation"
"Implement user authentication with JWT tokens"

I felt unstoppable. The AI was cranking out endpoint after endpoint. I was a full-stack architect, building the future of job searching.

Then I tried to connect it all together.

Act II: The API Renaissance Meets Reality (August 25-26)

August 26th was when I learned the first brutal lesson about AI-assisted development: AI is excellent at writing isolated components, terrible at making them work together.

While GitHub Copilot was suggesting perfect-looking API routes, and Claude was generating comprehensive test suites, none of them understood how my authentication middleware should interact with my database models, or why my job deduplication logic was creating infinite loops.

The commit messages tell the story:

"Fix authentication middleware integration"
"Debug job deduplication infinite loop"
"Implement proper error handling across all endpoints"

This mirrors what 90% of developers report: AI tools struggle with large codebase context. They don't understand your existing patterns, your architectural decisions, or the subtle dependencies between your modules.

I was becoming a translator between different AI suggestions, spending more time debugging AI-generated integration bugs than I would have spent just writing the damn thing myself.

Act III: The Frontend Struggle is Real (August 27-30)

And then came the frontend. Oh, the frontend.

August 27th: "Write out plan for migrating jsx component to new reworked frontend service"

Translation: "The AI generated a bunch of React components that look perfect in isolation but form a Frankenstein's monster when assembled."

August 28th: "Restart on making the tsx rendering layer; using playwright mcp this time"

Translation: "Nothing works. Starting over. Again."

August 30th: The commits from this day perfectly capture the AI development experience:

"Ai wrote a bunch for some reason" (When you let Claude take the wheel and it generated 200 lines of code you didn't ask for)
"Done; no human validation, hope this went well" (Every developer's prayer when using AI)
"Finish reworking frontend; still bugs" (Brutal honesty)

This is the reality nobody talks about: AI tools excel at the easy stuff but struggle with the integration layer where real applications live.

Act IV: The Testing Enlightenment (Late August)

Here's where my story diverges from the typical "AI made me super productive" narrative. When everything was falling apart, I doubled down on testing.

Not because I'm some testing evangelist, but because testing was the only way to verify that AI suggestions actually worked.

My commits became:

"Big effort to implement playwright testing"
"Implement full-stack test suite, and first integration tests"
"Create foundation for Playwright Testing"

Testing became my AI reality check. When Claude confidently assured me that its authentication flow was "production-ready," my tests caught the security holes. When Copilot generated database queries that looked elegant, my integration tests revealed they'd break under load.

This is the pattern successful AI-assisted developers follow: Use AI for generation, humans for validation, and tests for truth.

Act V: The September Reset (The Plot Twist)

After August's 78-commit marathon, something interesting happened. I burned out. Hard.

But instead of abandoning the project, I did something that separates experienced developers from beginners: I made a strategic reset.

September 15th: "Branch reset: Reset main branch to ff06fcd" followed by "Restart Frontend Components"

This wasn't failure. This was wisdom. Sometimes the best progress is admitting your current approach isn't working and starting fresh with better knowledge.

The September commits show a more measured approach:

"Remove a bunch of the bloat"
"Begin frontend rework" (yes, another typo - some things never change)
"Create first draft of frontend rewrite"

The Real Lessons (What They Don't Tell You About AI Development)

After 94 commits and 27 days, here's what I learned about developing with AI in 2025:

1. AI is a Powerful Intern, Not a Senior Developer

AI tools are incredible at:

Generating boilerplate code
Writing isolated functions
Creating comprehensive test cases
Suggesting patterns you forgot existed

But they're terrible at:

Understanding your existing codebase context
Making architectural decisions
Debugging integration issues
Knowing when to stop generating code

2. The "Almost Right" Problem is Real

That 45% statistic about AI being "almost right, but not quite"? It's devastatingly accurate.

Almost-right code is worse than obviously broken code because it looks correct until it fails in production. You spend more time debugging subtle AI mistakes than you would writing correct code from scratch.

3. Testing Becomes Non-Negotiable

In the pre-AI era, you could sometimes get away with light testing if you wrote careful code. With AI assistance, comprehensive testing isn't optional - it's the only way to verify that generated code actually works.

4. Humans Excel at Integration and Architecture

AI generates components. Humans integrate systems. The real skill in AI-assisted development isn't prompting the AI to write perfect code - it's knowing how to combine AI-generated pieces into a coherent, maintainable system.

5. The Restart is a Feature, Not a Bug

Traditional development wisdom says "never rewrite." AI-assisted development changes this. Sometimes, starting fresh with better prompts and clearer architecture is faster than debugging a messy AI-generated codebase.

The Paradox We're All Living

Here's the thing that nobody wants to admit: AI coding tools are simultaneously the best and most frustrating thing to happen to development.

They're the best because they can generate complex, sophisticated code in seconds. They handle boilerplate, suggest patterns, and can kickstart projects that would take days to set up manually.

They're the most frustrating because they create a false sense of progress. You feel incredibly productive generating hundreds of lines of code, until you realize none of it works together and you're debugging problems you don't understand.

What This Means for You

If you're using AI coding tools (and statistically, you probably are), here's my advice based on 94 commits of real experience:

Start with Architecture, Not Code

Don't let AI drive your technical decisions. Plan your system architecture first, then use AI to implement the pieces.

Embrace the Human-AI Workflow

Use AI for generation, yourself for validation, and tests for verification. This trinity is your safeguard against the "almost right" problem.

Budget for Integration Time

AI can generate components in minutes, but integrating them takes human time. Plan accordingly.

Test Everything

If AI wrote it, test it. If you modified AI code, test it again. If it looks too good to be true, definitely test it.

Know When to Reset

Sometimes starting fresh with better prompts is faster than debugging a tangled AI-generated mess.

The Job Pilot Epilogue

As I write this, Job Pilot is alive and actively being developed. It's not the perfect app I envisioned on August 20th, but it's something better - a real application built through the messy, iterative process of human-AI collaboration.

The final commit message as of September 16th reads: "Document frontend-backend connection"

Not "Revolutionary AI-Generated App Launches" or "Perfect Code Generated by AI." Just the humble work of documenting how pieces fit together - the most human part of development.

That's the real story of coding with AI in 2025. It's not about AI replacing developers or making us obsolete. It's about learning to collaborate with incredibly powerful but fundamentally limited tools.

We're not just writing code anymore. We're conducting an orchestra where half the musicians are brilliant but can't read music, and the other half (us) need to make sure everyone plays in harmony.

And sometimes, just sometimes, when all the pieces align and the tests pass and the user clicks through your app without encountering a single bug, you remember why you fell in love with building software in the first place.

Even if your first commit had a typo.

What's Your AI Development Story?

I've shared my 94-commit journey into the reality of AI-assisted development. Now I want to hear yours.

Have you experienced the "almost right, but not quite" problem? How do you balance AI assistance with human judgment? What's your biggest AI development win or failure?

Share your story in the comments - let's build a real picture of what development looks like in the AI age, beyond the hype and the headlines.

Want to see the complete commit history that inspired this post? Check out the Job Pilot repository (coming soon) or follow my development journey for more real stories from the trenches of AI-assisted coding.

Crafting a Monster Hunter Wilds AI Assistant: Scrapy, Vector Search & Prompt Engineering

Alex Retana — Tue, 12 Aug 2025 21:54:59 +0000

Building a Local Monster Hunter Wilds RAG System: From Web Scraping to Prompt Engineering

Gaming wikis are treasure troves of detailed information, but finding the right answer to specific questions can be like hunting a Rathalos in a thunderstorm. What if you could have a personal Monster Hunter expert that knows every weapon combo, monster weakness, and crafting recipe? That's exactly what I built with my Monster Hunter Wilds RAG (Retrieval-Augmented Generation) system.

In this article, I'll walk you through building a complete RAG pipeline that scrapes gaming wiki content, vectorizes it for fast retrieval, and serves intelligent answers through a local web interface. Along the way, we'll explore why certain architectural decisions were made and how prompt engineering can dramatically improve system performance.

🏗️ System Architecture: Two-Part Approach

The system consists of two main components:

Intelligent Web Scraper: Harvests and structures wiki content
RAG Pipeline: Retrieves relevant content and generates contextual answers

Let's dive into each part.

Part 1: Building the Web Scraper with Scrapy

Why Scrapy Over Custom Solutions?

When building a web scraper, you have several options:

Write a custom scraper with requests and BeautifulSoup
Use browser automation tools like Selenium
Leverage a professional scraping framework like Scrapy

I chose Scrapy for several compelling reasons:

1. Built-in Politeness: Scrapy respects robots.txt files and implements automatic delays between requests, making it respectful to target servers.

2. Robust Crawling Features:

Pause/resume functionality through JOBDIR settings
Automatic duplicate detection and filtering
Depth limiting to prevent infinite crawling
Built-in retry mechanisms for failed requests

3. Scalability: Scrapy handles concurrent requests efficiently and can scale from small wikis to massive sites.

4. Extensibility: The pipeline architecture allows for easy data processing and storage customization.

Spider Implementation

Here's the core of my Fextralife spider:

class MyFextralifeSpider(scrapy.Spider):
    name = "myfextralifespider"
    allowed_domains = ["monsterhunterwilds.wiki.fextralife.com"]
    start_urls = ["https://monsterhunterwilds.wiki.fextralife.com/Monster+Hunter+Wilds+Wiki"]

    custom_settings = {
        "JOBDIR": f'jobs/daily-fextralife-{datetime.today().strftime("%Y-%m-%d")}',
        "DEPTH_LIMIT": 6,
        "CLOSESPIDER_TIMEOUT": 3600,
        "ITEM_PIPELINES": {
            'wikiproject.pipelines.WikiprojectPipeline': 300,
        }
    }

The spider automatically:

Follows internal links within the domain
Skips static assets (images, CSS, JS files)
Limits crawl depth to prevent infinite loops
Saves progress for pause/resume functionality

Intelligent Content Extraction

The magic happens in the content parsing. Wiki pages contain both structured (tables) and unstructured (text) content:

def parse_wiki_content(self, sel):
    # Extract clean text content from the main wiki content block
    wikicontent = (" ".join([
        x.strip() for x in sel.xpath('//div[@id="wiki-content-block"]//text()').getall()
    ])).replace('\xa0', ' ')
    return wikicontent

def parse_wiki_tables(self, html):
    # Convert HTML tables to structured JSON
    # Handles nested tables, images with alt text, and complex structures
    # Returns normalized data ready for vectorization
    ...

The system extracts:

Breadcrumb navigation for content categorization
Clean text content from the main wiki areas
Structured table data converted to JSON format
URL references for source attribution

Each page is transformed into a structured document:

If the user's answer is answered by information in this file, please direct them to {url}
URL: {url}
####################
Page Title: {title}
####################
Breadcrumb: {breadcrumb}
####################
Page Content:
{clean_text_content}
####################
Page Tables Stored as JSON:
{structured_table_data}

The Critical on_close() Function

Here's where the scraped data gets vectorized and stored. In Scrapy's pipeline system, the close_spider method in pipelines.py is called when crawling finishes:

def close_spider(self, spider):
    # Deduplicate scraped content and build breadcrumb map
    breadcrumb_map, total_page_count = dedupe_and_build_breadcrumb_map()

    print(f"Total Pages Scraped: {total_page_count}")
    print("Data has been ingested into Chroma vector store")

The dedupe_and_build_breadcrumb_map() function handles the final data processing:

def upsert_into_chroma(df):
    """Upserts DataFrame content into Chroma vector store."""
    print("Starting Chroma ingestion...")

    # Initialize embedding model
    embed_model = FixedOllamaEmbedding(model_name="nomic-embed-text")

    # Create persistent Chroma client
    chroma_client = chromadb.PersistentClient(path="../chroma_db")
    chroma_collection = chroma_client.get_or_create_collection("monsterhunter_fextralife_wiki")
    vector_store = ChromaVectorStore(chroma_collection=chroma_collection)

    # Convert to LlamaIndex Documents and create vector index
    documents = [Document(text=row["wiki_content"], metadata={"url": row["url"]}) 
                for _, row in df.iterrows()]

    index = VectorStoreIndex.from_documents(
        documents, storage_context=storage_context, embed_model=embed_model
    )

    print(f"Successfully ingested {len(documents)} documents")

This approach ensures all scraped content is automatically vectorized and ready for semantic search.

Part 2: The RAG System with OpenWebUI

OpenWebUI + Pipelines Architecture

I chose OpenWebUI as the frontend because it provides:

Familiar Chat Interface: ChatGPT-like experience for users
Pipeline System: Custom processing between user input and LLM
Local Hosting: Complete control over data and privacy
Multiple Model Support: Works with Ollama's local models

The pipeline architecture works like this:

User Query → OpenWebUI → Custom Pipeline → Chroma Search → Context + Query → LLM → Response

Early Implementation: Simple Interception

Initially, the pipeline was quite basic:

async def on_message(self, body: dict, __user__: Optional[dict] = None) -> dict:
    # Simply intercept the message
    user_query = body.get("content", "")

    # Search Chroma for relevant content  
    results = self.vector_store.search(user_query, top_k=5)

    # Lazily combine results with query
    context = "\n\n".join([doc.text for doc in results])
    enhanced_query = f"Context: {context}\n\nQuery: {user_query}"

    # Pass to LLM
    body["content"] = enhanced_query
    return body

This worked, but responses were generic and often missed domain-specific nuances.

Part 3: Evaluation Framework

Before diving into improvements, I built a comprehensive evaluation system to measure performance objectively.

Evaluation Metrics

Following LlamaIndex best practices, I implemented both end-to-end and component-wise evaluation:

End-to-End Metrics:

Faithfulness (0-1): Are responses faithful to retrieved context? (No hallucinations)
Relevancy (0-1): Are responses relevant to the query?
Correctness (0-1): Are responses factually correct?
Semantic Similarity (0-1): How similar are responses to expected answers?

Component-Wise Metrics:

Hit Rate: Percentage of queries where relevant documents are retrieved
Mean Reciprocal Rank (MRR): Quality of retrieval ranking
Response Time: Performance measurement

Dataset Generation

I created two approaches for generating evaluation data:

1. Automated Question Generation:

def generate_questions_from_vectorstore():
    # Sample random documents from Chroma
    # Use LLM to generate realistic questions
    # Create diverse query types (factual, procedural, comparative)
    # Categorize by content type (weapons, monsters, crafting, etc.)
    ...

2. Manual Answer Annotation:
I built a separate annotation program that:

Takes generated questions
Retrieves potential answers from the RAG system
Presents them to human reviewers for validation
Builds high-quality ground truth datasets

This hybrid approach ensured both scale (144 auto-generated questions) and quality (15 carefully curated sample queries).

Part 4: The Power of Prompt Engineering

Pre-Prompt Engineering Results

Running evaluation on the basic system revealed significant issues:

Dataset	Faithfulness	Relevancy	Correctness
Sample Queries (15)	80.0%	86.67%	26.67%
Generated Questions (144)	77.08%	90.97%	83.33%

The correctness scores revealed a major problem: while the system could find relevant information, it struggled to provide accurate, domain-specific answers.

What is Prompt Engineering?

Prompt engineering is the practice of designing, optimizing, and refining the instructions given to language models to achieve better performance on specific tasks. It involves:

Role Definition: Establishing the AI's persona and expertise
Context Guidelines: Specifying how to use provided information
Output Formatting: Defining response structure and style
Error Handling: Instructions for edge cases and missing information

Custom Monster Hunter Prompts

I implemented domain-specific prompts that transformed the system:

mh_qa_template = PromptTemplate(
    template=(
        "You are an expert Monster Hunter guide and wiki assistant with deep knowledge "
        "of Monster Hunter: Wilds. Your role is to provide accurate, helpful information "
        "about weapons, monsters, gameplay mechanics, and strategies.\n\n"

        "IMPORTANT GUIDELINES:\n"
        "- Use ONLY the information provided in the context below\n"
        "- Use correct Monster Hunter terminology (e.g., 'Great Sword' not 'Greatsword')\n"
        "- If information is insufficient, clearly state what you cannot answer\n"
        "- Include relevant URLs when directing users to specific pages\n"
        "- Structure responses clearly with sections when appropriate\n\n"

        "Context Information:\n"
        "{context_str}\n\n"

        "User Question: {query_str}\n\n"

        "Provide a comprehensive answer based on the context above:"
    )
)

Key improvements included:

Expert Persona: "You are an expert Monster Hunter guide"
Terminology Enforcement: Specific language requirements
Context Boundaries: "Use ONLY the information provided"
Response Structure: Clear formatting guidelines
Source Attribution: Including URLs for references

Results After Prompt Engineering

The impact was dramatic:

Dataset	Metric	Before	After	Improvement
Sample Queries	Correctness	26.67%	93.33%	+250%
Sample Queries	Faithfulness	80.0%	80.0%	Maintained
Generated Questions	Correctness	83.33%	91.67%	+10%
Generated Questions	Faithfulness	77.08%	86.11%	+12%

Key Performance Highlights

Exceptional Correctness Improvement:

Sample dataset correctness jumped from 26.67% to 93.33% - a 250% improvement
Large dataset correctness increased from 83.33% to 91.67%
Users now receive significantly more accurate responses

Enhanced Faithfulness:

12% improvement on large dataset (reduced hallucinations)
Better adherence to source material
Increased system reliability

Domain Expertise Integration:

Proper Monster Hunter terminology usage
Contextually appropriate responses
Category-specific performance improvements

The system now provides accurate answers 9 out of 10 times, with responses that stay true to the source material while being highly relevant to user queries.

Part 5: Local Hosting and Hardware Considerations

Why Local Over Cloud?

I made the conscious decision to keep this system local rather than hosting it online for several reasons:

Cost Considerations:

GPU Requirements: The system performs best with GPU acceleration for embeddings and LLM inference
High Memory Usage: Running multiple large language models (embedding + chat model) requires significant RAM
Storage Needs: Vector databases and model files consume substantial disk space
Compute Costs: Cloud GPU instances are expensive for continuous operation

Privacy Benefits:

Complete control over data
No external API dependencies
Gaming queries remain private
Can customize without service restrictions

Hardware Requirements

The system runs smoothly on my RTX 3090 setup:

GPU: RTX 3090 (24GB VRAM) - handles both embedding and LLM inference
RAM: 32GB system RAM for vector operations
Storage: SSD storage for fast vector database access

Performance with RTX 3090:

Embedding Generation: ~2-3 seconds for query embedding
Vector Search: Sub-second retrieval from Chroma
LLM Inference: 8-15 seconds for complete responses
Total Response Time: 10-20 seconds end-to-end

Automated Setup Scripts

To make the system accessible, I created comprehensive build and startup scripts:

Frontend Build Process:

# Windows
build.bat

# Linux/macOS  
./build.sh

System Startup:

# Windows
start_windows.bat

# Linux/macOS
./start.sh

The scripts automatically:

Install and configure Ollama server
Download required AI models (llama3:8b, nomic-embed-text)
Set up conda environments for different components
Build the OpenWebUI frontend
Launch all services in separate terminal windows

This automation transforms a complex multi-component system into a simple double-click experience.

Key Lessons Learned

1. Scrapy's Professional Features Matter

The built-in politeness, retry mechanisms, and pause/resume capabilities saved countless hours compared to custom solutions.

2. Data Quality Trumps Quantity

150 well-processed, structured documents outperformed thousands of poorly parsed pages.

3. Prompt Engineering is Critical

Generic prompts led to 26.67% correctness; domain-specific prompts achieved 93.33% - a game-changing difference.

4. Evaluation Drives Improvement

Without quantitative metrics, I would have never discovered the correctness issues or measured the dramatic improvements.

5. Local Hosting is Viable for Personal Projects

Modern consumer GPUs like the RTX 3090 make sophisticated AI systems accessible for personal use without ongoing cloud costs.

Future Enhancements

Several improvements could further enhance the system:

Multi-game Support: Extend to other gaming wikis
Advanced Context: Conversation history and user preferences
Performance Optimization: Reduce response times while maintaining quality
Mobile Interface: Responsive design for gaming on-the-go
Community Features: Shared question libraries and answer validation

Conclusion

Building this Monster Hunter RAG system taught me that modern AI tools can transform how we interact with domain-specific knowledge. The combination of intelligent web scraping, vector search, and carefully engineered prompts creates an experience far superior to traditional wiki browsing.

The system went from providing correct answers 1 in 4 times to 9 in 10 times through prompt engineering alone. This demonstrates the critical importance of domain-specific customization in RAG systems.

For gaming enthusiasts, researchers, or anyone working with specialized knowledge domains, this architecture provides a blueprint for building your own intelligent information systems. The complete codebase, evaluation framework, and setup scripts make it accessible even for those new to RAG systems.

Want to build your own gaming RAG system? The complete project is open source and includes automated setup scripts, comprehensive evaluation tools, and detailed documentation to get you started.

Happy hunting! 🏹

Tech Stack Used:

Web Scraping: Scrapy, BeautifulSoup
Vector Database: ChromaDB
LLM Framework: LlamaIndex
Models: Ollama (Llama 3, Nomic Embed)
Frontend: OpenWebUI (SvelteKit)
Evaluation: Custom framework with automated metrics
Languages: Python, JavaScript, Shell scripting

This project showcases the power of combining modern AI tools with careful engineering to create practical, high-performance systems for specialized domains.

From SageMaker to Static Site: Hosting a Deep Learning Model on the Frontend

Alex Retana — Mon, 28 Jul 2025 20:13:16 +0000

A couple of weeks ago, I revisited an old project: a face mask classifier I originally built in Keras.
In my last article, I retrained it three different ways:

✅ Classic deep learning (TensorFlow inside SageMaker Studio)

⚙️ Low-code SageMaker Canvas

🧠 Fully managed Rekognition Custom Labels

This time, I wanted to see if I could make the model run entirely in the browser. Not just for fun, but because it felt like a win on multiple fronts: it would remove backend inference costs since there’d be no server running 24/7; keep the webcam feed local to the user’s machine, improving privacy; and create a live demo anyone could try instantly, without waiting on an API call or spinning up infrastructure.

Here’s how that turned out, and why it was trickier than I first thought.

⚙️ Step 1: Converting my Keras model for TensorFlow.js

My original classifier was trained in SageMaker Studio, saved in the latest Keras v3 format.
Problem: TensorFlow.js only supports converting the old Keras v2 .h5 format.

So the first thing I had to do:

Retrain the model (same code, but explicitly save it to .h5)

Use the CLI:

tensorflowjs_converter \
  --input_format keras \
  my_model.h5 \
  ./model_web/
That produced a browser-ready model.json + weight files.
Loading it in JS was simple:

const model = await tf.loadLayersModel('/model_web/model.json');

This felt like a small win — but the model only takes cropped face images.
Next problem: how do I detect faces?

📦 Step 2: Adding face detection

In the original pipeline, I used a YOLOv3 model to detect faces, then classified them.
But YOLOv3 is heavy for browser use.

I needed something smaller, that worked in TensorFlow.js.

Luckily, TensorFlow.js has some pretrained lightweight face detectors.
I picked one, tested it on the webcam stream, and it worked surprisingly well:

Detect faces → crop → run classifier → draw predictions

All in real time.

Suddenly, I had a browser app that could see your face and tell if you were wearing a mask — without sending anything to a server.

🚫 Honorable Mention: Why Canvas & Rekognition models didn’t make it

At this point, I was hoping I could also bring over the models I built in SageMaker Canvas and Rekognition to run directly in the browser. But pretty quickly, I ran into hard limits: SageMaker Canvas only lets you export a model meant for Python or TensorFlow Serving, with no option to get a .h5 or SavedModel that I could convert for TensorFlow.js; and Rekognition Custom Labels doesn’t let you download the trained model at all — it’s locked behind AWS’s API. Since the whole goal was to keep everything frontend-only and client-side, these two paths just didn’t fit. It was a good reminder that the more managed and abstracted a tool is, the less portable your model ends up being.

🧰 Step 3: Building the demo & making it repeatable

With the model running locally in the browser, I wanted to take the next step: actually host it online so anyone could try it, and make deployments effortless. To do that, I built a small React frontend that grabs the webcam feed, detects faces, runs the mask classifier, and draws the predictions on screen in real time. Then I wrote some Terraform to handle the infrastructure: provisioning a public S3 bucket for static hosting, a CloudFront distribution for global CDN, and IAM roles to support CI/CD. Finally, I set up GitHub Actions so that every time I push to the repository, it automatically builds the site and deploys it to S3.

Now it’s fully repeatable:

terraform apply

And it’s live.

✅ Wrapping it up

In the end, what started as an old Keras side project turned into a modern, privacy‑friendly browser demo — running real‑time face detection and mask classification entirely on the client. To clean up the frontend, I rebuilt it using Solid.js for fast reactivity, styled it with Tailwind CSS and daisyUI, and added subtle animations with auto‑animate and solid‑transition‑group to make the UI feel more alive.

I even tried to get it working on mobile devices, but ran into a familiar wall: the model was just too big to run smoothly in the browser on most phones. At that point, training a new, smaller model felt like it deserved to be its own project — and I decided to leave it for another day.

Still, I’m happy with how it turned out: a repeatable, low‑cost, fully frontend ML demo that anyone can try without sending a single frame to a backend. And while it’s not production‑ready, it’s proof that with the right tools and some cloud glue, you can bring even an old deep learning project back to life — and make it feel brand new.

If you’ve tried something similar, run into the same Keras/TensorFlow.js headaches, or have ideas on building lighter models for mobile, I’d love to hear about it in the comments!

You can try the live demo here → Face Mask Classifier Demo, and if you’re curious about my other projects, check out my portfolio at retanatech.com.

Comparing 3 ways to Train a Face Mask Classifier: Tensorflow, AWS Canvas, and Rekognition

Alex Retana — Thu, 10 Jul 2025 21:48:27 +0000

🛠️ Introduction

A few years ago, I built a simple face mask image classifier using Keras and TensorFlow, trained locally on my own hardware. Recently, I decided to revisit this project for a few reasons:

To see how easy (or hard) it would be to rerun my old Jupyter notebook from 4–5 years ago.
To try running custom training jobs inside Amazon SageMaker Studio, instead of relying on my own machine.
And while I was at it, I wanted to compare my custom-trained model against other ways of building and deploying models on AWS, including low-code/no-code tools and out-of-the-box computer vision APIs.

Here are the three approaches I tested:

✅ Classic deep learning: Running my original Jupyter notebook inside a SageMaker Studio JupyterLab instance, retraining the model with TensorFlow, then hosting it for a front-end demo using TensorFlow.js + S3.

⚙️ Low-code/no-code: Using AWS SageMaker Canvas, which lets you upload images and build models through a point-and-click UI, without writing code.

🧠 Fully managed pre-trained service: Using AWS Rekognition’s facial analysis API to see if it can detect masks directly — no training required.

For each method, I wanted to evaluate:

Ease of training/setup
Options for deployment (can it run in the frontend? backend only? real-time or batch?)
AWS pricing cost
Computational cost & latency (how fast can it return predictions?)

In the rest of this article, I’ll walk through each method, compare their results, and share what I learned along the way.

📦 Method 1: Classic Deep Learning (TensorFlow + Jupyter)

📜 Revisiting the Old Project

The starting point for this method is my older project:

👉 alexretana/facemaskclassifier on GitHub

This was a small computer vision project I created a few years ago to explore transfer learning and pretrained models. The goal was to build a pipeline that could detect faces in an image and classify whether each face was wearing a mask. To do this, I combined a YOLOv3 model (pretrained to detect faces) with a custom classifier trained to recognize masks.

The workflow was straightforward: given an input image, the YOLOv3 model would identify and draw bounding boxes around the faces. Each detected face would then be cropped and passed to the mask classifier, which predicted “mask” or “no mask” along with a confidence score. Finally, the pipeline overlaid labels on the image to show the results.

I learned a lot during this process, especially about loading and fine-tuning pretrained models, feature extraction, and how to stitch multiple models together into a single pipeline.

Special thanks to PyImageSearch by Adrian Rosebrock. Many tutorials there helped me build this!

If you’re curious, the repo contains several notebooks:

PlayAroundWithPretrainModels.ipynb – experimenting with pretrained models
TransferLearning-FeatureExtraction.ipynb – logistic regression on extracted features
TransferLearning-FineTurning.ipynb – fine-tuning pretrained model layers
predict.ipynb – final pipeline: detection → cropping → classification → annotated output

Next, I'll describe how I retrained and ran this project inside SageMaker Studio instead of on my local machine.

⚙️ Running in SageMaker Studio

With my old notebooks ready, I wanted to see how easy it would be to train the same model using AWS SageMaker Studio, instead of my local machine.

🛠 If you haven’t set up SageMaker Studio yet, here’s AWS’s quick start guide — it walks through creating the Studio environment in a few clicks.

Once my SageMaker Studio was provisioned, the workflow was surprisingly smooth. From the Studio home dashboard, it’s straightforward to launch new compute instances to run Jupyter notebooks or other tools. I started by spinning up an ml.t3.medium instance, the cheapest option at the time of writing, just to get started.

The UI makes it easy to open a terminal or create a new notebook. I opened the terminal to clone my old project repo from GitHub. One thing I quickly realized: my original project didn’t include a requirements.txt file (lesson learned for the future!). Thankfully, SageMaker’s default environments already come with many common libraries pre-installed, including:

pandas
numpy
tensorflow / keras
scikit-learn

The only extra dependencies I had to install were:

imutils
opencv-python

For OpenCV to work properly, it also needed an additional system package:

sudo apt install -y libgl1

The biggest hiccup I ran into was around dataset preparation: my old notebooks didn’t include clear instructions or scripts to recreate the train/validation/test splits. I had to figure that part out again before training could actually run. The dataset itself has over 10,000 images (but is thankfully only around 20 MB). At first, I tried simply dragging and dropping the dataset into the JupyterLab web interface, but this turned out to be unreliable: not every file transferred, and it took a long time.

From reading the docs and best practices, a better solution (and a common pattern for larger file transfers) was to:

Upload the dataset to an S3 bucket
Download it from S3 to the notebook instance using the terminal

Uploading to S3 took about 20 minutes, but copying it down to the notebook instance was much faster; probably under a minute. This workflow felt much cleaner and avoided partial transfers.

Aside from that, the first notebook TransferLearning-FeatureExtraction.ipynb ran without any code changes. But I did run into another practical issue: the ml.t3.medium instance didn’t have enough RAM, and the process kept running out of memory, which would crash the kernel and restart the instance.

The fix was simple: I shut down the notebook instance and upgraded it to an ml.m5d.2xlarge instance (which has about 32GB RAM, which is roughly what I used to have on my local dev machine). After restarting, everything picked up right where it left off. No need to clone the repos and redownload images; however, the packages did have to get reinstalled.

After training my model in the new SageMaker environment, I wanted to compare the training curves to those from my earlier runs a few years ago.

In this chart, you can see there are two graphs for each year. That’s because the transfer learning process includes two rounds of training: first training only the network head, and then fine-tuning the entire model after unfreezing more layers.

While the overall accuracy results are similar, I noticed that the training loss and training accuracy curves are much noisier and more sporadic in the recent run.

From what I’ve read, improvements in data augmentation, optimizer updates, and weight initialization defaults in frameworks like Keras and TensorFlow over the last few years can produce this kind of noisier but potentially more robust training process. If anyone has experience or thoughts on why this might happen, I’d love to hear your perspective in the comments!

⚙️ Method 2: Low-Code / No-Code with SageMaker Canvas

For the second approach, I wanted to try AWS SageMaker Canvas — a no-code tool that lets you build machine learning models through a web UI, without writing a single line of code.

The first step was to prepare my dataset in a format Canvas could use. To do this, I reorganized the images into labeled folders (e.g., mask/ and no_mask/). When you import data in Canvas, it can automatically use the folder names as class labels. I then uploaded this new dataset structure into an S3 bucket.

In Canvas, creating the dataset is straightforward: you create a new dataset and point it at your S3 bucket location. Once imported, you can see the list of images and labels Canvas detected.

🏗 Training the model

I kicked off a standard training job (since the quick mode couldn't handle the size of my dataset). Canvas estimated it might take 3–5 hours, but in reality it completed in under 2 hours — maybe even less than one.

The best part? It was truly one-click training: Canvas doesn’t ask you to choose architectures or tune hyperparameters. Instead, it quietly evaluates multiple candidate models behind the scenes, though it doesn’t disclose exactly which models it tried or what metrics guided the selection.

📊 Model evaluation & explainability

For evaluation, Canvas automatically showed me per-label accuracy so I could see which class performed better, along with actual examples of images it got right or wrong. It also generated heatmaps (using Class Activation Maps) that highlighted where the model focused when making decisions, and included a confusion matrix to visualize where it confused “masked” vs “unmasked.” All of this appeared right after training finished, without needing to write any visualization code.

⚡ Making predictions

When it came time to test the model, Canvas offered two options: upload a single image to get an instant prediction, or run a batch prediction over multiple images at once. I tried both, but unfortunately the outputs either came back empty or had “FAILED” values in the CSV results, so I decided to skip ahead and deploy the model as an inference endpoint instead.

With just a few clicks, Canvas can deploy your trained model to an endpoint you can call via API, and I did that so I could finish my evaluation outside of the Canvas UI.

Starting from code written in fine tuning, I adapted a similar function to evaluate the accuracy of this model's predictions.

The results were surprisingly good: Canvas’s model ended up with slightly better accuracy than my manually trained TensorFlow model. However, batch processing did take a bit longer overall. Though it’s worth noting that both models were running inference on the same instance type ml.m5d.2xlarge, so the comparison is fair in terms of hardware. Here’s the classification report showing the final accuracy and per-class metrics:

In the end, SageMaker Canvas impressed me: it handled training, visualization, and deployment with almost no code. While I did run into some quirks with the batch prediction UI, the overall experience was very beginner-friendly — and the final model quality was competitive with a hand-crafted TensorFlow pipeline (granted my model is 5 years old).

🧠 Method 3: Fully Managed Pre-Trained Service (Rekognition Custom Labels)

For the last approach, I wanted to explore Amazon Rekognition’s Custom Labels feature, which lets you train your own image classifier on a custom dataset — still without writing code, but built directly into Rekognition’s console rather than SageMaker. The interface make following the steps developing your model straight forward and stream line.

The setup was familiar: I uploaded my dataset to an S3 bucket, using labeled folders (masked/ and unmasked/) so Rekognition could automatically detect the classes. After confirming the dataset, training was supposed to be as simple as clicking a button and waiting for it to finish.

However, the training failed on my first attempt. After digging into the documentation, I realized Rekognition requires all images in the training and test datasets to meet a minimum resolution. My original dataset included images smaller than that threshold. To fix this, I wrote a quick script to resize all images to an acceptable resolution, re-uploaded the updated dataset to S3, and restarted the training job.

In hindsight, this might explain why the prediction feature in Canvas also struggled with the same dataset, although it’s interesting that the inference endpoint created by Canvas worked fine with those smaller images.

After the training completed (which took about an hour), the results ended up being pretty comparable to what I got with SageMaker Canvas, and noticeably better than my old YOLOv3-based code.

One important limitation, though: unlike Canvas, Rekognition Custom Labels doesn’t let you register and download the raw model artifact. Instead, you’re fully dependent on calling Rekognition’s API for inference. That makes the solution less portable if you ever want to run the model outside AWS. On the plus side, this also means it’s incredibly quick to get started: after training finishes, you can deploy and start making predictions right away. Overall, this makes Rekognition Custom Labels a strong option for proof-of-concept projects or when you need to get something running with minimal setup.

💰 Cost Analysis

While testing each method, I kept track of the costs I saw in my AWS billing dashboard. Running everything manually through the Jupyter notebook (inside SageMaker Studio) ended up costing me less than $12 total — even after upgrading to a more expensive instance for training.

In contrast, SageMaker Canvas cost quite a bit more: about $49. To be fair, a lot of that cost probably came from my repeated attempts to run batch predictions, which ultimately didn’t work but still counted as billed time. If I would have to estimate the cost if things had run smoothly, I'd guess $10-$20.

Rekognition Custom Labels was by far the cheapest in my experiment: I was only charged $7.90. It’s worth noting, though, that this only covers training costs — not the cost of hosting the model or running real-time inference in production. I’m also curious how well Rekognition pricing scales over time as usage increases.

✅ Final Review & Comparison

Here’s how the three approaches stack up:

Method	Control & Flexibility	Ease of Use	Cost in Test	Portability	Notes
Classic Jupyter + TensorFlow	⭐⭐⭐⭐⭐	⭐⭐	~$12	Can export / host anywhere	Most setup & coding required; fully customizable
SageMaker Canvas	⭐⭐⭐	⭐⭐⭐⭐	~$49(probably actually ~$10-$20)	Can export model artifact	Great built-in visualizations; had issues with batch predictions; higher cost
Rekognition Custom Labels	⭐	⭐⭐⭐⭐⭐	~$8	Must use Rekognition API	Fastest setup; lowest upfront cost; can't download model; great for proof of concept

In the end, each option had its place:

If you want full control and portability, running your own TensorFlow notebooks (even inside SageMaker Studio) still feels best.

If you prefer no-code training and easy visualization tools, Canvas makes it remarkably simple to build, analyze, and deploy models — though at a higher cost and occasional quirks.

And if you just need to get something working fast, Rekognition Custom Labels is incredibly quick to set up and cheap to run — as long as you’re okay relying on AWS’s API for hosting.

Overall, revisiting this project showed me that today’s cloud tools can save a huge amount of time — but there are still trade-offs in cost, control, and portability. In the next article, I’ll look at deploying these models and providing a usable live demo so you can see them in action.
I’d love to hear if you’ve tried similar experiments, or what your experience has been — drop a comment below!