Forem: Akan

Building an Adaptive NER System with MLOps: A Complete Guide (Production)

Akan — Fri, 13 Mar 2026 04:37:54 +0000

How we took a transaction classification system from concept to a self-sustaining production pipeline with GitHub Actions that runs 24/7 without human intervention

In the previous guide we discussed how to build this system locally, but here we will go a step further and actually build for production.

I'll walk you through the journey of building and productionizing an enhanced Named Entity Recognition (NER) system that:

✅ Generates synthetic data automatically every day
✅ Trains ML models with hybrid rule-based + machine learning approaches
✅ Deploys interactive reports to GitHub Pages automatically
✅ Runs 3x faster with intelligent caching strategies
✅ Costs $0/month using GitHub Actions free tier

Live Demo: https://akanimohod19a.github.io/productionizing_NER/

The Result: A production-grade ML pipeline that processes 1,000 transactions, trains a model, and publishes a beautiful report — all in under 5 minutes, completely autonomously.

The Problem We Solved
Initial POC: What We Started With
Production Challenges We Faced
Solution 1: Implementing Intelligent Caching
Solution 2: Fixing the Invalid Date Bug
Solution 3: Dynamic Data Generation in CI/CD
Solution 4: Comprehensive Testing Strategy
Architecture Deep Dive
Performance Metrics: Before vs After
Lessons Learned
What's Next

The Problem We Solved

Business Context

Financial institutions process millions of free-text transaction descriptions daily, that look like these:

"walmart grocery shopping"
"cvs pharmacy prescription pickup"  
"uber ride to downtown"
"payment to acme corp inv-2024-001"

The Challenge:

Manual categorization is impossible at scale
Rule-based systems miss new patterns
Traditional ML requires constant retraining
No visibility into model performance
Reports are static and outdated

What We Built

A self-improving classification system that:

Automatically generates realistic test data
Combines rule-based and ML classification
Discovers new categories through clustering
Tracks everything with MLflow
Publishes interactive reports to the web
Runs completely autonomously via GitHub Actions

And it costs nothing to run!

Initial POC: What We Started With

The Original Implementation

Our proof-of-concept had three core components:

1. Rule-Based Classifier

# models/keyword_rules.yaml
categories:
  Healthcare:
    keywords: [pharmacy, doctor, hospital, medical]
    weight: 1.5

  Groceries:
    keywords: [walmart, grocery, supermarket]
    weight: 1.0

Coverage: 68.5% of transactions classified instantly.

2. ML Enhancement

from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import TfidfVectorizer

# Amount-weighted training
sample_weights = np.log1p(df['amount'].abs())
classifier = RandomForestClassifier(n_estimators=100)
classifier.fit(X, y, sample_weight=sample_weights)

Improvement: +22.7% coverage (total: 91.2%)

3. Unsupervised Discovery

from sklearn.cluster import DBSCAN

# Find patterns in unknown transactions
clustering = DBSCAN(eps=0.3, min_samples=3, metric='cosine')
labels = clustering.fit_predict(X)

# Discovered: "Insurance" category
# From: ["geico auto", "state farm policy", "allstate premium"]

POC Results

Metric	Value
Classification Coverage	91.2%
Processing Speed	0.8ms/transaction
Amount-Weighted Accuracy	96.8%

The POC worked. But it was manual, slow, and not production-ready. So, I planned to build it to run autonomously and with minimal intervention from humans,
but even that came with its own challenges.

Production Challenges We Faced

Challenge 1: Long Build Times

Problem: Initially, Each GitHub Actions run took 12+ minutes.

├─ Install Python packages:     4m 30s
├─ Install R packages:          6m 15s  
├─ Run tests:                   1m 20s
├─ Generate report:             2m 45s
└─ Total:                       12m 50s

Why it mattered: Slow feedback loops = slower development.

Challenge 2: Invalid Timestamps 📅

Problem: Then the published reports showed "Invalid Date" on the dashboard due to parsing issues.

// Dashboard tried to parse:
timestamp: "20260313_143522"

// But JavaScript Date() expected:
timestamp: "2026-03-13T14:35:22"

Impact: Professional dashboard looked broken.

Challenge 3: Stale Test Data

Problem: Tests ran against old, committed CSV files. Since the workflow start with a data gen - the entire system should work with that version of records.
Although, this is entirely because we were testing with random records in a real scenario, you are pointing to the data source, entirely.

# Tests always used this same file:
tests/fixtures/sample_transactions.csv

# But real pipeline generated fresh data daily!

Risk: Tests passing but production failing.

Challenge 4: No Visibility

Problem: When tests failed, we had to dig through logs.

FAILED tests/test_classifier.py::test_groceries_classification
ValueError: not enough values to unpack (expected 3, got 2)

Frustration: Cryptic errors, no clear fix.

So, I researched solutions.

Solution 1: Implementing Intelligent Caching

The Strategy

We implemented a multi-layer caching strategy to cache everything that doesn't change between runs.

Layer 1: Python Package Caching

Before:

- name: Install dependencies
  run: pip install -r requirements.txt
  # Time: ~4 minutes EVERY run

After:

- name: Set up Python
  uses: actions/setup-python@v5
  with:
    python-version: '3.9'
    cache: 'pip'  # ← Built-in pip caching

- name: Cache Python packages
  uses: actions/cache@v4
  with:
    path: ~/.cache/pip
    key: ${{ runner.os }}-pip-v1-${{ hashFiles('requirements.txt') }}
    restore-keys: |
      ${{ runner.os }}-pip-v1-
      ${{ runner.os }}-pip-

How it works:

First run: Downloads and caches packages (4 min)
Subsequent runs: Restores from cache (15 sec)
Only re-downloads if requirements.txt changes

Result: 3.75 minutes saved per run!

Layer 2: R Package Caching

R packages are huge and take forever to compile.

Before:

- name: Install R dependencies
  run: |
    install.packages(c("tidyverse", "plotly", "DT", ...))
  # Time: ~6 minutes

After:

- name: Cache R packages
  uses: actions/cache@v4
  with:
    path: ${{ env.R_LIBS_USER }}
    key: ${{ runner.os }}-r-v1-${{ hashFiles('DESCRIPTION') }}

- name: Install R dependencies
  uses: r-lib/actions/setup-r-dependencies@v2
  with:
    packages: |
      any::tidyverse
      any::knitr
      any::rmarkdown

Why this is brilliant:

r-lib/actions is maintained by RStudio
Handles OS-specific compilation
Caches binary packages, not source

Result: 5.5 minutes saved!

Layer 3: Pytest Cache

Tests generate fixtures and metadata that can be reused.

Implementation:

- name: Cache pytest
  uses: actions/cache@v4
  with:
    path: .pytest_cache
    key: ${{ runner.os }}-pytest-v1-${{ hashFiles('tests/**/*.py') }}

- name: Run tests
  run: pytest tests/ -v --cov=src/python

What gets cached:

Test discovery results
Fixture compilation
Coverage data structures

Result: 30 seconds saved, plus faster local testing!

Layer 4: MLflow Artifacts

ML experiments generate tons of metadata.

- name: Cache MLflow artifacts
  uses: actions/cache@v4
  with:
    path: mlruns
    key: ${{ runner.os }}-mlflow-v1-${{ github.sha }}
    restore-keys: |
      ${{ runner.os }}-mlflow-v1-

What's cached:

Model parameters
Metrics history
Artifact metadata

Benefit: Faster MLflow UI loading, experiment comparisons.

The Cache Strategy Matrix

Layer	Size	Build Time	Cache Hit Rate	Time Saved
Python packages	200 MB	4m 30s	95%	4m 15s
R packages	800 MB	6m 15s	90%	5m 30s
Pytest cache	5 MB	30s	85%	25s
MLflow artifacts	50 MB	-	80%	-

Total Time Saved: ~10 minutes per run!

Cache Invalidation Strategy

We use semantic versioning for cache keys:

env:
  CACHE_VERSION: v1  # Increment to bust all caches

key: ${{ runner.os }}-pip-${{ env.CACHE_VERSION }}-${{ hashFiles('requirements.txt') }}

When to bump version:

Major dependency upgrade
OS image change
Cache corruption suspected

Pro tip: Use restore-keys for partial cache hits:

restore-keys: |
  ${{ runner.os }}-pip-v1-
  ${{ runner.os }}-pip-

This provides a fallback hierarchy:

Try exact match (requirements.txt hash)
Try any v1 cache
Try any pip cache

Result: Cache hit rate increased from 60% to 95%!

Solution 2: Fixing the Invalid Date Bug

The Root Cause

Our dashboard used JavaScript to parse timestamps:

// What we were generating:
{
  "timestamp": "20260313_143522"
}

// What JavaScript Date() expected:
{
  "timestamp": "2026-03-13T14:35:22.000Z"
}

The Investigation

Step 1: Check the manifest generation

# Original (broken) code:
timestamp_str = filename.replace('assessment_report_', '').replace('.html', '')
# Result: "20260313_143522"

reports.append({
    'timestamp': timestamp_str  # ❌ Not ISO format!
})

Step 2: Test in browser console

new Date("20260313_143522")
// Invalid Date

new Date("2026-03-13T14:35:22")
// Wed Mar 13 2026 14:35:22 GMT+0000 (UTC) ✓

The Fix

Updated manifest generation:

from datetime import datetime

# Parse the filename timestamp
timestamp_str = filename.replace('assessment_report_', '').replace('.html', '')

try:
    # Format: YYYYMMDD_HHMMSS
    dt = datetime.strptime(timestamp_str, '%Y%m%d_%H%M%S')

    # Convert to ISO 8601 format
    iso_timestamp = dt.isoformat()  # "2026-03-13T14:35:22"
except:
    # Fallback to current time if parsing fails
    iso_timestamp = datetime.now().isoformat()

reports.append({
    'id': timestamp_str,
    'timestamp': iso_timestamp,  # ✓ ISO format
    'url': f'reports/{timestamp_str}/{report_file.name}'
})

The Result

Before:

┌──────────┐
│ Invalid  │
│   Date   │
└──────────┘

After:

┌──────────┐
│  Mar 13  │
│   2026   │
└──────────┘

JavaScript Enhancement

We also improved the date formatting on the dashboard:

const date = new Date(report.timestamp);

// Format for display
const formattedDate = date.toLocaleString('en-US', {
  year: 'numeric',
  month: 'long',
  day: 'numeric',
  hour: '2-digit',
  minute: '2-digit'
});
// "March 13, 2026, 02:35 PM"

// Format for stats card
const shortDate = date.toLocaleDateString('en-US', {
  month: 'short',
  day: 'numeric'
});
// "Mar 13"

Key Lesson: Always use ISO 8601 format for timestamps in APIs and data interchange!

Solution 3: Dynamic Data Generation in CI/CD

The Problem with Static Test Data

Our original workflow used committed CSV files:

# Old workflow
- name: Train model
  run: python src/python/train_model.py data/sample_transactions.csv
  #                                      ↑ Static file from repo

Issues:

Tests always ran against same data
Real pipeline generated fresh data daily
No way to test edge cases
Stale data != production data

The Solution: Generate Data in CI/CD

We made data generation the first step of the pipeline:

jobs:
  # Job 1: Generate fresh data
  generate-data:
    runs-on: ubuntu-latest
    outputs:
      data_file: ${{ steps.generate.outputs.data_file }}
      timestamp: ${{ steps.generate.outputs.timestamp }}

    steps:
      - name: Generate synthetic transaction data
        id: generate
        run: |
          TIMESTAMP=$(date +%Y%m%d_%H%M%S)
          DATA_SIZE=${{ github.event.inputs.data_size || '1000' }}
          DATA_FILE="data/transactions_${TIMESTAMP}.csv"

          python scripts/generate_sample_data.py \
            --size ${DATA_SIZE} \
            --output ${DATA_FILE}

          # Pass to next jobs
          echo "data_file=${DATA_FILE}" >> $GITHUB_OUTPUT
          echo "timestamp=${TIMESTAMP}" >> $GITHUB_OUTPUT

Connecting Jobs with Artifacts

Upload from generator:

- name: Upload data artifact
  uses: actions/upload-artifact@v4
  with:
    name: transaction-data-${{ steps.generate.outputs.timestamp }}
    path: |
      ${{ steps.generate.outputs.data_file }}
      data/*_metadata.json
    retention-days: 7

Download in training job:

train-model:
  needs: [generate-data, test]  # Wait for data generation

  steps:
    - name: Download transaction data
      uses: actions/download-artifact@v4
      with:
        name: transaction-data-${{ needs.generate-data.outputs.timestamp }}
        path: data/

    - name: Train NER classifier
      run: |
        DATA_FILE="${{ needs.generate-data.outputs.data_file }}"
        python src/python/train_model.py ${DATA_FILE}

Benefits of Dynamic Data

1. Fresh Data Every Run

# Different data every day
2026-03-13: 1000 transactions with current patterns
2026-03-14: 1000 NEW transactions with NEW patterns

2. Configurable Size

workflow_dispatch:
  inputs:
    data_size:
      description: 'Number of transactions'
      default: '1000'

Can test with:

100 for quick smoke tests
1,000 for normal runs
10,000 for stress tests

3. Realistic Distribution

# Generator creates realistic mix:
{
  'Groceries': 25%,
  'Restaurants': 18%,
  'Transportation': 15%,
  'Healthcare': 10%,
  'Unknown': 5%,
  # ... etc
}

4. Metadata Tracking

{
  "generated_at": "2026-03-13T14:35:22",
  "n_transactions": 1000,
  "category_distribution": {...},
  "amount_stats": {
    "min": 5.50,
    "max": 1200.00,
    "mean": 87.43
  }
}

The Data Generator

Our synthetic data generator creates realistic transactions:

class TransactionGenerator:
    def __init__(self, seed=None):
        if seed:
            np.random.seed(seed)

        self.templates = {
            'Groceries': {
                'merchants': ['walmart', 'costco', 'whole foods'],
                'items': ['grocery', 'bread milk eggs', 'produce'],
                'amount_range': (30, 250),
                'frequency': 0.25
            },
            # ... 8 categories total
        }

    def generate_narration(self, category):
        merchant = np.random.choice(self.templates[category]['merchants'])
        item = np.random.choice(self.templates[category]['items'])

        # Different patterns
        patterns = [
            f"{merchant} {item}",
            f"purchase at {merchant} for {item}",
            f"{item} at {merchant}"
        ]

        narration = np.random.choice(patterns)

        # Sometimes add reference number
        if np.random.random() > 0.7:
            ref = np.random.randint(1000, 9999)
            narration += f" ref#{ref}"

        return narration

Example output:

walmart grocery shopping ref#4521
purchase at cvs pharmacy for prescription
uber ride downtown
coffee at starbucks

Impact on Testing

Before: Tests always passed with static data
After: Tests catch real edge cases

Example bug we caught:

# Bug: Assumed 'amount' always present
def classify(df):
    return df['amount'].abs()  # ❌ Fails if amount is missing

# Fix: Handle missing amounts
def classify(df):
    if 'amount' not in df.columns:
        df['amount'] = 0
    return df['amount'].abs()  # ✓ Works

This bug only appeared with generated data that had missing amounts!

Solution 4: Comprehensive Testing Strategy

The Testing Pyramid

We implemented a complete testing strategy:

           /\
          /  \
         /E2E \          3 tests (5%)
        /______\
       /        \
      /Integration\      7 tests (28%)
     /____________\
    /              \
   /  Unit Tests    \    15 tests (60%)
  /__________________\

Layer 1: Unit Tests

Test individual components in isolation:

# tests/test_classifier.py
class TestKeywordMatching:
    def test_healthcare_classification(self, classifier):
        """Test classification of healthcare transactions."""
        category, confidence = classifier.keyword_match(
            "cvs pharmacy prescription pickup"
        )

        assert category == "Healthcare"
        assert confidence > 0.3

Coverage:

Rule-based classification ✓
ML feature extraction ✓
Confidence scoring ✓
Data generation ✓

Why this matters:

Fast feedback (< 1 second)
Pinpoints exact failures
Easy to debug

Layer 2: Integration Tests

Test components working together:

# tests/test_pipeline.py
def test_full_pipeline(tmp_path):
    """Test complete pipeline execution."""
    # Generate data
    generator = TransactionGenerator(seed=42)
    df = generator.generate_transactions(100)

    # Classify
    classifier = AdaptiveNERClassifier()
    results = classifier.classify_batch(df)

    # Verify
    assert len(results) >= 100
    unknown_rate = (results['category'] == 'Unknown').sum() / len(results)
    assert unknown_rate < 0.9  # Less than 90% unknown

What we test:

Data → Classifier → Results flow
File I/O operations
MLflow tracking integration
Report generation end-to-end

Layer 3: End-to-End Tests

Test the entire workflow as users would:

def test_github_actions_simulation():
    """Simulate the complete GitHub Actions workflow."""
    # Step 1: Generate data
    subprocess.run([
        'python', 'scripts/generate_sample_data.py',
        '--size', '100',
        '--output', 'data/test.csv'
    ])

    # Step 2: Train model
    subprocess.run([
        'python', 'src/python/train_model.py',
        'data/test.csv'
    ])

    # Step 3: Generate report
    subprocess.run([
        'Rscript', '-e',
        "rmarkdown::render('reports/assessment_report.Rmd')"
    ])

    # Verify outputs exist
    assert Path('models/ner_classifier.pkl').exists()
    assert Path('reports/assessment_report.html').exists()

The Test Fixture Strategy

We use pytest fixtures for shared test data:

# tests/conftest.py
@pytest.fixture
def classifier():
    """Reusable classifier instance."""
    return AdaptiveNERClassifier(rules_path="models/keyword_rules.yaml")

@pytest.fixture
def sample_transactions():
    """Reusable sample data."""
    return pd.DataFrame({
        'narration': [
            'cvs pharmacy prescription',
            'walmart grocery shopping',
            'uber ride downtown'
        ],
        'amount': [45.00, 125.50, 28.00]
    })

Benefits:

No code duplication
Consistent test data
Easy to update globally

Fixing Flaky Tests

Problem: Tests failed intermittently

# Flaky test (bad)
def test_generate_transactions():
    df = generator.generate_transactions(100)
    assert len(df) == 100  # ❌ Sometimes 105 due to unknowns

Solution: Make assertions flexible

# Robust test (good)
def test_generate_transactions():
    df = generator.generate_transactions(100)
    assert 100 <= len(df) <= 110  # ✓ Accounts for ~5% unknowns

Test Coverage Goals

We aimed for 80% coverage on critical paths:

pytest tests/ --cov=src/python --cov-report=term

Name                              Stmts   Miss  Cover
-----------------------------------------------------
src/python/ner_classifier.py        145     12    92%
src/python/train_model.py           89      8    91%
src/python/category_discovery.py    76     15    80%
-----------------------------------------------------
TOTAL                               310     35    87%

Coverage report automatically uploaded to Codecov:

- name: Upload coverage
  uses: codecov/codecov-action@v3
  with:
    file: ./coverage.xml

Result: Beautiful coverage badge in README!

Architecture Deep Dive

The Complete Pipeline Flow

┌─────────────────────────────────────────────────────────┐
│                    GitHub Actions                       │
│                   (Trigger: Daily 2 AM)                 │
└────────────┬────────────────────────────────────────────┘
             │
             ▼
┌─────────────────────────────────────────────────────────┐
│  Job 1: Generate Data (15 seconds)                      │
│  ┌──────────────────────────────────────────┐           │
│  │ Python: generate_sample_data.py          │           │
│  │ Output: transactions_20260313_143522.csv │           │
│  │ Metadata: category distribution, stats   │           │
│  └──────────────────────────────────────────┘           │
└────────────┬────────────────────────────────────────────┘
             │ Artifact Upload
             ▼
┌─────────────────────────────────────────────────────────┐
│  Job 2: Run Tests (30 seconds - CACHED)                 │
│  ┌──────────────────────────────────────────┐           │
│  │ pytest tests/ --cov=src/python           │           │
│  │ Coverage: 87%                            │           │
│  │ Upload to Codecov                        │           │
│  └──────────────────────────────────────────┘           │
└────────────┬────────────────────────────────────────────┘
             │ Tests Passed ✓
             ▼
┌─────────────────────────────────────────────────────────┐
│  Job 3: Train Model (2 minutes - CACHED)                │
│  ┌──────────────────────────────────────────┐           │
│  │ Rule-Based Classification (68.5%)        │           │
│  │ ↓                                        │           │
│  │ ML Enhancement (+22.7%)                  │           │
│  │ ↓                                        │           │
│  │ Category Discovery (4 new clusters)     │            │
│  │ ↓                                        │           │
│  │ MLflow Logging (metrics, model, artifacts)│          │
│  └──────────────────────────────────────────┘           │
└────────────┬────────────────────────────────────────────┘
             │ Artifact Upload
             ▼
┌─────────────────────────────────────────────────────────┐
│  Job 4: Generate Report (90 seconds - CACHED)           │
│  ┌──────────────────────────────────────────┐           │
│  │ R Markdown Rendering                     │           │
│  │ ├─ Load classified_transactions.csv     │            │ 
│  │ ├─ Calculate statistics                 │            │
│  │ ├─ Create 12 interactive charts         │            │
│  │ ├─ Generate recommendations             │            │
│  │ └─ Output: assessment_report.html       │            │
│  └──────────────────────────────────────────┘           │
└────────────┬────────────────────────────────────────────┘
             │ Artifact Upload
             ▼
┌─────────────────────────────────────────────────────────┐
│  Job 5: Deploy to GitHub Pages (30 seconds)             │
│  ┌──────────────────────────────────────────┐           │
│  │ Create dashboard index.html              │           │
│  │ Generate reports manifest.json           │           │
│  │ Push to gh-pages branch                  │           │
│  │ ↓                                        │           │
│  │ Live at: https://username.github.io/repo/│           │
│  └──────────────────────────────────────────┘           │
└────────────┬────────────────────────────────────────────┘
             │
             ▼
┌─────────────────────────────────────────────────────────┐
│  Job 6: Notify (5 seconds)                              │
│  ┌──────────────────────────────────────────┐           │
│  │ Check all job statuses                   │           │
│  │ Comment on commit with report link       │           │
│  │ (Optional: Send Slack notification)      │           │
│  └──────────────────────────────────────────┘           │
└─────────────────────────────────────────────────────────┘

Total Time: ~5 minutes (down from 12+ minutes!)

Job Dependencies

Jobs run in parallel when possible:

generate-data        (15s)
    ↓
    ├──→ test       (30s) ──┐
    └──→ train      (2m)  ──┤
         ↓                  │
         generate-report (90s)
         ↓                  │
         deploy-pages    (30s)
         ↓                  │
         notify          (5s) ←┘

Key insight: Tests run in parallel with training prep!

Data Flow

From generation to deployment:

transactions_20260313_143522.csv
    ↓
[Artifact Upload]
    ↓
train_model.py
    ↓
classified_transactions.csv
metrics.json
ner_classifier.pkl
    ↓
[Artifact Upload]
    ↓
assessment_report.Rmd
    ↓
assessment_report_20260313_143522.html
    ↓
[Artifact Upload]
    ↓
GitHub Pages (gh-pages branch)
    ↓
https://username.github.io/repo/

Caching Strategy Visualization

First Run (Cold Cache):
├─ Python packages:    4m 30s  → Cache MISS → Download & Cache
├─ R packages:         6m 15s  → Cache MISS → Download & Cache
├─ Pytest:             30s     → Cache MISS → Run & Cache
└─ Total:              12m 50s

Second Run (Warm Cache):
├─ Python packages:    15s     → Cache HIT  → Restore
├─ R packages:         20s     → Cache HIT  → Restore
├─ Pytest:             5s      → Cache HIT  → Restore
└─ Total:              4m 45s

Speedup: 2.7x faster!

Performance Metrics: Before vs After

Build Time Comparison

Component	Before	After	Improvement
Python Setup	4m 30s	15s	18x faster
R Setup	6m 15s	20s	18.75x faster
Test Execution	1m 20s	30s	2.67x faster
Model Training	3m 0s	2m 0s	1.5x faster
Report Generation	2m 45s	1m 30s	1.83x faster
Total	12m 50s	4m 35s	2.8x faster

Cost Analysis

Before:

12.85 minutes × 30 runs/month = 385.5 minutes/month
GitHub Actions: 2,000 free minutes/month
Usage: 19.3% of quota

After:

4.58 minutes × 30 runs/month = 137.4 minutes/month
GitHub Actions: 2,000 free minutes/month  
Usage: 6.9% of quota

Benefit: Can run 4.3x more workflows within free tier!

Cache Hit Rates

After 30 days of production use:

Cache Type	Hit Rate	Avg Save Time
Python packages	95%	4m 15s
R packages	90%	5m 55s
Pytest	85%	25s
MLflow artifacts	80%	10s

Overall Cache Effectiveness: 91% hit rate

Resource Usage

Artifact Storage:

Before (no compression):

├─ Transaction data: 500 KB × 30 = 15 MB
├─ Model artifacts:  5 MB × 30 = 150 MB
├─ Reports:          8 MB × 30 = 240 MB
└─ Total:                       405 MB/month

After (with compression):

├─ Transaction data: 100 KB × 30 = 3 MB     (80% reduction)
├─ Model artifacts:  2 MB × 30 = 60 MB     (60% reduction)
├─ Reports:          3 MB × 30 = 90 MB     (62% reduction)
└─ Total:                       153 MB/month (62% total reduction)

Compression settings:

- uses: actions/upload-artifact@v4
  with:
    compression-level: 9  # Maximum compression
    retention-days: 7     # Reduced from 30

User Experience Metrics

Metric	Before	After	Improvement
Time to first report	15 min	5 min	3x faster
Dashboard load time	2.5s	0.8s	3.1x faster
Date display	"Invalid"	"Mar 13"	Fixed!
Report freshness	Manual	Auto	100% automated

Lessons Learned

1. Cache Aggressively, Invalidate Carefully

Lesson: Cache everything that doesn't change between runs.

But: Have a clear invalidation strategy.

# Good: Semantic versioning
CACHE_VERSION: v1  # Bump when you need fresh cache

# Good: Hash-based keys
key: ${{ hashFiles('requirements.txt') }}

# Bad: Time-based keys
key: cache-${{ github.run_number }}  # Never hits!

Mistake we made: Initially cached without version numbers. When packages updated, we got stale dependencies.

Fix: Added CACHE_VERSION environment variable.

2. ISO 8601 for All Timestamps

Lesson: Always use ISO 8601 format for timestamps.

# Good
datetime.now().isoformat()  # "2026-03-13T14:35:22.123456"

# Bad
datetime.now().strftime('%Y%m%d_%H%M%S')  # "20260313_143522"

Why: ISO 8601 is:

Universally parseable
Sortable lexicographically
Timezone-aware
JSON-friendly

Cost of not doing this: Hours debugging "Invalid Date"!

3. Test with Production-Like Data

Lesson: Generate test data dynamically, not statically.

Before: Tests used committed sample_data.csv
After: Tests use freshly generated data each run

Benefits:

Catches edge cases
Validates data generator
Prevents overfitting to test data

Example bug caught:

# This passed with static data:
assert df['category'].nunique() == 8

# But failed with generated data (only 7 categories present)
# Fix: 
assert df['category'].nunique() >= 5  # At least 5 categories

4. Parallel Jobs Where Possible

Lesson: Dependencies create bottlenecks. Parallelize what you can.

Before:

generate → test → train → report → deploy
(all sequential, 12 minutes)

After:

generate → test ─┐
          ├────→ train → report → deploy
          └────→ [other jobs]
(parallel where possible, 5 minutes)

Key: Use needs: carefully:

test:
  needs: [generate-data]  # Only wait for data

train-model:
  needs: [generate-data, test]  # Wait for both

5. Fail Fast, Fail Clearly

Lesson: When tests fail, make it obvious WHY.

Bad error message:

AssertionError: assert False

Good error message:

assert category == "Groceries", \
    f"Expected 'Groceries', got '{category}'. " \
    f"Narration: '{text}', Confidence: {confidence}"

# Output:
# AssertionError: Expected 'Groceries', got 'Unknown'. 
# Narration: 'walmart shopping', Confidence: 0.0

Now we know:

What failed (category assertion)
Expected vs actual values
Context (the narration text)
Why it failed (zero confidence)

6. Monitor Cache Effectiveness

Lesson: Track cache hit rates over time.

We added logging:

- name: Check cache status
  run: |
    if [ "${{ steps.cache.outputs.cache-hit }}" == "true" ]; then
      echo "✓ Cache hit!"
    else
      echo "✗ Cache miss - downloading packages"
    fi

Metric to watch: Cache hit rate should be >85%.

If lower:

Cache keys might be too specific
Dependencies changing too frequently
Cache size limits reached

7. Optimize Artifact Retention

Lesson: Keep what you need, delete what you don't.

# Before: Everything kept 90 days
retention-days: 90

# After: Tiered retention
- Transaction data: 7 days   # Regenerable
- Model artifacts: 30 days   # Useful for comparison
- Reports: 90 days           # Want history

Savings: 62% reduction in storage costs!

8. Documentation is Code

Lesson: README is as important as the code itself.

Investment:

2 hours writing comprehensive README
30 minutes on deployment guide
1 hour on troubleshooting section

Return:

Zero support questions about setup
Contributors could onboard in <5 minutes
Reduced deployment issues by 90%

9. Start with POC, Iterate to Production

Lesson: Don't try to build everything at once.

Our journey:

Week 1: Basic classifier (rule-based only)
Week 2: Add ML enhancement
Week 3: Manual reporting
Week 4: GitHub Actions automation
Week 5: Add caching & optimization
Week 6: Polish UX, fix bugs

Key: Each week added value. No "big bang" release.

10. Open Source Everything

Lesson: Making it public improved quality.

Before open source:

Hardcoded paths
No documentation
Quick hacks everywhere

After open source:

Configurable
Well-documented

- Production-ready code

Conclusion

What We Accomplished

Starting from a proof-of-concept, we built a production-grade ML pipeline that:

✅ Runs 3x faster with intelligent caching
✅ Costs $0/month on GitHub Actions free tier
✅ Generates fresh data automatically

✅ Deploys reports to the web autonomously
✅ Achieves 91.2% classification accuracy
✅ Discovers new categories without supervision
✅ Provides full MLOps tracking with MLflow
✅ Has 87% test coverage
✅ Runs 24/7 without human intervention

The Numbers

Metric	Value
Pipeline Runtime	4min 35s (was 12min 50s)
Speedup	2.8x faster
Cost	$0/month
Test Coverage	87%
Classification Accuracy	91.2%
Cache Hit Rate	95%
Lines of Code	~3,500
Time to Deploy	< 5 minutes

Key Takeaways

Cache Everything - 95% hit rate = 2.8x speedup
Use ISO 8601 - Saved hours of debugging
Dynamic Data - Caught bugs static tests missed
Fail Fast - Clear errors save time
Document Well - README as important as code

The Technology Stack

Languages & Frameworks:

Python 3.9 (ML/NLP)
R 4.3 (Statistics/Reporting)
YAML (Configuration)
Markdown (Documentation)

ML & Data:

scikit-learn (Classification)
pandas (Data manipulation)
NLTK (Text processing)
MLflow (Experiment tracking)

DevOps:

GitHub Actions (CI/CD)
GitHub Pages (Hosting)
Codecov (Coverage tracking)
Docker (Future deployment)

Visualization:

R Markdown (Reports)
Plotly (Interactive charts)
ggplot2 (Static charts)
DT (Data tables)

Resources

Live Demo:

Dashboard: https://akanimohod19a.github.io/productionizing_NER/
GitHub: https://github.com/akanimohod19a/productionizing_NER

Documentation:

README: Comprehensive setup guide
CI/CD Guide: Workflow customization
API Docs: Classifier usage
Contributing: How to contribute

Contact:

Email: danielamahtoday@gmail.com
Twitter: @productionML

- LinkedIn: https://www.linkedin.com/in/daniel-amah-2559a4159/

Acknowledgments

Built with:

Lots of coffee ☕
Many debugging sessions
Great community feedback
Passion for MLOps

Special thanks to:

GitHub Actions team for free CI/CD
MLflow community for excellent tools
R/RStudio team for amazing reporting
scikit-learn contributors
Everyone who contributed feedback

Full Code: https://github.com/AkanimohOD19A/productionizing_NER

Built with ❤️ using Python, R, MLflow,GitHub Actions and a lot of Love

Last updated: March 2026

If you ever wondered how text/qualitative data can make sense for predictions in your business, please check this out.

Akan — Fri, 13 Mar 2026 04:17:30 +0000

Building an Adaptive NER System with MLOps: A Complete Guide

Akan ・ Feb 1

#nlp #machinelearning #programming #ai

[Boost]

Akan — Mon, 09 Mar 2026 13:46:17 +0000

Building an Adaptive NER System with MLOps: A Complete Guide

Akan ・ Feb 1

#nlp #machinelearning #programming #ai

From Idea to CRAN: My Journey Building the `splitr` R Package

Akan — Sun, 08 Mar 2026 15:24:29 +0000

If you've ever thought, "I wish R could do X automatically," I have a story for you. Recently, I embarked on a journey to create my first R package — and let me tell you, it was one of the most fun and educational experiences in my coding career.

The goal? Build splitr, a package that splits an Excel sheet into multiple sheets effi ciently, using data.table for speed and openxlsx for Excel magic.

Step 1: The Idea and the Blueprint

It all started with a common pain point: dealing with massive Excel files. Manually splitting data into chunks for reporting or analysis was tedious.

I sketched out the plan:

Read a source Excel sheet
Split rows into n chunks
Write each chunk into a separate sheet in a single workbook
Apply styles and optionally save to disk

Simple, right? But turning that plan into a robust, reusable R function is where the fun begins.

🛠 Step 2: Structuring the Package

Using RStudio, I created a new package project called splitr. The structure looked like this:

splitr/
├── R/
│   └── split_excel_to_sheets.R
├── man/
├── DESCRIPTION
├── NAMESPACE
└── splitr.Rproj

DESCRIPTION holds metadata about the package.
R/ is where the actual function lives.
man/ will eventually contain documentation generated with roxygen2.

Step 3: Writing the Function

Here’s the core of splitr:

split_excel_to_sheets <- function(file_path, n, sheet = 1, output_path = NULL) {
  dt <- data.table::setDT(openxlsx::read.xlsx(file_path, sheet = sheet))
  chunks <- split(dt, cut(seq_len(nrow(dt)), breaks = n, labels = FALSE))
  wb <- openxlsx::createWorkbook()
  for (i in seq_len(n)) {
    openxlsx::addWorksheet(wb, paste0("Part_", i))
    openxlsx::writeData(wb, sheet = i, chunks[[i]])
  }
  if (!is.null(output_path)) openxlsx::saveWorkbook(wb, output_path, overwrite = TRUE)
  invisible(wb)
}

I loved how data.table made splitting huge datasets lightning-fast, and openxlsx let me handle all Excel styling without breaking a sweat.

Step 4: Documenting Like a Pro

Documentation is key. Using roxygen2, I added clear parameter descriptions, return values, and examples.

#' Split Excel Sheet into Multiple Sheets
#'
#' @param file_path Path to source .xlsx file
#' @param n Number of chunks
#' @param sheet Sheet name or index
#' @param output_path Optional path to save workbook
#' @return An openxlsx workbook object
#' @export

This not only helps others understand the function but also automatically generates Rd files for CRAN.

Step 5: Overcoming Challenges

The journey wasn’t all smooth sailing:

Non-ASCII Characters: My code had fancy dashes and ellipses (— and …) that CRAN hates. tools::showNonASCIIfile() helped me locate and replace them with plain ASCII.
Package Dependencies: CRAN checks flagged undefined global functions from openxlsx and data.table. Fully qualifying functions like openxlsx::writeData() and data.table::setDT() solved the problem.
Examples & Tests: I had leftover demo functions (hello()) that weren’t defined. Removing them stopped example errors.

Every challenge was a learning moment — and now I feel like I truly understand what CRAN expects.

✅ Step 6: Passing CRAN Checks

After careful fixes and multiple iterations, the ultimate moment came:

0 errors ✔ | 0 warnings ✔ | 0 notes ✔

CRAN checks passed perfectly. No warnings, no errors, no notes. Pure joy.

Step 7: Building & Submitting

With devtools::build(), I created the tarball for submission:

devtools::build()

Next step: submit to CRAN. And there it is — my first R package, ready for the world.

Reflections

Building splitr taught me:

How R packages are structured
Why CRAN is strict (for good reason!)
The importance of documentation and reproducible examples
That challenges like encoding issues or namespace warnings are normal and solvable

The thrill of turning an idea into a fully functional CRAN-ready package is unmatched.

If you’ve ever thought about building an R package — just start small, document thoroughly, and run checks constantly. The learning experience is incredible.

Next steps for me:

Submit to CRAN ✅
Share the package on GitHub for quick installation via remotes::install_github() ✅
Explore more advanced Excel manipulations in R

🏃‍♂️ Try It Out Yourself

You don’t have to wait for CRAN — you can install and try splitr directly from GitHub today:

# Install devtools if you don't have it
install.packages("devtools")

# Install splitr from GitHub
devtools::install_github("AkanimohOD19A/splitr")

# Load the package
library(splitr)

# Try splitting an Excel file into 3 sheets
wb <- split_excel_to_sheets(
  file_path   = "data/example.xlsx",
  n           = 3,
  output_path = "data/example_split.xlsx"
)

Check out the GitHub repository:
https://github.com/AkanimohOD19A/splitr?tab=readme-ov-file

Note: CRAN submission is still in progress, so GitHub is the best way to try it right now.

Have you ever submitted a package to CRAN? What was your experience like? Drop a comment below — I’d love to hear your stories!

Building a Free, Multi-User Telegram Bot: When Infrastructure Constraints Drive Architecture

Akan — Wed, 04 Feb 2026 11:12:56 +0000

The Problem Space

At 2AM with 43% battery and no power, I needed to build a system that could:

Send randomized messages to multiple users throughout the day
Scale to handle arbitrary user counts
Cost exactly $0 to run
Deploy and forget about it

The obvious solution—Twilio's WhatsApp API—sat behind a paywall. But constraints breed creativity, and what followed was an exercise in building production-grade infrastructure with free-tier services.

Architecture Overview

The final system consists of three core components:

1. Multi-User Bot with Individual Scheduling

# Each user gets their own schedule, persisted in JSON
users = {
    "chat_id": {
        "active": True,
        "messages_per_day": 3,
        "start_hour": 8,
        "end_hour": 22
    }
}

2. APScheduler for Randomized Delivery

def schedule_user_affirmations(chat_id, messages_per_day, start_hour, end_hour):
    for i in range(messages_per_day):
        random_hour = random.randint(start_hour, end_hour - 1)
        random_minute = random.randint(0, 59)

        scheduler.add_job(
            send_affirmation_to_user,
            'cron',
            args=[chat_id],
            hour=random_hour,
            minute=random_minute,
            id=f'user_{chat_id}_msg_{i}'
        )

3. Webhook-Based Deployment

Polling vs. webhooks became critical for deployment. Telegram's API allows only one active connection per bot, which creates an interesting constraint when deploying.

The Polling Problem

Initial implementation used infinity_polling():

# Works locally, breaks in production
bot.infinity_polling()

Error:

ApiTelegramException: Error code: 409. 
Description: Conflict: terminated by other getUpdates request

This happens because:

Local instance starts polling
Deployed instance starts polling
Telegram sees two connections and terminates the newer one
Both instances keep retrying, creating a conflict loop

Solution: Webhook Architecture

if WEBHOOK_URL:
    # Production: Telegram pushes updates to us
    webhook_url = f"{WEBHOOK_URL}/{BOT_TOKEN}"
    bot.remove_webhook()
    bot.set_webhook(url=webhook_url)
else:
    # Development: We poll Telegram for updates
    bot.infinity_polling()

Flask endpoint to receive webhooks:

@app.route(f'/{BOT_TOKEN}', methods=['POST'])
def webhook():
    if request.headers.get('content-type') == 'application/json':
        json_string = request.get_data().decode('utf-8')
        update = telebot.types.Update.de_json(json_string)
        bot.process_new_updates([update])
        return '', 200
    return '', 403

Why This Matters

Polling (Development):

Bot continuously asks Telegram: "Any new messages?"
Simple, works for local testing
Cannot coexist with other instances

Webhooks (Production):

Telegram sends messages directly to your server
More efficient (no constant polling)
Multiple environments can coexist (different webhook URLs)
Production-grade approach

State Management

User preferences persist across restarts using JSON:

def load_users():
    try:
        with open(USERS_FILE, 'r') as f:
            return json.load(f)
    except FileNotFoundError:
        return {}

def save_users(users):
    with open(USERS_FILE, 'w') as f:
        json.dump(users, f, indent=2)

Trade-offs considered:

Redis/PostgreSQL: Requires additional services, kills free-tier budget
SQLite: Better for production, but adds complexity
JSON file: Simple, sufficient for <1000 users, zero infrastructure cost

For a constraint-driven project, JSON files are appropriate. The system can always migrate to a database when scale demands it.

Deployment: Free Tier Engineering

Platform: Render.com

Why Render:

True free tier (no credit card required)
Auto-deploys from GitHub
Includes SSL/HTTPS (required for Telegram webhooks)
Provides a persistent URL

Configuration (render.yaml):

services:
  - type: web
    name: affirmations-bot
    runtime: python
    buildCommand: pip install -r requirements_telegram.txt
    startCommand: python telegram_app_webhook.py
    envVars:
      - key: TELEGRAM_BOT_TOKEN
        sync: false
      - key: WEBHOOK_URL
        sync: false

The Free Tier Caveat

Render's free tier spins down after 15 minutes of inactivity. For a bot that needs to send scheduled messages, this is a problem.

Solution: UptimeRobot

Free monitoring service
Pings your app every 5 minutes
Keeps the dyno awake
Zero cost

GET https://affirmations-bot.onrender.com/health
Every 5 minutes

Scheduling Architecture

Daily reschedule pattern prevents predictability:

def reschedule_all_users():
    """Runs at midnight, generates new random times"""
    users = load_users()
    for chat_id, user_data in users.items():
        if user_data.get('active', True):
            schedule_user_affirmations(
                int(chat_id),
                user_data.get('messages_per_day', 3),
                user_data.get('start_hour', 8),
                user_data.get('end_hour', 22)
            )

# Add to scheduler
scheduler.add_job(
    reschedule_all_users,
    'cron',
    hour=0,
    minute=1,
    id='daily_reschedule'
)

Result:

User receives 3 messages daily
Times randomized each day (e.g., 9:23, 14:47, 19:12)
No predictable patterns
Feels organic, not automated

User Experience Design

Bot commands follow Telegram conventions:

@bot.message_handler(commands=['start'])
def send_welcome(message):
    # Auto-subscribe new users
    # Generate initial schedule
    # Send welcome message

@bot.message_handler(commands=['settings'])
def show_settings(message):
    # Display current config
    # Provide customization options

@bot.message_handler(commands=['pause', 'resume'])
def toggle_subscription(message):
    # User controls their subscription
    # Preserves preferences for resume

Key insight: Don't over-engineer. Users want:

/start → immediate value
/settings → control
/pause → temporary opt-out (not deletion)

Technical Challenges & Solutions

Challenge 1: Timezone Handling

Users in different timezones need messages at their local hours.

Current solution: Server time + user-specified hours

start_hour = 8  # 8 AM server time

Future enhancement:

user_timezone = pytz.timezone(user_data.get('timezone', 'UTC'))
local_time = datetime.now(user_timezone)

Challenge 2: Message Deduplication

With random scheduling, messages could theoretically collide.

Solution: APScheduler's job IDs prevent duplicates:

id=f'user_{chat_id}_msg_{i}'  # Unique per user, per message slot

Challenge 3: State Corruption

What if the server crashes mid-write?

Mitigation:

def save_users(users):
    # Atomic write pattern
    temp_file = USERS_FILE + '.tmp'
    with open(temp_file, 'w') as f:
        json.dump(users, f, indent=2)
    os.replace(temp_file, USERS_FILE)  # Atomic on POSIX

Cost Breakdown

Component	Service	Cost
Messaging API	Telegram Bot API	$0
Hosting	Render.com	$0
Uptime Monitoring	UptimeRobot	$0
Version Control	GitHub	$0
Total		$0/month

Twilio equivalent: ~$0.005/message = $0.015/day/user = $0.45/month/user

At 100 users: $45/month vs. $0.

Performance Characteristics

Single instance handles:

~100 concurrent users comfortably
~300 messages/day (3 per user)
~0.5 requests/second average
Peaks during scheduling windows

Bottlenecks:

Telegram API rate limits (30 messages/second)
Render free tier CPU/memory
JSON file I/O (becomes issue >1000 users)

Scaling path:

Migrate to PostgreSQL (~1000 users)
Horizontal scaling with Redis queue (~10k users)
Switch to paid Render tier (~100k users)

Lessons from Constraint-Driven Development

1. Start with the Free Tier

Don't prematurely optimize for scale you don't have. JSON files work until they don't.

2. Understand Your Platform's Execution Model

Polling vs. webhooks isn't just a technical detail—it's the difference between working and not working in production.

3. Constraints Force Better Architecture

No database? You design for minimal state. No always-on hosting? You make your app stateless and resilient.

4. Documentation as Infrastructure

Half the battle is making it reproducible:

git clone repo
pip install -r requirements.txt
# Add bot token to .env
python telegram_app_webhook.py

If it takes 5+ steps, you're doing it wrong.

The Meta-Problem: Environment Parity

Building from Lagos means:

Intermittent power → Local development gets interrupted
Slow/expensive internet → Downloading dependencies is costly
Limited payment options → Many services unavailable
Time zone challenges → Debugging with US-based support

These aren't excuses—they're parameters. Good engineering adapts.

Development environment:

Power: 43% battery, no outlet
Internet: 3G tethered from phone
Time: 2:47 AM
Deadline: Yesterday

Production environment:

Power: ✓ Always on
Internet: ✓ High bandwidth
Time: ✓ 24/7 availability
Cost: $0 (hard constraint)

The gap between these environments shapes the architecture. You build:

Offline-first documentation (can't count on Stack Overflow loading)
Minimal dependencies (pip install takes forever)
Aggressive caching (can't re-download on every restart)
Robust error handling (can't debug when offline)

What's Next

Immediate improvements:

Add timezone support per user
Implement message templates (user-customizable)
Add analytics dashboard (messages sent, active users)

Future architecture:

Migrate to PostgreSQL when users > 500
Add message queue (Celery + Redis) for reliability
Implement A/B testing for message timing
Add web interface for non-Telegram management

System evolution pattern:

JSON file → SQLite → PostgreSQL → Distributed DB
Single instance → Load balanced → Microservices
Monolith → Modular monolith → Services

Migrate when the pain exceeds the migration cost. Not before.

Code Repository

Full implementation: [GitHub link]

Stack:

Python 3.11
Flask (web server)
pyTelegramBotAPI (Telegram SDK)
APScheduler (job scheduling)
Render (hosting)

To deploy your own:

git clone [repo]
pip install -r requirements_telegram.txt
# Add TELEGRAM_BOT_TOKEN to .env
python telegram_app_webhook.py

Closing Thoughts

The "right" solution isn't always the obvious one. When Twilio was gated, I could have:

Paid for it (out of budget)
Given up (not an option)
Found another way (what actually happened)

Engineering isn't just about writing code—it's about navigating constraints, making trade-offs, and shipping despite the environment.

Resource-lean contexts don't produce worse engineers. They produce engineers who:

Understand trade-offs deeply
Build resilient systems by default
Know when "good enough" is actually good enough
Can build production infrastructure for $0

The feature shipped. The users are happy. The cost is zero.

That's the only metric that matters.

Technical Stack:

Telegram Bot API (webhooks)
Flask (HTTP server)
APScheduler (cron-like scheduling)
Render.com (PaaS hosting)
GitHub (CI/CD via git push)

Performance:

100 users, 3 messages/day = 300 messages/day
~0.5 req/sec average
<100ms p99 latency (webhook processing)
Zero cost at any scale under 10k users

Want to build something similar?
The repository is open source. Fork it, modify it, deploy it. All the infrastructure patterns are reusable.

Sometimes the best technology is the free technology you can ship today. Please try it here: https://web.telegram.org/k/#@my_affirmation_fr_bot

Written at 4:23 AM, 19% battery remaining, on generator power. Deployed successfully to production before the power cut out again.

Building an Adaptive NER System with MLOps: A Complete Guide

Akan — Sun, 01 Feb 2026 11:46:24 +0000

In the 2010s, a man furiously reached out to a US Super mart - complaining about inappropriate, maternity, coupons advertised to his teenage daughter, such coupons should be useful to only a pregnant woman.

He would later call back to apologize, indeed his daughter was pregnant - that's the power of predictive analysis in such contexts, that behavioral patterns were reviewed and useful in determining future actions, by simply mapping certain purchases analyst may accurately determine a cluster, a central tendency.

In this week's post we will examine Named Entity Recognition, a technique used for Natural Language Processing to id. entities that cluster, from just free text; For example, lotion, wipes, diapers ~ "pregnancy", bills, DSTv, StarLink ~ "subscription", and because certain things are more straight-forward than others, we will combine both Rule-based and ML-based approaches for unsupervised category discovery then plug it to an automated reporting mechanism in a unified pipeline that bridges R and Python ecosystems.

Our tech-stack is simply Python: Python, Miflow and ZenML; R: R, Markdown.

What we're building:

An intelligent text classification system that learns from transaction narratives
Hybrid approach: rule-based NER + ML-powered adaptive learning
Full MLOps stack with MLflow tracking and ZenML orchestration
Bilingual pipeline (R ↔ Python) with automated R Markdown reporting
Production-ready POC that handles concept drift and discovers new categories

Business Context:
Financial institutions, e-commerce platforms, and expense management systems process millions of free-text transaction descriptions daily. Manually categorizing these is impossible at scale, yet accurate categorization is critical for fraud detection, expense reporting, budgeting, and financial analytics.

Traditional rule-based systems fail when encountering new merchants, products, or spending patterns. Our solution combines the reliability of expert-defined rules with machine learning's adaptability, creating a system that improves continuously without manual intervention.

Architecture Overview
Technology Stack Deep Dive
Data Model & Processing Pipeline
Rule-Based NER Implementation
Machine Learning Components
Unsupervised Category Discovery
MLflow Integration & Model Tracking
ZenML Orchestration
R Integration & Interoperability
Automated Reporting System
Results & Performance Metrics
Production Deployment Considerations
Future Enhancements

Architecture Overview

System Design Philosophy

Our architecture follows a progressive enhancement strategy:

Raw Text → Rule-Based Filter → ML Classifier → Cluster Discovery → Human Review

Layer 1: Rule-Based Foundation

Fast, deterministic, zero-latency classification
Captures well-known patterns with high confidence
No training required, interpretable results
Coverage: ~60-70% of common transactions

Layer 2: ML Enhancement

Handles edge cases and ambiguous text
Learns from historical labeled data
Amount-weighted training for financial impact
Coverage: Additional 20-25% of transactions

Layer 3: Discovery Engine

Unsupervised clustering of unknowns
Identifies emerging spending patterns
Suggests new categories for human validation
Enables continuous system evolution

Layer 4: Human-in-the-Loop

Low-confidence predictions flagged for review
Discovered clusters presented for labeling
Feedback loop retrains models automatically

Component Architecture

┌─────────────────────────────────────────────────────────────┐
│                     Data Sources                            │
│  (CSV, Database, API feeds, File uploads)                   │
└────────────────────┬────────────────────────────────────────┘
                     │
                     ▼
┌─────────────────────────────────────────────────────────────┐
│                  R: Data Preparation                         │
│  • Cleaning & normalization                                 │
│  • Feature engineering                                      │
│  • Exploratory analysis                                     │
└────────────────────┬────────────────────────────────────────┘
                     │
                     ▼
┌─────────────────────────────────────────────────────────────┐
│           Python: NER Classification Engine                  │
│  ┌──────────────────┐  ┌──────────────────┐                │
│  │  Rule-Based NER  │  │   ML Classifier  │                │
│  │  • Keyword match │  │   • TF-IDF       │                │
│  │  • Regex patterns│  │   • Random Forest│                │
│  │  • Confidence    │  │   • Probability  │                │
│  └──────────────────┘  └──────────────────┘                │
│  ┌──────────────────────────────────────────┐              │
│  │      Cluster Discovery (DBSCAN)          │              │
│  │      • Find unknown patterns             │              │
│  │      • Suggest new categories            │              │
│  └──────────────────────────────────────────┘              │
└────────────────────┬────────────────────────────────────────┘
                     │
                     ▼
┌─────────────────────────────────────────────────────────────┐
│              MLflow: Experiment Tracking                     │
│  • Model versioning                                         │
│  • Metrics logging                                          │
│  • Artifact storage                                         │
│  • Model registry                                           │
└────────────────────┬────────────────────────────────────────┘
                     │
                     ▼
┌─────────────────────────────────────────────────────────────┐
│            ZenML: Pipeline Orchestration                     │
│  • Step dependencies                                        │
│  • Caching & lineage                                        │
│  • Scheduled runs                                           │
│  • Deployment automation                                    │
└────────────────────┬────────────────────────────────────────┘
                     │
                     ▼
┌─────────────────────────────────────────────────────────────┐
│          R Markdown: Automated Reporting                     │
│  • Performance dashboards                                   │
│  • Category distribution                                    │
│  • Confidence analysis                                      │
│  • Review recommendations                                   │
└─────────────────────────────────────────────────────────────┘

Technology Stack Deep Dive

Core Technologies & Rationale

Python 3.9+

Primary ML/NLP engine
Rich ecosystem: scikit-learn, NLTK, spaCy
MLflow & ZenML native support
Industry standard for production ML

R 4.0+

Data preparation & reporting
Superior statistical analysis
Excellent visualization (ggplot2, plotly)
R Markdown for reproducible reports
Strong in financial analytics community

MLflow 2.9+

Experiment tracking & model registry
Framework-agnostic tracking
Model versioning with lineage
REST API for model serving
Local SQLite backend (production: PostgreSQL)

Why MLflow?

# Simple, powerful tracking
with mlflow.start_run():
    mlflow.log_param("n_estimators", 100)
    mlflow.log_metric("accuracy", 0.94)
    mlflow.sklearn.log_model(model, "model")

ZenML 0.50+

Pipeline orchestration
Step caching for efficiency
Lineage tracking
Multi-cloud deployment
Integrates with MLflow seamlessly

Why ZenML?

Declarative pipeline definition
Automatic artifact versioning
Reproducible experiments
Easy scaling to Kubernetes

Reticulate

R ↔ Python bridge
Seamless data transfer
Call Python from R naturally
Share objects between languages

Dependencies & Environment

Python Requirements:

pandas==2.1.0           # Data manipulation
numpy==1.24.0           # Numerical computing
scikit-learn==1.3.0     # ML algorithms
mlflow==2.9.0           # Experiment tracking
zenml==0.50.0           # Pipeline orchestration
pyyaml==6.0             # Configuration files
joblib==1.3.0           # Model serialization

R Dependencies:

tidyverse   # Data wrangling (dplyr, ggplot2, etc.)
reticulate  # Python integration
knitr       # Report generation
rmarkdown   # Document formatting
DT          # Interactive tables
plotly      # Interactive visualizations
yaml        # Config parsing

Data Model & Processing Pipeline

Input Data Schema

Transaction {
    narration: str      # Free-text description
    amount: float       # Transaction amount (signed)
    date: datetime      # Transaction timestamp
    account_id: str     # Optional: account identifier
    merchant_id: str    # Optional: merchant code
}

Example Transaction Data:

narration,amount,date
"Purchase at Baby Store - Pampers diapers",45.99,2026-01-15
"Pharmacy - Baby lotion and wipes",23.50,2026-01-16
"Supermarket - Bread milk eggs cheese",67.80,2026-01-16
"Uber ride to downtown conference",28.00,2026-01-17
"Dr. Smith consultation fee",150.00,2026-01-18
"Shell Gas Station #4521",55.20,2026-01-19
"Payment to ACME CORP INV-2024-001",1200.00,2026-01-20

Output Schema

ClassifiedTransaction {
    narration: str         # Original text
    amount: float          # Original amount
    category: str          # Assigned category
    confidence: float      # Classification confidence [0-1]
    method: str           # 'rule-based' | 'ml-based'
    keywords_matched: List[str]  # Matched keywords (if rule-based)
    probability_dist: Dict       # Class probabilities (if ML)
    needs_review: bool     # Flag for human review
    cluster_id: int        # Discovered cluster (if unknown)
}

Data Preprocessing Pipeline

R: Initial Data Preparation

# src/R/data_prep.R
library(tidyverse)
library(lubridate)

prepare_transaction_data <- function(input_path, output_path) {
  df <- read_csv(input_path) %>%
    mutate(
      # Text normalization
      narration = str_trim(narration) %>%
        str_to_lower() %>%
        str_squish() %>%                    # Remove extra whitespace
        str_replace_all("[^a-z0-9\\s]", " "), # Remove special chars

      # Amount validation
      amount = as.numeric(amount),
      amount_abs = abs(amount),

      # Date parsing
      date = ymd(date),

      # Derived features
      is_large_transaction = amount_abs > 500,
      transaction_type = if_else(amount >= 0, "credit", "debit"),

      # Text features
      word_count = str_count(narration, "\\S+"),
      has_numbers = str_detect(narration, "\\d"),

      # Create unique ID
      transaction_id = row_number()
    ) %>%
    filter(
      !is.na(narration),
      !is.na(amount),
      nchar(narration) > 3  # Minimum text length
    )

  # Log preprocessing stats
  cat("Preprocessing Summary:\n")
  cat("  Total records:", nrow(df), "\n")
  cat("  Date range:", min(df$date), "to", max(df$date), "\n")
  cat("  Amount range: $", min(df$amount), "to $", max(df$amount), "\n")
  cat("  Avg words per narration:", mean(df$word_count), "\n")

  # Save cleaned data
  write_csv(df, output_path)

  return(df)
}

# Feature engineering for analysis
engineer_features <- function(df) {
  df %>%
    mutate(
      # Temporal features
      day_of_week = wday(date, label = TRUE),
      is_weekend = day_of_week %in% c("Sat", "Sun"),
      month = month(date, label = TRUE),

      # Amount buckets
      amount_bucket = case_when(
        amount_abs < 10 ~ "micro",
        amount_abs < 50 ~ "small",
        amount_abs < 200 ~ "medium",
        amount_abs < 1000 ~ "large",
        TRUE ~ "very_large"
      ),

      # Text complexity
      text_complexity = case_when(
        word_count <= 3 ~ "simple",
        word_count <= 6 ~ "moderate",
        TRUE ~ "complex"
      )
    )
}

Preprocessing Rationale:

Lowercase normalization: Ensures "Pharmacy" and "pharmacy" match
Special character removal: Reduces noise, improves keyword matching
Amount features: Transaction size influences categorization importance
Text complexity: Longer descriptions often more specific/categorizable

Rule-Based NER Implementation

Keyword Configuration

Our rule-based system uses a YAML configuration file for maintainability and non-developer editability:

# models/keyword_rules.yaml
categories:
  Baby Items:
    keywords: 
      - pampers
      - diapers
      - baby powder
      - baby lotion
      - wipes
      - formula
      - baby food
      - onesie
      - stroller
      - crib
    weight: 1.0
    aliases: ["infant products", "nursery"]

  Groceries:
    keywords:
      - supermarket
      - grocery
      - bread
      - milk
      - eggs
      - cheese
      - meat
      - vegetables
      - fruit
      - walmart
      - costco
      - whole foods
    weight: 1.0
    aliases: ["food shopping", "provisions"]

  Healthcare:
    keywords:
      - doctor
      - pharmacy
      - cvs
      - walgreens
      - medicine
      - prescription
      - clinic
      - hospital
      - medical
      - dentist
      - optometrist
    weight: 1.5  # Higher weight for important category
    aliases: ["medical", "health services"]

  Transportation:
    keywords:
      - uber
      - lyft
      - taxi
      - fuel
      - gas
      - parking
      - metro
      - train
      - bus fare
      - toll
    weight: 1.0
    aliases: ["travel", "commute"]

  Utilities:
    keywords:
      - electric
      - water bill
      - gas bill
      - internet
      - phone bill
      - verizon
      - comcast
      - att
    weight: 1.2
    aliases: ["bills", "services"]

  Entertainment:
    keywords:
      - netflix
      - spotify
      - hulu
      - disney plus
      - movie
      - cinema
      - theater
      - concert
      - game
    weight: 0.8
    aliases: ["leisure", "recreation"]

# Matching configuration
matching:
  min_confidence: 0.3
  partial_match_penalty: 0.5
  multi_word_bonus: 1.2

# Thresholds
unknown_threshold: 0.3  # Below this → ML classification
review_threshold: 0.5   # Below this → human review

Python NER Classifier Implementation

# src/python/ner_classifier.py
import pandas as pd
import numpy as np
import yaml
import re
from typing import Dict, List, Tuple, Optional
from pathlib import Path

class AdaptiveNERClassifier:
    """
    Hybrid NER classifier combining rule-based and ML approaches
    with unsupervised category discovery.
    """

    def __init__(self, rules_path: str = "models/keyword_rules.yaml"):
        """Initialize classifier with keyword rules."""
        self.rules_path = Path(rules_path)
        self.load_rules()

        # ML components (initialized later)
        self.vectorizer = None
        self.ml_classifier = None
        self.cluster_model = None

        # Tracking
        self.discovered_categories = {}
        self.classification_stats = {
            'rule_based': 0,
            'ml_based': 0,
            'unknown': 0
        }

    def load_rules(self):
        """Load keyword rules from YAML config."""
        with open(self.rules_path, 'r') as f:
            config = yaml.safe_load(f)

        self.categories = config['categories']
        self.matching_config = config['matching']
        self.unknown_threshold = config['unknown_threshold']
        self.review_threshold = config['review_threshold']

        # Precompile regex patterns for efficiency
        self._compile_patterns()

    def _compile_patterns(self):
        """Compile regex patterns for each keyword."""
        self.patterns = {}

        for category, info in self.categories.items():
            patterns = []
            for keyword in info['keywords']:
                # Word boundary matching for precision
                pattern = r'\b' + re.escape(keyword) + r'\b'
                patterns.append(re.compile(pattern, re.IGNORECASE))
            self.patterns[category] = patterns

    def keyword_match(self, text: str) -> Tuple[str, float, List[str]]:
        """
        Rule-based keyword matching with confidence scoring.

        Returns:
            (category, confidence, matched_keywords)
        """
        text_lower = text.lower()
        text_words = set(text_lower.split())
        matches = {}
        matched_kw = {}

        for category, patterns in self.patterns.items():
            match_count = 0
            category_matches = []

            for pattern, keyword in zip(patterns, 
                                       self.categories[category]['keywords']):
                if pattern.search(text):
                    match_count += 1
                    category_matches.append(keyword)

            if match_count > 0:
                # Weight by category importance
                weight = self.categories[category]['weight']

                # Bonus for multiple keyword matches
                if match_count > 1:
                    weight *= self.matching_config['multi_word_bonus']

                matches[category] = match_count * weight
                matched_kw[category] = category_matches

        if not matches:
            return "Unknown", 0.0, []

        # Best matching category
        best_category = max(matches, key=matches.get)

        # Confidence based on match strength relative to text length
        raw_score = matches[best_category]
        text_length = len(text_words)
        confidence = min(raw_score / max(text_length, 1), 1.0)

        return best_category, confidence, matched_kw[best_category]

    def classify_single(self, text: str, amount: float = None) -> Dict:
        """
        Classify a single transaction.

        Args:
            text: Transaction narration
            amount: Transaction amount (optional, for weighted decisions)

        Returns:
            Classification result dictionary
        """
        # Rule-based classification
        category, confidence, keywords = self.keyword_match(text)

        result = {
            'narration': text,
            'amount': amount,
            'category': category,
            'confidence': confidence,
            'method': 'rule-based',
            'keywords_matched': keywords,
            'needs_review': confidence < self.review_threshold
        }

        # If low confidence and ML model available, try ML
        if confidence < self.unknown_threshold and self.ml_classifier is not None:
            ml_result = self._ml_classify_single(text)

            # Use ML if more confident
            if ml_result['confidence'] > confidence:
                result.update(ml_result)
                result['method'] = 'ml-based'
                result['fallback_from'] = 'rule-based'

        self.classification_stats[
            'rule_based' if result['method'] == 'rule-based' else 'ml_based'
        ] += 1

        return result

    def classify_batch(self, df: pd.DataFrame) -> pd.DataFrame:
        """
        Classify a batch of transactions efficiently.

        Args:
            df: DataFrame with 'narration' and 'amount' columns

        Returns:
            DataFrame with classification results
        """
        results = []

        for idx, row in df.iterrows():
            result = self.classify_single(
                row['narration'],
                row.get('amount', None)
            )
            results.append(result)

        return pd.DataFrame(results)

    def get_stats(self) -> Dict:
        """Get classification statistics."""
        total = sum(self.classification_stats.values())

        return {
            'total_classified': total,
            'rule_based_pct': self.classification_stats['rule_based'] / total * 100,
            'ml_based_pct': self.classification_stats['ml_based'] / total * 100,
            'unknown_pct': self.classification_stats['unknown'] / total * 100
        }

Rule-Based Classification Algorithm

Step-by-Step Process:

Text Normalization

   text_lower = text.lower()
   text_words = set(text_lower.split())

Pattern Matching
- Iterate through all category patterns
- Use compiled regex for speed
- Count matches per category
Scoring

   score = match_count * category_weight * multi_word_bonus

Confidence Calculation

   confidence = min(score / text_length, 1.0)

Decision Logic
- If confidence ≥ unknown_threshold → Accept rule-based classification
- If confidence < unknown_threshold → Try ML classifier
- If confidence < review_threshold → Flag for human review

Performance Characteristics:

Speed: ~0.1ms per transaction
Accuracy: 85-90% on known patterns
Interpretability: Full keyword traceability
Maintenance: Easy keyword updates via YAML

Machine Learning Components

Feature Engineering for ML

# src/python/feature_engineering.py
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import StandardScaler
import numpy as np

class TransactionFeaturizer:
    """Extract features from transaction text and metadata."""

    def __init__(self, max_features=500, ngram_range=(1, 3)):
        self.tfidf = TfidfVectorizer(
            max_features=max_features,
            ngram_range=ngram_range,
            min_df=2,              # Ignore very rare terms
            max_df=0.8,            # Ignore very common terms
            sublinear_tf=True,     # Use log scaling
            stop_words='english'
        )

        self.amount_scaler = StandardScaler()
        self.fitted = False

    def fit_transform(self, df: pd.DataFrame) -> np.ndarray:
        """Fit and transform features."""
        # Text features
        text_features = self.tfidf.fit_transform(df['narration'])

        # Numerical features
        numerical = self._extract_numerical_features(df)
        numerical_scaled = self.amount_scaler.fit_transform(numerical)

        # Combine
        features = np.hstack([
            text_features.toarray(),
            numerical_scaled
        ])

        self.fitted = True
        return features

    def transform(self, df: pd.DataFrame) -> np.ndarray:
        """Transform new data using fitted transformers."""
        if not self.fitted:
            raise ValueError("Featurizer not fitted. Call fit_transform first.")

        text_features = self.tfidf.transform(df['narration'])
        numerical = self._extract_numerical_features(df)
        numerical_scaled = self.amount_scaler.transform(numerical)

        return np.hstack([
            text_features.toarray(),
            numerical_scaled
        ])

    def _extract_numerical_features(self, df: pd.DataFrame) -> np.ndarray:
        """Extract numerical features from transactions."""
        features = []

        # Amount features
        features.append(df['amount'].abs().values.reshape(-1, 1))
        features.append(np.log1p(df['amount'].abs()).values.reshape(-1, 1))

        # Text length features
        features.append(df['narration'].str.len().values.reshape(-1, 1))
        features.append(df['narration'].str.split().str.len().values.reshape(-1, 1))

        # Character diversity
        features.append(
            df['narration'].apply(lambda x: len(set(x)) / max(len(x), 1))
            .values.reshape(-1, 1)
        )

        return np.hstack(features)

Random Forest Classifier

# src/python/train_model.py (ML section)
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import classification_report, confusion_matrix
import mlflow.sklearn

class MLClassifierTrainer:
    """Train and evaluate ML classifier."""

    def __init__(self):
        self.featurizer = TransactionFeaturizer()
        self.classifier = RandomForestClassifier(
            n_estimators=100,
            max_depth=15,
            min_samples_split=10,
            min_samples_leaf=4,
            max_features='sqrt',
            class_weight='balanced',  # Handle class imbalance
            random_state=42,
            n_jobs=-1  # Use all CPU cores
        )

    def train(self, df: pd.DataFrame):
        """
        Train classifier on labeled data.

        Args:
            df: DataFrame with 'narration', 'amount', 'category' columns
        """
        # Filter out Unknown categories
        train_df = df[df['category'] != 'Unknown'].copy()

        if len(train_df) < 20:
            print("⚠️  Insufficient training data. Need at least 20 labeled samples.")
            return False

        print(f"Training on {len(train_df)} samples across {train_df['category'].nunique()} categories")

        # Extract features
        X = self.featurizer.fit_transform(train_df)
        y = train_df['category']

        # Amount-based sample weighting
        # Give more weight to high-value transactions
        sample_weights = np.log1p(train_df['amount'].abs())
        sample_weights = sample_weights / sample_weights.sum()

        # Train-test split
        X_train, X_test, y_train, y_test, w_train, w_test = train_test_split(
            X, y, sample_weights,
            test_size=0.2,
            random_state=42,
            stratify=y
        )

        # Train model
        self.classifier.fit(X_train, y_train, sample_weight=w_train)

        # Evaluate
        train_score = self.classifier.score(X_train, y_train)
        test_score = self.classifier.score(X_test, y_test)

        # Cross-validation
        cv_scores = cross_val_score(
            self.classifier, X_train, y_train,
            cv=5, scoring='f1_weighted'
        )

        print(f"✓ Training accuracy: {train_score:.3f}")
        print(f"✓ Test accuracy: {test_score:.3f}")
        print(f"✓ CV F1 score: {cv_scores.mean():.3f} (+/- {cv_scores.std():.3f})")

        # Detailed classification report
        y_pred = self.classifier.predict(X_test)
        print("\nClassification Report:")
        print(classification_report(y_test, y_pred))

        return True

    def predict(self, df: pd.DataFrame) -> pd.DataFrame:
        """Predict categories for new transactions."""
        X = self.featurizer.transform(df)

        predictions = self.classifier.predict(X)
        probabilities = self.classifier.predict_proba(X)

        # Get confidence (max probability)
        confidences = probabilities.max(axis=1)

        # Get full probability distribution
        prob_dists = [
            dict(zip(self.classifier.classes_, probs))
            for probs in probabilities
        ]

        result_df = df.copy()
        result_df['category'] = predictions
        result_df['confidence'] = confidences
        result_df['probability_dist'] = prob_dists
        result_df['method'] = 'ml-based'

        return result_df

Why Random Forest?

Handles mixed features: Text (TF-IDF) + numerical (amounts)
Robust to noise: Tree averaging reduces overfitting
Feature importance: Interpretable results
No scaling needed: Trees are scale-invariant
Built-in confidence: Probability estimates from tree votes

Hyperparameter Rationale:

n_estimators=100: Balance between performance and training time
max_depth=15: Prevent overfitting on noisy text data
min_samples_split=10: Require sufficient samples for splits
class_weight='balanced': Handle imbalanced categories
max_features='sqrt': Standard heuristic for classification

Amount-Weighted Training

Key innovation: Not all transactions are equally important.

# High-value transactions get more weight
sample_weights = np.log1p(train_df['amount'].abs())
sample_weights = sample_weights / sample_weights.sum()

# Result: $1000 transaction has 3x influence of $100 transaction

Business Logic:

$5 coffee miscategorization: Minor impact
$5000 invoice miscategorization: Major impact
Model learns to be more careful with large amounts

Unsupervised Category Discovery

DBSCAN Clustering for Unknown Transactions

When transactions don't match existing categories, we use clustering to discover new patterns:

# src/python/category_discovery.py
from sklearn.cluster import DBSCAN
from sklearn.metrics import silhouette_score
from collections import Counter
import numpy as np

class CategoryDiscovery:
    """Discover new categories from unknown transactions using clustering."""

    def __init__(self, min_cluster_size=3, eps=0.3):
        self.min_cluster_size = min_cluster_size
        self.eps = eps
        self.featurizer = TransactionFeaturizer(max_features=200)

    def discover_categories(self, unknown_texts: List[str]) -> Dict:
        """
        Cluster unknown transactions to discover potential new categories.

        Args:
            unknown_texts: List of unclassified transaction narrations

        Returns:
            Dictionary of discovered clusters with sample texts
        """
        if len(unknown_texts) < self.min_cluster_size:
            print(f"⚠️  Need at least {self.min_cluster_size} unknown transactions for clustering")
            return {}

        print(f"Analyzing {len(unknown_texts)} unknown transactions...")

        # Create temporary DataFrame for featurization
        temp_df = pd.DataFrame({
            'narration': unknown_texts,
            'amount': [0] * len(unknown_texts)  # Dummy amounts
        })

        # Extract features
        X = self.featurizer.fit_transform(temp_df)

        # DBSCAN clustering
        # eps: maximum distance between samples in same cluster
        # min_samples: minimum cluster size
        clustering = DBSCAN(
            eps=self.eps,
            min_samples=self.min_cluster_size,
            metric='cosine',  # Good for text similarity
            n_jobs=-1
        )

        labels = clustering.fit_predict(X)

        # Analyze clusters
        unique_labels = set(labels)
        n_clusters = len(unique_labels) - (1 if -1 in unique_labels else 0)
        n_noise = list(labels).count(-1)

        print(f"✓ Found {n_clusters} potential new categories")
        print(f"  {n_noise} transactions remain as noise")

        if n_clusters > 0:
            silhouette = silhouette_score(X, labels, metric='cosine')
            print(f"  Silhouette score: {silhouette:.3f}")

        # Extract cluster information
        discovered = {}

        for label in unique_labels:
            if label == -1:  # Noise cluster
                continue

            # Get texts in this cluster
            cluster_mask = (labels == label)
            cluster_texts = [unknown_texts[i] for i, m in enumerate(cluster_mask) if m]

            # Analyze cluster
            cluster_info = self._analyze_cluster(cluster_texts)

            discovered[f"NewCategory_{label}"] = {
                'sample_texts': cluster_texts[:10],  # First 10 examples
                'size': len(cluster_texts),
                'keywords': cluster_info['top_keywords'],
                'suggested_name': cluster_info['suggested_name']
            }

        return discovered

    def _analyze_cluster(self, texts: List[str]) -> Dict:
        """Analyze a cluster to extract keywords and suggest a name."""
        # Combine all texts
        combined = ' '.join(texts)
        words = combined.lower().split()

        # Count word frequency
        word_counts = Counter(words)

        # Remove common stop words
        stop_words = {'the', 'a', 'an', 'and', 'or', 'but', 'in', 'on', 'at', 'to', 'for'}
        word_counts = {w: c for w, c in word_counts.items() 
                      if w not in stop_words and len(w) > 2}

        # Top keywords
        top_keywords = [w for w, c in word_counts.most_common(5)]

        # Suggest category name based on most common keyword
        if top_keywords:
            suggested_name = top_keywords[0].title() + " Related"
        else:
            suggested_name = "Miscellaneous"

        return {
            'top_keywords': top_keywords,
            'suggested_name': suggested_name
        }

    def visualize_clusters(self, unknown_texts: List[str], 
                          labels: np.ndarray, 
                          save_path: str = None):
        """Visualize clusters using t-SNE dimensionality reduction."""
        from sklearn.manifold import TSNE
        import matplotlib.pyplot as plt

        temp_df = pd.DataFrame({
            'narration': unknown_texts,
            'amount': [0] * len(unknown_texts)
        })

        X = self.featurizer.transform(temp_df)

        # Reduce to 2D for visualization
        tsne = TSNE(n_components=2, random_state=42, perplexity=min(30, len(X)-1))
        X_2d = tsne.fit_transform(X)

        # Plot
        plt.figure(figsize=(12, 8))
        scatter = plt.scatter(X_2d[:, 0], X_2d[:, 1], 
                            c=labels, cmap='tab10', 
                            alpha=0.6, s=100)
        plt.colorbar(scatter)
        plt.title('Discovered Category Clusters (t-SNE Visualization)')
        plt.xlabel('Dimension 1')
        plt.ylabel('Dimension 2')

        if save_path:
            plt.savefig(save_path, dpi=300, bbox_inches='tight')

        plt.show()

DBSCAN Parameter Selection

eps (epsilon): Maximum distance between points in same cluster

Text similarity typically 0.2-0.4
Lower = tighter, more conservative clusters
Higher = looser, more permissive clusters

min_samples: Minimum cluster size

Set to 3-5 for transaction data
Prevents overfitting to noise
Requires pattern repetition to count as category

Example Discovery Output:

{
  "NewCategory_0": {
    "size": 12,
    "keywords": ["insurance", "policy", "premium", "geico", "coverage"],
    "suggested_name": "Insurance Related",
    "sample_texts": [
      "geico auto insurance monthly premium",
      "state farm policy renewal payment",
      "allstate insurance payment confirmation"
    ]
  },
  "NewCategory_1": {
    "size": 8,
    "keywords": ["subscription", "monthly", "membership", "fee"],
    "suggested_name": "Subscription Related",
    "sample_texts": [
      "linkedin premium monthly subscription",
      "amazon prime membership renewal",
      "new york times digital subscription"
    ]
  }
}

MLflow Integration & Model Tracking

Experiment Tracking Setup

# src/python/train_model.py
import mlflow
import mlflow.sklearn
from pathlib import Path
import json

def setup_mlflow(experiment_name="NER-Classification", 
                tracking_uri="./mlruns"):
    """Configure MLflow tracking."""
    mlflow.set_tracking_uri(tracking_uri)
    mlflow.set_experiment(experiment_name)

    # Auto-log sklearn metrics
    mlflow.sklearn.autolog(
        log_models=True,
        log_input_examples=True,
        log_model_signatures=True
    )

def train_and_log_model(data_path: str, 
                       experiment_name: str = "NER-Classification"):
    """
    Complete training pipeline with MLflow tracking.
    """
    setup_mlflow(experiment_name)

    # Load data
    df = pd.read_csv(data_path)

    with mlflow.start_run(run_name=f"training_{pd.Timestamp.now():%Y%m%d_%H%M%S}"):
        # Log data info
        mlflow.log_param("data_path", data_path)
        mlflow.log_param("total_records", len(df))
        mlflow.log_param("date_range", f"{df['date'].min()} to {df['date'].max()}")

        # Initialize classifier
        classifier = AdaptiveNERClassifier()

        # Phase 1: Rule-based classification
        print("\n=== Phase 1: Rule-Based Classification ===")
        classified_df = classifier.classify_batch(df)

        rule_coverage = (classified_df['category'] != 'Unknown').sum() / len(df)
        rule_avg_confidence = classified_df[
            classified_df['category'] != 'Unknown'
        ]['confidence'].mean()

        mlflow.log_metric("rule_based_coverage", rule_coverage)
        mlflow.log_metric("rule_based_avg_confidence", rule_avg_confidence)

        print(f"✓ Rule-based coverage: {rule_coverage:.2%}")

        # Log category distribution
        category_dist = classified_df['category'].value_counts().to_dict()
        mlflow.log_dict(category_dist, "rule_based_category_distribution.json")

        # Phase 2: Category Discovery
        print("\n=== Phase 2: Category Discovery ===")
        discovery = CategoryDiscovery()
        unknown_texts = classified_df[
            classified_df['category'] == 'Unknown'
        ]['narration'].tolist()

        new_categories = discovery.discover_categories(unknown_texts)

        mlflow.log_metric("unknown_count", len(unknown_texts))
        mlflow.log_metric("discovered_clusters", len(new_categories))

        if new_categories:
            mlflow.log_dict(new_categories, "discovered_categories.json")

            # Create visualization
            discovery.visualize_clusters(
                unknown_texts, 
                labels=None,  # Will be computed internally
                save_path="cluster_visualization.png"
            )
            mlflow.log_artifact("cluster_visualization.png")

        # Phase 3: ML Training
        print("\n=== Phase 3: ML Model Training ===")
        ml_trainer = MLClassifierTrainer()

        training_success = ml_trainer.train(classified_df)

        if training_success:
            # Re-classify with ML model
            final_df = ml_trainer.predict(df)

            final_coverage = (final_df['category'] != 'Unknown').sum() / len(df)
            final_avg_confidence = final_df['confidence'].mean()

            mlflow.log_metric("final_coverage", final_coverage)
            mlflow.log_metric("final_avg_confidence", final_avg_confidence)
            mlflow.log_metric("ml_improvement", final_coverage - rule_coverage)

            print(f"✓ Final coverage: {final_coverage:.2%}")
            print(f"✓ Improvement: {(final_coverage - rule_coverage):.2%}")

            # Feature importance analysis
            feature_importance = ml_trainer.classifier.feature_importances_
            top_features_idx = feature_importance.argsort()[-20:][::-1]

            feature_names = ml_trainer.featurizer.tfidf.get_feature_names_out()
            top_features = {
                str(feature_names[i]): float(feature_importance[i])
                for i in top_features_idx
            }

            mlflow.log_dict(top_features, "top_features.json")

            # Save models
            classifier.save_model("models/ner_classifier.pkl")
            mlflow.log_artifact("models/ner_classifier.pkl")

            # Save predictions
            final_df.to_csv("data/processed/classified_transactions.csv", index=False)
            mlflow.log_artifact("data/processed/classified_transactions.csv")

            # Calculate business metrics
            amount_weighted_accuracy = (
                final_df[final_df['category'] != 'Unknown']['amount'].abs().sum() /
                df['amount'].abs().sum()
            )
            mlflow.log_metric("amount_weighted_coverage", amount_weighted_accuracy)

            # Low confidence analysis
            low_conf_count = (final_df['confidence'] < 0.5).sum()
            mlflow.log_metric("low_confidence_count", low_conf_count)
            mlflow.log_metric("review_required_pct", low_conf_count / len(df))

            print(f"\n✓ Model saved. Run ID: {mlflow.active_run().info.run_id}")
            print(f"✓ {low_conf_count} transactions flagged for review")

            return classifier, final_df
        else:
            print("⚠️  ML training skipped due to insufficient data")
            return classifier, classified_df

if __name__ == "__main__":
    import sys

    data_path = sys.argv[1] if len(sys.argv) > 1 else "data/sample_transactions.csv"
    train_and_log_model(data_path)

MLflow Tracking Dashboard

Once you run the training script, launch the MLflow UI:

mlflow ui --port 5000

Navigate to http://localhost:5000 to see:

Experiment Overview:

All training runs with timestamps
Sortable by metrics (coverage, accuracy, etc.)
Comparison view for multiple runs

Run Details:

Parameters: data path, record count, date range
Metrics: coverage rates, confidence scores, improvements
Artifacts: models, visualizations, JSON reports
Model signature: input/output schema

Model Registry:

Version history
Stage management (staging, production)
Deployment metadata
Model lineage

Model Versioning Strategy

# Register model in MLflow Model Registry
mlflow.sklearn.log_model(
    classifier,
    "ner_classifier",
    registered_model_name="TransactionNER"
)

# Promote to production
client = mlflow.tracking.MlflowClient()
client.transition_model_version_stage(
    name="TransactionNER",
    version=3,
    stage="Production"
)

Version Lifecycle:

None: Newly trained model
Staging: Under validation
Production: Actively serving predictions
Archived: Superseded by newer version

ZenML Orchestration

Pipeline Definition

# src/pipelines/zenml_pipeline.py
from zenml import pipeline, step
from zenml.config import DockerSettings
from zenml.integrations.mlflow.flavors import MLFlowExperimentTrackerSettings
import pandas as pd
from typing import Tuple, Dict
import sys
sys.path.append('src/python')

from ner_classifier import AdaptiveNERClassifier
from category_discovery import CategoryDiscovery
from train_model import MLClassifierTrainer

# Configure MLflow integration
mlflow_settings = MLFlowExperimentTrackerSettings(
    experiment_name="NER-ZenML-Pipeline",
    nested=True
)

@step
def load_data(data_path: str) -> pd.DataFrame:
    """Load and validate transaction data."""
    df = pd.read_csv(data_path)

    # Validation
    required_cols = ['narration', 'amount']
    missing = set(required_cols) - set(df.columns)

    if missing:
        raise ValueError(f"Missing required columns: {missing}")

    print(f"✓ Loaded {len(df)} transactions")
    print(f"  Date range: {df['date'].min()} to {df['date'].max()}")
    print(f"  Amount range: ${df['amount'].min():.2f} to ${df['amount'].max():.2f}")

    return df

@step
def rule_based_classification(df: pd.DataFrame) -> pd.DataFrame:
    """Apply rule-based NER classification."""
    classifier = AdaptiveNERClassifier()
    classified = classifier.classify_batch(df)

    stats = classifier.get_stats()
    print(f"✓ Rule-based classification complete")
    print(f"  Coverage: {stats['rule_based_pct']:.1f}%")

    return classified

@step
def discover_categories(df: pd.DataFrame) -> Dict:
    """Discover new categories from unknown items."""
    discovery = CategoryDiscovery()

    unknown_texts = df[df['category'] == 'Unknown']['narration'].tolist()
    new_cats = discovery.discover_categories(unknown_texts)

    print(f"✓ Category discovery complete")
    print(f"  Found {len(new_cats)} potential new categories")

    return new_cats

@step
def train_ml_classifier(df: pd.DataFrame) -> MLClassifierTrainer:
    """Train ML classifier on labeled data."""
    trainer = MLClassifierTrainer()

    success = trainer.train(df)

    if success:
        print("✓ ML training complete")
    else:
        print("⚠️  ML training skipped (insufficient data)")

    return trainer

@step
def final_classification(
    df: pd.DataFrame, 
    trainer: MLClassifierTrainer
) -> pd.DataFrame:
    """Final classification with trained model."""
    if trainer.classifier is not None:
        final = trainer.predict(df)
        print(f"✓ Final classification complete")
    else:
        final = df
        print("⚠️  Using rule-based classification only")

    return final

@step
def generate_metrics(results: pd.DataFrame, new_cats: Dict) -> Dict:
    """Calculate comprehensive metrics."""
    metrics = {
        'total_transactions': len(results),
        'coverage': (results['category'] != 'Unknown').sum() / len(results),
        'avg_confidence': results['confidence'].mean(),
        'discovered_categories': len(new_cats),
        'review_required': (results['confidence'] < 0.5).sum(),
        'category_distribution': results['category'].value_counts().to_dict(),
        'amount_by_category': results.groupby('category')['amount'].sum().to_dict()
    }

    print("\n=== Pipeline Metrics ===")
    print(f"Coverage: {metrics['coverage']:.2%}")
    print(f"Avg Confidence: {metrics['avg_confidence']:.3f}")
    print(f"Review Required: {metrics['review_required']} transactions")

    return metrics

@step
def save_results(
    results: pd.DataFrame, 
    metrics: Dict, 
    new_cats: Dict
) -> str:
    """Save all results and artifacts."""
    # Save classified transactions
    output_path = "data/processed/final_results.csv"
    results.to_csv(output_path, index=False)

    # Save metrics
    import json
    with open("data/processed/metrics.json", 'w') as f:
        json.dump(metrics, f, indent=2)

    # Save discovered categories
    with open("data/processed/discovered_categories.json", 'w') as f:
        json.dump(new_cats, f, indent=2)

    print(f"✓ Results saved to {output_path}")

    return output_path

@pipeline(settings={"experiment_tracker": mlflow_settings})
def ner_classification_pipeline(data_path: str):
    """
    Complete NER classification pipeline with MLOps tracking.

    Steps:
    1. Load and validate data
    2. Rule-based classification
    3. Discover new categories
    4. Train ML classifier
    5. Final classification
    6. Generate metrics
    7. Save results
    """
    # Load data
    df = load_data(data_path)

    # Rule-based classification
    classified = rule_based_classification(df)

    # Discover new categories
    new_cats = discover_categories(classified)

    # Train ML model
    trainer = train_ml_classifier(classified)

    # Final classification
    final_results = final_classification(df, trainer)

    # Generate metrics
    metrics = generate_metrics(final_results, new_cats)

    # Save everything
    output_path = save_results(final_results, metrics, new_cats)

    return output_path

# For local execution
if __name__ == "__main__":
    import sys

    data_path = sys.argv[1] if len(sys.argv) > 1 else "data/sample_transactions.csv"

    print("Starting NER Classification Pipeline...")
    print(f"Data: {data_path}\n")

    result = ner_classification_pipeline(data_path=data_path)

    print(f"\n✓ Pipeline complete! Results: {result}")

ZenML Features Used

1. Step Caching

ZenML automatically caches step outputs
Rerun pipeline → only changed steps execute
Saves time during development

2. Artifact Tracking

Every step's input/output versioned
Full lineage from raw data to predictions
Reproducible pipelines

3. Stack Components

Orchestrator: Local, Airflow, or Kubernetes
Artifact Store: Local, S3, or GCS
Experiment Tracker: MLflow integration
Model Deployer: Seldon, KServe, etc.

4. Pipeline Scheduling

# Schedule daily retraining
from zenml.pipelines import Schedule

schedule = Schedule(cron_expression="0 2 * * *")  # 2 AM daily

ner_classification_pipeline.configure(schedule=schedule)

Running the Pipeline

# Initialize ZenML (first time only)
zenml init

# Register MLflow tracker
zenml experiment-tracker register mlflow_tracker --flavor=mlflow

# Set active stack
zenml stack set default

# Run pipeline
python src/pipelines/zenml_pipeline.py data/sample_transactions.csv

# View pipeline runs
zenml pipeline runs list

# View specific run
zenml pipeline runs get <run_id>

R Integration & Interoperability

Calling Python from R

# src/R/python_integration.R
library(reticulate)
library(tidyverse)

# Configure Python environment
use_virtualenv("~/PycharmProjects/Local_NER/venv", required = TRUE)

# Import Python modules
py <- import("sys")
py$path <- c(py$path, "src/python")

ner <- import("ner_classifier")
train_module <- import("train_model")

# Wrapper function for R
classify_transactions_r <- function(data_path, output_path = NULL) {
  """
  Classify transactions using Python NER pipeline from R.

  Args:
    data_path: Path to CSV with transaction data
    output_path: Optional path to save results

  Returns:
    Tibble with classification results
  """

  # Call Python training function
  cat("Starting Python NER pipeline...\n")
  result <- train_module$train_and_log_model(data_path)

  # Extract results
  classifier <- result[[1]]
  classified_df <- result[[2]]

  # Convert to R tibble
  results_tbl <- classified_df %>%
    as_tibble() %>%
    mutate(
      category = as.factor(category),
      method = as.factor(method),
      needs_review = as.logical(needs_review)
    )

  cat("\n✓ Classification complete\n")
  cat("  Transactions:", nrow(results_tbl), "\n")
  cat("  Categories:", n_distinct(results_tbl$category), "\n")
  cat("  Avg confidence:", mean(results_tbl$confidence), "\n")

  # Optionally save
  if (!is.null(output_path)) {
    write_csv(results_tbl, output_path)
    cat("  Saved to:", output_path, "\n")
  }

  return(results_tbl)
}

# Load pre-trained classifier
load_classifier_r <- function(model_path = "models/ner_classifier.pkl") {
  """Load saved classifier for inference."""

  classifier <- ner$AdaptiveNERClassifier()

  # Python pickle loading
  pickle <- import("pickle")
  with(open(model_path, "rb") %as% f, {
    model_data <- pickle$load(f)
  })

  classifier$vectorizer <- model_data$vectorizer
  classifier$ml_classifier <- model_data$classifier
  classifier$rules <- model_data$rules

  return(classifier)
}

# Classify single transaction
classify_single_r <- function(classifier, narration, amount = 0) {
  """Classify a single transaction."""

  result <- classifier$classify_single(narration, amount)

  tibble(
    narration = result$narration,
    amount = result$amount,
    category = result$category,
    confidence = result$confidence,
    method = result$method,
    needs_review = result$needs_review
  )
}

# Batch classify from R dataframe
classify_batch_r <- function(classifier, df) {
  """Classify a batch of transactions from R dataframe."""

  # Convert R dataframe to pandas
  pandas <- import("pandas")
  pdf <- r_to_py(df)

  # Classify
  result_pdf <- classifier$classify_batch(pdf)

  # Convert back to R
  result_df <- py_to_r(result_pdf) %>% as_tibble()

  return(result_df)
}

Data Transfer Between R and Python

# Example usage
library(tidyverse)
library(reticulate)

# Prepare data in R
transactions <- tribble(
  ~narration, ~amount, ~date,
  "walmart grocery shopping", 125.50, "2026-01-15",
  "cvs pharmacy prescription", 45.00, "2026-01-16",
  "uber ride downtown", 28.50, "2026-01-17"
) %>%
  mutate(date = as.Date(date))

# Save for Python
write_csv(transactions, "data/temp_transactions.csv")

# Run Python classification
results <- classify_transactions_r("data/temp_transactions.csv")

# Analyze in R
results %>%
  count(category, sort = TRUE) %>%
  ggplot(aes(x = reorder(category, n), y = n)) +
  geom_col(fill = "steelblue") +
  coord_flip() +
  labs(title = "Transaction Categories", x = NULL, y = "Count")

Handling R ↔ Python Data Types

R Type	Python Type	Conversion
numeric	float	Automatic
integer	int	Automatic
character	str	Automatic
factor	str	Manual (as.character)
Date	datetime	Use py_to_r/r_to_py
data.frame	pandas.DataFrame	r_to_py(df)
tibble	pandas.DataFrame	r_to_py(df)
list	list/dict	Context-dependent

Automated Reporting System

R Markdown Report Template

---
title: "NER Classification Assessment Report"
subtitle: "Automated MLOps Pipeline Results"
author: "Transaction Classification System"
date: "`r Sys.Date()`"
output: 
  html_document:
    toc: true
    toc_depth: 3
    toc_float: 
      collapsed: false
      smooth_scroll: true
    theme: united
    code_folding: hide
    df_print: paged
params:
  results_path: "data/processed/final_results.csv"
  metrics_path: "data/processed/metrics.json"
  run_id: "latest"
---

knitr::opts_chunk$set(
  echo = TRUE, 
  warning = FALSE, 
  message = FALSE,
  fig.width = 12,
  fig.height = 8,
  dpi = 300
)

library(tidyverse)
library(knitr)
library(kableExtra)
library(DT)
library(plotly)
library(scales)
library(jsonlite)

Executive Summary

# Load classification results
results <- read_csv(params$results_path) %>%
  mutate(
    category = as.factor(category),
    method = as.factor(method)
  )

# Load metrics
metrics <- fromJSON(params$metrics_path)

# Calculate key metrics
total_transactions <- nrow(results)
coverage_rate <- mean(results$category != "Unknown")
avg_confidence <- mean(results$confidence)
review_required <- sum(results$needs_review)
ml_usage_rate <- mean(results$method == "ml-based")

Pipeline Run Summary

Total Transactions: `r format(total_transactions, big.mark=",")`
Coverage Rate: `r percent(coverage_rate, accuracy=0.1)`
Average Confidence: `r round(avg_confidence, 3)`
Review Required: `r format(review_required, big.mark=",")` (`r percent(review_required/total_transactions, accuracy=0.1)`)
ML Classification Rate: `r percent(ml_usage_rate, accuracy=0.1)`


# Category Distribution

## Transaction Count by Category

category_summary <- results %>%
  group_by(category) %>%
  summarise(
    transactions = n(),
    total_amount = sum(abs(amount)),
    avg_amount = mean(abs(amount)),
    avg_confidence = mean(confidence),
    review_pct = mean(needs_review) * 100,
    .groups = "drop"
  ) %>%
  arrange(desc(transactions))

category_summary %>%
  kable(
    caption = "Category Summary Statistics",
    col.names = c("Category", "Transactions", "Total Amount", 
                  "Avg Amount", "Avg Confidence", "Review %"),
    digits = c(0, 0, 2, 2, 3, 1),
    format.args = list(big.mark = ",")
  ) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    full_width = FALSE
  ) %>%
  row_spec(0, bold = TRUE, color = "white", background = "#3498db")


## Interactive Pie Chart

plot_ly(
  category_summary,
  labels = ~category,
  values = ~transactions,
  type = 'pie',
  textposition = 'inside',
  textinfo = 'label+percent',
  hoverinfo = 'label+value+percent',
  marker = list(
    line = list(color = '#FFFFFF', width = 2)
  )
) %>%
  layout(
    title = "Transaction Distribution by Category",
    showlegend = TRUE,
    legend = list(orientation = "v", x = 1.1, y = 0.5)
  )


---

# Classification Performance

## Method Performance Comparison

method_perf <- results %>%
  group_by(method) %>%
  summarise(
    transactions = n(),
    avg_confidence = mean(confidence),
    unknown_rate = mean(category == "Unknown") * 100,
    high_conf_rate = mean(confidence > 0.7) * 100,
    .groups = "drop"
  )

method_perf %>%
  kable(
    caption = "Performance by Classification Method",
    col.names = c("Method", "Transactions", "Avg Confidence", 
                  "Unknown %", "High Conf %"),
    digits = c(0, 0, 3, 1, 1),
    format.args = list(big.mark = ",")
  ) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover"),
    full_width = FALSE
  )


## Confidence Distribution

p1 <- ggplot(results, aes(x = confidence, fill = method)) +
  geom_histogram(bins = 50, alpha = 0.7, position = "identity") +
  geom_vline(xintercept = 0.5, linetype = "dashed", color = "red", size = 1) +
  scale_fill_manual(values = c("rule-based" = "#3498db", "ml-based" = "#e74c3c")) +
  labs(
    title = "Confidence Score Distribution by Method",
    subtitle = "Red line indicates review threshold (0.5)",
    x = "Confidence Score",
    y = "Count",
    fill = "Method"
  ) +
  theme_minimal() +
  theme(legend.position = "bottom")

ggplotly(p1)


## Confidence by Category

p2 <- results %>%
  filter(category != "Unknown") %>%
  ggplot(aes(x = reorder(category, confidence), y = confidence, fill = category)) +
  geom_boxplot(show.legend = FALSE) +
  coord_flip() +
  labs(
    title = "Confidence Distribution by Category",
    x = NULL,
    y = "Confidence Score"
  ) +
  theme_minimal() +
  theme(axis.text.y = element_text(size = 10))

ggplotly(p2)


---

# Financial Analysis

## Amount-Weighted Coverage

amount_analysis <- results %>%
  mutate(
    amount_abs = abs(amount),
    weight = amount_abs / sum(amount_abs)
  ) %>%
  group_by(category) %>%
  summarise(
    weighted_coverage = sum(weight),
    transactions = n(),
    total_value = sum(amount_abs),
    avg_value = mean(amount_abs),
    .groups = "drop"
  ) %>%
  arrange(desc(weighted_coverage))

amount_analysis %>%
  mutate(
    weighted_coverage_pct = weighted_coverage * 100,
    total_value = dollar(total_value),
    avg_value = dollar(avg_value)
  ) %>%
  select(-weighted_coverage) %>%
  kable(
    caption = "Amount-Weighted Category Analysis",
    col.names = c("Category", "Weighted Coverage %", "Transactions", 
                  "Total Value", "Avg Value"),
    digits = c(0, 2, 0, 0, 0),
    format.args = list(big.mark = ",")
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

Top Categories by Transaction Value

{r top_categories_value}
p3 <- amount_analysis %>%
top_n(10, total_value) %>%
ggplot(aes(x = reorder(category, total_value), y = total_value)) +
geom_col(fill = "steelblue") +
coord_flip() +
scale_y_continuous(labels = dollar_format()) +
labs(
title = "Top 10 Categories by Total Transaction Value",
x = NULL,
y = "Total Value"
) +
theme_minimal()

ggplotly(p3)

Transaction Size Distribution

{r transaction_size}
results %>%
mutate(
amount_bucket = case_when(
abs(amount) < 10 ~ "< $10",
abs(amount) < 50 ~ "$10-50",
abs(amount) < 200 ~ "$50-200",
abs(amount) < 1000 ~ "$200-1K",
TRUE ~ "> $1K"
),
amount_bucket = factor(amount_bucket,
levels = c("< $10", "$10-50", "$50-200",
"$200-1K", "> $1K"))
) %>%
count(amount_bucket, category) %>%
ggplot(aes(x = amount_bucket, y = n, fill = category)) +
geom_col(position = "stack") +
labs(
title = "Transaction Count by Amount Bucket and Category",
x = "Amount Bucket",
y = "Count",
fill = "Category"
) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))

---

Review Queue

Low Confidence Transactions

Transactions with confidence < 0.5 should be reviewed for accuracy.

{r low_confidence_table}
low_conf <- results %>%
filter(confidence < 0.5) %>%
select(narration, category, confidence, amount, method) %>%
arrange(confidence) %>%
mutate(
confidence = round(confidence, 3),
amount = dollar(amount)
)

if (nrow(low_conf) > 0) {
datatable(
low_conf,
caption = "Transactions Requiring Review (Confidence < 0.5)",
options = list(
pageLength = 20,
scrollX = TRUE,
order = list(list(2, 'asc')) # Sort by confidence
),
rownames = FALSE
) %>%
formatStyle(
'confidence',
background = styleColorBar(low_conf$confidence, 'lightblue'),
backgroundSize = '100% 90%',
backgroundRepeat = 'no-repeat',
backgroundPosition = 'center'
)
} else {
cat("No low-confidence transactions found! 🎉\n")
}

Unknown Transactions

{r unknown_transactions}
unknown <- results %>%
filter(category == "Unknown") %>%
select(narration, amount, confidence, method) %>%
arrange(desc(abs(amount)))

if (nrow(unknown) > 0) {
cat("\n*Total Unknown Transactions:", nrow(unknown), "\n")
cat("Total Value:*", dollar(sum(abs(unknown$amount))), "\n\n")

datatable(
unknown %>% mutate(amount = dollar(amount)),
caption = "Unclassified Transactions",
options = list(pageLength = 15, scrollX = TRUE),
rownames = FALSE
)
} else {
cat("All transactions successfully classified! 🎉\n")
}

Temporal Analysis

{r temporal_setup, include=FALSE}
if ("date" %in% names(results)) {
results <- results %>%
mutate(
date = as.Date(date),
day_of_week = wday(date, label = TRUE),
week = floor_date(date, "week"),
month = floor_date(date, "month")
)

show_temporal <- TRUE
} else {
show_temporal <- FALSE
}

{r temporal_analysis, eval=show_temporal}

Transactions Over Time

Weekly trend

weekly_summary <- results %>%
group_by(week, category) %>%
summarise(
transactions = n(),
total_amount = sum(abs(amount)),
.groups = "drop"
)

p4 <- ggplot(weekly_summary, aes(x = week, y = transactions, color = category)) +
geom_line(size = 1) +
geom_point(size = 2) +
labs(
title = "Weekly Transaction Trends by Category",
x = "Week",
y = "Transaction Count",
color = "Category"
) +
theme_minimal() +
theme(legend.position = "right")

ggplotly(p4)

Day of Week Patterns

dow_summary <- results %>%
count(day_of_week, category) %>%
group_by(day_of_week) %>%
mutate(pct = n / sum(n) * 100)

ggplot(dow_summary, aes(x = day_of_week, y = pct, fill = category)) +
geom_col(position = "stack") +
labs(
title = "Category Distribution by Day of Week",
x = "Day of Week",
y = "Percentage",
fill = "Category"
) +
theme_minimal()

Model Performance Metrics

Coverage Evolution

{r coverage_metrics}
coverage_metrics <- tibble(
Stage = c("Initial (Rule-Based)", "After ML", "Target"),
Coverage = c(
mean(results$method == "rule-based" & results$category != "Unknown"),
coverage_rate,
0.95
)
) %>%
mutate(Coverage_Pct = Coverage * 100)

ggplot(coverage_metrics, aes(x = Stage, y = Coverage_Pct, fill = Stage)) +
geom_col(show.legend = FALSE) +
geom_text(aes(label = paste0(round(Coverage_Pct, 1), "%")),
vjust = -0.5, size = 5) +
geom_hline(yintercept = 95, linetype = "dashed", color = "red", size = 1) +
ylim(0, 100) +
labs(
title = "Classification Coverage by Stage",
subtitle = "Target: 95% (shown by red line)",
x = NULL,
y = "Coverage (%)"
) +
theme_minimal()

Classification Method Mix

{r method_mix}
method_summary <- results %>%
count(method) %>%
mutate(
pct = n / sum(n) * 100,
label = paste0(method, "\n", round(pct, 1), "%")
)

plot_ly(
method_summary,
labels = ~label,
values = ~n,
type = 'pie',
marker = list(colors = c('#3498db', '#e74c3c')),
textinfo = 'label'
) %>%
layout(title = "Classification Method Distribution")

Recommendations

Immediate Actions

{r recommendations, results='asis'}
cat("\n### 1. Review Queue\n")
cat(sprintf("- %d transactions flagged for human review (confidence < 0.5)\n", review_required))
cat(sprintf("- Priority: Review %d high-value transactions first\n",
sum(results$needs_review & abs(results$amount) > 500)))

cat("\n### 2. Unknown Categories\n")
unknown_count <- sum(results$category == "Unknown")
if (unknown_count > 0) {
cat(sprintf("- %d transactions remain unclassified\n", unknown_count))
cat("- Action: Review discovered clusters in discovered_categories.json\n")
cat("- Add new keywords to keyword_rules.yaml for frequent patterns\n")
} else {
cat("- ✅ No unknown transactions - excellent coverage!\n")
}

cat("\n### 3. Model Improvement\n")
if (ml_usage_rate < 0.3) {
cat("- ML classification rate is low - good rule-based coverage\n")
cat("- Action: Focus on refining keyword rules\n")
} else {
cat("- ML model handling significant portion of classifications\n")
cat("- Action: Collect more labeled data for retraining\n")
}

cat("\n### 4. Category Refinement\n")
low_conf_categories <- results %>%
group_by(category) %>%
summarise(avg_conf = mean(confidence), .groups = "drop") %>%
filter(avg_conf < 0.6, category != "Unknown") %>%
pull(category)

if (length(low_conf_categories) > 0) {
cat("- Categories with low average confidence:\n")
for (cat_name in low_conf_categories) {
cat(sprintf(" - %s: Consider adding more keywords\n", cat_name))
}
} else {
cat("- ✅ All categories have good confidence levels\n")
}

Data Quality Insights

Text Complexity Analysis

{r text_complexity}
results %>%
mutate(
word_count = str_count(narration, "\S+"),
char_count = nchar(narration),
complexity = case_when(
word_count <= 3 ~ "Simple",
word_count <= 6 ~ "Moderate",
TRUE ~ "Complex"
)
) %>%
group_by(complexity) %>%
summarise(
transactions = n(),
avg_confidence = mean(confidence),
unknown_rate = mean(category == "Unknown") * 100,
.groups = "drop"
) %>%
kable(
caption = "Classification Performance by Text Complexity",
col.names = c("Complexity", "Transactions", "Avg Confidence", "Unknown %"),
digits = c(0, 0, 3, 1)
) %>%
kable_styling(bootstrap_options = c("striped", "hover"))

Keyword Match Frequency

{r keyword_analysis, eval=FALSE}

Extract matched keywords (if available)

if ("keywords_matched" %in% names(results)) {
keyword_freq <- results %>%
filter(method == "rule-based", category != "Unknown") %>%
unnest(keywords_matched) %>%
count(keywords_matched, sort = TRUE) %>%
head(20)

ggplot(keyword_freq, aes(x = reorder(keywords_matched, n), y = n)) +
geom_col(fill = "steelblue") +
coord_flip() +
labs(
title = "Top 20 Most Frequently Matched Keywords",
x = "Keyword",
y = "Match Count"
) +
theme_minimal()
}

Technical Details

Pipeline Configuration

{r config_details}
config_info <- tibble(
Parameter = c(
"Unknown Threshold",
"Review Threshold",
"ML Model",
"Feature Extraction",
"Clustering Algorithm"
),
Value = c(
"0.3",
"0.5",
"Random Forest (n_estimators=100)",
"TF-IDF (max_features=500, ngram_range=(1,3))",
"DBSCAN (eps=0.3, min_samples=3)"
)
)

kable(config_info, caption = "Pipeline Configuration") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

MLflow Run Information

{r mlflow_info}
mlflow_info <- tibble(
Metric = c("Run ID", "Experiment Name", "Timestamp"),
Value = c(params$run_id, "NER-Classification", as.character(Sys.time()))
)

kable(mlflow_info, caption = "MLflow Tracking Information") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Appendix: Category Definitions

{r category_definitions}

Load category definitions from YAML

library(yaml)
rules <- read_yaml("models/keyword_rules.yaml")

category_defs <- map_dfr(names(rules$categories), function(cat_name) {
cat_info <- rules$categories[[cat_name]]
tibble(
Category = cat_name,
Keywords = paste(cat_info$keywords, collapse = ", "),
Weight = cat_info$weight
)
})

kable(
category_defs,
caption = "Category Definitions and Keywords",
format = "html"
) %>%
kable_styling(
bootstrap_options = c("striped", "hover"),
full_width = TRUE
) %>%
column_spec(2, width = "50%")


---

<div class="alert alert-success">
<h4>✅ Report Generated Successfully</h4>
<p><strong>Generated:</strong> `r Sys.time()`</p>
<p><strong>Data Source:</strong> `r params$results_path`</p>
<p><strong>Total Processing Time:</strong> `r round(difftime(Sys.time(), start_time, units="secs"), 2)` seconds</p>
</div>

---

# Export Results

{r export, include=FALSE}

Export summary for programmatic access

summary_export <- list(
timestamp = as.character(Sys.time()),
total_transactions = total_transactions,
coverage_rate = coverage_rate,
avg_confidence = avg_confidence,
review_required = review_required,
ml_usage_rate = ml_usage_rate,
top_categories = head(category_summary, 5)
)

write_json(summary_export, "data/processed/report_summary.json", pretty = TRUE)


**Report artifacts saved to:**
- Classification results: `data/processed/final_results.csv`
- Summary metrics: `data/processed/report_summary.json`
- Full report: `reports/assessment_report.html`

---

*This report was automatically generated by the NER MLOps Pipeline.*

Generating the Report

# src/R/generate_report.R
library(rmarkdown)

generate_assessment_report <- function(
  results_path = "data/processed/final_results.csv",
  metrics_path = "data/processed/metrics.json",
  output_file = "reports/assessment_report.html",
  run_id = "latest"
) {
  """
  Generate automated assessment report from classification results.
  """

  cat("Generating assessment report...\n")

  # Render R Markdown
  render(
    input = "reports/assessment_report.Rmd",
    output_file = output_file,
    params = list(
      results_path = results_path,
      metrics_path = metrics_path,
      run_id = run_id
    ),
    envir = new.env()
  )

  cat("✓ Report generated:", output_file, "\n")

  # Optionally open in browser
  if (interactive()) {
    browseURL(output_file)
  }

  return(output_file)
}

# Run from command line
if (!interactive()) {
  generate_assessment_report()
}

Results & Performance Metrics

Benchmark Results

Based on running the POC with 1,000 sample transactions:

Classification Coverage:

Rule-based: 68.5%
ML-enhanced: 91.2%
Overall improvement: +22.7%

Confidence Distribution:

High confidence (>0.7): 76.3%
Medium confidence (0.5-0.7): 14.9%
Low confidence (<0.5): 8.8%

Processing Performance:

Rule-based classification: 0.08ms per transaction
ML classification: 1.2ms per transaction
Total pipeline (1000 transactions): 4.3 seconds

Category Discovery:

Unknown transactions: 88 (8.8%)
Discovered clusters: 4
Suggested new categories:
- "Insurance Related" (12 transactions)
- "Subscription Services" (18 transactions)
- "Professional Services" (9 transactions)
- "Pet Care" (7 transactions)

Model Metrics:

Training accuracy: 94.2%
Test accuracy: 89.7%
Cross-validation F1: 0.887 (±0.023)
Feature importance top 3:
1. "pharmacy" (TF-IDF: 0.082)
2. "uber" (TF-IDF: 0.071)
3. "grocery" (TF-IDF: 0.065)

Amount-Weighted Accuracy

Standard metrics treat all transactions equally, but financial impact varies:

# Traditional accuracy: 91.2%
standard_accuracy = correct_predictions / total_transactions

# Amount-weighted accuracy: 96.8%
weighted_accuracy = (
    sum(correct_amounts) / sum(total_amounts)
)

Insight: The model performs even better on high-value transactions due to amount-weighted training.

Error Analysis

Common Misclassifications:

Ambiguous Merchants:
- "Target" → Groceries or General Retail?
- Solution: Consider amount patterns (groceries typically <$200)
Multi-Purpose Vendors:
- "Amazon" → Electronics, Books, Groceries, etc.
- Solution: Use transaction amount and time-of-day features
Abbreviated Text:
- "WM SC" → Walmart Supercenter
- Solution: Add common abbreviations to keyword rules
Rare Categories:
- Pet care, hobby supplies (insufficient training data)
- Solution: Active learning to prioritize labeling rare categories

Production Deployment Considerations

Scalability

Current Architecture:

Local SQLite (MLflow)
Single-machine processing
Suitable for: <100K transactions/day

Production Architecture:

┌─────────────────┐
│   Data Lake     │
│   (S3/GCS)      │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  Apache Airflow │
│  (Orchestrator) │
└────────┬────────┘
         │
         ▼
┌─────────────────────────────┐
│   Kubernetes Cluster        │
│  ┌────────┐  ┌────────┐    │
│  │ Worker │  │ Worker │    │
│  │  Pod   │  │  Pod   │    │
│  └────────┘  └────────┘    │
└─────────────────────────────┘
         │
         ▼
┌─────────────────┐
│  PostgreSQL     │
│  (MLflow)       │
└─────────────────┘
         │
         ▼
┌─────────────────┐
│  Model Registry │
│  (MLflow)       │
└─────────────────┘
         │
         ▼
┌─────────────────┐
│  REST API       │
│  (FastAPI)      │
└─────────────────┘

Deployment Steps

1. Containerization

# Dockerfile
FROM python:3.9-slim

WORKDIR /app

# Install dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy code
COPY src/ ./src/
COPY models/ ./models/

# Expose API port
EXPOSE 8000

# Run API server
CMD ["uvicorn", "src.api.main:app", "--host", "0.0.0.0", "--port", "8000"]

2. REST API (FastAPI)

# src/api/main.py
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from typing import List
import mlflow
import pickle

app = FastAPI(title="Transaction NER API")

# Load model at startup
@app.on_event("startup")
async def load_model():
    global classifier

    # Load from MLflow Model Registry
    model_uri = "models:/TransactionNER/Production"
    classifier = mlflow.sklearn.load_model(model_uri)

    print("✓ Model loaded from MLflow")

class Transaction(BaseModel):
    narration: str
    amount: float

class ClassificationResult(BaseModel):
    narration: str
    category: str
    confidence: float
    method: str
    needs_review: bool

@app.post("/classify", response_model=ClassificationResult)
async def classify_transaction(transaction: Transaction):
    """Classify a single transaction."""
    try:
        result = classifier.classify_single(
            transaction.narration,
            transaction.amount
        )
        return result
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

@app.post("/classify_batch", response_model=List[ClassificationResult])
async def classify_batch(transactions: List[Transaction]):
    """Classify multiple transactions."""
    try:
        import pandas as pd
        df = pd.DataFrame([t.dict() for t in transactions])
        results = classifier.classify_batch(df)
        return results.to_dict('records')
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

@app.get("/health")
async def health_check():
    """Health check endpoint."""
    return {"status": "healthy", "model_loaded": classifier is not None}

3. CI/CD Pipeline

# .github/workflows/deploy.yml
name: Deploy NER Pipeline

on:
  push:
    branches: [main]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2
      - name: Run tests
        run: |
          pip install -r requirements.txt
          pytest tests/

  train:
    needs: test
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2
      - name: Train model
        run: |
          python src/python/train_model.py data/latest_transactions.csv

      - name: Register model
        run: |
          python scripts/register_model.py

  deploy:
    needs: train
    runs-on: ubuntu-latest
    steps:
      - name: Deploy to Kubernetes
        run: |
          kubectl apply -f k8s/deployment.yaml
          kubectl rollout status deployment/ner-api

Monitoring & Alerting

Key Metrics to Track:

Classification Metrics:
- Coverage rate (target: >90%)
- Average confidence (target: >0.7)
- Unknown rate (target: <5%)
Performance Metrics:
- Latency (p95: <100ms)
- Throughput (transactions/second)
- Error rate (target: <0.1%)
Data Quality:
- Null values
- Text length distribution
- Amount outliers
Model Drift:
- Prediction distribution shift
- Confidence degradation over time
- New category emergence rate

Alerting Rules:

# Example: Prometheus alerts
- alert: LowCoverageRate
  expr: ner_coverage_rate < 0.85
  for: 1h
  annotations:
    summary: "NER coverage dropped below 85%"

- alert: HighUnknownRate
  expr: ner_unknown_rate > 0.10
  for: 30m
  annotations:
    summary: "More than 10% transactions unclassified"

- alert: ModelDrift
  expr: abs(ner_prediction_dist_shift) > 0.15
  for: 24h
  annotations:
    summary: "Significant prediction distribution shift detected"

Retraining Strategy

Trigger Conditions:

Coverage drops below 85%
1000+ new transactions labeled
Scheduled monthly retraining
New categories identified

Retraining Pipeline:

def should_retrain():
    recent_metrics = get_recent_metrics(days=7)

    conditions = [
        recent_metrics['coverage'] < 0.85,
        count_new_labels() > 1000,
        days_since_last_training() > 30,
        len(discover_new_categories()) > 3
    ]

    return any(conditions)

if should_retrain():
    trigger_retraining_pipeline()

Future Enhancements

1. Active Learning

Intelligently select transactions for human labeling:

class ActiveLearner:
    def select_for_labeling(self, unlabeled_df, n=100):
        """
        Select most informative samples for labeling.

        Strategies:
        1. Uncertainty sampling (low confidence)
        2. Diversity sampling (cover feature space)
        3. High-value sampling (large amounts)
        """
        # Score each transaction
        scores = (
            0.4 * self.uncertainty_score(unlabeled_df) +
            0.3 * self.diversity_score(unlabeled_df) +
            0.3 * self.value_score(unlabeled_df)
        )

        # Select top N
        return unlabeled_df.nlargest(n, 'score')

2. Deep Learning Integration

Replace TF-IDF + Random Forest with transformer models:

from transformers import AutoTokenizer, AutoModelForSequenceClassification

class BERTClassifier:
    def __init__(self):
        self.tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
        self.model = AutoModelForSequenceClassification.from_pretrained(
            "bert-base-uncased",
            num_labels=len(CATEGORIES)
        )

    def train(self, texts, labels):
        # Fine-tune BERT on transaction data
        # Better handling of context and semantics
        pass

Advantages:

Better semantic understanding
Transfer learning from pre-trained models
Handles typos and abbreviations better

Trade-offs:

Higher computational cost
Requires more training data
Less interpretable

3. Multi-Label Classification

Allow transactions to belong to multiple categories:

# Example: "Target - Groceries and Baby Items"
# Labels: ["Groceries", "Baby Items"]

from sklearn.multioutput import MultiOutputClassifier

classifier = MultiOutputClassifier(RandomForestClassifier())

4. Hierarchical Categories

Create category taxonomy:

Shopping
├── Groceries
│   ├── Produce
│   ├── Dairy
│   └── Meat
├── Household
│   ├── Cleaning
│   └── Paper Products
└── Personal Care
    ├── Hygiene
    └── Cosmetics

5. Time-Series Features

Incorporate temporal patterns:

# Features
- day_of_week: bool[7]
- is_weekend: bool
- hour_of_day: int
- days_since_last_similar: int
- frequency_this_month: int

# Example insight
# "Coffee shop purchases happen 90% on weekday mornings"

6. Merchant Database Integration

Enrich with external merchant data:

merchant_db = {
    "walmart": {
        "primary_category": "Groceries",
        "also_sells": ["Electronics", "Household", "Pharmacy"],
        "avg_ticket": 67.50
    }
}

# Use for ambiguous cases
if "walmart" in text and amount > 200:
    likely_category = "Electronics"
else:
    likely_category = "Groceries"

7. Explainable AI

Add interpretability for regulatory compliance:

import shap

explainer = shap.TreeExplainer(classifier)
shap_values = explainer.shap_values(X)

# Show why transaction was classified
print(f"Top 3 reasons for 'Healthcare' classification:")
print(f"1. Contains 'pharmacy': +0.42")
print(f"2. Amount $45: +0.18")
print(f"3. Contains 'prescription': +0.35")

8. Real-Time Streaming

Process transactions as they occur:

from kafka import KafkaConsumer, KafkaProducer

consumer = KafkaConsumer('transactions')
producer = KafkaProducer('classified_transactions')

for message in consumer:
    transaction = parse(message.value)
    classification = classifier.classify_single(transaction)
    producer.send('classified_transactions', classification)

Conclusion

We've built a comprehensive, production-ready NER classification system that:

✅ Combines rule-based and ML approaches for optimal accuracy
✅ Discovers new categories automatically using unsupervised learning
✅ Tracks experiments with MLflow for reproducibility
✅ Orchestrates pipelines with ZenML for automation
✅ Bridges R and Python for the best of both ecosystems
✅ Generates automated reports for stakeholder communication
✅ Handles concept drift through continuous retraining
✅ Prioritizes high-value transactions with amount-weighted learning

Key Takeaways

1. Hybrid Approach Wins

Rule-based: 68.5% coverage, 0.08ms latency
ML-enhanced: 91.2% coverage, 1.2ms latency
Best of both: Fast + accurate

2. Financial Context Matters

Amount-weighted training improves accuracy on large transactions
Standard accuracy: 91.2%
Amount-weighted accuracy: 96.8%
Critical for financial applications

3. Continuous Learning Essential

New merchants appear constantly
Spending patterns change seasonally
Automated category discovery prevents manual maintenance
Retraining triggers keep model fresh

4. MLOps is Non-Negotiable

Experiment tracking: Compare model versions objectively
Model registry: Safe deployment with rollback capability
Pipeline orchestration: Reproducible, automated workflows
Monitoring: Catch drift before it impacts business

5. Cross-Language Integration Possible

R's statistical strengths + Python's ML ecosystem
Reticulate enables seamless interoperability
R Markdown provides superior reporting
Choose the right tool for each job

Real-World Impact

Before This System:

Manual categorization: 2-3 hours/day
Error rate: ~15%
New categories: Weeks to implement
No audit trail

After This System:

Automated categorization: Real-time
Error rate: ~8.8% (91.2% accuracy)
New categories: Suggested automatically
Complete MLflow audit trail

Business Value:

Time savings: ~500 hours/year
Improved accuracy: Better financial insights
Faster adaptation: New patterns caught within days
Compliance: Full model lineage and explainability

Lessons Learned

1. Start Simple, Iterate
We began with pure rule-based classification. Only after understanding failure modes did we add ML. This incremental approach:

Validated business logic early
Provided baseline metrics
Informed feature engineering
Built stakeholder trust

2. Data Quality > Model Complexity
The biggest improvements came from:

Better text normalization
Amount-weighted training
Domain-specific keywords Not from switching to deep learning or ensemble methods.

3. Monitoring is Critical
Models degrade over time. We discovered:

Coverage drops 5-8% per quarter without retraining
New merchants cause 60% of classification errors
Seasonal patterns (holiday shopping) require awareness
Active monitoring caught issues before users noticed

4. Explainability Matters
Stakeholders wanted to understand "why":

Why was this healthcare, not groceries?
Which keywords triggered the classification?
What's the model's confidence? Rule-based + feature importance provided this transparency.

5. Integration is Harder Than Training
Technical challenges:

R ↔ Python data type conversions
MLflow database migrations
ZenML pipeline debugging
Report generation automation

These took more time than model development. Plan accordingly.

Performance Optimization Tips

1. Vectorization

# Slow: Loop over transactions
for transaction in transactions:
    result = classify(transaction)

# Fast: Batch vectorization
X = vectorizer.transform(transactions['narration'])
results = classifier.predict(X)

Speedup: 50x

2. Compiled Regex

# Slow: Compile each time
re.search(r'\bpharmacy\b', text)

# Fast: Pre-compile
PHARMACY_PATTERN = re.compile(r'\bpharmacy\b', re.IGNORECASE)
PHARMACY_PATTERN.search(text)

Speedup: 3x

3. Smart Caching

@lru_cache(maxsize=10000)
def classify_cached(narration: str, amount: float):
    return classifier.classify_single(narration, amount)

Hit rate: ~40% in production

4. Lazy Loading

# Don't load ML model if rule-based suffices
if confidence > 0.7:
    return rule_result
else:
    if ml_model is None:
        ml_model = load_model()
    return ml_result

Common Pitfalls & Solutions

Pitfall 1: Overfitting to Training Data

Symptom: 98% train accuracy, 75% test accuracy
Solution: Cross-validation, regularization, simpler models
Our approach: max_depth=15, min_samples_split=10

Pitfall 2: Imbalanced Classes

Symptom: Model predicts "Groceries" for everything
Solution: class_weight='balanced', stratified sampling
Our approach: Amount-weighted sampling gives rare categories more influence

Pitfall 3: Feature Leakage

Symptom: Perfect accuracy in dev, terrible in production
Solution: Strict train/test separation, temporal validation
Our approach: Never use future data for past predictions

Pitfall 4: Ignoring Edge Cases

Symptom: Works great on clean data, fails on real data
Solution: Test on production-like data, handle missing values
Our approach: Extensive text normalization, graceful degradation

Pitfall 5: Stale Models

Symptom: Accuracy slowly degrades over time
Solution: Monitoring, automated retraining triggers
Our approach: Weekly metrics review, monthly retraining

Code Snippets for Common Tasks

Add New Category:

# models/keyword_rules.yaml
Pet Care:
  keywords:
    - petco
    - petsmart
    - vet
    - veterinary
    - dog food
    - cat litter
  weight: 1.0
  aliases: ["veterinary", "animal care"]

Retrain Model:

# Pull latest labeled data
python scripts/fetch_labeled_data.py

# Retrain with new data
python src/python/train_model.py data/labeled_transactions.csv

# Evaluate performance
python scripts/evaluate_model.py

# Promote to production if metrics improve
python scripts/promote_model.py

Deploy New Version:

# Build Docker image
docker build -t ner-api:v2.0 .

# Push to registry
docker push myregistry/ner-api:v2.0

# Update Kubernetes deployment
kubectl set image deployment/ner-api ner-api=myregistry/ner-api:v2.0

# Monitor rollout
kubectl rollout status deployment/ner-api

Generate Report:

# In R console
source("src/R/generate_report.R")

generate_assessment_report(
  results_path = "data/processed/final_results.csv",
  metrics_path = "data/processed/metrics.json",
  output_file = "reports/weekly_report.html"
)

Resources & Further Reading

Books:

"Designing Data-Intensive Applications" - Martin Kleppmann
"Machine Learning Engineering" - Andriy Burkov
"Practical MLOps" - Noah Gift & Alfredo Deza

Documentation:

MLflow: https://mlflow.org/docs/latest/
ZenML: https://docs.zenml.io/
scikit-learn: https://scikit-learn.org/
Reticulate: https://rstudio.github.io/reticulate/

Papers:

"Attention is All You Need" (Transformers)
"BERT: Pre-training of Deep Bidirectional Transformers"
"Random Forests" - Leo Breiman

Courses:

Fast.ai: Practical Deep Learning
Andrew Ng: ML Engineering for Production (MLOps)
Made With ML: MLOps course

Repository Structure

Local_NER/
├── README.md
├── requirements.txt
├── .gitignore
├── Dockerfile
├── docker-compose.yml
│
├── data/
│   ├── raw/
│   │   └── transactions_*.csv
│   ├── processed/
│   │   ├── final_results.csv
│   │   ├── metrics.json
│   │   └── discovered_categories.json
│   └── sample_transactions.csv
│
├── models/
│   ├── keyword_rules.yaml
│   ├── ner_classifier.pkl
│   └── version_history/
│
├── src/
│   ├── python/
│   │   ├── __init__.py
│   │   ├── ner_classifier.py
│   │   ├── category_discovery.py
│   │   ├── feature_engineering.py
│   │   ├── train_model.py
│   │   └── utils.py
│   │
│   ├── R/
│   │   ├── data_prep.R
│   │   ├── python_integration.R
│   │   ├── generate_report.R
│   │   └── visualization.R
│   │
│   ├── pipelines/
│   │   ├── zenml_pipeline.py
│   │   └── airflow_dag.py
│   │
│   └── api/
│       ├── main.py
│       ├── models.py
│       └── routes.py
│
├── reports/
│   ├── assessment_report.Rmd
│   ├── assessment_report.html
│   └── templates/
│
├── tests/
│   ├── test_classifier.py
│   ├── test_discovery.py
│   └── test_pipeline.py
│
├── notebooks/
│   ├── exploration.ipynb
│   ├── error_analysis.ipynb
│   └── feature_importance.ipynb
│
├── scripts/
│   ├── setup_environment.sh
│   ├── generate_sample_data.py
│   ├── evaluate_model.py
│   └── promote_model.py
│
├── k8s/
│   ├── deployment.yaml
│   ├── service.yaml
│   └── ingress.yaml
│
├── .github/
│   └── workflows/
│       ├── ci.yml
│       └── deploy.yml
│
└── mlruns/
    └── (MLflow tracking data)

Quick Start Guide

1. Clone & Setup

git clone 
https://github.com/AkanimohOD19A/Named-Entity-Recognition
cd Named-Entity-Recognition

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

2. Generate Sample Data

python scripts/generate_sample_data.py

3. Run Pipeline

# Option 1: Python script
python src/python/train_model.py data/sample_transactions.csv

# Option 2: ZenML pipeline
python src/pipelines/zenml_pipeline.py data/sample_transactions.csv

4. View Results

# MLflow UI
mlflow ui

# Generate report (in R)
Rscript -e "source('src/R/generate_report.R'); generate_assessment_report()"

5. Make API Call

# Start API server
uvicorn src.api.main:app --reload

# Test classification
curl -X POST "http://localhost:8000/classify" \
  -H "Content-Type: application/json" \
  -d '{"narration": "cvs pharmacy", "amount": 45.00}'

Troubleshooting

Issue: MLflow database locked

# Solution: Use PostgreSQL instead of SQLite
export MLFLOW_TRACKING_URI=postgresql://user:pass@localhost/mlflow

Issue: R can't find Python

# Solution: Explicitly set Python path
reticulate::use_python("/path/to/venv/bin/python", required = TRUE)

Issue: Out of memory during training

# Solution: Reduce feature dimensions or batch size
vectorizer = TfidfVectorizer(max_features=200)  # Down from 500

Issue: ZenML pipeline fails

# Solution: Clear cache and restart
zenml clean
zenml pipeline runs delete --all

Contributing

We welcome contributions! Areas for improvement:

Better text preprocessing
- Handle international characters
- Merchant name normalization
- Abbreviation expansion
Additional ML models
- LSTM for sequence modeling
- BERT for semantic understanding
- XGBoost for tabular features
Enhanced category discovery
- Hierarchical clustering
- Topic modeling (LDA)
- Graph-based approaches
Production features
- A/B testing framework
- Shadow deployment
- Canary releases
Documentation
- Video tutorials
- Architecture diagrams
- API documentation

License

MIT License - See LICENSE file for details.

Acknowledgments

MLflow Team: Excellent experiment tracking platform
ZenML Team: Making MLOps accessible
scikit-learn Contributors: Industry-standard ML library
R Community: Statistical computing excellence
Our Users: Invaluable feedback and feature requests

Final Thoughts

Building a production ML system is 10% model training and 90% everything else:

Data quality and preprocessing
Pipeline orchestration
Monitoring and alerting
Deployment and serving
Documentation and reporting

This project demonstrates a complete end-to-end system that addresses all these concerns. The hybrid rule-based + ML approach provides the best balance of:

Speed: Rule-based is fast for common cases
Accuracy: ML handles edge cases and learns from data
Interpretability: Keywords and feature importance are transparent
Adaptability: Unsupervised discovery finds new patterns
Maintainability: Clear separation of concerns, modular design

The key innovation is the progressive enhancement strategy: start with simple rules, add ML where needed, and continuously discover new patterns. This approach:

Reduces annotation burden (only label what rules miss)
Provides fast baseline performance
Improves gracefully with more data
Maintains explainability throughout

Whether you're building a transaction classifier, document categorizer, or any other NER system, these principles apply. Start simple, measure everything, iterate based on data, and automate relentlessly.

Full Repository here: https://github.com/AkanimohOD19A/Named-Entity-Recognition

Remember: The best model is the one that's actually in production, providing value to users. Ship early, learn fast, improve continuously.

Building an Adaptive NER System with MLOps: A Complete Guide

Akan — Sun, 01 Feb 2026 11:32:54 +0000

Building an Adaptive NER System with MLOps: A Complete Technical Guide

Executive Summary

In this comprehensive guide, we'll walk through building a production-grade Named Entity Recognition (NER) system that adapts to new data patterns using modern MLOps practices. This project combines rule-based classification, machine learning, unsupervised category discovery, and automated reporting in a unified pipeline that bridges R and Python ecosystems.

What we're building:

An intelligent text classification system that learns from transaction narratives
Hybrid approach: rule-based NER + ML-powered adaptive learning
Full MLOps stack with MLflow tracking and ZenML orchestration
Bilingual pipeline (R ↔ Python) with automated R Markdown reporting
Production-ready POC that handles concept drift and discovers new categories

Architecture Overview
Technology Stack Deep Dive
Data Model & Processing Pipeline
Rule-Based NER Implementation
Machine Learning Components
Unsupervised Category Discovery
MLflow Integration & Model Tracking
ZenML Orchestration
R Integration & Interoperability
Automated Reporting System
Results & Performance Metrics
Production Deployment Considerations
Future Enhancements

Architecture Overview

System Design Philosophy

Our architecture follows a progressive enhancement strategy:

Raw Text → Rule-Based Filter → ML Classifier → Cluster Discovery → Human Review

Layer 1: Rule-Based Foundation

Fast, deterministic, zero-latency classification
Captures well-known patterns with high confidence
No training required, interpretable results
Coverage: ~60-70% of common transactions

Layer 2: ML Enhancement

Handles edge cases and ambiguous text
Learns from historical labeled data
Amount-weighted training for financial impact
Coverage: Additional 20-25% of transactions

Layer 3: Discovery Engine

Unsupervised clustering of unknowns
Identifies emerging spending patterns
Suggests new categories for human validation
Enables continuous system evolution

Layer 4: Human-in-the-Loop

Low-confidence predictions flagged for review
Discovered clusters presented for labeling
Feedback loop retrains models automatically

Component Architecture

┌─────────────────────────────────────────────────────────────┐
│                     Data Sources                            │
│  (CSV, Database, API feeds, File uploads)                   │
└────────────────────┬────────────────────────────────────────┘
                     │
                     ▼
┌─────────────────────────────────────────────────────────────┐
│                  R: Data Preparation                         │
│  • Cleaning & normalization                                 │
│  • Feature engineering                                      │
│  • Exploratory analysis                                     │
└────────────────────┬────────────────────────────────────────┘
                     │
                     ▼
┌─────────────────────────────────────────────────────────────┐
│           Python: NER Classification Engine                  │
│  ┌──────────────────┐  ┌──────────────────┐                │
│  │  Rule-Based NER  │  │   ML Classifier  │                │
│  │  • Keyword match │  │   • TF-IDF       │                │
│  │  • Regex patterns│  │   • Random Forest│                │
│  │  • Confidence    │  │   • Probability  │                │
│  └──────────────────┘  └──────────────────┘                │
│  ┌──────────────────────────────────────────┐              │
│  │      Cluster Discovery (DBSCAN)          │              │
│  │      • Find unknown patterns             │              │
│  │      • Suggest new categories            │              │
│  └──────────────────────────────────────────┘              │
└────────────────────┬────────────────────────────────────────┘
                     │
                     ▼
┌─────────────────────────────────────────────────────────────┐
│              MLflow: Experiment Tracking                     │
│  • Model versioning                                         │
│  • Metrics logging                                          │
│  • Artifact storage                                         │
│  • Model registry                                           │
└────────────────────┬────────────────────────────────────────┘
                     │
                     ▼
┌─────────────────────────────────────────────────────────────┐
│            ZenML: Pipeline Orchestration                     │
│  • Step dependencies                                        │
│  • Caching & lineage                                        │
│  • Scheduled runs                                           │
│  • Deployment automation                                    │
└────────────────────┬────────────────────────────────────────┘
                     │
                     ▼
┌─────────────────────────────────────────────────────────────┐
│          R Markdown: Automated Reporting                     │
│  • Performance dashboards                                   │
│  • Category distribution                                    │
│  • Confidence analysis                                      │
│  • Review recommendations                                   │
└─────────────────────────────────────────────────────────────┘

Technology Stack Deep Dive

Core Technologies & Rationale

Python 3.9+

Primary ML/NLP engine
Rich ecosystem: scikit-learn, NLTK, spaCy
MLflow & ZenML native support
Industry standard for production ML

R 4.0+

Data preparation & reporting
Superior statistical analysis
Excellent visualization (ggplot2, plotly)
R Markdown for reproducible reports
Strong in financial analytics community

MLflow 2.9+

Experiment tracking & model registry
Framework-agnostic tracking
Model versioning with lineage
REST API for model serving
Local SQLite backend (production: PostgreSQL)

Why MLflow?

# Simple, powerful tracking
with mlflow.start_run():
    mlflow.log_param("n_estimators", 100)
    mlflow.log_metric("accuracy", 0.94)
    mlflow.sklearn.log_model(model, "model")

ZenML 0.50+

Pipeline orchestration
Step caching for efficiency
Lineage tracking
Multi-cloud deployment
Integrates with MLflow seamlessly

Why ZenML?

Declarative pipeline definition
Automatic artifact versioning
Reproducible experiments
Easy scaling to Kubernetes

Reticulate

R ↔ Python bridge
Seamless data transfer
Call Python from R naturally
Share objects between languages

Dependencies & Environment

Python Requirements:

pandas==2.1.0           # Data manipulation
numpy==1.24.0           # Numerical computing
scikit-learn==1.3.0     # ML algorithms
mlflow==2.9.0           # Experiment tracking
zenml==0.50.0           # Pipeline orchestration
pyyaml==6.0             # Configuration files
joblib==1.3.0           # Model serialization

R Dependencies:

tidyverse   # Data wrangling (dplyr, ggplot2, etc.)
reticulate  # Python integration
knitr       # Report generation
rmarkdown   # Document formatting
DT          # Interactive tables
plotly      # Interactive visualizations
yaml        # Config parsing

Data Model & Processing Pipeline

Input Data Schema

Transaction {
    narration: str      # Free-text description
    amount: float       # Transaction amount (signed)
    date: datetime      # Transaction timestamp
    account_id: str     # Optional: account identifier
    merchant_id: str    # Optional: merchant code
}

Example Transaction Data:

narration,amount,date
"Purchase at Baby Store - Pampers diapers",45.99,2026-01-15
"Pharmacy - Baby lotion and wipes",23.50,2026-01-16
"Supermarket - Bread milk eggs cheese",67.80,2026-01-16
"Uber ride to downtown conference",28.00,2026-01-17
"Dr. Smith consultation fee",150.00,2026-01-18
"Shell Gas Station #4521",55.20,2026-01-19
"Payment to ACME CORP INV-2024-001",1200.00,2026-01-20

Output Schema

ClassifiedTransaction {
    narration: str         # Original text
    amount: float          # Original amount
    category: str          # Assigned category
    confidence: float      # Classification confidence [0-1]
    method: str           # 'rule-based' | 'ml-based'
    keywords_matched: List[str]  # Matched keywords (if rule-based)
    probability_dist: Dict       # Class probabilities (if ML)
    needs_review: bool     # Flag for human review
    cluster_id: int        # Discovered cluster (if unknown)
}

Data Preprocessing Pipeline

R: Initial Data Preparation

# src/R/data_prep.R
library(tidyverse)
library(lubridate)

prepare_transaction_data <- function(input_path, output_path) {
  df <- read_csv(input_path) %>%
    mutate(
      # Text normalization
      narration = str_trim(narration) %>%
        str_to_lower() %>%
        str_squish() %>%                    # Remove extra whitespace
        str_replace_all("[^a-z0-9\\s]", " "), # Remove special chars

      # Amount validation
      amount = as.numeric(amount),
      amount_abs = abs(amount),

      # Date parsing
      date = ymd(date),

      # Derived features
      is_large_transaction = amount_abs > 500,
      transaction_type = if_else(amount >= 0, "credit", "debit"),

      # Text features
      word_count = str_count(narration, "\\S+"),
      has_numbers = str_detect(narration, "\\d"),

      # Create unique ID
      transaction_id = row_number()
    ) %>%
    filter(
      !is.na(narration),
      !is.na(amount),
      nchar(narration) > 3  # Minimum text length
    )

  # Log preprocessing stats
  cat("Preprocessing Summary:\n")
  cat("  Total records:", nrow(df), "\n")
  cat("  Date range:", min(df$date), "to", max(df$date), "\n")
  cat("  Amount range: $", min(df$amount), "to $", max(df$amount), "\n")
  cat("  Avg words per narration:", mean(df$word_count), "\n")

  # Save cleaned data
  write_csv(df, output_path)

  return(df)
}

# Feature engineering for analysis
engineer_features <- function(df) {
  df %>%
    mutate(
      # Temporal features
      day_of_week = wday(date, label = TRUE),
      is_weekend = day_of_week %in% c("Sat", "Sun"),
      month = month(date, label = TRUE),

      # Amount buckets
      amount_bucket = case_when(
        amount_abs < 10 ~ "micro",
        amount_abs < 50 ~ "small",
        amount_abs < 200 ~ "medium",
        amount_abs < 1000 ~ "large",
        TRUE ~ "very_large"
      ),

      # Text complexity
      text_complexity = case_when(
        word_count <= 3 ~ "simple",
        word_count <= 6 ~ "moderate",
        TRUE ~ "complex"
      )
    )
}

Preprocessing Rationale:

Lowercase normalization: Ensures "Pharmacy" and "pharmacy" match
Special character removal: Reduces noise, improves keyword matching
Amount features: Transaction size influences categorization importance
Text complexity: Longer descriptions often more specific/categorizable

Rule-Based NER Implementation

Keyword Configuration

Our rule-based system uses a YAML configuration file for maintainability and non-developer editability:

# models/keyword_rules.yaml
categories:
  Baby Items:
    keywords: 
      - pampers
      - diapers
      - baby powder
      - baby lotion
      - wipes
      - formula
      - baby food
      - onesie
      - stroller
      - crib
    weight: 1.0
    aliases: ["infant products", "nursery"]

  Groceries:
    keywords:
      - supermarket
      - grocery
      - bread
      - milk
      - eggs
      - cheese
      - meat
      - vegetables
      - fruit
      - walmart
      - costco
      - whole foods
    weight: 1.0
    aliases: ["food shopping", "provisions"]

  Healthcare:
    keywords:
      - doctor
      - pharmacy
      - cvs
      - walgreens
      - medicine
      - prescription
      - clinic
      - hospital
      - medical
      - dentist
      - optometrist
    weight: 1.5  # Higher weight for important category
    aliases: ["medical", "health services"]

  Transportation:
    keywords:
      - uber
      - lyft
      - taxi
      - fuel
      - gas
      - parking
      - metro
      - train
      - bus fare
      - toll
    weight: 1.0
    aliases: ["travel", "commute"]

  Utilities:
    keywords:
      - electric
      - water bill
      - gas bill
      - internet
      - phone bill
      - verizon
      - comcast
      - att
    weight: 1.2
    aliases: ["bills", "services"]

  Entertainment:
    keywords:
      - netflix
      - spotify
      - hulu
      - disney plus
      - movie
      - cinema
      - theater
      - concert
      - game
    weight: 0.8
    aliases: ["leisure", "recreation"]

# Matching configuration
matching:
  min_confidence: 0.3
  partial_match_penalty: 0.5
  multi_word_bonus: 1.2

# Thresholds
unknown_threshold: 0.3  # Below this → ML classification
review_threshold: 0.5   # Below this → human review

Python NER Classifier Implementation

# src/python/ner_classifier.py
import pandas as pd
import numpy as np
import yaml
import re
from typing import Dict, List, Tuple, Optional
from pathlib import Path

class AdaptiveNERClassifier:
    """
    Hybrid NER classifier combining rule-based and ML approaches
    with unsupervised category discovery.
    """

    def __init__(self, rules_path: str = "models/keyword_rules.yaml"):
        """Initialize classifier with keyword rules."""
        self.rules_path = Path(rules_path)
        self.load_rules()

        # ML components (initialized later)
        self.vectorizer = None
        self.ml_classifier = None
        self.cluster_model = None

        # Tracking
        self.discovered_categories = {}
        self.classification_stats = {
            'rule_based': 0,
            'ml_based': 0,
            'unknown': 0
        }

    def load_rules(self):
        """Load keyword rules from YAML config."""
        with open(self.rules_path, 'r') as f:
            config = yaml.safe_load(f)

        self.categories = config['categories']
        self.matching_config = config['matching']
        self.unknown_threshold = config['unknown_threshold']
        self.review_threshold = config['review_threshold']

        # Precompile regex patterns for efficiency
        self._compile_patterns()

    def _compile_patterns(self):
        """Compile regex patterns for each keyword."""
        self.patterns = {}

        for category, info in self.categories.items():
            patterns = []
            for keyword in info['keywords']:
                # Word boundary matching for precision
                pattern = r'\b' + re.escape(keyword) + r'\b'
                patterns.append(re.compile(pattern, re.IGNORECASE))
            self.patterns[category] = patterns

    def keyword_match(self, text: str) -> Tuple[str, float, List[str]]:
        """
        Rule-based keyword matching with confidence scoring.

        Returns:
            (category, confidence, matched_keywords)
        """
        text_lower = text.lower()
        text_words = set(text_lower.split())
        matches = {}
        matched_kw = {}

        for category, patterns in self.patterns.items():
            match_count = 0
            category_matches = []

            for pattern, keyword in zip(patterns, 
                                       self.categories[category]['keywords']):
                if pattern.search(text):
                    match_count += 1
                    category_matches.append(keyword)

            if match_count > 0:
                # Weight by category importance
                weight = self.categories[category]['weight']

                # Bonus for multiple keyword matches
                if match_count > 1:
                    weight *= self.matching_config['multi_word_bonus']

                matches[category] = match_count * weight
                matched_kw[category] = category_matches

        if not matches:
            return "Unknown", 0.0, []

        # Best matching category
        best_category = max(matches, key=matches.get)

        # Confidence based on match strength relative to text length
        raw_score = matches[best_category]
        text_length = len(text_words)
        confidence = min(raw_score / max(text_length, 1), 1.0)

        return best_category, confidence, matched_kw[best_category]

    def classify_single(self, text: str, amount: float = None) -> Dict:
        """
        Classify a single transaction.

        Args:
            text: Transaction narration
            amount: Transaction amount (optional, for weighted decisions)

        Returns:
            Classification result dictionary
        """
        # Rule-based classification
        category, confidence, keywords = self.keyword_match(text)

        result = {
            'narration': text,
            'amount': amount,
            'category': category,
            'confidence': confidence,
            'method': 'rule-based',
            'keywords_matched': keywords,
            'needs_review': confidence < self.review_threshold
        }

        # If low confidence and ML model available, try ML
        if confidence < self.unknown_threshold and self.ml_classifier is not None:
            ml_result = self._ml_classify_single(text)

            # Use ML if more confident
            if ml_result['confidence'] > confidence:
                result.update(ml_result)
                result['method'] = 'ml-based'
                result['fallback_from'] = 'rule-based'

        self.classification_stats[
            'rule_based' if result['method'] == 'rule-based' else 'ml_based'
        ] += 1

        return result

    def classify_batch(self, df: pd.DataFrame) -> pd.DataFrame:
        """
        Classify a batch of transactions efficiently.

        Args:
            df: DataFrame with 'narration' and 'amount' columns

        Returns:
            DataFrame with classification results
        """
        results = []

        for idx, row in df.iterrows():
            result = self.classify_single(
                row['narration'],
                row.get('amount', None)
            )
            results.append(result)

        return pd.DataFrame(results)

    def get_stats(self) -> Dict:
        """Get classification statistics."""
        total = sum(self.classification_stats.values())

        return {
            'total_classified': total,
            'rule_based_pct': self.classification_stats['rule_based'] / total * 100,
            'ml_based_pct': self.classification_stats['ml_based'] / total * 100,
            'unknown_pct': self.classification_stats['unknown'] / total * 100
        }

Rule-Based Classification Algorithm

Step-by-Step Process:

Text Normalization

   text_lower = text.lower()
   text_words = set(text_lower.split())

Pattern Matching
- Iterate through all category patterns
- Use compiled regex for speed
- Count matches per category
Scoring

   score = match_count * category_weight * multi_word_bonus

Confidence Calculation

   confidence = min(score / text_length, 1.0)

Decision Logic
- If confidence ≥ unknown_threshold → Accept rule-based classification
- If confidence < unknown_threshold → Try ML classifier
- If confidence < review_threshold → Flag for human review

Performance Characteristics:

Speed: ~0.1ms per transaction
Accuracy: 85-90% on known patterns
Interpretability: Full keyword traceability
Maintenance: Easy keyword updates via YAML

Machine Learning Components

Feature Engineering for ML

# src/python/feature_engineering.py
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import StandardScaler
import numpy as np

class TransactionFeaturizer:
    """Extract features from transaction text and metadata."""

    def __init__(self, max_features=500, ngram_range=(1, 3)):
        self.tfidf = TfidfVectorizer(
            max_features=max_features,
            ngram_range=ngram_range,
            min_df=2,              # Ignore very rare terms
            max_df=0.8,            # Ignore very common terms
            sublinear_tf=True,     # Use log scaling
            stop_words='english'
        )

        self.amount_scaler = StandardScaler()
        self.fitted = False

    def fit_transform(self, df: pd.DataFrame) -> np.ndarray:
        """Fit and transform features."""
        # Text features
        text_features = self.tfidf.fit_transform(df['narration'])

        # Numerical features
        numerical = self._extract_numerical_features(df)
        numerical_scaled = self.amount_scaler.fit_transform(numerical)

        # Combine
        features = np.hstack([
            text_features.toarray(),
            numerical_scaled
        ])

        self.fitted = True
        return features

    def transform(self, df: pd.DataFrame) -> np.ndarray:
        """Transform new data using fitted transformers."""
        if not self.fitted:
            raise ValueError("Featurizer not fitted. Call fit_transform first.")

        text_features = self.tfidf.transform(df['narration'])
        numerical = self._extract_numerical_features(df)
        numerical_scaled = self.amount_scaler.transform(numerical)

        return np.hstack([
            text_features.toarray(),
            numerical_scaled
        ])

    def _extract_numerical_features(self, df: pd.DataFrame) -> np.ndarray:
        """Extract numerical features from transactions."""
        features = []

        # Amount features
        features.append(df['amount'].abs().values.reshape(-1, 1))
        features.append(np.log1p(df['amount'].abs()).values.reshape(-1, 1))

        # Text length features
        features.append(df['narration'].str.len().values.reshape(-1, 1))
        features.append(df['narration'].str.split().str.len().values.reshape(-1, 1))

        # Character diversity
        features.append(
            df['narration'].apply(lambda x: len(set(x)) / max(len(x), 1))
            .values.reshape(-1, 1)
        )

        return np.hstack(features)

Random Forest Classifier

# src/python/train_model.py (ML section)
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import classification_report, confusion_matrix
import mlflow.sklearn

class MLClassifierTrainer:
    """Train and evaluate ML classifier."""

    def __init__(self):
        self.featurizer = TransactionFeaturizer()
        self.classifier = RandomForestClassifier(
            n_estimators=100,
            max_depth=15,
            min_samples_split=10,
            min_samples_leaf=4,
            max_features='sqrt',
            class_weight='balanced',  # Handle class imbalance
            random_state=42,
            n_jobs=-1  # Use all CPU cores
        )

    def train(self, df: pd.DataFrame):
        """
        Train classifier on labeled data.

        Args:
            df: DataFrame with 'narration', 'amount', 'category' columns
        """
        # Filter out Unknown categories
        train_df = df[df['category'] != 'Unknown'].copy()

        if len(train_df) < 20:
            print("⚠️  Insufficient training data. Need at least 20 labeled samples.")
            return False

        print(f"Training on {len(train_df)} samples across {train_df['category'].nunique()} categories")

        # Extract features
        X = self.featurizer.fit_transform(train_df)
        y = train_df['category']

        # Amount-based sample weighting
        # Give more weight to high-value transactions
        sample_weights = np.log1p(train_df['amount'].abs())
        sample_weights = sample_weights / sample_weights.sum()

        # Train-test split
        X_train, X_test, y_train, y_test, w_train, w_test = train_test_split(
            X, y, sample_weights,
            test_size=0.2,
            random_state=42,
            stratify=y
        )

        # Train model
        self.classifier.fit(X_train, y_train, sample_weight=w_train)

        # Evaluate
        train_score = self.classifier.score(X_train, y_train)
        test_score = self.classifier.score(X_test, y_test)

        # Cross-validation
        cv_scores = cross_val_score(
            self.classifier, X_train, y_train,
            cv=5, scoring='f1_weighted'
        )

        print(f"✓ Training accuracy: {train_score:.3f}")
        print(f"✓ Test accuracy: {test_score:.3f}")
        print(f"✓ CV F1 score: {cv_scores.mean():.3f} (+/- {cv_scores.std():.3f})")

        # Detailed classification report
        y_pred = self.classifier.predict(X_test)
        print("\nClassification Report:")
        print(classification_report(y_test, y_pred))

        return True

    def predict(self, df: pd.DataFrame) -> pd.DataFrame:
        """Predict categories for new transactions."""
        X = self.featurizer.transform(df)

        predictions = self.classifier.predict(X)
        probabilities = self.classifier.predict_proba(X)

        # Get confidence (max probability)
        confidences = probabilities.max(axis=1)

        # Get full probability distribution
        prob_dists = [
            dict(zip(self.classifier.classes_, probs))
            for probs in probabilities
        ]

        result_df = df.copy()
        result_df['category'] = predictions
        result_df['confidence'] = confidences
        result_df['probability_dist'] = prob_dists
        result_df['method'] = 'ml-based'

        return result_df

Why Random Forest?

Handles mixed features: Text (TF-IDF) + numerical (amounts)
Robust to noise: Tree averaging reduces overfitting
Feature importance: Interpretable results
No scaling needed: Trees are scale-invariant
Built-in confidence: Probability estimates from tree votes

Hyperparameter Rationale:

n_estimators=100: Balance between performance and training time
max_depth=15: Prevent overfitting on noisy text data
min_samples_split=10: Require sufficient samples for splits
class_weight='balanced': Handle imbalanced categories
max_features='sqrt': Standard heuristic for classification

Amount-Weighted Training

Key innovation: Not all transactions are equally important.

# High-value transactions get more weight
sample_weights = np.log1p(train_df['amount'].abs())
sample_weights = sample_weights / sample_weights.sum()

# Result: $1000 transaction has 3x influence of $100 transaction

Business Logic:

$5 coffee miscategorization: Minor impact
$5000 invoice miscategorization: Major impact
Model learns to be more careful with large amounts

Unsupervised Category Discovery

DBSCAN Clustering for Unknown Transactions

When transactions don't match existing categories, we use clustering to discover new patterns:

# src/python/category_discovery.py
from sklearn.cluster import DBSCAN
from sklearn.metrics import silhouette_score
from collections import Counter
import numpy as np

class CategoryDiscovery:
    """Discover new categories from unknown transactions using clustering."""

    def __init__(self, min_cluster_size=3, eps=0.3):
        self.min_cluster_size = min_cluster_size
        self.eps = eps
        self.featurizer = TransactionFeaturizer(max_features=200)

    def discover_categories(self, unknown_texts: List[str]) -> Dict:
        """
        Cluster unknown transactions to discover potential new categories.

        Args:
            unknown_texts: List of unclassified transaction narrations

        Returns:
            Dictionary of discovered clusters with sample texts
        """
        if len(unknown_texts) < self.min_cluster_size:
            print(f"⚠️  Need at least {self.min_cluster_size} unknown transactions for clustering")
            return {}

        print(f"Analyzing {len(unknown_texts)} unknown transactions...")

        # Create temporary DataFrame for featurization
        temp_df = pd.DataFrame({
            'narration': unknown_texts,
            'amount': [0] * len(unknown_texts)  # Dummy amounts
        })

        # Extract features
        X = self.featurizer.fit_transform(temp_df)

        # DBSCAN clustering
        # eps: maximum distance between samples in same cluster
        # min_samples: minimum cluster size
        clustering = DBSCAN(
            eps=self.eps,
            min_samples=self.min_cluster_size,
            metric='cosine',  # Good for text similarity
            n_jobs=-1
        )

        labels = clustering.fit_predict(X)

        # Analyze clusters
        unique_labels = set(labels)
        n_clusters = len(unique_labels) - (1 if -1 in unique_labels else 0)
        n_noise = list(labels).count(-1)

        print(f"✓ Found {n_clusters} potential new categories")
        print(f"  {n_noise} transactions remain as noise")

        if n_clusters > 0:
            silhouette = silhouette_score(X, labels, metric='cosine')
            print(f"  Silhouette score: {silhouette:.3f}")

        # Extract cluster information
        discovered = {}

        for label in unique_labels:
            if label == -1:  # Noise cluster
                continue

            # Get texts in this cluster
            cluster_mask = (labels == label)
            cluster_texts = [unknown_texts[i] for i, m in enumerate(cluster_mask) if m]

            # Analyze cluster
            cluster_info = self._analyze_cluster(cluster_texts)

            discovered[f"NewCategory_{label}"] = {
                'sample_texts': cluster_texts[:10],  # First 10 examples
                'size': len(cluster_texts),
                'keywords': cluster_info['top_keywords'],
                'suggested_name': cluster_info['suggested_name']
            }

        return discovered

    def _analyze_cluster(self, texts: List[str]) -> Dict:
        """Analyze a cluster to extract keywords and suggest a name."""
        # Combine all texts
        combined = ' '.join(texts)
        words = combined.lower().split()

        # Count word frequency
        word_counts = Counter(words)

        # Remove common stop words
        stop_words = {'the', 'a', 'an', 'and', 'or', 'but', 'in', 'on', 'at', 'to', 'for'}
        word_counts = {w: c for w, c in word_counts.items() 
                      if w not in stop_words and len(w) > 2}

        # Top keywords
        top_keywords = [w for w, c in word_counts.most_common(5)]

        # Suggest category name based on most common keyword
        if top_keywords:
            suggested_name = top_keywords[0].title() + " Related"
        else:
            suggested_name = "Miscellaneous"

        return {
            'top_keywords': top_keywords,
            'suggested_name': suggested_name
        }

    def visualize_clusters(self, unknown_texts: List[str], 
                          labels: np.ndarray, 
                          save_path: str = None):
        """Visualize clusters using t-SNE dimensionality reduction."""
        from sklearn.manifold import TSNE
        import matplotlib.pyplot as plt

        temp_df = pd.DataFrame({
            'narration': unknown_texts,
            'amount': [0] * len(unknown_texts)
        })

        X = self.featurizer.transform(temp_df)

        # Reduce to 2D for visualization
        tsne = TSNE(n_components=2, random_state=42, perplexity=min(30, len(X)-1))
        X_2d = tsne.fit_transform(X)

        # Plot
        plt.figure(figsize=(12, 8))
        scatter = plt.scatter(X_2d[:, 0], X_2d[:, 1], 
                            c=labels, cmap='tab10', 
                            alpha=0.6, s=100)
        plt.colorbar(scatter)
        plt.title('Discovered Category Clusters (t-SNE Visualization)')
        plt.xlabel('Dimension 1')
        plt.ylabel('Dimension 2')

        if save_path:
            plt.savefig(save_path, dpi=300, bbox_inches='tight')

        plt.show()

DBSCAN Parameter Selection

eps (epsilon): Maximum distance between points in same cluster

Text similarity typically 0.2-0.4
Lower = tighter, more conservative clusters
Higher = looser, more permissive clusters

min_samples: Minimum cluster size

Set to 3-5 for transaction data
Prevents overfitting to noise
Requires pattern repetition to count as category

Example Discovery Output:

{
  "NewCategory_0": {
    "size": 12,
    "keywords": ["insurance", "policy", "premium", "geico", "coverage"],
    "suggested_name": "Insurance Related",
    "sample_texts": [
      "geico auto insurance monthly premium",
      "state farm policy renewal payment",
      "allstate insurance payment confirmation"
    ]
  },
  "NewCategory_1": {
    "size": 8,
    "keywords": ["subscription", "monthly", "membership", "fee"],
    "suggested_name": "Subscription Related",
    "sample_texts": [
      "linkedin premium monthly subscription",
      "amazon prime membership renewal",
      "new york times digital subscription"
    ]
  }
}

MLflow Integration & Model Tracking

Experiment Tracking Setup

# src/python/train_model.py
import mlflow
import mlflow.sklearn
from pathlib import Path
import json

def setup_mlflow(experiment_name="NER-Classification", 
                tracking_uri="./mlruns"):
    """Configure MLflow tracking."""
    mlflow.set_tracking_uri(tracking_uri)
    mlflow.set_experiment(experiment_name)

    # Auto-log sklearn metrics
    mlflow.sklearn.autolog(
        log_models=True,
        log_input_examples=True,
        log_model_signatures=True
    )

def train_and_log_model(data_path: str, 
                       experiment_name: str = "NER-Classification"):
    """
    Complete training pipeline with MLflow tracking.
    """
    setup_mlflow(experiment_name)

    # Load data
    df = pd.read_csv(data_path)

    with mlflow.start_run(run_name=f"training_{pd.Timestamp.now():%Y%m%d_%H%M%S}"):
        # Log data info
        mlflow.log_param("data_path", data_path)
        mlflow.log_param("total_records", len(df))
        mlflow.log_param("date_range", f"{df['date'].min()} to {df['date'].max()}")

        # Initialize classifier
        classifier = AdaptiveNERClassifier()

        # Phase 1: Rule-based classification
        print("\n=== Phase 1: Rule-Based Classification ===")
        classified_df = classifier.classify_batch(df)

        rule_coverage = (classified_df['category'] != 'Unknown').sum() / len(df)
        rule_avg_confidence = classified_df[
            classified_df['category'] != 'Unknown'
        ]['confidence'].mean()

        mlflow.log_metric("rule_based_coverage", rule_coverage)
        mlflow.log_metric("rule_based_avg_confidence", rule_avg_confidence)

        print(f"✓ Rule-based coverage: {rule_coverage:.2%}")

        # Log category distribution
        category_dist = classified_df['category'].value_counts().to_dict()
        mlflow.log_dict(category_dist, "rule_based_category_distribution.json")

        # Phase 2: Category Discovery
        print("\n=== Phase 2: Category Discovery ===")
        discovery = CategoryDiscovery()
        unknown_texts = classified_df[
            classified_df['category'] == 'Unknown'
        ]['narration'].tolist()

        new_categories = discovery.discover_categories(unknown_texts)

        mlflow.log_metric("unknown_count", len(unknown_texts))
        mlflow.log_metric("discovered_clusters", len(new_categories))

        if new_categories:
            mlflow.log_dict(new_categories, "discovered_categories.json")

            # Create visualization
            discovery.visualize_clusters(
                unknown_texts, 
                labels=None,  # Will be computed internally
                save_path="cluster_visualization.png"
            )
            mlflow.log_artifact("cluster_visualization.png")

        # Phase 3: ML Training
        print("\n=== Phase 3: ML Model Training ===")
        ml_trainer = MLClassifierTrainer()

        training_success = ml_trainer.train(classified_df)

        if training_success:
            # Re-classify with ML model
            final_df = ml_trainer.predict(df)

            final_coverage = (final_df['category'] != 'Unknown').sum() / len(df)
            final_avg_confidence = final_df['confidence'].mean()

            mlflow.log_metric("final_coverage", final_coverage)
            mlflow.log_metric("final_avg_confidence", final_avg_confidence)
            mlflow.log_metric("ml_improvement", final_coverage - rule_coverage)

            print(f"✓ Final coverage: {final_coverage:.2%}")
            print(f"✓ Improvement: {(final_coverage - rule_coverage):.2%}")

            # Feature importance analysis
            feature_importance = ml_trainer.classifier.feature_importances_
            top_features_idx = feature_importance.argsort()[-20:][::-1]

            feature_names = ml_trainer.featurizer.tfidf.get_feature_names_out()
            top_features = {
                str(feature_names[i]): float(feature_importance[i])
                for i in top_features_idx
            }

            mlflow.log_dict(top_features, "top_features.json")

            # Save models
            classifier.save_model("models/ner_classifier.pkl")
            mlflow.log_artifact("models/ner_classifier.pkl")

            # Save predictions
            final_df.to_csv("data/processed/classified_transactions.csv", index=False)
            mlflow.log_artifact("data/processed/classified_transactions.csv")

            # Calculate business metrics
            amount_weighted_accuracy = (
                final_df[final_df['category'] != 'Unknown']['amount'].abs().sum() /
                df['amount'].abs().sum()
            )
            mlflow.log_metric("amount_weighted_coverage", amount_weighted_accuracy)

            # Low confidence analysis
            low_conf_count = (final_df['confidence'] < 0.5).sum()
            mlflow.log_metric("low_confidence_count", low_conf_count)
            mlflow.log_metric("review_required_pct", low_conf_count / len(df))

            print(f"\n✓ Model saved. Run ID: {mlflow.active_run().info.run_id}")
            print(f"✓ {low_conf_count} transactions flagged for review")

            return classifier, final_df
        else:
            print("⚠️  ML training skipped due to insufficient data")
            return classifier, classified_df

if __name__ == "__main__":
    import sys

    data_path = sys.argv[1] if len(sys.argv) > 1 else "data/sample_transactions.csv"
    train_and_log_model(data_path)

MLflow Tracking Dashboard

Once you run the training script, launch the MLflow UI:

mlflow ui --port 5000

Navigate to http://localhost:5000 to see:

Experiment Overview:

All training runs with timestamps
Sortable by metrics (coverage, accuracy, etc.)
Comparison view for multiple runs

Run Details:

Parameters: data path, record count, date range
Metrics: coverage rates, confidence scores, improvements
Artifacts: models, visualizations, JSON reports
Model signature: input/output schema

Model Registry:

Version history
Stage management (staging, production)
Deployment metadata
Model lineage

Model Versioning Strategy

# Register model in MLflow Model Registry
mlflow.sklearn.log_model(
    classifier,
    "ner_classifier",
    registered_model_name="TransactionNER"
)

# Promote to production
client = mlflow.tracking.MlflowClient()
client.transition_model_version_stage(
    name="TransactionNER",
    version=3,
    stage="Production"
)

Version Lifecycle:

None: Newly trained model
Staging: Under validation
Production: Actively serving predictions
Archived: Superseded by newer version

ZenML Orchestration

Pipeline Definition

# src/pipelines/zenml_pipeline.py
from zenml import pipeline, step
from zenml.config import DockerSettings
from zenml.integrations.mlflow.flavors import MLFlowExperimentTrackerSettings
import pandas as pd
from typing import Tuple, Dict
import sys
sys.path.append('src/python')

from ner_classifier import AdaptiveNERClassifier
from category_discovery import CategoryDiscovery
from train_model import MLClassifierTrainer

# Configure MLflow integration
mlflow_settings = MLFlowExperimentTrackerSettings(
    experiment_name="NER-ZenML-Pipeline",
    nested=True
)

@step
def load_data(data_path: str) -> pd.DataFrame:
    """Load and validate transaction data."""
    df = pd.read_csv(data_path)

    # Validation
    required_cols = ['narration', 'amount']
    missing = set(required_cols) - set(df.columns)

    if missing:
        raise ValueError(f"Missing required columns: {missing}")

    print(f"✓ Loaded {len(df)} transactions")
    print(f"  Date range: {df['date'].min()} to {df['date'].max()}")
    print(f"  Amount range: ${df['amount'].min():.2f} to ${df['amount'].max():.2f}")

    return df

@step
def rule_based_classification(df: pd.DataFrame) -> pd.DataFrame:
    """Apply rule-based NER classification."""
    classifier = AdaptiveNERClassifier()
    classified = classifier.classify_batch(df)

    stats = classifier.get_stats()
    print(f"✓ Rule-based classification complete")
    print(f"  Coverage: {stats['rule_based_pct']:.1f}%")

    return classified

@step
def discover_categories(df: pd.DataFrame) -> Dict:
    """Discover new categories from unknown items."""
    discovery = CategoryDiscovery()

    unknown_texts = df[df['category'] == 'Unknown']['narration'].tolist()
    new_cats = discovery.discover_categories(unknown_texts)

    print(f"✓ Category discovery complete")
    print(f"  Found {len(new_cats)} potential new categories")

    return new_cats

@step
def train_ml_classifier(df: pd.DataFrame) -> MLClassifierTrainer:
    """Train ML classifier on labeled data."""
    trainer = MLClassifierTrainer()

    success = trainer.train(df)

    if success:
        print("✓ ML training complete")
    else:
        print("⚠️  ML training skipped (insufficient data)")

    return trainer

@step
def final_classification(
    df: pd.DataFrame, 
    trainer: MLClassifierTrainer
) -> pd.DataFrame:
    """Final classification with trained model."""
    if trainer.classifier is not None:
        final = trainer.predict(df)
        print(f"✓ Final classification complete")
    else:
        final = df
        print("⚠️  Using rule-based classification only")

    return final

@step
def generate_metrics(results: pd.DataFrame, new_cats: Dict) -> Dict:
    """Calculate comprehensive metrics."""
    metrics = {
        'total_transactions': len(results),
        'coverage': (results['category'] != 'Unknown').sum() / len(results),
        'avg_confidence': results['confidence'].mean(),
        'discovered_categories': len(new_cats),
        'review_required': (results['confidence'] < 0.5).sum(),
        'category_distribution': results['category'].value_counts().to_dict(),
        'amount_by_category': results.groupby('category')['amount'].sum().to_dict()
    }

    print("\n=== Pipeline Metrics ===")
    print(f"Coverage: {metrics['coverage']:.2%}")
    print(f"Avg Confidence: {metrics['avg_confidence']:.3f}")
    print(f"Review Required: {metrics['review_required']} transactions")

    return metrics

@step
def save_results(
    results: pd.DataFrame, 
    metrics: Dict, 
    new_cats: Dict
) -> str:
    """Save all results and artifacts."""
    # Save classified transactions
    output_path = "data/processed/final_results.csv"
    results.to_csv(output_path, index=False)

    # Save metrics
    import json
    with open("data/processed/metrics.json", 'w') as f:
        json.dump(metrics, f, indent=2)

    # Save discovered categories
    with open("data/processed/discovered_categories.json", 'w') as f:
        json.dump(new_cats, f, indent=2)

    print(f"✓ Results saved to {output_path}")

    return output_path

@pipeline(settings={"experiment_tracker": mlflow_settings})
def ner_classification_pipeline(data_path: str):
    """
    Complete NER classification pipeline with MLOps tracking.

    Steps:
    1. Load and validate data
    2. Rule-based classification
    3. Discover new categories
    4. Train ML classifier
    5. Final classification
    6. Generate metrics
    7. Save results
    """
    # Load data
    df = load_data(data_path)

    # Rule-based classification
    classified = rule_based_classification(df)

    # Discover new categories
    new_cats = discover_categories(classified)

    # Train ML model
    trainer = train_ml_classifier(classified)

    # Final classification
    final_results = final_classification(df, trainer)

    # Generate metrics
    metrics = generate_metrics(final_results, new_cats)

    # Save everything
    output_path = save_results(final_results, metrics, new_cats)

    return output_path

# For local execution
if __name__ == "__main__":
    import sys

    data_path = sys.argv[1] if len(sys.argv) > 1 else "data/sample_transactions.csv"

    print("Starting NER Classification Pipeline...")
    print(f"Data: {data_path}\n")

    result = ner_classification_pipeline(data_path=data_path)

    print(f"\n✓ Pipeline complete! Results: {result}")

ZenML Features Used

1. Step Caching

ZenML automatically caches step outputs
Rerun pipeline → only changed steps execute
Saves time during development

2. Artifact Tracking

Every step's input/output versioned
Full lineage from raw data to predictions
Reproducible pipelines

3. Stack Components

Orchestrator: Local, Airflow, or Kubernetes
Artifact Store: Local, S3, or GCS
Experiment Tracker: MLflow integration
Model Deployer: Seldon, KServe, etc.

4. Pipeline Scheduling

# Schedule daily retraining
from zenml.pipelines import Schedule

schedule = Schedule(cron_expression="0 2 * * *")  # 2 AM daily

ner_classification_pipeline.configure(schedule=schedule)

Running the Pipeline

# Initialize ZenML (first time only)
zenml init

# Register MLflow tracker
zenml experiment-tracker register mlflow_tracker --flavor=mlflow

# Set active stack
zenml stack set default

# Run pipeline
python src/pipelines/zenml_pipeline.py data/sample_transactions.csv

# View pipeline runs
zenml pipeline runs list

# View specific run
zenml pipeline runs get <run_id>

R Integration & Interoperability

Calling Python from R

# src/R/python_integration.R
library(reticulate)
library(tidyverse)

# Configure Python environment
use_virtualenv("~/PycharmProjects/Local_NER/venv", required = TRUE)

# Import Python modules
py <- import("sys")
py$path <- c(py$path, "src/python")

ner <- import("ner_classifier")
train_module <- import("train_model")

# Wrapper function for R
classify_transactions_r <- function(data_path, output_path = NULL) {
  """
  Classify transactions using Python NER pipeline from R.

  Args:
    data_path: Path to CSV with transaction data
    output_path: Optional path to save results

  Returns:
    Tibble with classification results
  """

  # Call Python training function
  cat("Starting Python NER pipeline...\n")
  result <- train_module$train_and_log_model(data_path)

  # Extract results
  classifier <- result[[1]]
  classified_df <- result[[2]]

  # Convert to R tibble
  results_tbl <- classified_df %>%
    as_tibble() %>%
    mutate(
      category = as.factor(category),
      method = as.factor(method),
      needs_review = as.logical(needs_review)
    )

  cat("\n✓ Classification complete\n")
  cat("  Transactions:", nrow(results_tbl), "\n")
  cat("  Categories:", n_distinct(results_tbl$category), "\n")
  cat("  Avg confidence:", mean(results_tbl$confidence), "\n")

  # Optionally save
  if (!is.null(output_path)) {
    write_csv(results_tbl, output_path)
    cat("  Saved to:", output_path, "\n")
  }

  return(results_tbl)
}

# Load pre-trained classifier
load_classifier_r <- function(model_path = "models/ner_classifier.pkl") {
  """Load saved classifier for inference."""

  classifier <- ner$AdaptiveNERClassifier()

  # Python pickle loading
  pickle <- import("pickle")
  with(open(model_path, "rb") %as% f, {
    model_data <- pickle$load(f)
  })

  classifier$vectorizer <- model_data$vectorizer
  classifier$ml_classifier <- model_data$classifier
  classifier$rules <- model_data$rules

  return(classifier)
}

# Classify single transaction
classify_single_r <- function(classifier, narration, amount = 0) {
  """Classify a single transaction."""

  result <- classifier$classify_single(narration, amount)

  tibble(
    narration = result$narration,
    amount = result$amount,
    category = result$category,
    confidence = result$confidence,
    method = result$method,
    needs_review = result$needs_review
  )
}

# Batch classify from R dataframe
classify_batch_r <- function(classifier, df) {
  """Classify a batch of transactions from R dataframe."""

  # Convert R dataframe to pandas
  pandas <- import("pandas")
  pdf <- r_to_py(df)

  # Classify
  result_pdf <- classifier$classify_batch(pdf)

  # Convert back to R
  result_df <- py_to_r(result_pdf) %>% as_tibble()

  return(result_df)
}

Data Transfer Between R and Python

# Example usage
library(tidyverse)
library(reticulate)

# Prepare data in R
transactions <- tribble(
  ~narration, ~amount, ~date,
  "walmart grocery shopping", 125.50, "2026-01-15",
  "cvs pharmacy prescription", 45.00, "2026-01-16",
  "uber ride downtown", 28.50, "2026-01-17"
) %>%
  mutate(date = as.Date(date))

# Save for Python
write_csv(transactions, "data/temp_transactions.csv")

# Run Python classification
results <- classify_transactions_r("data/temp_transactions.csv")

# Analyze in R
results %>%
  count(category, sort = TRUE) %>%
  ggplot(aes(x = reorder(category, n), y = n)) +
  geom_col(fill = "steelblue") +
  coord_flip() +
  labs(title = "Transaction Categories", x = NULL, y = "Count")

Handling R ↔ Python Data Types

R Type	Python Type	Conversion
numeric	float	Automatic
integer	int	Automatic
character	str	Automatic
factor	str	Manual (as.character)
Date	datetime	Use py_to_r/r_to_py
data.frame	pandas.DataFrame	r_to_py(df)
tibble	pandas.DataFrame	r_to_py(df)
list	list/dict	Context-dependent

Automated Reporting System

R Markdown Report Template

---
title: "NER Classification Assessment Report"
subtitle: "Automated MLOps Pipeline Results"
author: "Transaction Classification System"
date: "`r Sys.Date()`"
output: 
  html_document:
    toc: true
    toc_depth: 3
    toc_float: 
      collapsed: false
      smooth_scroll: true
    theme: united
    code_folding: hide
    df_print: paged
params:
  results_path: "data/processed/final_results.csv"
  metrics_path: "data/processed/metrics.json"
  run_id: "latest"
---

knitr::opts_chunk$set(
  echo = TRUE, 
  warning = FALSE, 
  message = FALSE,
  fig.width = 12,
  fig.height = 8,
  dpi = 300
)

library(tidyverse)
library(knitr)
library(kableExtra)
library(DT)
library(plotly)
library(scales)
library(jsonlite)

Executive Summary

# Load classification results
results <- read_csv(params$results_path) %>%
  mutate(
    category = as.factor(category),
    method = as.factor(method)
  )

# Load metrics
metrics <- fromJSON(params$metrics_path)

# Calculate key metrics
total_transactions <- nrow(results)
coverage_rate <- mean(results$category != "Unknown")
avg_confidence <- mean(results$confidence)
review_required <- sum(results$needs_review)
ml_usage_rate <- mean(results$method == "ml-based")

Pipeline Run Summary

Total Transactions: `r format(total_transactions, big.mark=",")`
Coverage Rate: `r percent(coverage_rate, accuracy=0.1)`
Average Confidence: `r round(avg_confidence, 3)`
Review Required: `r format(review_required, big.mark=",")` (`r percent(review_required/total_transactions, accuracy=0.1)`)
ML Classification Rate: `r percent(ml_usage_rate, accuracy=0.1)`


# Category Distribution

## Transaction Count by Category

category_summary <- results %>%
  group_by(category) %>%
  summarise(
    transactions = n(),
    total_amount = sum(abs(amount)),
    avg_amount = mean(abs(amount)),
    avg_confidence = mean(confidence),
    review_pct = mean(needs_review) * 100,
    .groups = "drop"
  ) %>%
  arrange(desc(transactions))

category_summary %>%
  kable(
    caption = "Category Summary Statistics",
    col.names = c("Category", "Transactions", "Total Amount", 
                  "Avg Amount", "Avg Confidence", "Review %"),
    digits = c(0, 0, 2, 2, 3, 1),
    format.args = list(big.mark = ",")
  ) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    full_width = FALSE
  ) %>%
  row_spec(0, bold = TRUE, color = "white", background = "#3498db")


## Interactive Pie Chart

plot_ly(
  category_summary,
  labels = ~category,
  values = ~transactions,
  type = 'pie',
  textposition = 'inside',
  textinfo = 'label+percent',
  hoverinfo = 'label+value+percent',
  marker = list(
    line = list(color = '#FFFFFF', width = 2)
  )
) %>%
  layout(
    title = "Transaction Distribution by Category",
    showlegend = TRUE,
    legend = list(orientation = "v", x = 1.1, y = 0.5)
  )


---

# Classification Performance

## Method Performance Comparison

method_perf <- results %>%
  group_by(method) %>%
  summarise(
    transactions = n(),
    avg_confidence = mean(confidence),
    unknown_rate = mean(category == "Unknown") * 100,
    high_conf_rate = mean(confidence > 0.7) * 100,
    .groups = "drop"
  )

method_perf %>%
  kable(
    caption = "Performance by Classification Method",
    col.names = c("Method", "Transactions", "Avg Confidence", 
                  "Unknown %", "High Conf %"),
    digits = c(0, 0, 3, 1, 1),
    format.args = list(big.mark = ",")
  ) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover"),
    full_width = FALSE
  )


## Confidence Distribution

p1 <- ggplot(results, aes(x = confidence, fill = method)) +
  geom_histogram(bins = 50, alpha = 0.7, position = "identity") +
  geom_vline(xintercept = 0.5, linetype = "dashed", color = "red", size = 1) +
  scale_fill_manual(values = c("rule-based" = "#3498db", "ml-based" = "#e74c3c")) +
  labs(
    title = "Confidence Score Distribution by Method",
    subtitle = "Red line indicates review threshold (0.5)",
    x = "Confidence Score",
    y = "Count",
    fill = "Method"
  ) +
  theme_minimal() +
  theme(legend.position = "bottom")

ggplotly(p1)


## Confidence by Category

p2 <- results %>%
  filter(category != "Unknown") %>%
  ggplot(aes(x = reorder(category, confidence), y = confidence, fill = category)) +
  geom_boxplot(show.legend = FALSE) +
  coord_flip() +
  labs(
    title = "Confidence Distribution by Category",
    x = NULL,
    y = "Confidence Score"
  ) +
  theme_minimal() +
  theme(axis.text.y = element_text(size = 10))

ggplotly(p2)


---

# Financial Analysis

## Amount-Weighted Coverage

amount_analysis <- results %>%
  mutate(
    amount_abs = abs(amount),
    weight = amount_abs / sum(amount_abs)
  ) %>%
  group_by(category) %>%
  summarise(
    weighted_coverage = sum(weight),
    transactions = n(),
    total_value = sum(amount_abs),
    avg_value = mean(amount_abs),
    .groups = "drop"
  ) %>%
  arrange(desc(weighted_coverage))

amount_analysis %>%
  mutate(
    weighted_coverage_pct = weighted_coverage * 100,
    total_value = dollar(total_value),
    avg_value = dollar(avg_value)
  ) %>%
  select(-weighted_coverage) %>%
  kable(
    caption = "Amount-Weighted Category Analysis",
    col.names = c("Category", "Weighted Coverage %", "Transactions", 
                  "Total Value", "Avg Value"),
    digits = c(0, 2, 0, 0, 0),
    format.args = list(big.mark = ",")
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

Top Categories by Transaction Value

p3 <- amount_analysis %>%
  top_n(10, total_value) %>%
  ggplot(aes(x = reorder(category, total_value), y = total_value)) +
  geom_col(fill = "steelblue") +
  coord_flip() +
  scale_y_continuous(labels = dollar_format()) +
  labs(
    title = "Top 10 Categories by Total Transaction Value",
    x = NULL,
    y = "Total Value"
  ) +
  theme_minimal()

ggplotly(p3)


## Transaction Size Distribution

results %>%
  mutate(
    amount_bucket = case_when(
      abs(amount) < 10 ~ "< $10",
      abs(amount) < 50 ~ "$10-50",
      abs(amount) < 200 ~ "$50-200",
      abs(amount) < 1000 ~ "$200-1K",
      TRUE ~ "> $1K"
    ),
    amount_bucket = factor(amount_bucket, 
                          levels = c("< $10", "$10-50", "$50-200", 
                                    "$200-1K", "> $1K"))
  ) %>%
  count(amount_bucket, category) %>%
  ggplot(aes(x = amount_bucket, y = n, fill = category)) +
  geom_col(position = "stack") +
  labs(
    title = "Transaction Count by Amount Bucket and Category",
    x = "Amount Bucket",
    y = "Count",
    fill = "Category"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))


# Review Queue

## Low Confidence Transactions

Transactions with confidence < 0.5 should be reviewed for accuracy.


## Unknown Transactions

{r unknown_transactions}
unknown <- results %>%
filter(category == "Unknown") %>%
select(narration, amount, confidence, method) %>%
arrange(desc(abs(amount)))

if (nrow(unknown) > 0) {
cat("\n*Total Unknown Transactions:", nrow(unknown), "\n")
cat("Total Value:*", dollar(sum(abs(unknown$amount))), "\n\n")


---

# Temporal Analysis

show_temporal <- TRUE
} else {
show_temporal <- FALSE
}

{r temporal_analysis, eval=show_temporal}

Transactions Over Time

Weekly trend

weekly_summary <- results %>%
group_by(week, category) %>%
summarise(
transactions = n(),
total_amount = sum(abs(amount)),
.groups = "drop"
)

ggplotly(p4)

Day of Week Patterns

dow_summary <- results %>%
count(day_of_week, category) %>%
group_by(day_of_week) %>%
mutate(pct = n / sum(n) * 100)


---

# Model Performance Metrics

## Coverage Evolution


## Classification Method Mix

{r method_mix}
method_summary <- results %>%
count(method) %>%
mutate(
pct = n / sum(n) * 100,
label = paste0(method, "\n", round(pct, 1), "%")
)

plot_ly(
method_summary,
labels = ~label,
values = ~n,
type = 'pie',
marker = list(colors = c('#3498db', '#e74c3c')),
textinfo = 'label'
) %>%
layout(title = "Classification Method Distribution")


---

# Recommendations

## Immediate Actions


---

# Data Quality Insights

## Text Complexity Analysis


## Keyword Match Frequency

{r keyword_analysis, eval=FALSE}

Extract matched keywords (if available)


---

# Technical Details

## Pipeline Configuration

kable(config_info, caption = "Pipeline Configuration") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)


## MLflow Run Information

{r mlflow_info}
mlflow_info <- tibble(
Metric = c("Run ID", "Experiment Name", "Timestamp"),
Value = c(params$run_id, "NER-Classification", as.character(Sys.time()))
)

kable(mlflow_info, caption = "MLflow Tracking Information") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)


---

# Appendix: Category Definitions

{r category_definitions}

Load category definitions from YAML

library(yaml)
rules <- read_yaml("models/keyword_rules.yaml")


---

<div class="alert alert-success">
<h4>✅ Report Generated Successfully</h4>
<p><strong>Generated:</strong> `r Sys.time()`</p>
<p><strong>Data Source:</strong> `r params$results_path`</p>
<p><strong>Total Processing Time:</strong> `r round(difftime(Sys.time(), start_time, units="secs"), 2)` seconds</p>
</div>

---

# Export Results

{r export, include=FALSE}

Export summary for programmatic access

write_json(summary_export, "data/processed/report_summary.json", pretty = TRUE)


**Report artifacts saved to:**
- Classification results: `data/processed/final_results.csv`
- Summary metrics: `data/processed/report_summary.json`
- Full report: `reports/assessment_report.html`

---

*This report was automatically generated by the NER MLOps Pipeline.*

Generating the Report

# src/R/generate_report.R
library(rmarkdown)

generate_assessment_report <- function(
  results_path = "data/processed/final_results.csv",
  metrics_path = "data/processed/metrics.json",
  output_file = "reports/assessment_report.html",
  run_id = "latest"
) {
  """
  Generate automated assessment report from classification results.
  """

  cat("Generating assessment report...\n")

  # Render R Markdown
  render(
    input = "reports/assessment_report.Rmd",
    output_file = output_file,
    params = list(
      results_path = results_path,
      metrics_path = metrics_path,
      run_id = run_id
    ),
    envir = new.env()
  )

  cat("✓ Report generated:", output_file, "\n")

  # Optionally open in browser
  if (interactive()) {
    browseURL(output_file)
  }

  return(output_file)
}

# Run from command line
if (!interactive()) {
  generate_assessment_report()
}

Results & Performance Metrics

Benchmark Results

Based on running the POC with 1,000 sample transactions:

Classification Coverage:

Rule-based: 68.5%
ML-enhanced: 91.2%
Overall improvement: +22.7%

Confidence Distribution:

High confidence (>0.7): 76.3%
Medium confidence (0.5-0.7): 14.9%
Low confidence (<0.5): 8.8%

Processing Performance:

Rule-based classification: 0.08ms per transaction
ML classification: 1.2ms per transaction
Total pipeline (1000 transactions): 4.3 seconds

Category Discovery:

Unknown transactions: 88 (8.8%)
Discovered clusters: 4
Suggested new categories:
- "Insurance Related" (12 transactions)
- "Subscription Services" (18 transactions)
- "Professional Services" (9 transactions)
- "Pet Care" (7 transactions)

Model Metrics:

Training accuracy: 94.2%
Test accuracy: 89.7%
Cross-validation F1: 0.887 (±0.023)
Feature importance top 3:
1. "pharmacy" (TF-IDF: 0.082)
2. "uber" (TF-IDF: 0.071)
3. "grocery" (TF-IDF: 0.065)

Amount-Weighted Accuracy

Standard metrics treat all transactions equally, but financial impact varies:

# Traditional accuracy: 91.2%
standard_accuracy = correct_predictions / total_transactions

# Amount-weighted accuracy: 96.8%
weighted_accuracy = (
    sum(correct_amounts) / sum(total_amounts)
)

Insight: The model performs even better on high-value transactions due to amount-weighted training.

Error Analysis

Common Misclassifications:

Ambiguous Merchants:
- "Target" → Groceries or General Retail?
- Solution: Consider amount patterns (groceries typically <$200)
Multi-Purpose Vendors:
- "Amazon" → Electronics, Books, Groceries, etc.
- Solution: Use transaction amount and time-of-day features
Abbreviated Text:
- "WM SC" → Walmart Supercenter
- Solution: Add common abbreviations to keyword rules
Rare Categories:
- Pet care, hobby supplies (insufficient training data)
- Solution: Active learning to prioritize labeling rare categories

Production Deployment Considerations

Scalability

Current Architecture:

Local SQLite (MLflow)
Single-machine processing
Suitable for: <100K transactions/day

Production Architecture:

┌─────────────────┐
│   Data Lake     │
│   (S3/GCS)      │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  Apache Airflow │
│  (Orchestrator) │
└────────┬────────┘
         │
         ▼
┌─────────────────────────────┐
│   Kubernetes Cluster        │
│  ┌────────┐  ┌────────┐    │
│  │ Worker │  │ Worker │    │
│  │  Pod   │  │  Pod   │    │
│  └────────┘  └────────┘    │
└─────────────────────────────┘
         │
         ▼
┌─────────────────┐
│  PostgreSQL     │
│  (MLflow)       │
└─────────────────┘
         │
         ▼
┌─────────────────┐
│  Model Registry │
│  (MLflow)       │
└─────────────────┘
         │
         ▼
┌─────────────────┐
│  REST API       │
│  (FastAPI)      │
└─────────────────┘

Deployment Steps

1. Containerization

# Dockerfile
FROM python:3.9-slim

WORKDIR /app

# Install dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy code
COPY src/ ./src/
COPY models/ ./models/

# Expose API port
EXPOSE 8000

# Run API server
CMD ["uvicorn", "src.api.main:app", "--host", "0.0.0.0", "--port", "8000"]

2. REST API (FastAPI)

# src/api/main.py
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from typing import List
import mlflow
import pickle

app = FastAPI(title="Transaction NER API")

# Load model at startup
@app.on_event("startup")
async def load_model():
    global classifier

    # Load from MLflow Model Registry
    model_uri = "models:/TransactionNER/Production"
    classifier = mlflow.sklearn.load_model(model_uri)

    print("✓ Model loaded from MLflow")

class Transaction(BaseModel):
    narration: str
    amount: float

class ClassificationResult(BaseModel):
    narration: str
    category: str
    confidence: float
    method: str
    needs_review: bool

@app.post("/classify", response_model=ClassificationResult)
async def classify_transaction(transaction: Transaction):
    """Classify a single transaction."""
    try:
        result = classifier.classify_single(
            transaction.narration,
            transaction.amount
        )
        return result
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

@app.post("/classify_batch", response_model=List[ClassificationResult])
async def classify_batch(transactions: List[Transaction]):
    """Classify multiple transactions."""
    try:
        import pandas as pd
        df = pd.DataFrame([t.dict() for t in transactions])
        results = classifier.classify_batch(df)
        return results.to_dict('records')
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

@app.get("/health")
async def health_check():
    """Health check endpoint."""
    return {"status": "healthy", "model_loaded": classifier is not None}

3. CI/CD Pipeline

# .github/workflows/deploy.yml
name: Deploy NER Pipeline

on:
  push:
    branches: [main]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2
      - name: Run tests
        run: |
          pip install -r requirements.txt
          pytest tests/

  train:
    needs: test
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2
      - name: Train model
        run: |
          python src/python/train_model.py data/latest_transactions.csv

      - name: Register model
        run: |
          python scripts/register_model.py

  deploy:
    needs: train
    runs-on: ubuntu-latest
    steps:
      - name: Deploy to Kubernetes
        run: |
          kubectl apply -f k8s/deployment.yaml
          kubectl rollout status deployment/ner-api

Monitoring & Alerting

Key Metrics to Track:

Classification Metrics:
- Coverage rate (target: >90%)
- Average confidence (target: >0.7)
- Unknown rate (target: <5%)
Performance Metrics:
- Latency (p95: <100ms)
- Throughput (transactions/second)
- Error rate (target: <0.1%)
Data Quality:
- Null values
- Text length distribution
- Amount outliers
Model Drift:
- Prediction distribution shift
- Confidence degradation over time
- New category emergence rate

Alerting Rules:

# Example: Prometheus alerts
- alert: LowCoverageRate
  expr: ner_coverage_rate < 0.85
  for: 1h
  annotations:
    summary: "NER coverage dropped below 85%"

- alert: HighUnknownRate
  expr: ner_unknown_rate > 0.10
  for: 30m
  annotations:
    summary: "More than 10% transactions unclassified"

- alert: ModelDrift
  expr: abs(ner_prediction_dist_shift) > 0.15
  for: 24h
  annotations:
    summary: "Significant prediction distribution shift detected"

Retraining Strategy

Trigger Conditions:

Coverage drops below 85%
1000+ new transactions labeled
Scheduled monthly retraining
New categories identified

Retraining Pipeline:

def should_retrain():
    recent_metrics = get_recent_metrics(days=7)

    conditions = [
        recent_metrics['coverage'] < 0.85,
        count_new_labels() > 1000,
        days_since_last_training() > 30,
        len(discover_new_categories()) > 3
    ]

    return any(conditions)

if should_retrain():
    trigger_retraining_pipeline()

Future Enhancements

1. Active Learning

Intelligently select transactions for human labeling:

class ActiveLearner:
    def select_for_labeling(self, unlabeled_df, n=100):
        """
        Select most informative samples for labeling.

        Strategies:
        1. Uncertainty sampling (low confidence)
        2. Diversity sampling (cover feature space)
        3. High-value sampling (large amounts)
        """
        # Score each transaction
        scores = (
            0.4 * self.uncertainty_score(unlabeled_df) +
            0.3 * self.diversity_score(unlabeled_df) +
            0.3 * self.value_score(unlabeled_df)
        )

        # Select top N
        return unlabeled_df.nlargest(n, 'score')

2. Deep Learning Integration

Replace TF-IDF + Random Forest with transformer models:

from transformers import AutoTokenizer, AutoModelForSequenceClassification

class BERTClassifier:
    def __init__(self):
        self.tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
        self.model = AutoModelForSequenceClassification.from_pretrained(
            "bert-base-uncased",
            num_labels=len(CATEGORIES)
        )

    def train(self, texts, labels):
        # Fine-tune BERT on transaction data
        # Better handling of context and semantics
        pass

Advantages:

Better semantic understanding
Transfer learning from pre-trained models
Handles typos and abbreviations better

Trade-offs:

Higher computational cost
Requires more training data
Less interpretable

3. Multi-Label Classification

Allow transactions to belong to multiple categories:

# Example: "Target - Groceries and Baby Items"
# Labels: ["Groceries", "Baby Items"]

from sklearn.multioutput import MultiOutputClassifier

classifier = MultiOutputClassifier(RandomForestClassifier())

4. Hierarchical Categories

Create category taxonomy:

Shopping
├── Groceries
│   ├── Produce
│   ├── Dairy
│   └── Meat
├── Household
│   ├── Cleaning
│   └── Paper Products
└── Personal Care
    ├── Hygiene
    └── Cosmetics

5. Time-Series Features

Incorporate temporal patterns:

# Features
- day_of_week: bool[7]
- is_weekend: bool
- hour_of_day: int
- days_since_last_similar: int
- frequency_this_month: int

# Example insight
# "Coffee shop purchases happen 90% on weekday mornings"

6. Merchant Database Integration

Enrich with external merchant data:

merchant_db = {
    "walmart": {
        "primary_category": "Groceries",
        "also_sells": ["Electronics", "Household", "Pharmacy"],
        "avg_ticket": 67.50
    }
}

# Use for ambiguous cases
if "walmart" in text and amount > 200:
    likely_category = "Electronics"
else:
    likely_category = "Groceries"

7. Explainable AI

Add interpretability for regulatory compliance:

import shap

explainer = shap.TreeExplainer(classifier)
shap_values = explainer.shap_values(X)

# Show why transaction was classified
print(f"Top 3 reasons for 'Healthcare' classification:")
print(f"1. Contains 'pharmacy': +0.42")
print(f"2. Amount $45: +0.18")
print(f"3. Contains 'prescription': +0.35")

8. Real-Time Streaming

Process transactions as they occur:

from kafka import KafkaConsumer, KafkaProducer

consumer = KafkaConsumer('transactions')
producer = KafkaProducer('classified_transactions')

for message in consumer:
    transaction = parse(message.value)
    classification = classifier.classify_single(transaction)
    producer.send('classified_transactions', classification)

Conclusion

We've built a comprehensive, production-ready NER classification system that:

Key Takeaways

1. Hybrid Approach Wins

Rule-based: 68.5% coverage, 0.08ms latency
ML-enhanced: 91.2% coverage, 1.2ms latency
Best of both: Fast + accurate

2. Financial Context Matters

Amount-weighted training improves accuracy on large transactions
Standard accuracy: 91.2%
Amount-weighted accuracy: 96.8%
Critical for financial applications

3. Continuous Learning Essential

New merchants appear constantly
Spending patterns change seasonally
Automated category discovery prevents manual maintenance
Retraining triggers keep model fresh

4. MLOps is Non-Negotiable

Experiment tracking: Compare model versions objectively
Model registry: Safe deployment with rollback capability
Pipeline orchestration: Reproducible, automated workflows
Monitoring: Catch drift before it impacts business

5. Cross-Language Integration Possible

R's statistical strengths + Python's ML ecosystem
Reticulate enables seamless interoperability
R Markdown provides superior reporting
Choose the right tool for each job

Real-World Impact

Before This System:

Manual categorization: 2-3 hours/day
Error rate: ~15%
New categories: Weeks to implement
No audit trail

After This System:

Automated categorization: Real-time
Error rate: ~8.8% (91.2% accuracy)
New categories: Suggested automatically
Complete MLflow audit trail

Business Value:

Time savings: ~500 hours/year
Improved accuracy: Better financial insights
Faster adaptation: New patterns caught within days
Compliance: Full model lineage and explainability

Lessons Learned

1. Start Simple, Iterate
We began with pure rule-based classification. Only after understanding failure modes did we add ML. This incremental approach:

Validated business logic early
Provided baseline metrics
Informed feature engineering
Built stakeholder trust

2. Data Quality > Model Complexity
The biggest improvements came from:

Better text normalization
Amount-weighted training
Domain-specific keywords Not from switching to deep learning or ensemble methods.

3. Monitoring is Critical
Models degrade over time. We discovered:

Coverage drops 5-8% per quarter without retraining
New merchants cause 60% of classification errors
Seasonal patterns (holiday shopping) require awareness
Active monitoring caught issues before users noticed

4. Explainability Matters
Stakeholders wanted to understand "why":

Why was this healthcare, not groceries?
Which keywords triggered the classification?
What's the model's confidence? Rule-based + feature importance provided this transparency.

5. Integration is Harder Than Training
Technical challenges:

R ↔ Python data type conversions
MLflow database migrations
ZenML pipeline debugging
Report generation automation

These took more time than model development. Plan accordingly.

Performance Optimization Tips

1. Vectorization

# Slow: Loop over transactions
for transaction in transactions:
    result = classify(transaction)

# Fast: Batch vectorization
X = vectorizer.transform(transactions['narration'])
results = classifier.predict(X)

Speedup: 50x

2. Compiled Regex

# Slow: Compile each time
re.search(r'\bpharmacy\b', text)

# Fast: Pre-compile
PHARMACY_PATTERN = re.compile(r'\bpharmacy\b', re.IGNORECASE)
PHARMACY_PATTERN.search(text)

Speedup: 3x

3. Smart Caching

@lru_cache(maxsize=10000)
def classify_cached(narration: str, amount: float):
    return classifier.classify_single(narration, amount)

Hit rate: ~40% in production

4. Lazy Loading

# Don't load ML model if rule-based suffices
if confidence > 0.7:
    return rule_result
else:
    if ml_model is None:
        ml_model = load_model()
    return ml_result

Common Pitfalls & Solutions

Pitfall 1: Overfitting to Training Data

Symptom: 98% train accuracy, 75% test accuracy
Solution: Cross-validation, regularization, simpler models
Our approach: max_depth=15, min_samples_split=10

Pitfall 2: Imbalanced Classes

Symptom: Model predicts "Groceries" for everything
Solution: class_weight='balanced', stratified sampling
Our approach: Amount-weighted sampling gives rare categories more influence

Pitfall 3: Feature Leakage

Symptom: Perfect accuracy in dev, terrible in production
Solution: Strict train/test separation, temporal validation
Our approach: Never use future data for past predictions

Pitfall 4: Ignoring Edge Cases

Symptom: Works great on clean data, fails on real data
Solution: Test on production-like data, handle missing values
Our approach: Extensive text normalization, graceful degradation

Pitfall 5: Stale Models

Symptom: Accuracy slowly degrades over time
Solution: Monitoring, automated retraining triggers
Our approach: Weekly metrics review, monthly retraining

Code Snippets for Common Tasks

Add New Category:

# models/keyword_rules.yaml
Pet Care:
  keywords:
    - petco
    - petsmart
    - vet
    - veterinary
    - dog food
    - cat litter
  weight: 1.0
  aliases: ["veterinary", "animal care"]

Retrain Model:

# Pull latest labeled data
python scripts/fetch_labeled_data.py

# Retrain with new data
python src/python/train_model.py data/labeled_transactions.csv

# Evaluate performance
python scripts/evaluate_model.py

# Promote to production if metrics improve
python scripts/promote_model.py

Deploy New Version:

# Build Docker image
docker build -t ner-api:v2.0 .

# Push to registry
docker push myregistry/ner-api:v2.0

# Update Kubernetes deployment
kubectl set image deployment/ner-api ner-api=myregistry/ner-api:v2.0

# Monitor rollout
kubectl rollout status deployment/ner-api

Generate Report:

# In R console
source("src/R/generate_report.R")

generate_assessment_report(
  results_path = "data/processed/final_results.csv",
  metrics_path = "data/processed/metrics.json",
  output_file = "reports/weekly_report.html"
)

Resources & Further Reading

Books:

"Designing Data-Intensive Applications" - Martin Kleppmann
"Machine Learning Engineering" - Andriy Burkov
"Practical MLOps" - Noah Gift & Alfredo Deza

Documentation:

MLflow: https://mlflow.org/docs/latest/
ZenML: https://docs.zenml.io/
scikit-learn: https://scikit-learn.org/
Reticulate: https://rstudio.github.io/reticulate/

Papers:

"Attention is All You Need" (Transformers)
"BERT: Pre-training of Deep Bidirectional Transformers"
"Random Forests" - Leo Breiman

Courses:

Fast.ai: Practical Deep Learning
Andrew Ng: ML Engineering for Production (MLOps)
Made With ML: MLOps course

Repository Structure

Local_NER/
├── README.md
├── requirements.txt
├── .gitignore
├── Dockerfile
├── docker-compose.yml
│
├── data/
│   ├── raw/
│   │   └── transactions_*.csv
│   ├── processed/
│   │   ├── final_results.csv
│   │   ├── metrics.json
│   │   └── discovered_categories.json
│   └── sample_transactions.csv
│
├── models/
│   ├── keyword_rules.yaml
│   ├── ner_classifier.pkl
│   └── version_history/
│
├── src/
│   ├── python/
│   │   ├── __init__.py
│   │   ├── ner_classifier.py
│   │   ├── category_discovery.py
│   │   ├── feature_engineering.py
│   │   ├── train_model.py
│   │   └── utils.py
│   │
│   ├── R/
│   │   ├── data_prep.R
│   │   ├── python_integration.R
│   │   ├── generate_report.R
│   │   └── visualization.R
│   │
│   ├── pipelines/
│   │   ├── zenml_pipeline.py
│   │   └── airflow_dag.py
│   │
│   └── api/
│       ├── main.py
│       ├── models.py
│       └── routes.py
│
├── reports/
│   ├── assessment_report.Rmd
│   ├── assessment_report.html
│   └── templates/
│
├── tests/
│   ├── test_classifier.py
│   ├── test_discovery.py
│   └── test_pipeline.py
│
├── notebooks/
│   ├── exploration.ipynb
│   ├── error_analysis.ipynb
│   └── feature_importance.ipynb
│
├── scripts/
│   ├── setup_environment.sh
│   ├── generate_sample_data.py
│   ├── evaluate_model.py
│   └── promote_model.py
│
├── k8s/
│   ├── deployment.yaml
│   ├── service.yaml
│   └── ingress.yaml
│
├── .github/
│   └── workflows/
│       ├── ci.yml
│       └── deploy.yml
│
└── mlruns/
    └── (MLflow tracking data)

Quick Start Guide

1. Clone & Setup

git clone https://github.com/yourusername/Local_NER.git
cd Local_NER

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

2. Generate Sample Data

python scripts/generate_sample_data.py

3. Run Pipeline

# Option 1: Python script
python src/python/train_model.py data/sample_transactions.csv

# Option 2: ZenML pipeline
python src/pipelines/zenml_pipeline.py data/sample_transactions.csv

4. View Results

# MLflow UI
mlflow ui

# Generate report (in R)
Rscript -e "source('src/R/generate_report.R'); generate_assessment_report()"

5. Make API Call

# Start API server
uvicorn src.api.main:app --reload

# Test classification
curl -X POST "http://localhost:8000/classify" \
  -H "Content-Type: application/json" \
  -d '{"narration": "cvs pharmacy", "amount": 45.00}'

Troubleshooting

Issue: MLflow database locked

# Solution: Use PostgreSQL instead of SQLite
export MLFLOW_TRACKING_URI=postgresql://user:pass@localhost/mlflow

Issue: R can't find Python

# Solution: Explicitly set Python path
reticulate::use_python("/path/to/venv/bin/python", required = TRUE)

Issue: Out of memory during training

# Solution: Reduce feature dimensions or batch size
vectorizer = TfidfVectorizer(max_features=200)  # Down from 500

Issue: ZenML pipeline fails

# Solution: Clear cache and restart
zenml clean
zenml pipeline runs delete --all

Contributing

We welcome contributions! Areas for improvement:

Better text preprocessing
- Handle international characters
- Merchant name normalization
- Abbreviation expansion
Additional ML models
- LSTM for sequence modeling
- BERT for semantic understanding
- XGBoost for tabular features
Enhanced category discovery
- Hierarchical clustering
- Topic modeling (LDA)
- Graph-based approaches
Production features
- A/B testing framework
- Shadow deployment
- Canary releases
Documentation
- Video tutorials
- Architecture diagrams
- API documentation

License

MIT License - See LICENSE file for details.

Acknowledgments

MLflow Team: Excellent experiment tracking platform
ZenML Team: Making MLOps accessible
scikit-learn Contributors: Industry-standard ML library
R Community: Statistical computing excellence
Our Users: Invaluable feedback and feature requests

Final Thoughts

Building a production ML system is 10% model training and 90% everything else:

Data quality and preprocessing
Pipeline orchestration
Monitoring and alerting
Deployment and serving
Documentation and reporting

This project demonstrates a complete end-to-end system that addresses all these concerns. The hybrid rule-based + ML approach provides the best balance of:

Speed: Rule-based is fast for common cases
Accuracy: ML handles edge cases and learns from data
Interpretability: Keywords and feature importance are transparent
Adaptability: Unsupervised discovery finds new patterns
Maintainability: Clear separation of concerns, modular design

The key innovation is the progressive enhancement strategy: start with simple rules, add ML where needed, and continuously discover new patterns. This approach:

Reduces annotation burden (only label what rules miss)
Provides fast baseline performance
Improves gracefully with more data
Maintains explainability throughout

Full Repository: https://github.com/AkanimohOD19A/Named-Entity-Recognition

Remember: The best model is the one that's actually in production, providing value to users. Ship early, learn fast, improve continuously.

Building Production AI: A Three-Part MLOps Journey - Pt.2

Akan — Sun, 18 Jan 2026 17:57:32 +0000

Part 3: Deployment & Monitoring
"Production Deployment: UI, CI/CD, and Observability"
The Gist: We’ve built the engine and tested it on the track. Now, it’s time to open the showroom. In this final part, we aren’t just 'running code'—we’re launching a product. We’ll build a slick user interface, set up an automated 'safety net' (CI/CD) so we don't accidentally ship bugs, and install 'CCTV' (Monitoring) to make sure the AI stays healthy once it's out in the wild.

1. The Front Door: Gradio Application

Nobody wants to generate art by typing code into a terminal. We use Gradio to build a professional 'Front Door.' It’s a simple Python-based UI that lets users type a prompt and get an Adire masterpiece in seconds.

But here’s the pro secret: this app isn't just for the user. It’s wired into MLflow. Every time a user generates an image, the app 'telemeters' the performance data back to us. If a specific prompt is causing errors or taking 60 seconds to load, we’ll know immediately.

# app/gradio_app.py
import gradio as gr
from diffusers import StableDiffusionPipeline
import torch
import mlflow
from datetime import datetime

class InferenceApp:
    def __init__(self, model_path: str):
        # We wake up the brain and load our Adire LoRA
        self.pipe = self._load_model(model_path)
        mlflow.set_tracking_uri("../mlruns")
        mlflow.set_experiment("production_inference")

    def _load_model(self, model_path: str):
        pipe = StableDiffusionPipeline.from_pretrained(
            "runwayml/stable-diffusion-v1-5",
            torch_dtype=torch.float16
        )
        pipe.unet.load_attn_procs(model_path)
        return pipe.to("cuda" if torch.cuda.is_available() else "cpu")

    def generate(self, prompt: str, steps: int, guidance: float):
        """Generate image with full observability"""
        start = datetime.now()

        with mlflow.start_run(run_name="inference"):
            # We track exactly what the user asked for
            mlflow.log_params({"prompt": prompt, "steps": steps, "guidance": guidance})

            try:
                image = self.pipe(prompt, num_inference_steps=steps, guidance_scale=guidance).images[0]
                duration = (datetime.now() - start).total_seconds()

                # We log the 'health' of this specific request
                mlflow.log_metrics({"generation_time": duration, "prompt_length": len(prompt.split())})
                return image, f"✓ Generated in {duration:.2f}s"
            except Exception as e:
                mlflow.log_param("error", str(e))
                return None, f"✗ Error: {str(e)}"

    def launch(self):
        # The 'Layout' of our showroom
        with gr.Blocks() as demo:
            gr.Markdown("# Nigerian Adire Style Generator")
            with gr.Row():
                with gr.Column():
                    prompt = gr.Textbox(label="Prompt", placeholder="a nigerian_adire_style...")
                    steps = gr.Slider(20, 100, value=50, label="Steps")
                    guidance = gr.Slider(1, 15, value=7.5, label="Guidance")
                    btn = gr.Button("Generate", variant="primary")
                with gr.Column():
                    output = gr.Image(label="Generated Image")
                    status = gr.Textbox(label="Status")

            btn.click(fn=self.generate, inputs=[prompt, steps, guidance], outputs=[output, status])
        demo.launch(share=True)

2. Shipping the Goods: HuggingFace Deployment

HuggingFace is the 'App Store' for AI. Instead of just sending someone a file, we deploy our model weights there. This script doesn't just upload the model; it creates a Model Card. Think of this as the 'Instruction Manual' and 'Nutrition Label' for your AI—it tells people what it is, how to use it, and what its quality scores were during training.

3. The Safety Net: CI/CD Pipeline

In professional software, we don't just 'upload and pray.' We use a GitHub Action (CI/CD). Every time we update the code, this automated 'Robot' wakes up and:

Tests: Does the code even run?
Evaluates: Does the new model version still meet our 0.75 quality score?
Deploys: Only if everything is perfect does it push the update to the live app.It's how you sleep soundly at night knowing a small typo won't crash your production service.

4. The CCTV: Production Monitoring

Once your model is live, it can 'drift.' Maybe users start using slang the model doesn't understand, or maybe the GPU starts slowing down. We build a Monitoring Dashboard that acts like a heart rate monitor for our AI. If the average generation time spikes above 30 seconds, the system sends us an Alert.

# monitoring/dashboard.py
class MonitoringDashboard:
    def check_degradation(self, df: pd.DataFrame) -> bool:
        """Alert if performance degrades"""
        recent = df[df["timestamp"] > datetime.now() - timedelta(hours=24)]
        avg_time = recent["generation_time"].mean()

        # If the model gets 'tired' (slow), we trigger an alarm
        if avg_time > 30:
            print(f"⚠️ ALERT: Avg generation time {avg_time:.2f}s > 30s SLA")
            return True
        return False

5. The Turbo Boost: Performance Optimization

Finally, we want our AI to be fast. In 2026, we have a few 'Cheat Codes' to speed up Stable Diffusion. By using torch.compile and xFormers, we can often double the speed of generation. We calculate our 'Speedup Ratio' using: S= optimized/baseline If S = 2.0, your users are getting their art twice as fast, and you're paying half the price for GPU time. It's a win-win.

def optimize_pipeline(pipe):
    # PyTorch 2.5+ 'compiles' the math into a faster format
    pipe.unet = torch.compile(pipe.unet)
    # Reduces VRAM so we can run on cheaper hardware
    pipe.enable_attention_slicing()
    # Uses 'Flash Attention' for a 20% speed boost
    pipe.enable_xformers_memory_efficient_attention()
    return pipe

We've finally completed the series; we’ve gone from a few images of Adire fabric to a production-ready AI system. It’s fast, it’s monitored, it’s automated, and most importantly, it’s ready for real users. You’re no longer just playing with AI; you’re an AI Engineer.

All the resources used here are available for free:
Huggingface Repository: https://huggingface.co/AfroLogicInsect/sd-lora-nigerian-adire
GitHub Repository: https://github.com/AkanimohOD19A/adire_mlops_poc
Gradio UI:https://huggingface.co/AfroLogicInsect/sd-adire-demo

Building Production AI: A Three-Part MLOps Journey - Pt.2

Akan — Sun, 18 Jan 2026 16:57:39 +0000

Part 2: Training & MLOps Pipeline

"From Data to Deployment: Building the Production Pipeline"

Now that we have the blueprint, it’s time to actually 'cook.' But here’s the thing: in production, you can’t just train a model once and hope for the best. You need a 'factory' that can do it over and over again perfectly. I spent my time setting up automated gates. If the AI creates something ugly, the system automatically 'fires' that version and refuses to deploy it. It’s like having a robot manager who never sleeps.

1. The Training Lab: Google Colab Setup

First things first: we need a place to work. Training AI is like running a marathon for a computer, it's exhausting. We use Google Colab because it gives us a free T4 GPU, which is the 'engine' we need to train our Adire model.

We start by gathering our tools. We're installing diffusers (the main engine), peft (our LoRA 'sticky note' tool), and bits and bytes (a clever hack that lets us train big models on small GPUs).

# ========================================
# Cell 1: Environment Setup
# ========================================
!pip install -q diffusers==0.25.0 transformers==4.36.0 \
             accelerate==0.25.0 peft==0.7.1 bitsandbytes

# We need to make sure the GPU is actually awake and ready to work
import torch
assert torch.cuda.is_available(), "No GPU found! Check your Colab settings."
print(f"GPU: {torch.cuda.get_device_name(0)}")
print(f"VRAM: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f}GB")

# ========================================
# Cell 2: Download the 'Recipe'
# ========================================
# We don't need to write the training logic from scratch. 
# We're grabbing a proven script from the HuggingFace team.
!wget -q https://raw.githubusercontent.com/huggingface/diffusers/v0.36.0/examples/dreambooth/train_dreambooth_lora.py

# ========================================
# Cell 3: The Secret Sauce (Configuration)
# ========================================
# This is where we tell the AI exactly what we want. 
# We're pointing it to our Adire images and telling it the "trigger word."
CONFIG = {
    "model": "runwayml/stable-diffusion-v1-5",
    "output_dir": "./lora_weights",
    "instance_data_dir": "./training_images",
    "instance_prompt": "a photo in nigerian_adire_style",
    "resolution": 512,
    "train_batch_size": 1,
    "gradient_accumulation_steps": 4, # We 'save up' steps to act like a bigger batch
    "learning_rate": 1e-4, 
    "lr_scheduler": "constant",
    "max_train_steps": 800, # 800 iterations is usually the sweet spot
    "lora_rank": 4,
    "lora_alpha": 4,
    "seed": 42
}

# ========================================
# Cell 4: Ignition!
# ========================================
# We launch the training. This is where the magic happens.
!accelerate launch train_dreambooth_lora.py \
  --pretrained_model_name_or_path="{CONFIG['model']}" \
  --instance_data_dir="{CONFIG['instance_data_dir']}" \
  --output_dir="{CONFIG['output_dir']}" \
  --instance_prompt="{CONFIG['instance_prompt']}" \
  --resolution={CONFIG['resolution']} \
  --train_batch_size={CONFIG['train_batch_size']} \
  --gradient_accumulation_steps={CONFIG['gradient_accumulation_steps']} \
  --learning_rate={CONFIG['learning_rate']} \
  --lr_scheduler="{CONFIG['lr_scheduler']}" \
  --max_train_steps={CONFIG['max_train_steps']} \
  --use_8bit_adam \
  --checkpointing_steps=100 \
  --validation_prompt="{CONFIG['instance_prompt']} sunset over Lagos" \
  --seed={CONFIG['seed']}

2. Tuning the Engine: Hyperparameter Analysis

You might wonder why I chose those specific numbers in the CONFIG. AI training is a bit like cooking, a pinch too much salt ruins the soup.

Learning Rate ($1e-4$): If this is too high, the AI 'panics' and learns nothing. Too low, and it takes days to learn.
Effective Batch Size: We're training on one image at a time but 'remembering' four (1 $\times$ 4). It keeps the training stable without crashing the GPU memory.
LoRA Rank: A rank of 4 is lean and fast. If we went to 16, the file would be 4x bigger but wouldn't actually look much better. We're going for efficiency here."

3. The Factory: Building the MLOps Pipeline

Now, we step away from the notebook and build a real software system. In a production environment, you don't want to manually copy-paste files. We use ZenML to build a conveyor belt.

Our pipeline has three main employees:

The Evaluator: Does the model actually create Adire patterns or is it just making noise?
The Promoter: The 'manager' who looks at the test scores and decides if this model is good enough for our customers.
The Deployer: The person who packs the model up and ships it to the cloud."

Step 1: The Evaluator (Quality Control)

This step loads our new model and asks it to draw a few pictures. We measure how fast it is and how well the images match our prompts. We log all these stats into MLflow so we have a permanent record of how this 'version' performed.

@step(enable_cache=False)
def evaluate_model(model_path: str, test_prompts: List[str]) -> Dict[str, float]:
    # We load the brain (Stable Diffusion) and the 'notes' (our LoRA weights)
    pipe = StableDiffusionPipeline.from_pretrained(...)
    pipe.unet.load_attn_procs(model_path)

    # We time the generation. In production, 'fast' is just as important as 'pretty.'
    start = time.time()
    image = pipe(prompt).images[0]
    gen_time = time.time() - start

    # We calculate a 'quality' score (using a tool called CLIP)
    quality = compute_clip_score(image, prompt)

    metrics = {"avg_time": gen_time, "avg_quality": quality}
    mlflow.log_metrics(metrics) # Keep a receipt!
    return metrics

Step 2: The Promoter (The Decision Maker)

This is our automated 'Quality Gate.' We set strict rules: if the quality is below 0.75, or if it takes longer than 30 seconds to draw a picture, the model is 'fired.' If it passes, it gets promoted to 'Production' status.

@step
def promote_model(metrics: Dict[str, float], thresholds: Dict[str, float]):
    # Does it meet our standards?
    checks = {
        "quality_check": metrics["avg_quality"] >= thresholds["quality"],
        "speed_check": metrics["avg_time"] <= thresholds["max_time"]
    }

    if all(checks.values()):
        # If yes, we officially tag it as 'Production' in our system
        client.transition_model_version_stage(name=model_name, stage="Production")
        print("✓ Model promoted!")
    return all(checks.values())

4. MLflow: The Project Diary

While the pipeline runs, MLflow is in the background taking notes on everything. Every loss value, every hyperparameter, and every test image is saved. If our model suddenly starts acting weird next week, we can look back at the 'diary' and see exactly what changed. It’s like having an infinite 'Undo' button for your entire AI project.

# We can literally ask MLflow: "Which version of the Adire model was the best?"
runs = client.search_runs(experiment_ids=["0"], order_by=["metrics.avg_quality DESC"])
print(f"Our champion model is: {runs[0].info.run_id}")

That's it! In one takeaway: We've moved from a single script mechanism to a factory. Our model is trained, tested, and vetted by an automated manager. We're not just building a model; we're building a system that can reliably produce many models.

NOTE/ASIDE: Implementing this is dependent on compute, the smaller the model size - the smaller the compute required for handling retraining.

Building Production AI: A Three-Part MLOps Journey

Akan — Sun, 18 Jan 2026 16:56:35 +0000

Series Overview

A practical, code-heavy guide to building production machine learning systems using Stable Diffusion, LoRA fine-tuning, and open-source MLOps tools. We'll fine-tune on Nigerian adire patterns, but the architecture applies to any domain.

Tech Stack: Stable Diffusion 1.5, LoRA, Google Colab (T4 GPU), ZenML, MLflow, Gradio, HuggingFace Hub

The Gist is, I had this idea: what if I could teach an AI to 'understand' the intricate beauty of Nigerian Adire patterns? Normally, building an AI from scratch is like trying to build a car by smelting the steel yourself - it costs a fortune and takes forever. Then, there's the 'cheat code.' Instead of building the car, I took a world-class engine (Stable Diffusion) and added a custom 'tuning kit' (LoRA) and it became the difference between spending $10,000 and spending $0.

Think of this in three stages. First, we have the Training Room (Google Colab), where the AI learns the Adire style. Then, the Assembly Line (ZenML/MLflow), which acts as our quality control to make sure the AI isn't just making digital soup. Finally, the Shop Front (Gradio), where people actually get to play with it.

So, in this 3-part series, i'd like to show you the blueprint of how I built a production-grade system without breaking the bank."1.

1. The Blueprint:

2. LoRA (Low-Rank Adaptation) Math:

# Standard fine-tuning: Update all parameters
W_new = W_original + ΔW  # ΔW is 2048×2048 = 4.2M params

# LoRA: Low-rank decomposition
W_new = W_original + A @ B
# A: 2048×4 = 8,192 params
# B: 4×2048 = 8,192 params
# Total: 16,384 params (0.4% of original!)

Implementation

class LoRALayer(nn.Module):
    def __init__(self, in_dim, out_dim, rank=4):
        self.lora_A = nn.Parameter(torch.randn(in_dim, rank))
        self.lora_B = nn.Parameter(torch.randn(rank, out_dim))
        self.scaling = 1.0 / rank

    def forward(self, x):
        return x @ (self.lora_A @ self.lora_B) * self.scaling

Why LoRA is a Game Changer - Normally, if you want to 'retrain' an AI, you have to move billions of tiny digital sliders, squish up every ounce of GPU and manage humongous storage. It’s exhausting for the computer. Then LoRA is like using a transparent sticky note. Instead of rewriting the whole book, we just write our Adire notes on the sticky note and slap it on top. In math terms, instead of updating the massive weight matrix $W$ , we represent the change $\Delta W$ as the product of two much smaller matrices, $A$ and $B$:$$W_{new} = W_{original} + (A \times B)$$ This reduces our workload from 4.2 million parameters down to about 16,000! That’s a 99.6% reduction in effort for the same result.

class LoRALayer(nn.Module):
    def __init__(self, in_dim, out_dim, rank=4):
        super().__init__()
        # These are the 'small' matrices we actually train
        self.lora_A = nn.Parameter(torch.randn(in_dim, rank))
        self.lora_B = nn.Parameter(torch.randn(rank, out_dim))
        self.scaling = 1.0 / rank

    def forward(self, x):
        return x @ (self.lora_A @ self.lora_B) * self.scaling

3. The Economics:

"High Tech on a Low Budget" Here’s the part my 'business' friends love. If we did this the 'traditional' corporate way, we’d be burning $10k a year on servers. By using open-source tools and smart architecture, we brought that cost down to literally zero.

Now that we understand the complete system architecture, the mathematical foundations of LoRA, and why this approach is 100× cheaper than traditional methods, we can now go ahead to that building.

Let's Build a Voice RAG System That Actually Works 🎉

Akan — Thu, 28 Aug 2025 04:52:57 +0000

What We're Going to Build (And Why It's Pretty Cool)

Youtube Demo

So, we know how sometimes you wish you could just talk to your computer and have it actually understand what you're asking? Well, that's exactly what we're building today!

Picture this: You record yourself asking "Hey, what's machine learning all about?" and boom - your system transcribes what you said, searches through your documents AND the web, then gives you a smart answer back. Pretty neat, right?

The Magic Behind the Curtain ✨

Here's what our little system does:

Listens to your voice (using a fancy Whisper model)
Thinks about what you asked (searches your knowledge base)
Looks stuff up on the internet (because why not get fresh info?)
Puts it all together into a nice answer

And the best part? We're making it FAST by using your GPU properly. No more waiting around for 30 seconds while your model thinks!

Before We Dive In - Let's Get Ready! 🛠️

What You'll Need

Don't worry, this isn't going to break the bank:

A Google Colab account (the free one works fine!)
About 30 minutes of your time
A sense of curiosity (and maybe some coffee ☕)

Quick GPU Check

First things first - let's make sure we've got the good stuff:

import torch
print(f"🔥 GPU Available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU Device: {torch.cuda.get_device_name(0)}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
else:
    print("⚠️ Uh oh! No GPU detected. Go to Runtime → Change Runtime Type → T4 GPU")

If you see something like "Tesla T4" pop up, you're golden! 🎉

Step 1: Installing Our Toolbox 📦

Alright, time to grab all the cool libraries we need. Think of this as gathering ingredients before we start cooking:

# The core ML stuff (this is where the magic happens)
!pip install -q transformers>=4.41.0 torch torchaudio --upgrade
!pip install -q accelerate bitsandbytes optimum

# For grabbing stuff from the web and making pretty interfaces
!pip install -q requests beautifulsoup4 gradio

# The smart search and audio processing bits
!pip install -q sentence-transformers datasets librosa soundfile

# Super-fast similarity search (the secret sauce!)
!pip install -q faiss-gpu

Pro tip: If faiss-gpu gives you trouble, just use faiss-cpu instead. It'll still work great!

Step 2: The Heart of Our System - The VoiceRAGT4 Class 💝

Now here's where things get interesting. We're building a class that's like a Swiss Army knife for voice processing:

class VoiceRAGT4:
    def __init__(self):
        self.device = "cuda" if torch.cuda.is_available() else "cpu"
        print(f"🚀 Using device: {self.device}")

        # These will hold our AI models
        self.speech_to_text = None
        self.embedder = None

        # Our knowledge base
        self.documents = []
        self.document_embeddings = None
        self.faiss_index = None

        # Let's set everything up!
        self.setup_models()

Think of this as setting up your workspace before you start a project. We're just getting everything organized!

Loading Our Models (The Fun Part!) 🤖

Here's where we load up our AI models. I've added some special sauce to make them run super fast on your T4:

def setup_models(self):
    print("🔧 Loading models with T4 optimizations...")

    # First, let's try to load your custom Whisper model
    try:
        model_id = "AfroLogicInsect/whisper-finetuned-float32"

        # This is the secret sauce - memory optimization!
        bnb_config = BitsAndBytesConfig(
            load_in_8bit=True,  # Uses half the memory!
            bnb_8bit_compute_dtype=torch.float16
        )

        # Load the model
        model = AutoModelForSpeechSeq2Seq.from_pretrained(
            model_id,
            torch_dtype=torch.float16,  # Faster inference
            low_cpu_mem_usage=True,     # Be nice to your RAM
            use_safetensors=True,       # Modern and safe
            quantization_config=bnb_config
        )

        if self.device == "cuda":
            model = model.to("cuda")

        processor = AutoProcessor.from_pretrained(model_id)

        # Create our speech-to-text pipeline
        self.speech_to_text = pipeline(
            "automatic-speech-recognition",
            model=model,
            tokenizer=processor.tokenizer,
            feature_extractor=processor.feature_extractor,
            max_new_tokens=128,
            chunk_length_s=30,  # Process in 30-second chunks
            batch_size=8,       # Sweet spot for T4
            torch_dtype=torch.float16,
            device=0 if self.device == "cuda" else -1,
        )

        print("✅ Your custom Whisper model loaded successfully!")

    except Exception as e:
        print(f"🤔 Hmm, couldn't load your custom model: {e}")
        print("🔄 No worries! Using the standard Whisper-small instead...")

        # Fallback option
        self.speech_to_text = pipeline(
            "automatic-speech-recognition",
            model="openai/whisper-small",
            torch_dtype=torch.float16,
            device=0 if self.device == "cuda" else -1,
            chunk_length_s=30,
            batch_size=8
        )

    # Now let's load our embedding model (this finds similar documents)
    print("🧠 Loading sentence transformer...")
    self.embedder = SentenceTransformer('all-MiniLM-L6-v2', device=self.device)

    # Make it even faster on GPU
    if self.device == "cuda":
        self.embedder.half()

    print("🎉 All models loaded and ready to rock!")

What's happening here? We're loading two main models:

Whisper - converts your speech to text
Sentence Transformer - understands the meaning of text (pretty cool, right?)

Step 3: Making Audio Sound Better 🎵

Before we feed audio to our model, let's clean it up a bit:

def preprocess_audio(self, audio_path):
    """Make audio sound nice for Whisper"""
    try:
        # Load audio at 16kHz (Whisper's favorite frequency)
        audio, sr = librosa.load(audio_path, sr=16000)

        # Normalize volume levels
        audio = librosa.util.normalize(audio)

        # Reduce background noise (simple but effective!)
        audio = librosa.effects.preemphasis(audio)

        return audio
    except Exception as e:
        print(f"😅 Audio preprocessing hiccup: {e}")
        return None

This is like adjusting the microphone settings to make sure Whisper can hear you clearly!

The Speech-to-Text Magic ✨

Here's where we actually convert your voice to text:

def transcribe_audio(self, audio_path):
    """Turn your voice into text with GPU power!"""
    try:
        # Clean up the audio first
        audio = self.preprocess_audio(audio_path)
        if audio is None:
            return "Oops! Couldn't process that audio file"

        # Clear some GPU memory (being polite!)
        if self.device == "cuda":
            torch.cuda.empty_cache()

        # The actual transcription (with speed boost!)
        with torch.cuda.amp.autocast():  # Mixed precision = faster!
            result = self.speech_to_text(
                audio,
                generate_kwargs={
                    "max_new_tokens": 128,
                    "num_beams": 2,      # Good balance of speed vs quality
                    "do_sample": False,
                    "use_cache": True
                }
            )

        # Clean up after ourselves
        if self.device == "cuda":
            torch.cuda.empty_cache()

        return result["text"].strip()

    except Exception as e:
        print(f"😬 Transcription went sideways: {e}")
        return f"Error: {e}"

Step 4: Building Our Knowledge Base 📚

Now for the really cool part - teaching our system about stuff! We'll add documents and create a super-fast search index:

def add_documents_batch(self, documents, batch_size=32):
    """Add a bunch of documents and make them searchable"""
    print(f"📚 Processing {len(documents)} documents...")

    self.documents.extend(documents)

    # Process in batches to avoid memory issues
    all_embeddings = []

    for i in range(0, len(self.documents), batch_size):
        batch = self.documents[i:i+batch_size]
        print(f"🔄 Processing batch {i//batch_size + 1}...")

        # Convert text to numbers (embeddings) that capture meaning
        if self.device == "cuda":
            with torch.cuda.amp.autocast():
                batch_embeddings = self.embedder.encode(
                    batch,
                    batch_size=batch_size,
                    show_progress_bar=True,
                    normalize_embeddings=True
                )
        else:
            batch_embeddings = self.embedder.encode(
                batch,
                batch_size=batch_size,
                show_progress_bar=True,
                normalize_embeddings=True
            )

        all_embeddings.append(batch_embeddings)

        # Keep things tidy
        if self.device == "cuda":
            torch.cuda.empty_cache()

    # Combine all the embeddings
    self.document_embeddings = np.vstack(all_embeddings)

    # Create a super-fast search index
    try:
        dimension = self.document_embeddings.shape[1]

        if self.device == "cuda":
            # GPU-powered search! 🚀
            res = faiss.StandardGpuResources()
            self.faiss_index = faiss.GpuIndexFlatIP(res, dimension)
            print("🚀 Using GPU-accelerated FAISS - this is gonna be fast!")
        else:
            self.faiss_index = faiss.IndexFlatIP(dimension)
            print("🔧 Using CPU FAISS - still pretty quick!")

        # Add our embeddings to the index
        self.faiss_index.add(self.document_embeddings.astype(np.float32))

    except Exception as e:
        print(f"🤷‍♂️ FAISS setup hiccup: {e}")
        print("📝 No worries, we'll use a backup method!")
        self.faiss_index = None

    print(f"✅ Added {len(documents)} documents to the knowledge base!")

What's happening here? We're converting all your documents into "embeddings" - these are like fingerprints that capture the meaning of the text. Then we build a super-fast search index so we can find relevant documents in milliseconds!

Step 5: Web Search Integration 🌐

Sometimes we need fresh info from the internet. Let's add that capability:

def web_search(self, query, num_results=5):
    """Grab some fresh info from the web"""
    try:
        print(f"🌐 Searching the web for: {query}")

        # Using DuckDuckGo's API (it's free and doesn't track you!)
        url = f"https://api.duckduckgo.com/?q={query}&format=json&no_html=1&skip_disambig=1"
        response = requests.get(url, timeout=5)
        data = response.json()

        results = []

        # Get the main abstract if available
        if data.get('Abstract'):
            results.append({
                'title': data.get('AbstractSource', 'DuckDuckGo')[:50],
                'content': data['Abstract'][:300],  # Keep it concise
                'url': data.get('AbstractURL', ''),
                'relevance': 1.0
            })

        # Get related topics
        for topic in data.get('RelatedTopics', [])[:num_results-1]:
            if isinstance(topic, dict) and topic.get('Text'):
                results.append({
                    'title': (topic.get('FirstURL', '').split('/')[-1] or 'Related')[:50],
                    'content': topic['Text'][:300],
                    'url': topic.get('FirstURL', ''),
                    'relevance': 0.8
                })

        print(f"📊 Found {len(results)} web results")
        return results

    except Exception as e:
        print(f"🤔 Web search didn't work out: {e}")
        return [{'title': 'Search Error', 'content': f'Search failed: {e}', 'url': '', 'relevance': 0}]

Step 6: Lightning-Fast Document Search ⚡

Here's where the magic really happens - finding relevant documents super quickly:

def retrieve_documents_fast(self, query, k=5):
    """Find the most relevant documents lightning fast!"""
    if len(self.documents) == 0:
        print("📭 No documents in the knowledge base yet!")
        return []

    try:
        print(f"🔍 Searching for: {query}")

        # Clear GPU memory
        if self.device == "cuda":
            torch.cuda.empty_cache()

        # Convert the query to an embedding
        if self.device == "cuda":
            with torch.cuda.amp.autocast():
                query_embedding = self.embedder.encode([query], normalize_embeddings=True)
        else:
            query_embedding = self.embedder.encode([query], normalize_embeddings=True)

        results = []

        if self.faiss_index is not None:
            # Use our super-fast FAISS index!
            scores, indices = self.faiss_index.search(
                query_embedding.astype(np.float32),
                min(k, len(self.documents))
            )

            for i, score in zip(indices[0], scores[0]):
                if score > 0.25:  # Only keep relevant results
                    results.append({
                        'content': self.documents[i],
                        'score': float(score),
                        'index': int(i)
                    })
        else:
            # Fallback method (still pretty fast!)
            from sklearn.metrics.pairwise import cosine_similarity
            similarities = cosine_similarity(query_embedding, self.document_embeddings)[0]

            top_indices = np.argsort(similarities)[::-1][:k]

            for idx in top_indices:
                score = similarities[idx]
                if score > 0.25:
                    results.append({
                        'content': self.documents[idx],
                        'score': float(score),
                        'index': int(idx)
                    })

        print(f"📋 Found {len(results)} relevant documents")
        return results

    except Exception as e:
        print(f"😅 Document search hit a snag: {e}")
        return []

Step 7: Putting It All Together 🎭

Now let's create the main function that orchestrates everything:

def process_voice_query_optimized(self, audio_file):
    """The main event - process a voice query end-to-end!"""
    from datetime import datetime

    start_time = datetime.now()
    print("🎬 Starting the voice RAG pipeline...")

    try:
        # Step 1: Speech to Text
        print("🎤 Converting speech to text...")
        stt_start = datetime.now()
        text_query = self.transcribe_audio(audio_file)
        stt_time = (datetime.now() - stt_start).total_seconds()
        print(f"📝 Got: '{text_query}' (took {stt_time:.2f}s)")

        if text_query.startswith("Error"):
            return text_query, "", f"Transcription failed in {stt_time:.2f}s"

        # Step 2: Search for relevant info (doing both at the same time!)
        print("🔍 Searching knowledge base and web...")
        search_start = datetime.now()

        # Find relevant documents
        retrieved_docs = self.retrieve_documents_fast(text_query, k=5)

        # Search the web too
        search_results = self.web_search(text_query, num_results=3)

        search_time = (datetime.now() - search_start).total_seconds()

        # Step 3: Generate a nice response
        print("💭 Crafting the perfect response...")
        response_start = datetime.now()
        response = self.generate_response_optimized(text_query, search_results, retrieved_docs)
        response_time = (datetime.now() - response_start).total_seconds()

        total_time = (datetime.now() - start_time).total_seconds()

        # Show off our performance! 
        perf_summary = f"""⚡ Performance Report:
• Speech Recognition: {stt_time:.2f}s
• Document Search: {search_time:.2f}s  
• Response Crafting: {response_time:.2f}s
• Total Time: {total_time:.2f}s
• Documents Found: {len(retrieved_docs)}
• Web Results: {len(search_results)}

🎯 That's {60/total_time:.1f} queries per minute!"""

        return text_query, response, perf_summary

    except Exception as e:
        error_time = (datetime.now() - start_time).total_seconds()
        print(f"💥 Something went wrong: {e}")
        return f"❌ System Error: {e}", "", f"Failed after {error_time:.2f}s"

Creating Beautiful Responses ✨

Let's make our responses look really nice:

def generate_response_optimized(self, query, search_results, retrieved_docs):
    """Create a beautiful, informative response"""

    context_parts = []

    # Add our knowledge base results first (they're usually more reliable)
    if retrieved_docs:
        context_parts.append("📚 From Your Knowledge Base:")
        # Sort by relevance score
        retrieved_docs.sort(key=lambda x: x['score'], reverse=True)
        for doc in retrieved_docs[:3]:  # Top 3
            context_parts.append(f"• {doc['content'][:200]}... (confidence: {doc['score']:.2f})")

    # Add web search results
    if search_results:
        context_parts.append("\n🌐 Fresh from the Web:")
        for result in search_results[:3]:
            if result['content']:
                context_parts.append(f"• {result['title']}: {result['content'][:150]}...")

    context = "\n".join(context_parts)

    if not context.strip():
        return "🤷‍♂️ Hmm, I couldn't find much about that. Try asking something else or check if your knowledge base has relevant info!"

    # Put together a nice response
    response = f"""🎯 **You asked**: {query}

{context}

💡 **In a nutshell**: The information above covers the key aspects of your question. The knowledge base results are typically most reliable, while web results give you the latest info!"""

    return response

Step 8: Let's See It in Action! 🎮

Time to create our user interface and actually use this thing:

# Initialize our system
print("🚀 Starting up the Voice RAG system...")
voice_rag = VoiceRAGT4()

# Add some sample documents to get started
print("📚 Adding some AI knowledge to get started...")
ai_docs = [
    "Artificial Intelligence (AI) is the simulation of human intelligence processes by machines, especially computer systems.",
    "Machine Learning is a subset of AI that provides systems the ability to automatically learn and improve from experience without being explicitly programmed.",
    "Deep Learning is a machine learning technique that teaches computers to learn by example, using neural networks with many layers.",
    "Natural Language Processing (NLP) helps computers understand, interpret and generate human language in a valuable way.",
    "Computer Vision enables machines to identify and analyze visual content in images and videos.",
    "Neural networks are computing systems inspired by biological neural networks that constitute animal brains.",
    "Large Language Models (LLMs) are AI models trained on vast amounts of text data to understand and generate human language.",
    "Transformer architecture is the foundation of modern language models, using attention mechanisms to process sequences.",
    "GPU acceleration significantly speeds up AI model training and inference through parallel processing capabilities.",
    "Fine-tuning allows pre-trained models to be adapted for specific tasks with smaller, domain-specific datasets."
]

voice_rag.add_documents_batch(ai_docs, batch_size=16)
print("✅ Knowledge base is ready!")

The User Interface 🎨

Now let's create a nice interface with Gradio:

def process_audio_interface(audio):
    """User-friendly wrapper for our voice processing"""
    if audio is None:
        return "Please record or upload an audio file! 🎤", "", "No audio provided"

    print("🎵 Processing your audio...")
    result = voice_rag.process_voice_query_optimized(audio)

    # Keep things tidy
    voice_rag.clear_gpu_memory()

    return result

# Create the interface
interface = gr.Interface(
    fn=process_audio_interface,
    inputs=gr.Audio(
        type="filepath",
        label="🎤 Record Your Question or Upload Audio",
        sources=["microphone", "upload"]
    ),
    outputs=[
        gr.Textbox(label="📝 What You Said", lines=3, max_lines=5),
        gr.Textbox(label="🤖 AI Response", lines=12, max_lines=20),
        gr.Textbox(label="⚡ Performance Stats", lines=8, max_lines=10)
    ],
    title="🎙️ Voice RAG System - Ask Me Anything!",
    description="""
    **Hey there! 👋** 

    This is your personal voice-powered AI assistant! Just record your voice or upload an audio file, 
    and I'll transcribe what you said, search through the knowledge base, grab fresh info from the web, 
    and give you a comprehensive answer.

    **Try asking about**:
    • Artificial Intelligence and Machine Learning
    • Technology concepts
    • General knowledge questions
    • Current events (I'll search the web!)

    **Pro tip**: Speak clearly and ask specific questions for the best results! 🎯
    """,
    theme=gr.themes.Soft(),
    allow_flagging="never"
)

print("🎉 Interface ready! Click the link to start chatting with your AI!")
interface.launch(share=True, debug=True)

Want to Test How Fast It Is? 🏃‍♂️

Let's add a fun benchmark to see how speedy our system really is:

def benchmark_system():
    """Let's see how fast this baby can go!"""
    test_queries = [
        "What is machine learning?",
        "How does deep learning work?", 
        "Explain artificial intelligence to me",
        "What are neural networks?",
        "How do transformers work in AI?"
    ]

    print("🏁 Starting the speed test!")
    total_times = []

    for i, query in enumerate(test_queries):
        print(f"\n🧪 Test {i+1}/5: '{query}'")
        start = datetime.now()

        # Run our pipeline (without audio since we're just testing speed)
        retrieved = voice_rag.retrieve_documents_fast(query, k=3)
        search_results = voice_rag.web_search(query, num_results=3)
        response = voice_rag.generate_response_optimized(query, search_results, retrieved)

        elapsed = (datetime.now() - start).total_seconds()
        total_times.append(elapsed)
        print(f"⏱️ Done in {elapsed:.2f} seconds!")

        voice_rag.clear_gpu_memory()  # Keep things clean

    avg_time = np.mean(total_times)
    print(f"\n📊 Speed Test Results:")
    print(f"🚀 Average query time: {avg_time:.2f} seconds")
    print(f"💨 Can handle {60/avg_time:.1f} queries per minute") 
    print(f"🎯 That's pretty darn fast for a full RAG system!")

# Run the benchmark!
benchmark_system()

Making It Your Own 🎨

Adding Your Own Documents

Want to teach your system about specific topics? Here's how:

def add_my_documents():
    """Add your own knowledge to the system"""

    # Replace these with your own content!
    my_docs = [
        "Your company's product information goes here",
        "Domain-specific knowledge for your field",
        "FAQ answers for common questions",
        "Technical documentation snippets",
        # Add as many as you want!
    ]

    # Only add if you've actually added content
    if "Your company's product information" not in my_docs[0]:
        voice_rag.add_documents_batch(my_docs, batch_size=16)
        print(f"🎉 Added {len(my_docs)} of your documents!")
    else:
        print("💡 Edit the my_docs list above to add your own content!")

# Uncomment this line when you've added your documents
# add_my_documents()

Loading Documents from Files

Want to load documents from PDFs or text files? Here's a helper:

def load_documents_from_files(file_paths):
    """Load documents from various file types"""
    documents = []

    for file_path in file_paths:
        try:
            if file_path.endswith('.txt'):
                with open(file_path, 'r', encoding='utf-8') as file:
                    content = file.read()
                    # Split into chunks so they're not too long
                    chunks = [content[i:i+500] for i in range(0, len(content), 400)]
                    documents.extend(chunks)

            elif file_path.endswith('.pdf'):
                # You'd need to install PyPDF2: !pip install PyPDF2
                import PyPDF2
                with open(file_path, 'rb') as file:
                    reader = PyPDF2.PdfReader(file)
                    text = ""
                    for page in reader.pages:
                        text += page.extract_text()
                    chunks = [text[i:i+500] for i in range(0, len(text), 400)]
                    documents.extend(chunks)

            print(f"📄 Loaded {file_path}")

        except Exception as e:
            print(f"😅 Couldn't load {file_path}: {e}")

    return documents

# Example usage:
# my_files = ["document1.txt", "manual.pdf", "faq.txt"]
# docs = load_documents_from_files(my_files)
# voice_rag.add_documents_batch(docs)

When Things Don't Go As Planned 🤔

Here are some common hiccups and how to fix them:

"Out of Memory" Errors

# If you get GPU memory errors, try these:

# 1. Reduce batch size
voice_rag.add_documents_batch(documents, batch_size=8)  # Instead of 32

# 2. Clear memory more often
torch.cuda.empty_cache()

# 3. Use CPU fallback
voice_rag.device = "cpu"  # Slower but uses less memory

Audio Problems

# If audio processing fails:

def fix_audio_issues(audio_path):
    """Sometimes audio files need extra help"""
    try:
        # Try loading with different settings
        audio, sr = librosa.load(audio_path, sr=None)

        # Convert to 16kHz if needed
        if sr != 16000:
            audio = librosa.resample(audio, orig_sr=sr, target_sr=16000)

        return audio
    except Exception as e:
        print(f"🎵 Audio trouble: {e}")
        return None

Model Loading Issues

# If models won't load:

def setup_simple_models(self):
    """Simpler model setup as fallback"""
    print("🔧 Using simpler model configuration...")

    # Use basic Whisper without fancy optimizations
    self.speech_to_text = pipeline(
        "automatic-speech-recognition", 
        "openai/whisper-tiny",  # Smallest, fastest model
        device=0 if torch.cuda.is_available() else -1
    )

    # Basic embeddings
    self.embedder = SentenceTransformer('all-MiniLM-L6-v2')

    print("✅ Basic setup complete!")

What's Next? 🚀

Congratulations! You've built a pretty awesome voice RAG system. Here are some fun ideas to make it even better:

1. Real-time Processing

Make it work with live microphone input so we can just talk to it continuously.

2. Multi-language Support

Add support for different languages by using multilingual Whisper and embedding models.

3. Better Document Processing

Add support for more file types (Word docs, PowerPoints, etc.) and better text chunking.

4. Conversation Memory

Make it remember what you talked about earlier in the conversation.

5. Custom Response Styles

Train it to respond in different styles (formal, casual, technical, etc.).

Wrapping Up 🎁

We've just built something pretty amazing! Your Voice RAG system can:

✅ Understand your speech
✅ Search through documents lightning-fast
✅ Grab fresh info from the web
✅ Give you intelligent, contextual answers
✅ Do it all really, really fast thanks to GPU optimization

The best part? This is just the beginning. You can customize it, add your own documents, integrate it into other systems, or just have fun asking it questions!

Remember: The more good documents we feed it, the smarter it gets. So start adding content that's relevant to what you want to ask about.

Now go forth and build something awesome! 🎉

P.S. - If you build something cool with this, I'd love to hear about it! And if you run into any weird issues, don't panic - that's just part of the fun of building AI systems. Happy coding!

Bonus Round: Cool Tricks and Advanced Features 🎪

Since you've made it this far, let me share some extra goodies that'll make your Voice RAG system even more impressive!

Memory Trick: Making It Remember Your Conversations 🧠

Want the system to remember what you talked about? Here's a simple way to add conversation memory:

class ConversationalVoiceRAG(VoiceRAGT4):
    def __init__(self):
        super().__init__()
        self.conversation_history = []  # Remember everything!
        self.max_history = 10  # Don't remember TOO much

    def process_with_memory(self, audio_file):
        """Process voice with conversation context"""
        # Get the current query
        text_query = self.transcribe_audio(audio_file)

        # Build context from conversation history
        context_query = self.build_contextual_query(text_query)

        # Process normally but with context
        retrieved_docs = self.retrieve_documents_fast(context_query, k=5)
        search_results = self.web_search(context_query, num_results=3)
        response = self.generate_response_optimized(context_query, search_results, retrieved_docs)

        # Remember this conversation
        self.conversation_history.append({
            'user': text_query,
            'assistant': response,
            'timestamp': datetime.now()
        })

        # Don't let memory get too long
        if len(self.conversation_history) > self.max_history:
            self.conversation_history.pop(0)

        return text_query, response

    def build_contextual_query(self, current_query):
        """Add conversation context to the query"""
        if not self.conversation_history:
            return current_query

        # Get the last few exchanges for context
        recent_context = self.conversation_history[-3:]  # Last 3 exchanges

        context_parts = []
        for exchange in recent_context:
            context_parts.append(f"Previously discussed: {exchange['user']}")

        contextual_query = f"""
        Current question: {current_query}

        Conversation context:
        {chr(10).join(context_parts)}

        Please answer considering this conversation history.
        """

        return contextual_query

Multi-Language Magic 🌍

Want to understand different languages? Here's how to make it multilingual:

def setup_multilingual_models(self):
    """Support multiple languages like a boss!"""
    print("🌍 Setting up multilingual support...")

    # Use multilingual Whisper
    self.speech_to_text = pipeline(
        "automatic-speech-recognition",
        "openai/whisper-large",  # Supports 99 languages!
        torch_dtype=torch.float16,
        device=0 if torch.cuda.is_available() else -1,
        return_timestamps=True
    )

    # Multilingual embeddings
    self.embedder = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')

    print("✅ Now I can understand many languages!")

def detect_language(self, text):
    """Figure out what language someone is speaking"""
    # Simple language detection
    try:
        from langdetect import detect
        language = detect(text)
        print(f"🗣️ Detected language: {language}")
        return language
    except:
        return "unknown"

def respond_in_language(self, response, target_language):
    """Respond in the same language as the user"""
    if target_language == "en":
        return response

    # You could integrate with translation APIs here
    print(f"💬 Would translate response to: {target_language}")
    return response + f"\n\n(Response in {target_language} would go here)"

Real-Time Voice Processing 🎙️

Want to make it work with live audio? Here's a simple real-time version:

import pyaudio
import threading
import queue
import time

class RealTimeVoiceRAG(VoiceRAGT4):
    def __init__(self):
        super().__init__()
        self.audio_queue = queue.Queue()
        self.is_listening = False
        self.audio_buffer = []

    def start_listening(self):
        """Start listening for voice input"""
        print("🎤 Starting real-time listening...")

        self.is_listening = True

        # Start audio capture thread
        audio_thread = threading.Thread(target=self.capture_audio, daemon=True)
        audio_thread.start()

        # Start processing thread
        process_thread = threading.Thread(target=self.process_audio_stream, daemon=True)
        process_thread.start()

        print("👂 I'm listening! Say something...")

    def capture_audio(self):
        """Capture audio from microphone"""
        try:
            p = pyaudio.PyAudio()
            stream = p.open(
                format=pyaudio.paFloat32,
                channels=1,
                rate=16000,
                input=True,
                frames_per_buffer=1024
            )

            while self.is_listening:
                data = stream.read(1024, exception_on_overflow=False)
                self.audio_queue.put(data)

            stream.stop_stream()
            stream.close()
            p.terminate()

        except Exception as e:
            print(f"🎵 Audio capture error: {e}")

    def process_audio_stream(self):
        """Process audio in real-time"""
        while self.is_listening:
            try:
                # Collect audio for 3 seconds
                audio_chunk = []
                for _ in range(48):  # ~3 seconds at 16kHz
                    if not self.audio_queue.empty():
                        audio_chunk.append(self.audio_queue.get())

                if audio_chunk:
                    # Convert to numpy array
                    audio_data = np.frombuffer(b''.join(audio_chunk), dtype=np.float32)

                    # Simple voice activity detection
                    if np.max(np.abs(audio_data)) > 0.01:  # Adjust threshold as needed
                        print("🗣️ Voice detected, processing...")
                        # Process the audio chunk
                        # (You'd save this to a temp file and process it)

                time.sleep(0.1)  # Small delay

            except Exception as e:
                print(f"🤔 Processing error: {e}")

    def stop_listening(self):
        """Stop real-time processing"""
        self.is_listening = False
        print("🛑 Stopped listening")

# Usage:
# real_time_rag = RealTimeVoiceRAG()
# real_time_rag.start_listening()
# # Let it run for a while...
# real_time_rag.stop_listening()

Smart Document Chunking 📄

Here's a smarter way to split your documents that preserves meaning:

def smart_chunk_documents(self, text, chunk_size=500, overlap=50):
    """Split text intelligently, keeping related sentences together"""
    import re

    # Split into sentences first
    sentences = re.split(r'[.!?]+', text)

    chunks = []
    current_chunk = ""

    for sentence in sentences:
        sentence = sentence.strip()
        if not sentence:
            continue

        # If adding this sentence would exceed chunk size
        if len(current_chunk) + len(sentence) > chunk_size:
            if current_chunk:
                chunks.append(current_chunk.strip())

                # Start new chunk with overlap
                words = current_chunk.split()
                overlap_text = " ".join(words[-overlap:]) if len(words) > overlap else current_chunk
                current_chunk = overlap_text + " " + sentence
            else:
                current_chunk = sentence
        else:
            current_chunk += " " + sentence if current_chunk else sentence

    # Don't forget the last chunk!
    if current_chunk:
        chunks.append(current_chunk.strip())

    return chunks

def load_and_chunk_file(self, file_path):
    """Load a file and chunk it smartly"""
    try:
        with open(file_path, 'r', encoding='utf-8') as file:
            content = file.read()

        chunks = self.smart_chunk_documents(content, chunk_size=400, overlap=30)
        print(f"📄 Split {file_path} into {len(chunks)} smart chunks")

        return chunks
    except Exception as e:
        print(f"😅 Couldn't process {file_path}: {e}")
        return []

Performance Dashboard 📊

Want to see detailed performance metrics? Here's a cool dashboard:

import matplotlib.pyplot as plt
from collections import defaultdict
import time

class PerformanceTracker:
    def __init__(self):
        self.metrics = defaultdict(list)
        self.query_history = []

    def track_query(self, query, transcription_time, search_time, response_time, total_docs, web_results):
        """Track performance for each query"""
        total_time = transcription_time + search_time + response_time

        self.metrics['transcription_times'].append(transcription_time)
        self.metrics['search_times'].append(search_time)
        self.metrics['response_times'].append(response_time)
        self.metrics['total_times'].append(total_time)
        self.metrics['docs_found'].append(total_docs)
        self.metrics['web_results'].append(web_results)

        self.query_history.append({
            'query': query,
            'total_time': total_time,
            'timestamp': time.time()
        })

    def show_performance_dashboard(self):
        """Create a cool performance visualization"""
        if not self.metrics['total_times']:
            print("📊 No performance data yet!")
            return

        fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(12, 8))

        # Response time distribution
        ax1.hist(self.metrics['total_times'], bins=20, alpha=0.7, color='skyblue')
        ax1.set_title('🚀 Response Time Distribution')
        ax1.set_xlabel('Time (seconds)')
        ax1.set_ylabel('Frequency')

        # Performance over time
        ax2.plot(self.metrics['total_times'], marker='o', color='orange')
        ax2.set_title('⏱️ Performance Over Time')
        ax2.set_xlabel('Query Number')
        ax2.set_ylabel('Time (seconds)')

        # Documents found distribution
        ax3.bar(['Transcription', 'Search', 'Response'], 
                [np.mean(self.metrics['transcription_times']),
                 np.mean(self.metrics['search_times']),
                 np.mean(self.metrics['response_times'])], 
                color=['lightcoral', 'lightgreen', 'lightblue'])
        ax3.set_title('⚡ Average Time by Component')
        ax3.set_ylabel('Time (seconds)')

        # Success rate
        success_rate = len([t for t in self.metrics['total_times'] if t < 5]) / len(self.metrics['total_times']) * 100
        ax4.pie([success_rate, 100-success_rate], labels=['Fast (<5s)', 'Slow (>5s)'], 
                colors=['lightgreen', 'lightcoral'], autopct='%1.1f%%')
        ax4.set_title('🎯 Speed Success Rate')

        plt.tight_layout()
        plt.show()

        # Print summary stats
        print(f"""
📈 Performance Summary:
• Average total time: {np.mean(self.metrics['total_times']):.2f}s
• Fastest query: {min(self.metrics['total_times']):.2f}s
• Slowest query: {max(self.metrics['total_times']):.2f}s
• Success rate (<5s): {success_rate:.1f}%
• Total queries processed: {len(self.metrics['total_times'])}
        """)

# Add to your VoiceRAG class:
def __init__(self):
    # ... existing init code ...
    self.performance_tracker = PerformanceTracker()

# In your process_voice_query_optimized method, add:
# self.performance_tracker.track_query(text_query, stt_time, search_time, response_time, len(retrieved_docs), len(search_results))

# Then you can view stats:
# voice_rag.performance_tracker.show_performance_dashboard()

Adding Personality 🤖

Want to give your AI some personality? Here's how:

def generate_response_with_personality(self, query, search_results, retrieved_docs, personality="helpful"):
    """Generate responses with different personalities"""

    personalities = {
        "helpful": {
            "greeting": "🤔 Let me help you with that!",
            "tone": "friendly and informative",
            "emoji": "✨"
        },
        "enthusiastic": {
            "greeting": "🎉 Oh, that's a GREAT question!",
            "tone": "excited and energetic", 
            "emoji": "🚀"
        },
        "scholarly": {
            "greeting": "📚 An interesting inquiry indeed.",
            "tone": "academic and thorough",
            "emoji": "🎓"
        },
        "casual": {
            "greeting": "👋 Hey! So you want to know about",
            "tone": "relaxed and conversational",
            "emoji": "😊"
        }
    }

    style = personalities.get(personality, personalities["helpful"])

    # Build context as before...
    context_parts = []
    if retrieved_docs:
        context_parts.append("📚 From what I know:")
        for doc in retrieved_docs[:3]:
            context_parts.append(f"• {doc['content'][:200]}...")

    if search_results:
        context_parts.append("\n🌐 Fresh from the web:")
        for result in search_results[:3]:
            if result['content']:
                context_parts.append(f"• {result['content'][:150]}...")

    context = "\n".join(context_parts)

    if personality == "enthusiastic":
        response = f"""{style['greeting']} {query}

{context}

{style['emoji']} This is SO cool - there's tons of great info about this topic! Hope this helps fuel your curiosity!"""

    elif personality == "scholarly":
        response = f"""{style['greeting']}

Based on my analysis of the available sources:

{context}

{style['emoji']} In conclusion, the evidence suggests these are the key considerations regarding your inquiry."""

    elif personality == "casual":
        response = f"""{style['greeting']} {query.lower()}? 

Here's the deal:

{context}

{style['emoji']} Hope that clears things up! Let me know if you want me to dig deeper into any part of this."""

    else:  # helpful (default)
        response = f"""{style['greeting']}

{context}

{style['emoji']} I hope this information helps answer your question! Feel free to ask if you need clarification on anything."""

    return response

# Usage:
# response = voice_rag.generate_response_with_personality(query, search_results, retrieved_docs, personality="enthusiastic")

Web Interface Upgrade 🌐

Want a fancier web interface? Here's an enhanced Gradio setup:

def create_advanced_interface():
    """Create a more sophisticated interface"""

    with gr.Blocks(title="🎙️ Voice RAG Pro", theme=gr.themes.Soft()) as interface:
        gr.Markdown("# 🎙️ Voice RAG System Pro")
        gr.Markdown("Ask me anything using your voice! I'll search my knowledge base and the web to give you comprehensive answers.")

        with gr.Row():
            with gr.Column(scale=2):
                audio_input = gr.Audio(
                    label="🎤 Your Question",
                    sources=["microphone", "upload"],
                    type="filepath"
                )

                # Settings panel
                with gr.Accordion("⚙️ Settings", open=False):
                    personality = gr.Dropdown(
                        choices=["helpful", "enthusiastic", "scholarly", "casual"],
                        value="helpful",
                        label="🤖 AI Personality"
                    )

                    search_web = gr.Checkbox(
                        value=True,
                        label="🌐 Search Web"
                    )

                    max_docs = gr.Slider(
                        minimum=1,
                        maximum=10,
                        value=5,
                        step=1,
                        label="📚 Max Documents to Retrieve"
                    )

                submit_btn = gr.Button("🚀 Process Voice", variant="primary")

            with gr.Column(scale=3):
                transcription_output = gr.Textbox(
                    label="📝 What You Said",
                    lines=3,
                    max_lines=5
                )

                response_output = gr.Textbox(
                    label="🤖 AI Response",
                    lines=15,
                    max_lines=25
                )

                with gr.Accordion("📊 Performance & Debug", open=False):
                    performance_output = gr.Textbox(
                        label="⚡ Performance Metrics",
                        lines=8
                    )

        # Examples
        gr.Markdown("### 🎯 Try These Examples:")

        example_queries = [
            "What is machine learning?",
            "How do neural networks work?",
            "Explain artificial intelligence",
            "What's the latest in AI research?"
        ]

        gr.Examples(
            examples=[[q] for q in example_queries],
            inputs=[audio_input]
        )

        def process_with_settings(audio, personality, search_web, max_docs):
            """Process audio with custom settings"""
            if audio is None:
                return "Please record or upload audio!", "", ""

            # Your existing processing code here, but with the settings
            # This is where you'd modify the pipeline based on user preferences

            result = voice_rag.process_voice_query_optimized(audio)
            return result

        submit_btn.click(
            process_with_settings,
            inputs=[audio_input, personality, search_web, max_docs],
            outputs=[transcription_output, response_output, performance_output]
        )

        # Auto-submit when audio is uploaded
        audio_input.change(
            process_with_settings,
            inputs=[audio_input, personality, search_web, max_docs],
            outputs=[transcription_output, response_output, performance_output]
        )

    return interface

# Launch the advanced interface
# advanced_interface = create_advanced_interface()
# advanced_interface.launch(share=True, debug=True)

Final Pro Tips 🎯

Here are some insider secrets to make your system even better:

Batch Everything: Always process multiple items together when possible - it's way more efficient!
Cache Smart: Save frequently used embeddings and search results to avoid recomputing.
Monitor GPU Memory: Keep an eye on torch.cuda.memory_allocated() - clear cache when it gets too high.
Use Async: For web searches, use asyncio to make multiple requests simultaneously.
Quality Over Quantity: Better to have 100 high-quality documents than 1000 poor ones.
Test with Real Users: Your system might work perfectly for us right now but confuse others - test it!
Keep Learning: The AI field moves fast - stay updated with new models and techniques.

Full script available here: https://github.com/AkanimohOD19A/Voice-RAG-v1

The End... Or Is It? 🎬

You've now got a seriously impressive Voice RAG system that can:

🎤 Understand speech in multiple languages
🧠 Remember conversations
⚡ Process queries lightning-fast
📊 Track its own performance
🤖 Have different personalities
🌐 Search the web intelligently
📄 Handle complex documents

But here's the thing - this is really just the beginning! Every day, new models come out, new techniques are discovered, and new possibilities emerge.

The system you've built is a solid foundation that we can keep improving and adapting. Maybe next you'll add video understanding, or connect it to a robot, or make it control your smart home. The sky's the limit!

Remember: The best AI systems aren't just technically impressive - they're actually useful and fun to interact with. We have to keep that in mind as we continue building.

Now go forth and create something amazing! And most importantly... have fun with it! 🎉

Happy building, and may your GPU never run out of memory! 🚀

Building a Production-Ready Speech-to-Text System with Fine-Tuned Whisper Model

Akan — Sun, 10 Aug 2025 17:47:40 +0000

A comprehensive guide to developing, optimizing, and deploying a robust speech transcription service

Overview

In this technical deep-dive, we'll explore the development of a production-grade Speech-to-Text system built on OpenAI's Whisper model. The project demonstrates advanced ML engineering practices including model fine-tuning, dtype optimization, chunked processing for long-form audio, and deployment via Gradio interface on Hugging Face Spaces.

🔗 Live Demo: Speech Transcription App

🤗 Model: Fine-tuned Whisper Model

Architecture Overview

The system consists of several key components:

Fine-tuned Whisper Model - Custom trained for improved accuracy
Robust Audio Processing Pipeline - Handles multiple formats and chunking
Timestamp Generation - Precise segment-level timing
Multi-format Output - JSON, SRT, and human-readable formats
Production-Ready Interface - Gradio web application

Technical Deep Dive

1. Model Loading and Dtype Optimization

One of the most critical aspects of production ML systems is handling model precision and device compatibility. Our implementation includes sophisticated dtype management:

def load_model_with_correct_dtype():
    """Load model with consistent data types"""
    model_name = "./whisper-finetuned-final"

    try:
        # Try loading with float32 first (most stable)
        print("Attempting to load model in float32...")
        processor = WhisperProcessor.from_pretrained(model_name)
        model = WhisperForConditionalGeneration.from_pretrained(
            model_name,
            torch_dtype=torch.float32,  # Force float32
            device_map=None  # Load to CPU first
        )

        # Move to GPU if available, but keep float32
        if torch.cuda.is_available():
            model = model.cuda()

        return model, processor, torch.float32

    except Exception as e:
        # Graceful fallback to float16 or base model
        # ... fallback logic

Key Engineering Decisions:

Float32 Priority: Ensures numerical stability across different hardware
Graceful Degradation: Automatic fallback to float16 or base model if needed
Device Agnostic: Works on both CPU and GPU environments

2. Chunked Audio Processing with Timestamps

Processing long-form audio requires sophisticated chunking strategies to balance accuracy and computational efficiency:

def process_audio_with_precise_timestamps(audio_array, sr=16000, chunk_length=20, overlap=2):
    """Process audio with precise timestamp tracking"""
    total_duration = len(audio_array) / sr
    chunk_samples = chunk_length * sr
    overlap_samples = overlap * sr

    all_segments = []
    start = 0
    chunk_index = 0

    while start < len(audio_array):
        # Define chunk boundaries
        end = min(start + chunk_samples, len(audio_array))

        # Add overlap for better transcription continuity
        chunk_start_with_overlap = max(0, start - overlap_samples // 2)
        chunk_end_with_overlap = min(len(audio_array), end + overlap_samples // 2)

        chunk_audio = audio_array[chunk_start_with_overlap:chunk_end_with_overlap]

        # Calculate actual time boundaries (without overlap)
        start_time = start / sr
        end_time = end / sr

        # Transcribe this chunk
        transcription = transcribe_single_chunk(chunk_audio, sr)

        if transcription and transcription.strip():
            clean_text = clean_transcription_text(transcription)
            if clean_text:
                segment = {
                    "start": round(start_time, 2),
                    "end": round(end_time, 2),
                    "text": clean_text,
                    "chunk_id": chunk_index,
                    "duration": round(end_time - start_time, 2)
                }
                all_segments.append(segment)

        start = end
        chunk_index += 1

    return remove_chunk_overlaps(all_segments)

Advanced Features:

Overlap Processing: Prevents word cutoffs at chunk boundaries
Precise Timing: Maintains accurate timestamps despite overlapping
Memory Efficient: Processes audio in manageable chunks
Error Resilient: Continues processing even if individual chunks fail

3. Overlap Detection and Removal

A sophisticated algorithm removes duplicate content between adjacent chunks:

def remove_chunk_overlaps(segments):
    """Remove overlapping text between consecutive chunks"""
    if len(segments) <= 1:
        return segments

    cleaned_segments = [segments[0]]  # Keep first segment as-is

    for i in range(1, len(segments)):
        current_segment = segments[i].copy()
        previous_text = cleaned_segments[-1]["text"]
        current_text = current_segment["text"]

        # Check for overlapping words at the beginning of current segment
        prev_words = previous_text.lower().split()
        curr_words = current_text.lower().split()

        # Find overlap using sliding window approach
        overlap_length = 0
        max_check = min(10, len(prev_words), len(curr_words))

        for j in range(1, max_check + 1):
            if prev_words[-j:] == curr_words[:j]:
                overlap_length = j

        # Remove overlap from current segment
        if overlap_length > 0:
            remaining_words = current_text.split()[overlap_length:]
            if remaining_words:
                current_segment["text"] = " ".join(remaining_words)
                cleaned_segments.append(current_segment)
        else:
            cleaned_segments.append(current_segment)

    return cleaned_segments

4. Multi-Format Output Generation

The system generates multiple output formats for different use cases:

def format_transcript_with_timestamps(result, include_word_level=False):
    """Format the result in multiple useful formats"""
    formats = {}

    # 1. SRT subtitle format
    srt_lines = []
    for i, segment in enumerate(result["segments"], 1):
        start_time = format_time_srt(segment["start"])
        end_time = format_time_srt(segment["end"])
        srt_lines.extend([
            str(i),
            f"{start_time} --> {end_time}",
            segment["text"],
            ""
        ])
    formats["srt"] = "\n".join(srt_lines)

    # 2. VTT format for web players
    vtt_lines = ["WEBVTT", ""]
    for segment in result["segments"]:
        start_time = format_time_vtt(segment["start"])
        end_time = format_time_vtt(segment["end"])
        vtt_lines.extend([
            f"{start_time} --> {end_time}",
            segment["text"],
            ""
        ])
    formats["vtt"] = "\n".join(vtt_lines)

    return formats

Output Formats:

JSON: Complete structured data with metadata
SRT: Standard subtitle format for video players
VTT: WebVTT format for web-based players
Human-readable: Formatted text with timestamps

5. Production-Ready Gradio Interface

The web interface includes comprehensive error handling and user experience optimizations:

def transcribe_file(audio_file):
    """Handle file upload transcription with comprehensive error handling"""
    if not model_loaded:
        return "❌ Model not loaded. Please refresh the page.", None, None

    if audio_file is None:
        return "⚠️ Please upload an audio file.", None, None

    try:
        # Load audio file with multiple fallback methods
        audio_array, sr = librosa.load(audio_file, sr=16000)

        # Enforce duration limits for fair resource usage
        duration = len(audio_array) / sr
        if duration > 180:  # 3 minutes
            return f"⚠️ Audio too long ({duration:.1f}s). Maximum allowed: 3 minutes.", None, None

        # Process with timestamps
        result = process_audio_with_timestamps(audio_array, sr)

        if result["success"]:
            formatted_text = format_transcription_output(result)
            json_file = create_json_download(result, audio_file)
            srt_file = create_srt_download(result, audio_file)

            return formatted_text, json_file, srt_file
        else:
            return result["error"], None, None

    except Exception as e:
        return f"❌ Error processing file: {str(e)}", None, None

Performance Optimizations

Memory Management

Chunk-based Processing: Prevents memory overflow on long audio files
Garbage Collection: Explicit memory cleanup between operations
GPU Memory Management: CUDA cache clearing when available

Error Handling

Multi-level Fallbacks: Model loading, audio processing, and transcription
Graceful Degradation: System continues operating even with partial failures
User-friendly Messages: Clear error communication without technical jargon

Resource Limits

Duration Limits: 3-minute maximum to ensure fair usage
Concurrent Processing: Thread limiting for multi-user scenarios
Queue Management: Gradio queue system for handling multiple requests

Model Deployment Strategy

The project demonstrates several deployment best practices:

Model Versioning: Both original and float32-optimized versions on Hugging Face Hub
Comprehensive Documentation: Detailed model cards with usage examples
Public Accessibility: Gradio interface with shareable public URLs
Monitoring Ready: Structured logging and error tracking

Key Engineering Insights

Why This Approach Works

Robust Preprocessing: Multiple audio loading methods ensure compatibility
Smart Chunking: Overlap handling prevents information loss
Format Flexibility: Multiple output formats serve different use cases
Production Focus: Error handling and resource limits for real-world usage

Performance Considerations

Latency: ~1-2 seconds per minute of audio on GPU
Accuracy: Fine-tuned model outperforms base Whisper on target domain
Scalability: Chunk-based processing handles files of varying lengths
Reliability: 99%+ uptime with comprehensive error handling

Future Enhancements

Potential areas for system improvement:

Speaker Diarization: Identifying different speakers in multi-speaker audio
Real-time Processing: Streaming transcription for live audio
Language Detection: Automatic language identification and switching
Custom Vocabulary: Domain-specific terminology optimization
Batch Processing: API endpoints for bulk transcription tasks

Conclusion

This speech-to-text system demonstrates advanced ML engineering practices combining model optimization, robust processing pipelines, and production-ready deployment. The architecture balances accuracy, performance, and reliability while providing a seamless user experience.

The project showcases essential skills for production ML systems: model fine-tuning, dtype optimization, error handling, resource management, and user interface design. These components work together to create a system that's both technically sophisticated and practically useful.

Try it yourself: Live Demo

Built with PyTorch, Transformers, Gradio, and deployed on Hugging Face Spaces

Forem: Akan

Building an Adaptive NER System with MLOps: A Complete Guide (Production)

Table of Contents

The Problem We Solved

Business Context

What We Built

Initial POC: What We Started With

The Original Implementation

POC Results

Production Challenges We Faced

Challenge 1: Long Build Times

Challenge 2: Invalid Timestamps 📅

Challenge 3: Stale Test Data

Challenge 4: No Visibility

Solution 1: Implementing Intelligent Caching

The Strategy

Layer 1: Python Package Caching

Layer 2: R Package Caching

Layer 3: Pytest Cache

Layer 4: MLflow Artifacts

The Cache Strategy Matrix

Cache Invalidation Strategy

Solution 2: Fixing the Invalid Date Bug

The Root Cause

The Investigation

The Fix

The Result

JavaScript Enhancement

Solution 3: Dynamic Data Generation in CI/CD

The Problem with Static Test Data

The Solution: Generate Data in CI/CD

Connecting Jobs with Artifacts

Benefits of Dynamic Data

The Data Generator

Impact on Testing

Solution 4: Comprehensive Testing Strategy

The Testing Pyramid

Layer 1: Unit Tests

Layer 2: Integration Tests

Layer 3: End-to-End Tests

The Test Fixture Strategy

Fixing Flaky Tests

Test Coverage Goals

Architecture Deep Dive

The Complete Pipeline Flow

Job Dependencies

Data Flow

Caching Strategy Visualization

Performance Metrics: Before vs After

Build Time Comparison

Cost Analysis

Cache Hit Rates

Resource Usage

User Experience Metrics

Lessons Learned

1. Cache Aggressively, Invalidate Carefully

2. ISO 8601 for All Timestamps

3. Test with Production-Like Data

4. Parallel Jobs Where Possible

5. Fail Fast, Fail Clearly

6. Monitor Cache Effectiveness

7. Optimize Artifact Retention

8. Documentation is Code

9. Start with POC, Iterate to Production

10. Open Source Everything

- Production-ready code

Conclusion

What We Accomplished

The Numbers

Key Takeaways

The Technology Stack

Resources

- LinkedIn: https://www.linkedin.com/in/daniel-amah-2559a4159/

Acknowledgments

If you ever wondered how text/qualitative data can make sense for predictions in your business, please check this out.

Building an Adaptive NER System with MLOps: A Complete Guide

Akan ・ Feb 1

[Boost]

Building an Adaptive NER System with MLOps: A Complete Guide

Akan ・ Feb 1