Forem: Pasquale Molinaro

Real-time video classification with PaliGemma: architecture patterns for low-latency VLM inference

Pasquale Molinaro — Sun, 24 May 2026 13:53:24 +0000

In a previous article, we benchmarked three open-source Vision-Language Models on zero-shot object detection and arrived at an uncomfortable conclusion: even the fastest contender, Phi-3.5-vision-instruct, takes 4.45 seconds per frame on an NVIDIA L4. LLaVA-v1.6 sits at 8.13 seconds. For any application that needs to process a live video stream, these numbers are disqualifying. But the conclusion that VLMs are fundamentally incompatible with real-time workloads deserves more scrutiny. That 8-second figure was measured on a general-purpose zero-shot detection task, asking the model to reason about arbitrary objects in arbitrary scenes. What happens when you constrain the problem? When you give the model a closed vocabulary, a fixed resolution, a deterministic decoding strategy, and a non-blocking inference pipeline?
This article answers that question. Using PaliGemma, Google’s compact vision-language model, we built a real-time video classification system running at approximately 0.8 to 1.2 seconds per frame on an NVIDIA RTX A4500. That is a six to eight times improvement over LLaVA on comparable professional hardware, achieved entirely through architectural decisions rather than hardware upgrades. Here are the four patterns that made it possible.

Why PaliGemma

Before getting into architecture, the choice of model itself deserves an explanation, because PaliGemma is significantly underrepresented in the developer literature relative to its practical value. PaliGemma is a 3-billion parameter vision-language model built by Google, combining a SigLIP vision encoder with a Gemma language backbone. Compared to LLaVA-7B or Phi-3.5-vision, it is roughly half the size, which translates directly to lower VRAM consumption and faster inference on the same hardware. More importantly for classification tasks, it was explicitly fine-tuned on a wide range of visual understanding benchmarks including image captioning, visual question answering, and object localization, which means it has strong priors for the kind of constrained, structured responses we are going to elicit.

import torch
from transformers import PaliGemmaProcessor, PaliGemmaForConditionalGeneration

MODEL_PATH = "./paligemma_offline"

model = PaliGemmaForConditionalGeneration.from_pretrained(
    MODEL_PATH,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    local_files_only=True
).eval()

processor = PaliGemmaProcessor.from_pretrained(
    MODEL_PATH,
    local_files_only=True
)

Two details here are worth noting. Loading in bfloat16 rather than float32 halves the VRAM footprint with negligible accuracy degradation on classification tasks. The local_files_only=True flag is not just a convenience for offline environments: in a production system, it eliminates the network round-trip on initialization and guarantees that your inference environment is fully reproducible.

Pattern 1: resolution as a latency knob

The single most impactful decision in a real-time VLM pipeline is input resolution. VLMs process images by dividing them into patches and encoding each patch as a sequence of tokens. A 1280×720 frame generates a dramatically larger token sequence than a 448×448 crop, and since transformer attention scales quadratically with sequence length, resolution is not a linear cost: it is an exponential one. For zero-shot object detection, where spatial precision matters, downsampling is a real trade-off. But for scene-level classification tasks, where you are asking “what is the dominant emotion in this frame?” rather than “give me the pixel coordinates of every object,” 448×448 preserves more than enough semantic information.

import cv2
import numpy as np
from PIL import Image

def preprocess_frame(frame_bgr: np.ndarray) -> Image.Image:
    # Resize to inference resolution before any VLM processing
    frame_small = cv2.resize(frame_bgr, (448, 448))
    # Convert BGR (OpenCV) to RGB (PIL/transformers)
    return Image.fromarray(cv2.cvtColor(frame_small, cv2.COLOR_BGR2RGB))

The key insight is that resolution should be chosen based on the granularity of information your task actually requires, not based on the resolution of your input stream. If your camera captures at 1080p but your classification task only needs to distinguish between five emotional states, you are paying a massive compute tax for information you will never use.

Pattern 2: deterministic decoding with a closed vocabulary

Standard VLM usage treats the model as an open-ended text generator. You prompt it, it samples from a probability distribution, and you receive a natural language response that you then parse. This is the source of the fragility problem we discussed in the previous article, and it is also a significant source of latency: sampling with high max_new_tokens means the model runs the full autoregressive loop for every token it generates. For classification tasks, you can break this entirely. Instead of asking the model to describe what it sees, you constrain its output to a fixed vocabulary of valid labels and limit generation to the minimum number of tokens needed to express one of them.

# A generic set of states tailored to your specific domain
VALID_CLASSES = ['active', 'idle', 'error', 'offline', 'unknown']

def classify_frame(model: torch.nn.Module, processor, image: Image.Image) -> str:
    prompt = (
        f"<image> What is the current operational state shown in this frame? "
        f"You MUST choose ONLY ONE from: {VALID_CLASSES}."
    )

    inputs = processor(
        text=prompt,
        images=image,
        return_tensors="pt"
    ).to(model.device, model.dtype)

    with torch.inference_mode():
        output_ids = model.generate(
            **inputs,
            max_new_tokens=10,   # A single class label needs at most 2-3 tokens
            do_sample=False      # Greedy decoding: deterministic, faster, no temperature needed
        )

    raw_output = processor.decode(
        output_ids[0][inputs.input_ids.shape[-1]:],
        skip_special_tokens=True
    ).strip().lower()

    # Sanitize to alphabetic characters only
    return ''.join(filter(str.isalpha, raw_output))

Setting do_sample=False switches the model to greedy decoding, which always selects the highest-probability token at each step. This eliminates the sampling overhead and makes the output fully deterministic: identical inputs will always produce identical outputs, which is essential for debugging and for the temporal smoothing pattern we cover next. The max_new_tokens=10 cap means the model stops generating almost immediately after producing the label, rather than continuing to produce explanatory text nobody asked for.
The result is that you are using a 3B-parameter VLM as a highly capable, semantically-aware classifier rather than a generative model. You get the zero-shot flexibility of natural language prompting with inference characteristics that approach those of a dedicated classification head.

Pattern 3: temporal smoothing for prediction stability

Even with deterministic decoding, a VLM processing live video will produce noisy predictions. Lighting changes, motion blur, partial occlusion, and transient visual artifacts will cause the model to output inconsistent labels across consecutive frames. If you pipe raw per-frame predictions directly to a downstream system, you get jittery, unreliable output. The solution is temporal smoothing: instead of trusting any single prediction, you maintain a rolling window of recent predictions and emit the majority vote.

from collections import deque, Counter

class TemporalSmoother:
    def __init__(self, window_size: int = 5):
        self.history = deque(maxlen=window_size)

    def update(self, prediction: str) -> str:
        self.history.append(prediction)
        # Return the most common prediction in the window
        return Counter(self.history).most_common(1)[0][0]

A window of 5 frames at our inference rate translates to roughly 4 to 6 seconds of temporal context. This is enough to absorb transient noise while remaining responsive to genuine state changes. The window size is the primary tuning parameter: larger windows are more stable but slower to react; smaller windows are more responsive but noisier. For most classification tasks, 3 to 7 frames covers the practical range.

Pattern 4: non-blocking inference with a decoupled shared state architecture

The previous three patterns optimize the inference call itself. This one addresses a more fundamental systemic issue: a VLM inference call that takes 0.8 to 1.2 seconds will block any thread it runs on for that entire duration. If your video capture and your inference run on the same thread, your stream will stutter at the inference rate rather than the camera rate.

The naive solution is to use a standard Python queue.Queue to pass frames between threads. However, this introduces consumer-competition: if a rendering thread and an AI thread both read from the same queue, they consume the frames, stealing data from one another and causing severe visual stuttering and skipped inference cycles. The production-grade solution is an Asynchronous Shared State Pattern with granular locking. The video capture thread acts as a producer, continuously overwriting a shared “latest frame” pointer. The rendering thread (running on the main thread, which is mandatory for OpenCV UI operations on macOS and Wayland) and the AI background thread act as independent consumers, copying the latest frame into local memory whenever they are ready for their next cycle.

import threading
import time
import numpy as np
import cv2
import torch
from typing import Optional, Any

class SharedState:
    """
    Thread-safe state container.
    The lock is strictly granular: it is only held for memory assignment/copying,
    never during expensive I/O or AI inference operations.
    """
    def __init__(self):
        self.latest_frame: Optional[np.ndarray] = None
        self.prediction: str = "WAITING"
        self.lock = threading.Lock()
        self.running: bool = True

shared = SharedState()

def video_capture_worker(source: int = 0) -> None:
    """Reads frames at hardware speed and updates the shared state."""
    cap = cv2.VideoCapture(source)

    while shared.running:
        ret, frame = cap.read()
        if not ret:
            time.sleep(0.01)
            continue

        with shared.lock:
            # Overwrite with the freshest data.
            # Pointer assignment is fast enough to barely hold the lock.
            shared.latest_frame = frame

    cap.release()

def inference_worker(model: torch.nn.Module, processor: Any) -> None:
    """Consumes the latest frame at the AI's maximum throughput rate."""
    smoother = TemporalSmoother(window_size=5)

    while shared.running:
        with shared.lock:
            # Deep copy to prevent OpenCV from mutating the array during inference
            frame = shared.latest_frame.copy() if shared.latest_frame is not None else None

        if frame is None:
            time.sleep(0.05)
            continue

        try:
            image = preprocess_frame(frame)
            raw_pred = classify_frame(model, processor, image)
            smoothed_pred = smoother.update(raw_pred)

            with shared.lock:
                shared.prediction = smoothed_pred

        except torch.cuda.OutOfMemoryError:
            # Handle temporary VRAM spikes gracefully without killing the thread
            with shared.lock:
                shared.prediction = "OOM_ERROR"
            time.sleep(1.0)

        except Exception as e:
            # Catch corrupt frames or tensor mismatches
            with shared.lock:
                shared.prediction = "ERROR"

# Initialize background workers as Daemon threads
threads = [
    threading.Thread(target=video_capture_worker, args=(0,), daemon=True),
    threading.Thread(target=inference_worker, args=(model, processor), daemon=True),
]

for t in threads:
    t.start()

# Main Thread UI Loop
# UI libraries (cv2.imshow) must run on the main thread to prevent OS-level crashes.
while shared.running:
    with shared.lock:
        frame = shared.latest_frame.copy() if shared.latest_frame is not None else None
        label = shared.prediction

    if frame is not None:
        cv2.putText(frame, label.upper(), (30, 50),
                    cv2.FONT_HERSHEY_SIMPLEX, 1.5, (0, 255, 0), 3)
        cv2.imshow("Live Classification", frame)

    # 30 FPS rendering limit (33ms) + graceful shutdown
    if cv2.waitKey(33) & 0xFF == ord('q'):
        shared.running = False

cv2.destroyAllWindows()

The critical design principle here is granular locking: the lock is acquired, the numpy array is copied in memory (which takes microseconds), and the lock is instantly released. Holding the lock across a one-second VLM inference call would serialize all three components and defeat the entire purpose of the architecture. With this structure, your video capture thread runs at hardware framerate (e.g., 30 fps), your rendering loop displays frames at 30 fps, and your inference thread runs at its own async rate (1 fps). The three systems are temporally independent, limited only by their respective hardware bounds.

Benchmark summary

Running the complete pipeline on an NVIDIA RTX A4500 (20GB GDDR6, Ampere architecture) with PaliGemma in bfloat16 across a three-stream live video scenario yields a highly stable performance profile. By restricting the input resolution to 448 × 448 and capping the output at a maximum of 10 new tokens via a greedy decoding strategy (do_sample=False), the system achieves an inference latency between 0.8 and 1.2 seconds per frame. Combined with a 5-frame temporal smoothing window, this configuration ensures reliable state classification while the decoupled architecture allows the video capture thread to maintain a steady 25 fps, completely independent of the inference bottleneck.

For comparison, LLaVA-v1.6-Mistral-7B running open-vocabulary zero-shot detection on an NVIDIA L4 clocks in at 8.13 seconds per frame. While the hardware is not directly equivalent, the magnitude of the gap confirms that architectural constraints, rather than raw compute, account for the vast majority of the difference.

When this architecture makes sense

This pattern is a strong fit when your task is classifiable into a fixed label set, you need continuous processing of a live stream rather than batch analysis of static images, data privacy requirements preclude sending frames to an external API, and you can tolerate sub-second rather than sub-100ms latency. It is not the right tool when you need genuine real-time response at conveyor-belt speeds, where sub-50ms latency is non-negotiable. In that regime, you are back in YOLO territory, or you use a pipeline like the one described in the previous article: leverage the VLM to auto-annotate a dataset overnight, then train a dedicated lightweight classifier for production deployment.

Conclusion

The gap between “VLMs are too slow for video” and “VLMs work in production video pipelines” is not primarily a hardware problem. It is an architectural one. Choosing a compact model like PaliGemma over a 7B-parameter alternative, constraining resolution to what your task actually requires, enforcing deterministic decoding over a closed vocabulary, smoothing predictions temporally, and decoupling inference from capture and rendering: none of these require a bigger GPU. They require thinking carefully about what you are actually asking the model to do, and building your pipeline around that constraint rather than against it.

The full pattern, from model loading to multi-threaded inference, fits in under 150 lines of Python. That is a reasonable price for zero-shot semantic classification on a live video stream.

Stop retraining YOLO: a developer’s guide to zero-shot object detection with generative VLMs

Pasquale Molinaro — Fri, 22 May 2026 20:20:53 +0000

If you have ever maintained a computer vision pipeline in a factory, warehouse, or construction site, you already know the drill. You spend weeks collecting images, annotating bounding boxes, and fine-tuning a YOLO or Faster R-CNN model just to detect safety helmets and high-visibility vests. Then, the safety department introduces a new type of protective glove, your model’s accuracy tanks, and you are thrust right back into the endless loop of data collection, labeling, and retraining.

Generative Vision-Language Models (VLMs) solve this by turning object detection into a zero-shot semantic prompt:

“Find all non-compliant protective equipment in this scene and return their coordinates.”

But for industrial engineering teams, implementing this introduces a new architectural headache. Do you self-host a heavy open-source model like LLaVA to ensure air-gapped data privacy? Or do you leverage managed APIs like GPT-4o, using Structured Outputs to guarantee type-safe JSON bounding boxes in seconds?

In this article, we will explore both paths. We will break down the hardware realities of the local edge approach across three open-source models, and then write a Pydantic-validated Python baseline to build a robust, zero-shot detection pipeline using GPT-4o.

The legacy trap: domain shift

If you are running visual inspections on an assembly line, you likely rely on models like YOLOv8. Optimized for edge deployment, a YOLO baseline can process a frame in approximately 0.03 seconds on an NVIDIA L4 GPU. For high-speed manufacturing, this is as close to perfection as inference gets.

But its operational Achilles’ heel is domain shift.

Traditional object detectors only know how to map specific pixel gradients to an integer class ID. If you train a model on yellow helmets, what happens when procurement switches to white helmets? The pipeline shatters. You are forced to halt operations, harvest failing frames, manually draw new boxes, and re-balance your dataset. In a dynamic industrial environment, this rigid cycle of constant fine-tuning destroys your time-to-market.

The semantic shift: prompting instead of predicting

The key difference between legacy detectors and VLMs is vocabulary. A VLM reasons about image content in natural language. You describe what you are looking for, and the model maps that semantic description to spatial coordinates. You no longer retrain to find a new object class; you just ask for it.

Scope Clarification: While VLM-generated bounding boxes are not yet a replacement for specialized, sub-millimeter real-time detectors in high-precision automation, they are highly effective for semantic inspection, auditing, and rapid dataset generation workflows. This zero-shot flexibility comes at a cost measured in compute budget and latency.

The “build” route: self-hosting at the edge

If your factory floor mandates strict data privacy, sending video frames to a cloud API is not an option. You must self-host an open-source VLM. Loading a 7-billion parameter model via Hugging Face transformers is deceptively simple:

import torch
from transformers import LlavaNextProcessor, LlavaNextForConditionalGeneration

model_id = "llava-hf/llava-v1.6-mistral-7b-hf"

processor = LlavaNextProcessor.from_pretrained(model_id)
model = LlavaNextForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

This elegant script hides a brutal hardware reality. Running a 7B model without heavy quantization requires at least 14 to 16 GB of VRAM. You cannot run this on a cheap edge device; it demands enterprise-grade silicon like an NVIDIA L4 (24 GB) or L40S (48 GB).

The latency reality

(Methodology Note: Benchmarks were measured on a single NVIDIA L4 GPU using single-image inference on 1024x1024 inputs, with warm model loading and bfloat16 precision, no aggressive quantization).Not all open-source VLMs are equal. While our legacy YOLOv8 flies at 0.03 seconds, Phi-3.5-vision-instruct yields an average processing time of 4.45 seconds per image (costing roughly €0.67 per hour in compute). Stepping up to LLaVA-v1.6-Mistral-7B pushes latency to 8.13 seconds (€1.23/hr), and Molmo-7B drags the pipeline down to 13.73 seconds (€2.07/hr).

The gap between YOLOv8 and Phi-3.5 is roughly 150x. Real-time conveyor belt inspection is not a use case for zero-shot VLMs today. However, Phi-3.5-vision-instruct emerges as the most operationally interesting option for on-premise deployments, cutting LLaVA’s latency in half at a fraction of the cost. Self-hosting gives you absolute data privacy and zero marginal API costs, but 4 to 8 seconds per image is a massive operational constraint.

The “buy” route: API-driven detection with GPT-4o

If multi-second latencies and VRAM limits are too steep for a proof of concept, managed APIs offer an immediate alternative. However, early adopters of LLMs for computer vision quickly hit an operational wall: parsing fragility.

Historically, asking a vision model for coordinates returned unstructured text. Engineering teams wrote brittle regex patterns to extract those numbers. If the model hallucinated a parenthesis, the pipeline crashed. The modern enterprise approach eliminates this fragility by enforcing Structured Outputs. Using OpenAI’s API, you define a strict data contract with Pydantic. The model is forced to return a perfectly typed JSON object mapped to a normalized 1,000x1,000 spatial grid.

Here is a robust, production-oriented baseline script:

import base64
from pydantic import BaseModel, Field
from openai import OpenAI

client = OpenAI()

#Define the data contract
class BoundingBox(BaseModel):
    ymin: int = Field(description="Top-left Y coord on a 1000x1000 grid")
    xmin: int = Field(description="Top-left X coord on a 1000x1000 grid")
    ymax: int = Field(description="Bottom-right Y coord on a 1000x1000 grid")
    xmax: int = Field(description="Bottom-right X coord on a 1000x1000 grid")

class DetectedPPE(BaseModel):
    equipment_type: str = Field(description="Class of the item, e.g. 'helmet' or 'gloves'")
    is_compliant: bool = Field(description="True if properly worn, False otherwise")
    box: BoundingBox

class SceneAnalysis(BaseModel):
    detected_items: list[DetectedPPE]

def encode_image(image_path: str) -> str:
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode("utf-8")

def detect_ppe(image_path: str) -> SceneAnalysis:
    base64_image = encode_image(image_path)

    response = client.beta.chat.completions.parse(
        model="gpt-4o",
        messages=[
            {
                "role": "system",
                "content": (
                    "You are an industrial safety inspector. Find all PPE items. "
                    "Return bounding box coordinates mapping the image to a 1000x1000 grid, "
                    "where [0,0] is the top-left corner."
                )
            },
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": "Locate all helmets, vests, and gloves. Flag non-compliant items."},
                    {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{base64_image}"}}
                ]
            }
        ],
        response_format=SceneAnalysis,
        temperature=0.0
    )

    #The output is a validated Python object, zero regex required
    return response.choices[0].message.parsed

Several design decisions in this code are worth noting explicitly. Setting the temperature to zero is critical because it eliminates sampling variance, ensuring you get repeatable bounding box coordinates across identical frames. Furthermore, mapping the output to a normalized 1,000x1,000 grid makes your coordinates entirely resolution-agnostic, allowing them to be scaled to any output image size without additional logic. Finally, by utilizing the parse method instead of the raw completions endpoint, the OpenAI SDK handles schema enforcement natively at the API level. This means a malformed response raises a standard Python validation error rather than silently corrupting your downstream pipeline.

In practice, VLM-based localization is still probabilistic. Small coordinate drift, missed objects under heavy occlusion, or inconsistent bounding boxes across consecutive frames remain common failure modes in cluttered industrial scenes. This is why the code above is a robust starting point, but true production systems will still require retry logic, confidence calibration, and temporal smoothing.

The economics: when does it make sense to migrate?

You now have two functioning zero-shot architectures. The decision of which to use is entirely about latency, scale, and budget.

Consider a safety inspection system processing 310 images per shift. Using the GPT-4o API, that single batch costs approximately €21.27. Dropping to GPT-4o mini reduces it to €4.29, though trading off some accuracy on complex scenes. Multiplying that baseline across three shifts, seven days a week, yields hundreds of euros per month for a single station. This is when on-premise starts making financial sense. A dedicated NVIDIA L4 instance at €1.23 per hour running Phi-3.5 covers unlimited inference for a fixed monthly cost. For most production-scale deployments, the API route stops being economical somewhere between three and six months of continuous operation.

The decision framework for a Tech Lead ultimately hinges on the maturity and speed of the project. When validating logic with unpredictable object classes, the API route is the undisputed starting point. It requires no infrastructure setup, ignores VRAM limits, and delivers a working pipeline in a single afternoon. However, once that logic is validated and daily inference volumes grow — or if the factory security team mandates a strict air-gap — migrating to a model like Phi-3.5 on-premise provides cost efficiency at scale while retaining semantic flexibility.

Finally, there is the real-time fallback. If a manufacturing process genuinely requires sub-100ms inference on a high-speed conveyor belt, neither local VLMs nor cloud APIs are the answer. The most pragmatic engineering path is to leverage GPT-4o overnight to auto-annotate a massive dataset of the newly introduced object classes. That auto-generated data can then be used to train a specialized YOLOv8 model for real-time deployment. In this scenario, the VLM serves as a highly intelligent labeling engine rather than an inference engine.

Conclusion

Industrial computer vision is shifting from fixed classifiers to semantic interfaces. The default response to a new object class no longer has to be six weeks of manual annotation. The benchmark data tells a clear story. GPT-4o and its API-driven structured outputs give you a working, type-safe detection pipeline in an afternoon. Open-source alternatives like Phi-3.5-vision-instruct offer a credible on-premise path for teams with privacy requirements. And for ultra-fast use cases, VLMs are best understood as intelligent labeling tools rather than inference engines.

The bottleneck is no longer annotation throughput. It is architectural choice.

Stop retraining. Start prompting.

Your Outlier Detection is Lying to You

Pasquale Molinaro — Tue, 19 May 2026 22:33:39 +0000

Why DBSCAN breaks in high dimensions and what to do instead

You tuned epsilon to 1.5 because it felt reasonable. Here is what that decision actually means. On a dataset with 16 features, shifting epsilon from 1.0 to 2.0 changes your outlier rate from 60.31% to 2.35%. Same data. Same algorithm. One decimal point of difference. These are not numbers from a toy dataset: they come from a decade of real Australian weather records, 145,000 observations, 16 continuous meteorological variables.

If someone asked you to justify eps=1.5 in a production review, what would you say?

The Setup

The dataset is the Australian weather observations from the Bureau of Meteorology, publicly available on Kaggle. It contains daily measurements from 49 stations across the country: temperature, rainfall, wind speed, pressure, humidity. Real data, messy data, with missing values and a distribution that does not care about your assumptions.

The preprocessing is standard. Select numerical columns, impute missing values with the column median, and scale everything with StandardScaler. Sixteen features survive the selection.

import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer

df = pd.read_csv("weatherAUS.csv")

num_cols = df.select_dtypes(include=[np.number]).columns.tolist()
imputer = SimpleImputer(strategy='median')
df_num_imputed = pd.DataFrame(
    imputer.fit_transform(df[num_cols]), columns=num_cols
)

scaler = StandardScaler()
df_scaled = scaler.fit_transform(df_num_imputed)

print(f"Total rows: {len(df_scaled)} | Dimensions: {len(num_cols)}")
# Total rows: 145460 | Dimensions: 16

Nothing unusual so far. This is the pipeline you have probably written a dozen times. The problem starts at the next step.

Why DBSCAN Cannot Handle This

DBSCAN defines a point as an outlier if no other point falls within a radius of epsilon in the feature space. The logic is intuitive in two or three dimensions. In sixteen dimensions it stops making geometric sense.

The reason is the curse of dimensionality. As dimensions increase Euclidean distances between points concentrate. The ratio between the maximum and minimum distance across all point pairs converges toward one. In practice this means that in a high-dimensional space all points start to look roughly equidistant from each other. The notion of a dense neighborhood that DBSCAN relies on becomes increasingly difficult to define and the choice of epsilon loses its geometric interpretation.

from sklearn.cluster import DBSCAN

eps_values = [1.0, 1.5, 2.0]
outlier_counts = []

for eps in eps_values:
    dbscan = DBSCAN(eps=eps, min_samples=4, n_jobs=-1)
    labels = dbscan.fit_predict(df_scaled)
    n_outliers = np.sum(labels == -1)
    pct = (n_outliers / len(df_scaled)) * 100
    outlier_counts.append(pct)
    print(f"DBSCAN eps={eps}: {n_outliers} outlier ({pct:.2f}%)")

# Output:
# DBSCAN eps=1.0: 87720 outlier (60.31%)
# DBSCAN eps=1.5: 18166 outlier (12.49%)
# DBSCAN eps=2.0:  3423 outlier  (2.35%)

That is the structural problem. You are not making a calibration decision. You are making an arbitrary choice that determines whether your pipeline discards 87,000 rows or 3,400 rows and you have no principled way to defend either number.

The Paradigm Shift: Isolation Over Distance

Isolation Forest does not use distances. It builds an ensemble of random decision trees and for each tree it randomly selects a feature and a split value within the feature range. A point is considered anomalous if it gets isolated near the root of the tree, meaning very few splits were needed to separate it from the rest of the data.

This matters because anomalies are by definition rare and different. A truly anomalous point sits in a sparse region of the feature space and is easy to isolate with just a few random cuts. A normal point lives in a dense cluster and requires many cuts to separate. The algorithm exploits this structural property without ever computing a distance.

The practical consequence is that Isolation Forest does not suffer from the concentration of distances that kills DBSCAN in high dimensions. Each split operates on a single feature so the geometric complexity does not scale with the number of dimensions in the same catastrophic way.

from sklearn.ensemble import IsolationForest

# For meteorological data, ~5% of anomalous events is a reasonable estimate
# based on domain knowledge. This is not a magic number: it is a claim
# you can argue in front of a domain expert.
CONTAMINATION = 0.05

iso = IsolationForest(contamination=CONTAMINATION, random_state=42, n_jobs=-1)
iso.fit(df_scaled)

anomaly_scores = iso.decision_function(df_scaled)
predictions = iso.predict(df_scaled)

df['Anomaly_Score'] = anomaly_scores
df['Is_Anomaly'] = (predictions == -1)

Notice what changed conceptually. With DBSCAN you were choosing a geometric radius with no interpretable meaning in 16 dimensions. With Isolation Forest you are choosing a contamination rate, a domain assumption you can state explicitly. You can argue that you expect approximately 5 percent of these observations to be genuine meteorological anomalies. That is a claim you can bring to a domain expert or a code reviewer. An epsilon of 1.5 is not.

The Sensitivity Problem Has Not Disappeared

Here is something that deserves honesty. Isolation Forest does not eliminate parameter sensitivity. It relocates it to a space where the sensitivity is at least interpretable.

print("--- Threshold sensitivity in Isolation Forest ---")
for threshold in [-0.10, -0.05, 0.00, 0.05]:
    n = np.sum(anomaly_scores < threshold)
    print(f"  Threshold {threshold:+.2f}: {n} outlier ({(n/len(df))*100:.2f}%)")

# Output:
#   Threshold -0.10:   123 outlier (0.08%)
#   Threshold -0.05:  1405 outlier (0.97%)
#   Threshold +0.00:  7273 outlier (5.00%)
#   Threshold +0.05: 28844 outlier (19.83%)

The range from 123 to 28,844 outliers is still dramatic. The difference from the DBSCAN case is that each of these thresholds maps to a falsifiable claim about the data. Cutting at a threshold of 0.00 corresponds to your 5 percent contamination assumption. Cutting at a lower threshold means you only want to remove the most extreme fractions of a percent. You can debate those percentages with domain knowledge. You cannot debate what a geometric radius means in a 16-dimensional standardized feature space because it does not mean anything you can explain to another human being.

What the Algorithm Actually Found

The real test of an unsupervised method with no ground truth is whether its outputs make sense to a domain expert. Look at the top anomalies Isolation Forest flagged.

cols_to_show = ['Date', 'Location', 'Rainfall', 'MaxTemp', 'WindGustSpeed', 'Anomaly_Score']
top_5 = df.sort_values('Anomaly_Score').head(5)
print(top_5[cols_to_show].to_string(index=False))

# Output:
#       Date  Location  Rainfall  MaxTemp  WindGustSpeed  Anomaly_Score
# 2011-02-15    Darwin     132.6     24.8           98.0      -0.154950
# 2015-12-24    Darwin     122.8     27.0           80.0      -0.151782
# 2014-01-01   Woomera       0.0     46.8           74.0      -0.151524
# 2011-02-16    Darwin     367.6     25.6           83.0      -0.143283
# 2009-12-12    Darwin     141.2     26.1           94.0      -0.135914

Darwin in February 2011 is not a statistical artifact. That is Cyclone Carlos, which produced record-breaking precipitation across the Northern Territory, with Darwin International Airport recording its highest 24-hour rainfall total in history. Woomera with 46.8 degrees Celsius and wind gusts of 74 km/h is a documented extreme heat event in one of Australia's most arid regions.

The algorithm did not know any of this. It learned the typical joint distribution across 16 variables and flagged the points that were hardest to explain given that distribution. The fact that those points correspond to historically documented extreme events is as close to external validation as you can get without labeled ground truth.

iso_outliers = np.sum(df['Is_Anomaly'])
print(f"DBSCAN (eps=1.5)          -> {dbscan_outliers} outlier ({(dbscan_outliers/len(df))*100:.2f}%)")
print(f"Isolation Forest (c=0.05) -> {iso_outliers} outlier ({(iso_outliers/len(df))*100:.2f}%)")

# Output:
# DBSCAN (eps=1.5)          -> 18166 outlier (12.49%)
# Isolation Forest (c=0.05) ->  7273 outlier  (5.00%)

The two methods disagree by roughly 11,000 rows on the same dataset. Without ground truth labels you cannot say with certainty which one is right. What you can say is which one gives you a number you can stand behind in front of another human being.

Every anomaly detection method requires a human to make a threshold decision. DBSCAN forces you to make that decision in a geometric space that loses interpretability as dimensions grow. Isolation Forest forces you to make it in the space of contamination rates, which is a domain question with a domain answer.

In production you will always be asked to justify your choices. The question is whether you want to justify a geometric radius in a 16-dimensional standardized space or whether you want to justify what proportion of your data you believe to be genuinely anomalous.

One of those conversations is possible. The other is not.

Leakage in ML Pipelines: How to build a bulletproof preprocessing architecture

Pasquale Molinaro — Thu, 14 May 2026 16:23:58 +0000

You build a model, run the evaluations, and hit a 95% accuracy on your test set. You deploy it to production feeling like a genius, only to watch it fail miserably on real-world data. We’ve all been there. When a model explodes in production after perfect local testing, the culprit is rarely the algorithm itself. Most of the time, it’s a silent architecture flaw introduced during the very first steps of preprocessing: Data Leakage. In this article, we will discover how one of the most common mistakes in handling Missing Values and oversampling implicitly corrupts your test data, and how to build a bulletproof, leak-free pipeline using Scikit-Learn.

The common error is preprocessing before splitting
Let’s look at a classic approach to data preparation. If you have a dataset with missing values and a mix of categorical and numerical columns, the most intuitive approach is to clean everything up before feeding it to the model.

You split the dataframes by type, apply a LabelEncoder to the text, and use an Imputer to fill the NaNs. The code usually looks something like this:

import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder


# 1. Splitting categorical and numerical data
df_num = df_copy.select_dtypes(include=[np.number])
df_cat = df_copy.select_dtypes(include=['object'])

# 2. Encoding categorical data
for attr in df_cat.columns:
    df_cat[attr] = LabelEncoder().fit_transform(df_cat[attr])

# 3. Merging back and imputing missing values (omitted for brevity)
df_encoded = pd.concat([df_cat, df_num], axis=1)

# 4. Splitting into Train and Test ONLY at the end
X = df_encoded.drop(['TargetVariable'], axis=1)
y = df_encoded['TargetVariable']
y = df_encoded['TargetVariable']X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

It looks logical, clean, and perfectly executed. But this code is a ticking time bomb. By applying .fit_transform() on the entire categorical dataframe before calling train_test_split, your encoder has just learned the global distribution of every single category, including the ones that will end up in your test set. If you apply an Imputer to fill missing values using the mean or median of the entire column, your training data is mathematically contaminated by the future testing data.

Your model isn’t predicting, it’s cheating.

The Leak-Proof Architecture

Step 1: Isolate the Test Set Immediately

The golden rule of production-ready machine learning is simple: quarantine your test data before you do anything else. Do not impute, do not encode, do not even look at it. You cut the dataset raw and dirty.

from sklearn.model_selection import train_test_split

# Define your target column name
TARGET_COL = 'target_variable' # e.g., 'RainTomorrow'

# Separate features and target from your raw dataframe
X = df.drop([TARGET_COL], axis=1)
y = df[TARGET_COL]

# 1. Split the raw, dirty data FIRST
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

To put this in perspective, as an illustrative example: a model with data leakage might show a dazzling 98% accuracy during local testing. But when you fix the pipeline and evaluate it cleanly, the true baseline accuracy drops to 82%. That 16% gap is the illusion of data leakage.

Step 2: Define Independent Transformers
Now that our X_test is safely locked away, we need a way to clean X_train. But instead of manually hacking DataFrames and looping through columns, we define "Transformers". Think of them as blueprints for cleaning data. They don't do anything yet; they just hold the logic.

We will use Scikit-Learn’s ColumnTransformer to handle numerical and categorical columns independently, entirely avoiding messy pd.concat operations and index misalignments.

import numpy as np
from sklearn.compose import ColumnTransformer, make_column_selector as selector
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline

# Blueprint for numbers: fill missing with median
numeric_transformer = SimpleImputer(strategy='median')

# Blueprint for categories: fill missing with most frequent, then encode
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
])

# Bundle the blueprints together using dynamic selection
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, selector(dtype_include=np.number)),
        ('cat', categorical_transformer, selector(dtype_include=object))
    ])

Step 3: The Leak-Proof Pipeline in Action
Now we assemble the final architecture. Notice a crucial detail: we are importing the Pipeline from imblearn.pipeline, not the standard sklearn.pipeline. This is the secret sauce, and it’s where most junior developers get stuck with a TypeError. Why can't we just use the standard Scikit-Learn pipeline?

Under the hood, a standard Scikit-Learn pipeline expects every step (except the final estimator) to have a .transform() method. It assumes you are just modifying existing rows (like scaling numbers or encoding text). However, SMOTE doesn't just transform data; it generates entirely new synthetic rows using a method called .fit_resample(). If you put SMOTE inside a standard Scikit-Learn pipeline, it crashes. The imblearn pipeline is explicitly overridden to handle this structural difference. It knows exactly when to call .fit_resample() (only during training) and when to completely ignore the SMOTE step (during .predict() or .score()). By using this architecture, you guarantee that your test set remains pristine and untouched by synthetic data generation.

The Cross-Validation Trap
You might think you are safe if you use Cross-Validation (CV). But if you apply SMOTE or Standardization to your entire X_train before passing it to cross_val_score, you are leaking data across your folds! The beauty of the imblearn pipeline is that you can pass the entire pipeline object directly into Scikit-Learn’s cross_val_score. The framework will strictly apply SMOTE only to the training folds of each iteration, leaving the validation fold completely untouched.

from imblearn.pipeline import Pipeline as ImbPipeline
from imblearn.over_sampling import SMOTE
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

# Assemble the ultimate leak-proof pipeline
pipeline = ImbPipeline(steps=[
    ('preprocessor', preprocessor),
    ('smote', SMOTE(random_state=42)),
    ('classifier', RandomForestClassifier(random_state=42)) 
])

# --- THE CROSS-VALIDATION PROOF ---
# The pipeline guarantees SMOTE is applied strictly to the training folds 
# of each split, never touching the validation folds.
cv_scores = cross_val_score(pipeline, X_train, y_train, cv=5)
print(f"Cross-Validation Accuracy: {cv_scores.mean():.4f} (+/- {cv_scores.std() * 2:.4f})")

# --- FINAL PRODUCTION EVALUATION ---
# Train everything in a single step on the Train Set ONLY
pipeline.fit(X_train, y_train)

# Evaluate on the Test Set safely. 
accuracy = pipeline.score(X_test, y_test)
print(f"True Production Accuracy: {accuracy:.4f}")

The Leakage Iceberg
While imputation and categorical encoding are the most frequent culprits for junior developers, remember that Data Leakage has many faces. Calculating target-based features (like historical averages), applying a StandardScaler, or running feature selection algorithms (like PCA or statistical tests) on your entire dataset before splitting will also silently contaminate your test set. The Golden Rule applies to your entire pipeline: Split first, ask questions later.

Conclusion
Data Leakage is the silent killer of machine learning models. Splitting your raw dataset immediately and wrapping all your preprocessing logic into a strict pipeline isn’t just “clean code”, it is the only mathematical guarantee that your model’s performance in local testing will match its performance in production. Stop hacking DataFrames with manual .fit_transform() loops, and start engineering bulletproof pipelines. Your future self (and your production server) will thank you.

Here the full code:

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer, make_column_selector as selector
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder 
from sklearn.pipeline import Pipeline 
from imblearn.pipeline import Pipeline as ImbPipeline
from imblearn.over_sampling import SMOTE
from sklearn.ensemble import RandomForestClassifier

# --- 1. DATA LOADING & INITIAL SPLIT ---
df = pd.read_csv('your_dataset.csv')
TARGET_COL = 'target_column_name'

X = df.drop([TARGET_COL], axis=1)
y = df[TARGET_COL]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# --- 2. PREPROCESSING LOGIC ---
numeric_transformer = SimpleImputer(strategy='median')

# We use standard sklearn Pipeline here, as no sampling happens at this stage
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, selector(dtype_include=np.number)),
        ('cat', categorical_transformer, selector(dtype_include=object))
    ])

# --- 3. FINAL PIPELINE ASSEMBLY ---
# Here we MUST use imblearn.pipeline to handle SMOTE safely
pipeline = ImbPipeline(steps=[
    ('preprocessor', preprocessor),
    ('smote', SMOTE(random_state=42)),
    ('classifier', RandomForestClassifier(random_state=42))
])

# --- 4. EXECUTION ---
pipeline.fit(X_train, y_train)
accuracy = pipeline.score(X_test, y_test)
print(f"Final Leak-Free Accuracy: {accuracy:.4f}")