Forem: TildAlice

Pandas Time Series Resample: OHLC 14x Faster Than Custom

TildAlice — Wed, 06 May 2026 21:03:37 +0000

OHLC Looks Like a Shortcut Until You Measure It

Most traders and quant devs reach for df.resample('1H').ohlc() when they need hourly bars from minute-level tick data. It's a one-liner, it's built-in, and the docs make it look like the obvious choice. But when you're processing millions of rows of crypto or futures data, that convenience costs you. I tested OHLC against custom aggregation on 500K rows of real tick data — OHLC finished in 0.31 seconds, custom agg took 4.4 seconds. That's a 14x gap.

The weird part? Custom aggregation gives you more control and flexibility. You'd expect the tradeoff to be speed vs features, but here you lose on both fronts if you avoid the built-in. This post digs into why that performance gap exists, when you actually need custom aggregation despite the cost, and how to close the gap when you can't avoid it.

Photo by Wunha Chen on Pexels

The Test Setup: Real Tick Data and Two Approaches

Continue reading the full article on TildAlice

Python weakref for Cache Systems: Prevent Memory Leaks

TildAlice — Wed, 06 May 2026 18:05:11 +0000

A 400MB Memory Leak from 12 Lines of Cache Code

I watched a production service climb from 200MB to 4GB over 6 hours. The culprit? A dictionary-based cache that never forgot anything.

# The silent killer
class ImageProcessor:
    _cache = {}  # This grows forever

    @classmethod
    def get_processed(cls, image_id, raw_image):
        if image_id not in cls._cache:
            cls._cache[image_id] = expensive_process(raw_image)
        return cls._cache[image_id]

The fix was three lines. Here's the working version first, then I'll explain why the naive approach fails so spectacularly:

import weakref

class ImageProcessor:
    _cache = weakref.WeakValueDictionary()

    @classmethod
    def get_processed(cls, image_id, processed_image):
        # Only caches while caller holds a reference
        cls._cache[image_id] = processed_image
        return processed_image

    @classmethod  
    def get_cached(cls, image_id):
        return cls._cache.get(image_id)  # Returns None if GC'd

That's it. WeakValueDictionary doesn't prevent garbage collection of its values. When the last strong reference to a cached object disappears, the entry evicts itself. No manual cleanup, no LRU complexity, no TTL tracking.

Continue reading the full article on TildAlice

MobileNetV3 vs EfficientNet-Lite: ARM CPU Latency Benchmark

TildAlice — Wed, 06 May 2026 15:06:48 +0000

MobileNetV3 vs EfficientNet-Lite: Which Actually Runs Faster on ARM?

MobileNetV3-Small claims 15ms inference on a Pixel phone. EfficientNet-Lite0 claims similar accuracy with "better efficiency." But when I converted both to TFLite and ran them on a Raspberry Pi 4, the numbers told a different story—MobileNetV3-Small hit 23ms while EfficientNet-Lite0 crawled at 67ms. That's a 2.9x gap that no paper prepared me for.

You can read the MobileNetV3 paper here (Howard et al., ICCV 2019) and the EfficientNet paper here (Tan & Le, ICML 2019).

This isn't about which architecture is "better"—it's about why theoretical FLOPs and actual ARM latency diverge so dramatically, and what that means if you're building an interview portfolio project that needs to run on real edge hardware.

Photo by Shawn Stutzman on Pexels

Why the Paper Numbers Don't Match Your Raspberry Pi

Continue reading the full article on TildAlice

Kadane's Algorithm: Maximum Subarray in O(N) + Edge Cases

TildAlice — Tue, 05 May 2026 21:03:55 +0000

Most Candidates Fail This Problem Because They Skip One Check

Here's a claim: a significant portion of candidates who know Kadane's algorithm still fail the maximum subarray problem in interviews. Not because they don't understand the algorithm—they fail because they don't handle an edge case that seems trivial until you're staring at a wrong answer with 5 minutes left.

The edge case? An array of all negative numbers.

Kadane's algorithm, in its textbook form, returns 0 for [-3, -1, -4]. But the correct answer is -1. And that single oversight has tanked more interviews than I'd like to guess.

The Brute Force Baseline: Why O(N²) Makes Sense First

Before optimizing, let me show you the brute force approach—not because you'd use it, but because it clarifies what we're actually computing.

def max_subarray_bruteforce(nums: list[int]) -> int:
    """O(N²) baseline: check all subarrays"""
    if not nums:
        raise ValueError("Empty array has no subarrays")

    n = len(nums)
    max_sum = nums[0]  # Not 0! This is the first trap.

    for i in range(n):
        current_sum = 0
        for j in range(i, n):
            current_sum += nums[j]
            max_sum = max(max_sum, current_sum)

    return max_sum

# Quick sanity check
print(max_subarray_bruteforce([-3, -1, -4]))  # -1, not 0
print(max_subarray_bruteforce([1, -2, 3, 4]))  # 7

Continue reading the full article on TildAlice

Pandas vs SQL: 3.2x Speed Gap in Real Data Cleaning Jobs

TildAlice — Tue, 05 May 2026 18:05:02 +0000

SQL Won By a Mile. Then I Ran It Again.

I ran the same data cleaning job in Pandas and SQL expecting Pandas to edge ahead on small datasets. The opposite happened — PostgreSQL finished in 1.8 seconds while Pandas took 5.9 seconds on a 500k-row CSV with messy nulls, duplicates, and type mismatches. The gap widened to 3.2x on 2 million rows.

This contradicts the "use SQL for big data, Pandas for small" advice you see everywhere. The reality depends on what you're actually doing. Filtering and joins? SQL wins at any scale. Complex string parsing or regex-heavy transformations? Pandas pulls ahead because Python's string methods are richer than SQL's.

I'm sharing side-by-side code for five common cleaning tasks: deduplication, null handling, type conversion, outlier filtering, and date parsing. You'll see exact timings, memory footprints, and the specific edge cases where each tool chokes.

Photo by Alicia Chai Hui Yi on Pexels

Test Setup: Same Messy Data, Two Approaches

Continue reading the full article on TildAlice

PyTorch 2.6 vs TensorFlow 2.18: 5x Faster Training

TildAlice — Tue, 05 May 2026 15:04:15 +0000

The Compile Mode Nobody Actually Uses

PyTorch 2.6's torch.compile() claims 2-5x speedups. TensorFlow 2.18's XLA promises similar gains. Most repos I've audited still wrap models in model.to(device) and call it a day.

I ran the same ResNet-50 training script on both frameworks with and without compilation. PyTorch with compile mode hit 847 images/sec on an A100. TensorFlow with XLA managed 612 images/sec. Vanilla PyTorch? 168 images/sec. The gap is real, but the setup friction explains why it's rare in production code.

Here's what actually happened when I forced both frameworks through identical workloads.

Photo by Daniil Komov on Pexels

Why Compile Mode Exists (and Why It's Not Default)

Both frameworks execute models in eager mode by default — every operation becomes a Python function call. This flexibility makes debugging trivial. You can print tensor shapes mid-forward pass, drop into pdb, inspect gradients line by line.

Continue reading the full article on TildAlice

PPO Entropy Decay Bug: Why Exploration Dies at 500K Steps

TildAlice — Mon, 04 May 2026 21:04:50 +0000

The Bug That Killed My Agent at Step 523,000

Your PPO agent trains beautifully for 500,000 steps, hits 80% win rate, then flatlines. The policy stops exploring, gets stuck repeating the same suboptimal actions, and never recovers. You check the value loss, policy loss, KL divergence—everything looks normal. But if you plot the entropy coefficient over time, you'll see it decayed to 0.0001 while your entropy bonus weight stayed at 0.01. The agent stopped exploring because the coefficient that controls exploration vanished.

This isn't a hyperparameter tuning problem. It's a silent implementation bug in how most PPO codebases handle entropy decay.

I hit this training a MuJoCo Ant-v4 agent (Gymnasium 0.29.1, Stable Baselines3 2.2.1). The agent learned to walk forward, then stopped trying new gaits entirely. Training curves showed the policy entropy $H(\pi)$ dropping from 2.1 nats to 0.03 nats between steps 400K-600K, but the entropy coefficient scheduler had already bottomed out at step 520K. Once the coefficient hit its minimum, the entropy bonus term in the loss function became negligible:

$$L_{total} = L_{clip} + c_1 L_{value} - c_{ent} H(\pi)$$

When $c_{ent} = 0.0001$ and your base weight is 0.01, the effective entropy bonus is $0.01 \times 0.0001 = 0.000001$. At that point, the policy gradient overwhelmingly favors exploitation. The agent locks into a local optimum and stops trying new actions.

Continue reading the full article on TildAlice

Kubernetes HPA + Triton: Custom Metrics Autoscaling Setup

TildAlice — Mon, 04 May 2026 18:06:23 +0000

The Default CPU Metric Doesn't Scale Inference Pods Right

Kubernetes Horizontal Pod Autoscaler (HPA) ships with CPU and memory metrics out of the box. Sounds great until you realize inference workloads don't behave like web servers. I've seen Triton pods sit at 30% CPU utilization while requests queue for 2+ seconds because the GPU is maxed out. The cluster thinks everything's fine. It's not.

Triton Inference Server can batch requests and pipeline stages across CPU/GPU, which means CPU usage becomes a terrible proxy for "is this pod overloaded?" You need to scale on what actually matters: GPU utilization, queue depth, or batch occupancy. This post walks through wiring HPA to Triton's Prometheus metrics so your cluster scales on signal that reflects reality.

I'll show the full stack: Prometheus → Prometheus Adapter → HPA custom metrics → autoscaling Triton deployments. The key insight is that HPA only knows about metrics the API server exposes, so you're building a pipeline from Triton metrics to custom.metrics.k8s.io API.

Photo by Paolo Motti on Pexels

Continue reading the full article on TildAlice

RLHF vs DPO: Training Cost Drops 68% in Real Migration

TildAlice — Mon, 04 May 2026 15:04:40 +0000

The $12,000 Surprise

RLHF training for a 7B parameter model ran us $12,400 on AWS for three days of continuous runs. The compute wasn't the issue — it was the waste. Every iteration meant spinning up a critic model, generating completions, calculating rewards, backpropagating through both networks, and repeating. When we migrated the same preference dataset to DPO, the equivalent training run cost $3,950. Same dataset, same base model, 68% cost reduction.

But the migration wasn't a drop-in replacement. DPO doesn't use a reward model at all, which sounds like a simplification until you realize your entire loss function changes shape.

Photo by Ene Marius on Pexels

What Actually Changed Under the Hood

RLHF trains two models in tandem. The policy model generates text, the reward model scores it, and policy gradient methods (usually PPO) nudge the policy toward higher rewards. The loss for the policy network involves a rather involved expectation:

Continue reading the full article on TildAlice

OpenCV to Albumentations: 3x Faster Augmentation Pipeline

TildAlice — Sun, 03 May 2026 21:05:45 +0000

Why Your OpenCV Augmentation Loop Is Probably Too Slow

I've seen production pipelines where augmentation takes longer than model training per epoch. The culprit? Hand-rolled OpenCV transforms applied one by one in a Python for-loop.

OpenCV is great for reading images and basic preprocessing. But when you're stacking 8+ augmentations per image across 50,000 training samples, those sequential cv2.rotate(), cv2.GaussianBlur(), and manual brightness adjustments compound into a bottleneck. Albumentations solves this by batching transforms into a single optimized pipeline with minimal memory copies.

Here's what I mean. A typical OpenCV augmentation setup looks like this:


python
import cv2
import numpy as np
import random

def augment_opencv(image):
    # Horizontal flip
    if random.random() > 0.5:
        image = cv2.flip(image, 1)

    # Rotation
    angle = random.uniform(-15, 15)
    h, w = image.shape[:2]
    M = cv2.getRotationMatrix2D((w/2, h/2), angle, 1.0)
    image = cv2.warpAffine(image, M, (w, h))

    # Brightness adjustment
    brightness_factor = random.uniform(0.8, 1.2)
    image = np.clip(image * brightness_factor, 0, 255).astype(np.uint8)

    # Gaussian blur
    if random.random() > 0.7:
        image = cv2.GaussianBlur(image, (5, 5), 0)

    # Hue/saturation shift (requires HSV conversion)
    hsv = cv2.cvtColor(image, cv2.COLOR_BGR2HSV)
    hsv[:, :, 0] = (hsv[:, :, 0] + random.randint(-10, 10)) % 180

---

*Continue reading the full article on [TildAlice](https://tildalice.io/opencv-to-albumentations-augmentation-pipeline-migration/)*

PyTorch vs TensorFlow 2026: Why Framework Wars Distract

TildAlice — Sun, 03 May 2026 18:04:17 +0000

The Framework Debate That Won't Die

Every few months, someone on Twitter declares PyTorch or TensorFlow "dead." The thread gets 500+ quote tweets, tempers flare, and nothing changes. Teams keep shipping models in both frameworks. The debate wastes energy on the wrong question.

The reality? Both frameworks converged years ago. TensorFlow adopted eager execution (basically PyTorch's dynamic graph model), PyTorch added torch.compile() for graph optimization (TensorFlow's strength). The performance gap narrowed to the point where your choice matters less than your team's familiarity with the API.

I've shipped production models in both. The framework didn't determine success — data quality, experiment tracking, and deployment infrastructure did. Yet I still see teams agonizing over this choice for weeks, delaying actual work.

Photo by Google DeepMind on Pexels

The Convergence Nobody Talks About

Continue reading the full article on TildAlice

Real-Time FFT Pipeline: Vibration to Alert in 100 Lines

TildAlice — Sun, 03 May 2026 15:04:41 +0000

You Don't Need a 50-Page Framework to Catch Bearing Faults

Most FFT tutorials stop at plotting pretty spectrograms. Production systems need more: streaming data ingestion, fault frequency detection, and actionable alerts — ideally before someone asks "why is the pump making that noise?"

I built a minimal real-time pipeline that goes from raw accelerometer samples to Slack notifications in under 100 lines of Python. No Kafka. No Docker. Just NumPy, a threshold detector, and enough signal processing to catch early-stage bearing defects. It's been running on a test rig for three months, and the only false positive came from someone dropping a wrench.

This isn't a production-grade SCADA integration. It's a proof-of-concept that shows the core mechanics: buffering, windowing, FFT computation, peak detection, and alerting. If you're migrating from schedule-based maintenance or just want to understand what happens between "sensor wire" and "email notification," this is the skeleton.

Photo by Artem Podrez on Pexels

The 20Hz Outer Race Fault That Schedule-Based Maintenance Missed

Continue reading the full article on TildAlice