What Actually Slows Down PyTorch Training? I Surveyed ML Engineers

Abhinav Srivastav — Thu, 25 Dec 2025 15:50:22 +0000

I surveyed ML engineers about their training bottlenecks. The results were eye-opening.

The Setup

These aren't quick experiments; performance matters.

Top 3 pain points:

77% of engineers hit OOM errors at least occasionally. This isn't rare, it's a regular frustration.

When I asked what is slowing down their training:

46% said "I don't know"

Let that sink in. Nearly half of ML engineers can't identify their bottlenecks.

The other 31% pointed to forward pass, dataloader, or batch size issues. Only 8% had it figured out.

What people use today:

The problem? These tools fall into two camps:

Heavy profilers (PyTorch Profiler): Great detail, but 10-50% overhead
Aggregate monitoring (TensorBoard, WandB): Shows overall metrics, not layer-level bottlenecks

Most requested features:

The pattern is clear: people want layer-level visibility without killing performance.

The gap is obvious—we need lightweight, always-on profiling that shows:

The dashboard shows you in real-time:

No guessing. Just data.

GitHub: https://github.com/traceopt-ai/traceml/

If you've ever wondered why your training is slow or hit mysterious OOM errors, give it a try. Would love your feedback.

⭐ Star on GitHub if you find it useful