Forem: Benjamin Pires

On-Device Pose Estimation on iOS: What Actually Works in Production (Not Just Research Papers)

Benjamin Pires — Sun, 10 May 2026 22:40:29 +0000

Research papers on pose estimation show impressive accuracy numbers. Production apps on consumer devices tell a different story. Here's what I learned shipping real-time pose estimation to thousands of users across 22 sports.

I built SportsReflector, an AI coaching app that analyzes athletic form using Apple's Vision framework on-device. The app runs pose estimation at 30fps during live sessions and frame-by-frame during video analysis. This article covers the gap between what the documentation promises and what actually works when real users point their iPhones at themselves in gyms, courts, and living rooms.

The Model Options on iOS
Apple gives you three paths for pose estimation on iOS:

VNDetectHumanBodyPoseRequest (Vision framework)
Extracts 19 body points. Runs on the Neural Engine. No custom model needed. This is what most developers should use.

CreateML trained custom model
Train your own pose model with labeled data. More control over which points you detect. Requires training data which is expensive to create for sports.

Third-party CoreML models (MoveNet, BlazePose, PoseNet)
Converted from TensorFlow/PyTorch. More keypoints (33 with BlazePose vs 19 with Vision). Harder to optimize for Neural Engine. Often slower than Apple's native model.

What I chose and why:
Apple's VNDetectHumanBodyPoseRequest. The 19 keypoints are sufficient for scoring form in every sport I've tested. The performance is dramatically better than converted third-party models because Apple optimized it specifically for their Neural Engine silicon. On an iPhone 13 or newer, single-frame inference is 8-12ms — fast enough for 60fps processing with headroom.

The third-party models give you more keypoints but at 2-3x the inference cost. For a research project where accuracy matters more than speed, use BlazePose. For a consumer app where smooth real-time performance matters more than marginal accuracy, use Apple's native model.

What the Documentation Doesn't Tell You
Problem 1: Confidence scores vary wildly by body position
Apple's pose estimation returns a confidence score (0.0-1.0) for each keypoint. The documentation suggests filtering points below 0.5 confidence. In practice, this threshold is too aggressive for many athletic movements.

During a squat at the bottom position, hip keypoints regularly drop to 0.3-0.4 confidence because the thighs occlude the hip joint from the camera's perspective. During a boxing combination, the rear hand drops below 0.3 when it's behind the torso. During a tennis serve at trophy position, the tossing arm's wrist confidence drops when it crosses behind the head.

The fix was sport-specific confidence thresholds. For squats, I accept hip keypoints down to 0.2 and interpolate position from adjacent frames when confidence is low. For boxing, I track the rear hand's last known position and predict its current position when occluded. For tennis, I use temporal smoothing across 5-frame windows to maintain continuity through low-confidence phases.

What I'd tell other developers: don't use a single confidence threshold globally. Calibrate per body part and per activity type. The wrist needs different thresholds than the shoulder, and both need different thresholds during a squat vs a sprint.

Problem 2: Camera angle dramatically affects accuracy

The documentation shows pose estimation working with a straight-on camera view. Users don't set up cameras at optimal angles. They prop their phone against a water bottle on the gym floor. They lean it against a wall at a 30-degree angle. Their training partner holds it at head height. Their tripod is behind them.
Accuracy degrades significantly at angles beyond 30 degrees from perpendicular. Side views work well for sagittal-plane movements (squats, deadlifts, running gait). Front views work for frontal-plane movements (lateral raises, jumping jacks). But most users don't know which angle to use for which exercise.
The fix was adding a camera setup guide that shows the user exactly where to place their phone for each exercise type, plus an automatic angle quality check that warns users when their camera angle is suboptimal before they start recording.
What I'd tell other developers: never assume optimal camera placement. Your users will find every possible bad angle. Build angle detection into your pipeline and guide users toward good placement before processing.
Problem 3: Lighting conditions in gyms are terrible
Research papers evaluate pose estimation under controlled lighting. Gyms have mixed lighting — overhead fluorescents, natural light from windows, mirror reflections creating secondary light sources, and dark corners near cable machines.
Low-light conditions cause two problems. Frame noise reduces keypoint confidence. And auto-exposure adjustments cause momentary brightness shifts that confuse frame-to-frame tracking.
The fix was pre-processing frames with adaptive histogram equalization before feeding them to the pose estimator, plus locking camera exposure after the initial setup phase to prevent mid-recording brightness shifts.
What I'd tell other developers: test your pose estimation in the worst lighting you can find. If it works under a single dim bulb with mirror reflections, it'll work everywhere.
Problem 4: Clothing and equipment cause occlusion
Loose clothing (baggy gym shorts, hoodies, wide-leg joggers) hides joint positions. The pose estimator can't see the knee through baggy shorts, so it guesses — often incorrectly.
Equipment compounds this. A barbell across the shoulders during a squat occludes the neck and upper back keypoints. A tennis racket in the hand confuses wrist detection. Boxing gloves change hand proportions that the model expects.
There's no clean fix for this. The mitigations are temporal smoothing (use the trajectory of the joint over multiple frames to predict position during occlusion), anatomical constraints (the knee can't be above the hip during a squat, so cap estimates to physiologically possible ranges), and user guidance (suggest form-fitting clothing in the onboarding flow).
What I'd tell other developers: your pose estimation will fail on some percentage of users due to clothing and equipment. Build graceful degradation — lower confidence scores should trigger warnings to the user rather than wildly incorrect analysis.
Problem 5: Multiple people in frame
Gym environments frequently have other people in the background. The pose estimator detects all of them. If you're not careful, your analysis might score the person walking behind your user instead of your user.
The fix was implementing a "primary subject" tracking system. On the first frame, detect all bodies and select the largest (closest to camera) as the primary subject. Track that subject's position across frames using centroid tracking. Ignore all other detected bodies. If the primary subject disappears (walks out of frame), pause analysis and prompt the user to reposition.

What I'd tell other developers: always implement subject isolation if your app runs in environments with multiple people. Single-person pose estimation in multi-person environments is a pipeline problem, not a model problem.

Performance Optimization That Actually Matters
Frame skipping for battery life
Running pose estimation at 30fps continuously drains battery fast. For live AR feedback, you need every frame. For video analysis (deferred path), you don't.

For deferred analysis, I process every 3rd frame (10fps effective) for initial pass, then re-process key frames (phase transitions, lowest position in squat, peak extension in serve) at full resolution. This reduces processing time by 60% with negligible accuracy loss for scoring.

For live AR during workouts, I dynamically adjust processing frequency based on battery level. Above 50% battery, process every frame. Below 50%, process every 2nd frame. Below 20%, process every 3rd frame and show a battery warning.
Memory management during long sessions

A 60-minute workout session at 30fps generates 108,000 frames. Storing pose data for every frame would consume hundreds of megabytes. The fix is streaming analysis — process each frame, extract metrics, store only the aggregate metrics (per-rep scores, phase timestamps, anomaly flags), and discard the raw pose data.
For video analysis where users want to replay with skeleton overlay, store keypoints for key frames only (phase transitions, anomalies) and interpolate between them during playback.
Neural Engine vs GPU vs CPU
Apple's Neural Engine is fastest for pose estimation but isn't always available — it's shared with other system processes. The Vision framework automatically falls back to GPU or CPU when the Neural Engine is busy.
The problem: inference time varies 3-5x between Neural Engine (8ms) and CPU fallback (30-40ms). If your real-time pipeline assumes consistent 8ms inference, CPU fallback frames cause visible stuttering in the AR overlay.
The fix was building a frame budget system that measures actual inference time per frame and dynamically adjusts overlay rendering complexity. Fast inference: full skeleton with joint angles and color-coded feedback. Slow inference: simplified skeleton with key joints only. The user sees smooth animation regardless of which processor handles inference.
The Results
After shipping these optimizations, the production performance profile looks like this:

Real-time AR overlay: 30fps sustained on iPhone 12+, 60fps on iPhone 13 Pro+
Video analysis: 10-15 seconds for a 30-second clip
Battery consumption during 60-minute AR workout: approximately 15-20% battery drain
Crash rate related to pose estimation: 0.0% over 30 days (error boundaries catch all edge cases)
User-reported accuracy satisfaction: tracked through review sentiment

The gap between research demos and production apps is significant. Research optimizes for accuracy on clean datasets. Production optimizes for resilience across terrible camera angles, bad lighting, occluded joints, multiple subjects, and variable processing budgets. Understanding this gap early would have saved me months.

SportsReflector is available on the iOS App Store. Built with Apple's Vision framework, CoreML, and ARKit. If you're shipping pose estimation in production and dealing with similar issues, I'd love to compare notes.

How I Built a Multi-Sport AI Coach on iOS as a Solo Developer — Architecture Decisions That Actually Mattered

Benjamin Pires — Sat, 09 May 2026 01:47:25 +0000

Most articles about building AI apps focus on the model. This one focuses on everything around the model — the architecture decisions that determined whether the product would actually ship, actually perform, and actually retain users.

SportsReflector is an AI coaching app that analyzes athletic form across 22 sports and every common gym exercise. It uses on-device pose estimation to extract body landmarks from video, calculates biomechanical metrics against sport-specific benchmarks, and returns a 0-100 form score with corrective coaching feedback. I built it solo.

Here are the architecture decisions that mattered most — and the ones I got wrong initially.

Decision 1: On-Device vs Cloud Inference
The first prototype sent video frames to a cloud GPU for pose estimation. It worked. It was also unusable.
Round-trip latency for a single frame was 200-400ms depending on network conditions. For real-time AR overlay at 30fps, you need sub-33ms inference per frame. Cloud inference was 10x too slow for the core feature.
The fix was moving to Apple's Vision framework with VNDetectHumanBodyPoseRequest running entirely on-device via CoreML. On an iPhone 12 or newer, single-frame pose estimation runs in 8-15ms — fast enough for real-time AR overlay at 60fps on ProMotion devices.
The business implications of this decision were massive:
Cloud inference at scale would have cost roughly $0.02-0.05 per analysis. At 10,000 daily active users doing 3 analyses each, that's $600-1,500/day in GPU costs before the business generates meaningful revenue. On-device inference costs exactly $0.00 per analysis regardless of user volume. Gross margin scales with subscriptions, not usage.
The tradeoff: on-device models are smaller and less accurate than cloud models. Apple's MoveNet SINGLEPOSE_THUNDER model extracts 17 keypoints per frame. Research-grade models like MediaPipe BlazePose extract 33. For consumer coaching (not clinical biomechanics), 17 keypoints is sufficient to score form, detect asymmetries, and identify common technique errors. The accuracy ceiling matters less than the latency floor for user experience.
What I'd tell other developers: default to on-device inference for any consumer AI product. Cloud inference is for batch processing, enterprise workflows, and use cases where latency tolerance is measured in seconds. Consumer apps need sub-100ms response times. On-device delivers that. Cloud doesn't.

**Decision 2: Sport-Specific vs Generic Analysis
**The naive approach to multi-sport analysis is building one model that scores all movements generically. Detect joints, measure angles, score deviations from some universal "correct" standard.
This doesn't work because biomechanics is sport-specific. A deep squat is correct form for powerlifting but incorrect for Olympic weightlifting (where you want to catch at parallel). A wide elbow flare is wrong for bench press but correct for a boxing hook. Knee valgus is a red flag in a squat but a natural movement pattern in certain tennis footwork transitions.
The architecture that works is a shared pose estimation layer feeding into sport-specific analysis modules. The pose data (17 keypoints with x, y coordinates and confidence scores per frame) is identical regardless of sport. The interpretation layer is modular — each sport has its own:

Biomechanical benchmark definitions
Phase detection logic (setup → load → execute → follow-through)
Failure mode catalog
Corrective drill mappings
Scoring weight distributions

Adding a new sport means writing a new analysis module, not retraining the pose estimation model. The 23rd sport is incremental engineering. The first sport was the hard part — designing the module interface that all future sports plug into.
What I'd tell other developers: if you're building any multi-category AI product (not just sports), invest heavily in the interface between your core ML layer and your domain-specific interpretation layer. That interface is your architecture. Get it right and scaling to new categories is additive. Get it wrong and every new category is a rewrite.

Decision 3: Synchronous vs Asynchronous Analysis
The first version analyzed video synchronously — user taps "analyze," the app freezes for 3-8 seconds while processing, then displays results. Users hated it. The perceived wait felt broken even though the actual processing time was reasonable.
The fix was splitting analysis into two paths:
Real-time path (AR workouts, training partner): pose estimation runs every frame at 30fps. No analysis latency — feedback is continuous. The tradeoff is shallower analysis since you can only compute what fits in a 33ms budget per frame.
Deferred path (video analysis, form scoring): video is recorded first, then analyzed frame-by-frame in the background. A progress indicator shows the user their analysis is cooking. Results appear as a notification or in-app card when ready. The user can do other things while waiting.
The deferred path allowed much deeper analysis per frame because there's no real-time constraint. Phase detection, kinetic chain analysis, asymmetry measurement, and LLM-generated coaching feedback all run sequentially without blocking the UI.
What I'd tell other developers: never freeze UI for ML inference. Either run inference continuously (real-time path) or run it in the background with progress feedback (deferred path). The perceived performance of your app is more important than the actual inference speed.

Decision 4: Monolithic vs Modular Feature Architecture
SportsReflector has a lot of features: video analysis, AR workouts, AI training partner, workout planner, sports planner, drills library, calorie tracker, coach dashboard. Building these as a monolith would have been faster initially but catastrophic for iteration speed.
Each feature is a semi-independent module with defined interfaces:

Video Analysis Module: accepts video frames, returns pose data + form scores
AR Module: accepts real-time pose data, renders overlays
Training Partner Module: accepts real-time pose data, manages rep counting + voice coaching
Planner Module: accepts user preferences, returns workout plans, reads form scores for adaptation
Calorie Module: accepts food photos, returns nutrition data, reads training data for macro recommendations
Coach Module: accepts athlete data, returns dashboards + reports

The modules share data through a central store but don't directly depend on each other. The workout planner reads form scores but doesn't import the video analysis module. The calorie tracker reads training load but doesn't import the planner.
This modularity meant I could ship features independently. The workout planner shipped in version 1.3. Cardio training shipped in 1.4. The calorie tracker shipped in 1.4.1. Each feature was developed, tested, and released without touching the other modules.
What I'd tell other developers: resist the temptation to build features that deeply intertwine. Define interfaces early. Ship modules independently. The velocity advantage of modular architecture compounds over months — you're not debugging the entire app every time you ship a new feature.

Decision 5: Launch Time Optimization
The first production build launched in 2.1 seconds cold. For an app that users might open 3-5 times per day at the gym, that's an eternity of staring at a splash screen.
The optimization was straightforward in concept, tedious in practice:

Defer all network calls to after the first frame renders
Lazy-initialize the ML model (don't load it until the user opens the analysis screen)
Cache the home screen layout so the first render uses pre-computed dimensions
Move analytics initialization to a background queue
Pre-warm the camera session on a background thread after the home screen appears

Cold launch dropped to under 500ms. Users perceive the app as "instant." The compound effect on retention is real — users who experience fast launches open the app more frequently, which drives engagement metrics, which drives App Store ranking.
What I'd tell other developers: measure your cold launch time. If it's over 1 second, you're losing users to perceived sluggishness. The fix is almost always "stop doing things before the first frame renders."
What I Got Wrong
Underestimating localization. I launched English-only and added localization in version 1.4.2 across 25 languages. Should have done it from day one. International markets (Japan, Korea, Germany, Brazil) have high willingness to pay for fitness apps. Every month without localization was revenue left on the table.
Overengineering onboarding. The first onboarding flow was six screens explaining features. Users bounced before reaching the first analysis. The current flow is two screens: pick your sport, start recording. Feature discovery happens through usage, not tutorials.
Not triggering review prompts early enough. The app launched with zero App Store ratings for weeks. Should have triggered SKStoreReviewRequest after the first completed analysis from day one. Ratings velocity is the single biggest factor in App Store discoverability. Every day without ratings is a day your app is invisible.
The Stack
For developers curious about the technical stack:

UI: SwiftUI with UIKit bridges for camera and AR views
Pose Estimation: Apple Vision framework (VNDetectHumanBodyPoseRequest)
AR: ARKit with body tracking, RealityKit for 3D overlays
ML Runtime: CoreML for on-device inference
AI Coaching: LLM API for generating sport-specific feedback
Food Recognition: Vision-based photo analysis for calorie tracking
Backend: Firebase (auth, database, cloud functions)
Analytics: MetricKit for performance monitoring
Haptics: CoreHaptics for custom feedback patterns
Voice: AVSpeechSynthesizer for training partner voice coaching

The app is built primarily with Rork, an AI-assisted iOS development platform, which significantly accelerated development speed for a solo developer.
The Takeaway
The model matters less than the architecture around it. Pose estimation is a solved problem — Apple gives you a production-ready model for free. The hard part is everything else: making inference fast enough for real-time use, building modular sport-specific analysis that scales to new sports, designing UI that doesn't block on ML processing, and shipping features independently without breaking existing ones.
If you're building an AI-powered consumer app, focus less on model accuracy and more on perceived performance, architectural modularity, and launch speed. Those are the decisions that determine whether users come back tomorrow.

SportsReflector is available on the iOS App Store. If you're working on pose estimation, CoreML, or ARKit and want to compare notes, reach out — I'm always happy to talk shop with other developers building in this space.