Forem: Sagar Joshi

We Solved the Recording Problem. The Playback Problem Is Still Broken.

Sagar Joshi — Sun, 22 Mar 2026 20:54:00 +0000

You joined a new company. Someone shared a link — a 2-hour product walkthrough, detailed, important. You opened it. You watched 20 minutes. You closed it and never went back.

Nobody told you which 15 minutes actually mattered.

We solved recording. We solved search. We solved summaries. But somewhere along the way, we optimized recordings for machines to understand — not for humans to consume.

That is the playback problem. And it is not a new problem — it is an unsolved one.

The Format Problem — Mid-2010s

In the mid-2010s, enterprise platforms stored recordings in a fragmented landscape of proprietary formats. Adobe Connect used Flash-based FLV. WebEx had ARF. GoToMeeting used G2M. For platforms built on VNC-based screen sharing, the format was FBS — Free Buffer Stream. Not a mainstream format. A niche protocol dump with no standard player, no indexing, and tooling its own developers described as old and barely maintained.

Fuze — a major enterprise UCaaS platform serving 400,000+ users — had accumulated thousands of recordings in this format without a viable conversion path.

There was no off-the-shelf solution. We built one from scratch using open source tools — no commercial SDK, no vendor API.

The approach: extend the RFB player code to dump a JPEG per frame, concatenate via ffmpeg at 30fps, solve sync drift between screen share and audio, handle resolution distortion using a Black Image Padding Technique.

Result: hundreds of recordings rescued from obsolescence. A format problem solved.

The Delivery Problem — 2017 to 2018

Even converted recordings remained tethered to the company network. Petabytes of content — meetings, training, product walkthroughs — locked behind a network connection.

Exporting the three components (audio, video, screen share) separately caused sync drift. The approach that worked: record the screen while the content played, using entirely open source tools.

Xvfb — headless virtual framebuffer, ran the platform URL on Chrome
ScreenCastify — captured Xvfb output
ffmpeg — converted WebM to MP4
chrome.runtime APIs — detected buffering, paused capture to prevent sync errors
RabbitMQ — async task queue so users weren't blocked waiting for long exports

Result: a download button that actually worked. The delivery problem solved.

What the Industry Actually Solved — And What It Didn't

Tools like Otter.ai, Fireflies.ai, Read AI, and tl;dv transformed recordings into structured data. Teams Intelligent Recap added chapters. Panopto enabled dual-stream playback. Zoom added clips.

This progress is real and useful.

But it solved a different problem.

The industry optimized recordings for machines to understand — not for humans to consume.

If you missed a meeting today, you have three options: read the summary, search the transcript, or watch the full recording. What you still cannot do is watch the right version of the recording.

A 2-hour recording is still a 2-hour recording — just with better indexing.

The Four Gaps

1. No viewer-controlled dynamic view switching
Zoom's multi-view recording requires host pre-configuration. Panopto's dual-stream requires editor intervention — not a viewer action. No platform offers audio-only mode or true viewer-controlled layout at replay time.

2. No downloadable intelligent highlight packages
Online highlight reels exist (Read AI, tl;dv) — but they don't travel offline. You can download the full 2-hour recording. You cannot download the intelligent 29-minute version.

3. Personalization exists at the summary layer — not the playback layer
Every tool sends the new joiner and the senior engineer the same MP4. Role-aware playback packaging does not exist.

4. The root cause is invisible — recordings are permanently flattened at capture time
When a meeting ends, audio, video, and screen share are merged into a single MP4 and the streams are discarded. Automatically. Silently. Irreversibly. This single architectural decision forecloses every intelligent replay option downstream. Most users never know it happened.

A Missing Category: Intelligent Replay Systems

Intelligent Replay is the ability to generate a personalized, context-aware version of a recording at playback time.

What makes this achievable now is multimodal AI. Models like Gemini and GPT-4o can watch a video stream and understand it visually — detecting when a presenter shifts to demonstrating, when a slide changes, when a live demo begins. They can decide, second by second, which stream carries the most relevant signal.

The shift:

Recording → file → dataset → generated experience

This is not an extension of existing tools — it is orthogonal to them.

We would not accept a document that can only be read top to bottom with no ability to skip, restructure, or personalize. That is exactly how we still treat video.

Why Nobody Has Built This Yet

Three structural constraints:

Storage and processing cost — Multi-stream recording means 2-3x storage. Per-user packaging multiplies compute.

Architectural inertia — Most platforms flatten at capture time. Changing this requires product conviction most teams haven't developed.

Implicit demand — The signal was always there in unopened recording links, in "can you summarize this?" messages, in training videos unwatched for months. The industry prescribed a painkiller — summaries and transcripts. Nobody diagnosed the underlying condition: that playback itself was broken.

But those constraints are weakening. Storage is cheaper. AI is significantly better. The cost of wasted time is becoming measurable.

Conclusion

We solved recording. We solved summaries. But we never solved playback.

The infrastructure exists. The AI exists. Multimodal models can already watch a video and decide which stream matters at every second. What is missing is the decision to treat playback as a product — not a file.

To the startup ecosystem: the Big Three have optimized for the summary. The playback layer is uncontested.

That category is Intelligent Replay. It is waiting to be claimed.

Full article with diagrams on Medium:
https://medium.com/@jo.sagar/we-solved-the-recording-problem-the-playback-problem-is-still-broken-1768038911b3

The Missing Protocol: How BFCP Unlocked Dual-Monitor Conferencing for Enterprise Room Systems

Sagar Joshi — Sat, 14 Mar 2026 22:57:05 +0000

Enterprise room systems like Polycom and Lifesize support dual monitors — one for video, one for screen share. But when they joined a web conference through a SIP gateway, both monitors showed the same combined feed.

The hardware was capable. OpalVoIP had no BFCP support. So screen share got injected into the video feed.

I added BFCP to OpalVoIP from scratch in 2018 — integrating libbfcp via a C++ wrapper, extending SDP offer/answer processing, and handling a dual-role design where the gateway acts as BFCP client on outbound calls and BFCP server on inbound calls.

The full write-up covers:

The SDP negotiation details (with before/after examples)
The dual-role client/server architecture
Why pre-negotiation beats mid-call renegotiation
How WebRTC handles this differently today — and why SIP gateways still need BFCP
Lessons learned from production deployment with real hardware endpoints Medium Article

Open source contribution credited at open.gslab.com. The BFCP gap still exists in OpalVoIP's main codebase today — sharing in case it's useful to anyone hitting the same problem.

The UX Problem Isn’t Speed — It’s Intent

Sagar Joshi — Mon, 26 Jan 2026 16:24:13 +0000

Modern software isn’t slow — it’s cognitively heavy.

We’ve built fast systems, powerful features, and rich interfaces. Yet users still spend a surprising amount of effort translating what they want to do into clicks, menus, and workflows.

The bottleneck isn’t performance.
It’s that software lacks a shared, adaptive grammar for user intent.

The Translation Gap: Intent vs Interface

Users don’t think in UI elements.
They think in goals.

“Reply quickly.”
“Escalate this issue.”
“Share this with the team.”

But software forces users to perform a translation step:

Find the right menu
Recall where an action lives
Learn different interaction patterns for every product

This repeated translation creates cognitive overload — not because tools are complex, but because each tool speaks a different interaction language.

Humans adapt to computers.
Computers rarely adapt to humans.

Why Keyboard Shortcuts Don’t Fully Solve This

Keyboard shortcuts were an early attempt to compress intent into speed. They help — but they have structural limits:

Invisible: Users can’t discover what they don’t know
Inconsistent: Same shortcut, different behavior across apps
High memory cost: Each tool demands a new mental vocabulary

Most users don’t avoid shortcuts because they dislike efficiency — they avoid them because software rarely teaches shortcuts at the moment intent is expressed.

Adaptive UX: Learning at Runtime

Static onboarding tours don’t work. Users skip them.

What does work is just-in-time guidance — adapting while the user is already working.

With modern telemetry and AI, this is finally practical.

Examples:

Fade-in shortcuts

If a user manually clicks “Export to PDF” multiple times, subtly surface the shortcut only when the pattern appears.

Behavioral macros

If a user consistently mutes audio and disables video before a recurring meeting, the system can suggest automating it.

Gesture hints

When repeated multi-step actions are detected, lightweight visual cues can introduce a gesture or touchpad action to replace them.

This isn’t about adding features — it’s about teaching efficiency at the moment of intent.

Beyond the Keyboard: A Shared Interaction Vocabulary

Intent doesn’t have to be typed.

Modern devices already support:

multi-finger gestures
touchpad zones
pressure or context-aware inputs

The real opportunity isn’t inventing new gestures — it’s making them consistent.

If the same gesture meant “Send to team” across chat, email, and issue trackers, users would develop muscle memory for intent, not for menus.

That’s what a shared interaction grammar enables.

Why This Matters for Product Teams

From a product and UX perspective, intent-driven interaction offers clear business value:

Faster onboarding: Users become effective through usage, not training
Lower support load: Many “how do I” tickets are really “where is” problems
Higher flow state: Reduced friction keeps users focused on outcomes

When software understands what the user is trying to do, it can minimize the effort required to do it.

What Comes Next

The industry doesn’t need more features.

It needs:

shared patterns for expressing common user intents
adaptive interfaces that learn from real behavior
interaction models that prioritize goals over UI mechanics

We’ve already personalized content.
Now it’s time to personalize interaction.

Closing Thought

Good software doesn’t ask users to think harder.

It understands what they’re trying to achieve — and clears the shortest path to get there.

Reference

For an earlier exploration of structured, intent-driven interaction patterns:

Why Collaboration Still Breaks Context — And What Comes Next

Sagar Joshi — Mon, 05 Jan 2026 19:24:47 +0000

Even in 2026, collaboration still feels oddly fragile.

We jump between chat, calls, email, documents, and wikis all day — yet the context rarely follows. A discussion starts in chat, escalates to a call, spills into documents, and somehow ends up summarized (poorly) in email.

The tools are powerful.
The experience is fragmented.

This isn’t a UX nit — it’s a systems gap.

The Real Problem Isn’t Channels — It’s Context

Most modern collaboration platforms are optimized around mediums, not conversations.

What breaks during transitions:

participants change
intent gets diluted
decisions lose traceability
historical “why” disappears

Even with AI summaries, context is usually reconstructed after the fact, inside a single tool.

What’s missing is a continuous conversation model that survives medium changes.

What Context-Preserving Collaboration Should Feel Like

Imagine moving fluidly across:

Chat → Call → Doc → Email → Wiki → Chat

…and carrying forward:

participants
shared artifacts
conversation history
decisions
intent

No re-explaining.
No restarting.
No “what did we decide again?”

This is not a feature problem — it’s an architecture problem.

An Old Idea That’s Suddenly Relevant Again

Years ago, an IBM defensive disclosure (IPCOM/000193228) explored a simple but powerful idea:

Dynamically switch between collaboration mediums while preserving the full interaction context.

The concept introduced a collaboration switch that:

captured parameters from the current medium (participants, content, resources)

mapped them into the next medium

preserved continuity across transitions

At the time, it felt ahead of the ecosystem.

Today, it feels like the missing layer.

Where Today’s Tools Still Fall Short

Microsoft Teams

Strength: Strong in-app context
What’s missing: Email ↔ Teams continuity still feels fragmented

Slack

Strength: Excellent chat threads
What’s missing: Context often breaks when switching to meetings

Zoom

Strength: Great meeting capture
What’s missing: Weak pre- and post-meeting continuity

Google Workspace

Strength: Smart Canvas shows promise
What’s missing: No unified memory across chat, mail, docs, and meetings 📌 Despite different strengths, none provide a single conversation graph across mediums.

There is still no unified conversation graph.

Three Gaps That Matter in 2026

Context Is Still Siloed

AI summaries live inside tools, not across them.

No Intelligent Medium Switching

Tools don’t suggest when to switch:

heated debate → call
long call → document
stalled thread → async summary

No Unified Timeline

There’s no single narrative stitching together:
chat + calls + docs + edits + decisions.

Why Contextual Search Changes Everything

Once context is unified, search becomes transformational.

Instead of keywords, you ask:

“What concerns came up before Feature X was approved?”
“Show the full incident timeline across tools.”
“Summarize discussions involving these people last month.”

Search becomes intent-aware, narrative-aware, and cross-channel.

Why This Is Finally Possible

The technology stack has caught up:

LLMs for semantic mapping
embeddings + graph databases for conversation modeling
identity resolution across tools
APIs that expose collaboration state

The pieces exist.
What’s missing is the unifying layer.

What Comes Next

The future likely includes:

persistent context profiles that follow work, not tools

AI-driven medium recommendations

unified conversation timelines

portable context across platforms

Collaboration shouldn’t belong to an app.
It should belong to the work itself.

Closing Thought

We’ve built excellent channels.

Now it’s time to build the intelligent connective tissue between them.

If you work on collaboration platforms, distributed systems, or developer productivity:

Where does context break most for you?
Would automatic medium switching help — or feel intrusive?
What would a unified timeline change in your day?

Curious to hear how others think about this problem.

Why Real-Time Communication Still Breaks — And What a 2014 Idea Got Right

Sagar Joshi — Fri, 05 Dec 2025 17:32:34 +0000

Every time a Teams, Zoom, or Webex call freezes during an important moment, we’re reminded of something uncomfortable:

Real-time communication is still too fragile.

Even in 2025, we continue to see:

Regional outages
Overloaded media servers
Capacity limits blocking new joiners
Users forced to rejoin calls

Years ago at IBM, we explored a different path — one that didn’t depend on running idle backup servers.

A Simple but Powerful Idea (US Patent 10,051,235)

Instead of keeping mirrored servers on standby, the system could:

Monitor performance using SDP thresholds
Dynamically elect an active participant as a temporary media relay
Switch over seamlessly when the primary server failed

No reconnects.

No cold backups.

No duplicated cloud cost.

Traditional Redundancy vs. Our 2014 Approach

Traditional Redundancy

Requires mirrored idle servers
High cloud/infra cost
Failover takes seconds
Participants must reconnect

Our Approach

Elect from active participants
Zero idle infrastructure
Instant failover
Session stays alive

A 2025 Application: Scaling When Servers Hit Capacity

One of today’s biggest challenges:

Live events and webinars hitting concurrency limits.

When a media server is full, instead of rejecting viewers:

Let an existing viewer temporarily proxy the stream.

This gives:

Graceful scaling during spikes
Fewer “room full” errors
Continuity during partial outages
Better reliability in poor networks

The Hard Part: Graceful Failback

A modern design would need to:

Sync media + session state
Avoid ping-pong failovers
Time the switch during quiet windows
Possibly split media vs signaling paths

With today’s telemetry (latency, jitter, CPU), this can even be predictive.

Quiet Validation Over the Years

This idea was later granted as U.S. Patent 10,051,235, now cited by:

Unify
Vonage
VMware

(Full citation list available on Google Patents.)

Closing Thought

Failure in real-time systems is inevitable.

But failure doesn’t need to reach the user.

If you're working on real-time media, cloud systems, or distributed comms, I’d love to hear your perspective:

What does “never drop the call” look like in your world today?

References

https://www.linkedin.com/pulse/from-cloud-outages-built-in-continuity-rethinking-sagar-joshi-8xogc
https://patents.google.com/patent/US10051235B2
https://medium.com/@jo.sagar/from-access-cards-to-ai-presence-the-evolution-of-intelligent-communication-routing-5ad39dea698e

A 2009 IBM Patent That Solved Indoor Location Without GPS — And Got Cited 64 Times by Cisco, Microsoft, Avaya…

Sagar Joshi — Mon, 24 Nov 2025 07:00:06 +0000

In 2009 most location patents were about GPS.
We took a different route: use the badge-swipe data enterprises already had, infer where someone probably is, and route calls accordingly — no tracking, no battery drain, no extra hardware.
Patent US8635366B2 was born.
Fifteen years and 64 forward citations later (including Cisco, Microsoft, Avaya, RingCentral, Mitel), it feels worth revisiting.

Click here to open patent link

More than a decade ago, while leading a team at IBM, I co-authored and successfully prosecuted U.S. Patent 8,635,366 — “Communication Routing” — a foundational invention in context-aware communication systems.

The goal was simple yet visionary: connect people through the right device at the right time — without GPS or invasive tracking.

Back then, GPS was expensive, unreliable indoors, and privacy concerns were rising. Our patented method used existing enterprise access control data (badge swipes) to infer location and route calls — a low-cost, privacy-first alternative that predated modern AI presence systems by nearly a decade.

Key Innovation (Claim 1): “Routing communication to an individual by identifying current location from access control information.” This was a new paradigm for location inference without geolocation hardware.

Why It Still Matters in 2025
Despite billions invested in collaboration tools, the core problem remains unsolved: systems still don’t truly know where you are or how best to reach you.

Today, most platforms default to “ring everything” — creating noise, not clarity.

But what if communication could be predictive, adaptive, and human-centered?

With cloud, AI, and behavioral signals now mature, we can reimagine my original patent as an AI-driven presence engine — one that learns from patterns while preserving privacy.

Smarter Device Routing Through AI (Next-Gen Vision)
Context SignalIntelligent ActionActive laptop on corporate Wi-Fi→ Ring desktop app firstCommute pattern (mobile + Bluetooth)→ Route to phone, suppress desktopCalendar focus block→ Auto-deflect or notify laterBadge swipe at cafeteria→ Update IM status: “In cafeteria — back at 1 PM”

Become a member
This isn’t science fiction — it’s the natural evolution of my 2009 patent, now supercharged with AI.

Proven Impact: Cited by 64 Patents
My invention has been cited 64 times — including by Cisco, Microsoft, Avaya, RingCentral, and Mitel — in systems for:

Location-based call routing (Cisco)
Presence aggregation (Microsoft)
Dynamic device selection (Avaya)
Click here for full citation list

This sustained influence proves the patent’s role as a foundational building block in enterprise unified communications.

Looking Ahead: From Patent to Platform
What began as a GPS alternative is now a blueprint for AI-native communication.

Today, I’m exploring how generative AI + privacy-preserving inference can power the next leap:

“Presence that predicts, protects, and personalizes — without ever tracking.”

Let’s Build the Future of Communication
How would you design an intelligent routing system today?

Should AI infer availability from calendar + badge + typing patterns?
How do we balance prediction with privacy?
Can we make “do not disturb” truly intelligent?
Drop your thoughts below — I read every comment.

Originally shared on LinkedIn and Medium.

Patent: https://patents.google.com/patent/US8635366B2

64-citation CSV: https://patents.google.com/patent/US8635366B2#patentCitations

Curious how you’d extend this idea with today’s telemetry + ML? Let’s talk in the comments.

AI #CloudComputing #VoIP #Innovation #SoftwareArchitecture #PrivacyTech #Patents #EB1A #TechLeadership #Presence #LocationAwareness