<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Daniel Anthony</title>
    <description>The latest articles on Forem by Daniel Anthony (@fidget_dan).</description>
    <link>https://forem.com/fidget_dan</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3254047%2Fe17934ee-2849-4d1e-b2a5-2bf281905c8f.png</url>
      <title>Forem: Daniel Anthony</title>
      <link>https://forem.com/fidget_dan</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/fidget_dan"/>
    <language>en</language>
    <item>
      <title>Advanced Use Cases for AI Video Summaries</title>
      <dc:creator>Daniel Anthony</dc:creator>
      <pubDate>Thu, 26 Jun 2025 18:50:39 +0000</pubDate>
      <link>https://forem.com/fidget_dan/advanced-use-cases-for-ai-video-summaries-5fhi</link>
      <guid>https://forem.com/fidget_dan/advanced-use-cases-for-ai-video-summaries-5fhi</guid>
      <description>&lt;p&gt;Hey everyone! I’m the solo founder of &lt;a href="https://getfidget.pro/" rel="noopener noreferrer"&gt;Fidget&lt;/a&gt;. Today I published a deep-dive on "5 Advanced Ways Teams &amp;amp; Creators Use AI Video Summaries", covering:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Automated meeting minutes for remote teams&lt;/li&gt;
&lt;li&gt;Creator workflows from live stream to social clip&lt;/li&gt;
&lt;li&gt;Compliance &amp;amp; training summaries&lt;/li&gt;
&lt;li&gt;Market research &amp;amp; competitive analysis&lt;/li&gt;
&lt;li&gt;Accessibility &amp;amp; multilingual recaps&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7u5s3144eqjfc9jsajmv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7u5s3144eqjfc9jsajmv.png" alt="Infographic detailing the potential benefits of using multimodal AI for market research and analysis" width="800" height="2000"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Each section includes real use cases (e.g. teams could cut research cycles by 50%). Check it out and let me know which use case would make the biggest impact for you:&lt;/p&gt;

&lt;p&gt;👉 &lt;a href="https://medium.com/p/af375ac8e019" rel="noopener noreferrer"&gt;5 Advanced Ways Teams &amp;amp; Creators Use AI Video Summaries&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>multimodal</category>
      <category>productivity</category>
      <category>sass</category>
    </item>
    <item>
      <title>5 Ways AI Summaries Save You Time</title>
      <dc:creator>Daniel Anthony</dc:creator>
      <pubDate>Thu, 19 Jun 2025 13:35:56 +0000</pubDate>
      <link>https://forem.com/fidget_dan/5-ways-ai-summaries-save-you-time-571o</link>
      <guid>https://forem.com/fidget_dan/5-ways-ai-summaries-save-you-time-571o</guid>
      <description>&lt;p&gt;Hey!  I’m the solo founder of Fidget, an AI tool that fuses audio, video and text to give you bullet‑point recaps of any long video. Below I’m sharing "5 Ways AI Summaries Save You Time," complete with real beta metrics showing users cutting 6 hrs down to 30 min per lecture.&lt;/p&gt;

&lt;p&gt;👉 &lt;strong&gt;Read on&lt;/strong&gt; for the full dive, then let me know which time‑saving feature you think we need to add to Fidget and grab early access on our waitlist by visiting &lt;a href="https://getfidget.pro/" rel="noopener noreferrer"&gt;https://getfidget.pro/&lt;/a&gt; now!&lt;/p&gt;

&lt;p&gt;So, imagine spending five hours a week scrubbing through lectures, tutorials, or webinars and &lt;em&gt;still&lt;/em&gt; missing key points! With Fidget’s &lt;strong&gt;video summary AI&lt;/strong&gt;, you get &lt;strong&gt;laser-sharp, bullet-point recaps in seconds&lt;/strong&gt;, reclaiming your time for the work that matters. Below are five proven ways our &lt;strong&gt;multimodal&lt;/strong&gt; engine streamlines your workflow and frees you from endless video scrubbing.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. Learn 10× Faster with Instant Bullet Points
&lt;/h2&gt;

&lt;p&gt;Why watch an hour-long talk when you can read a 60-second summary? Fidget’s &lt;strong&gt;multimodal AI&lt;/strong&gt; fuses audio transcription, slide-change detection, and on-screen text extraction to pinpoint exactly what you need — no fluff, no filler.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Example:&lt;/strong&gt; During tests, Fidget users reported they’d be able to cut their study time from 6 hrs to just 30 minutes per lecture.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fagmr5uywd2bp0diuf7c8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fagmr5uywd2bp0diuf7c8.png" alt="Infographic comparing before and after using AI summaries for study" width="800" height="818"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  2. Reclaim 8+ Hours per Week from Long Videos
&lt;/h2&gt;

&lt;p&gt;We know your calendar is stuffed with meetings, email and real actual work. Don’t let long-form content eat into your schedule. Fidget identifies and extracts the most impactful takeaways, so you can skim summaries during your commute or coffee break.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Example:&lt;/strong&gt; Early testers saved an average of &lt;strong&gt;8.2 hours&lt;/strong&gt; weekly.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  3. Stay Ahead of 99% of Viewers
&lt;/h2&gt;

&lt;p&gt;In a world of infinite content, the first to act wins. Fidget’s &lt;strong&gt;video summary AI&lt;/strong&gt; surfaces trends and tactics from industry webinars faster than anyone else — giving you a competitive edge without getting beaten to the punch.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Scenario:&lt;/strong&gt; Marketing teams can use Fidget to digest three competitor webinars in just 2 minutes, versus 180 minutes manually.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5mvgtqcxvo6sjjhc42yj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5mvgtqcxvo6sjjhc42yj.png" alt="Infographic showing the timeline of AI adoption compared to manual summary workflows" width="800" height="1147"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  4. Multitask Effortlessly with Summaries at Your Fingertips
&lt;/h2&gt;

&lt;p&gt;Cooking, commuting, or working out? Let Fidget do the heavy lifting. Our &lt;strong&gt;API-first&lt;/strong&gt; design means you can integrate on-demand summaries into whatever app or tool you already use. Hit “summarize” and keep living your life.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Tech note:&lt;/strong&gt; One POST to /summarize returns time-stamped bullet points ready for Slack, Notion, or your LMS.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  5. Skip the Scrubbing and Get Straight to the Key Moments
&lt;/h2&gt;

&lt;p&gt;No more scrubbing through silence or irrelevant chatter. Fidget’s &lt;strong&gt;slide detection&lt;/strong&gt; flags every context shift — so your summary highlights only the core ideas, not the small talk.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Example&lt;/strong&gt;: Our multimodal engine achieved 99% accuracy in matching human-generated summaries in blind tests.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Ready to Reclaim Your Time?
&lt;/h2&gt;

&lt;p&gt;Stop wasting hours on long videos. Join the waitlist today and be among the first to experience Fidget’s &lt;strong&gt;video summary AI&lt;/strong&gt; when it launches!&lt;/p&gt;

&lt;p&gt;👉 &lt;strong&gt;&lt;a href="https://getfidget.pro" rel="noopener noreferrer"&gt;Join Our Waitlist&lt;/a&gt;&lt;/strong&gt; and start saving hours every week!&lt;/p&gt;

</description>
    </item>
    <item>
      <title>How We Built our API Multimodal Summary Engine</title>
      <dc:creator>Daniel Anthony</dc:creator>
      <pubDate>Thu, 12 Jun 2025 13:49:42 +0000</pubDate>
      <link>https://forem.com/fidget_dan/how-we-built-our-api-multimodal-summary-engine-1oj4</link>
      <guid>https://forem.com/fidget_dan/how-we-built-our-api-multimodal-summary-engine-1oj4</guid>
      <description>&lt;p&gt;I’m the founder of Fidget, an AI-powered video summarizer. Today’s post covers our multimodal engine’s architecture, complete with code examples.&lt;/p&gt;

&lt;p&gt;When we set out to build our &lt;strong&gt;Multimodal Summary Engine&lt;/strong&gt;, the idea was clear: ingest data from many sources (e.g. video, audio, metadata etc…) and use it to produce a neat, human-readable summary. If you rely on off-the-shelf summarizers, you still end up manually parsing transcripts and missing slide cues. That’s why Fidget’s multimodal AI engine was built from day one to capture every visual and audio nuance. Instead of simply transcribing audio, Fidget will listen for tonal emphasis, detects slide changes, and integrate on-screen text all in real time.&lt;/p&gt;

&lt;h3&gt;
  
  
  Building the Architecture for the Multimodal Engine
&lt;/h3&gt;

&lt;p&gt;Firstly we needed a home for our new system, so we spec’d out the Fidget API. We knew developers didn’t want extra complexity, so Fidget exposes a single endpoint to handle incoming requests. However, we found that an API without guardrails is like a candy store without lockable cases — rate limiting and user permissions became a top priority. So, from day one, it was planned that &lt;strong&gt;every&lt;/strong&gt; request through the endpoint gets checked against per-user quotas, tokens and roles.&lt;/p&gt;

&lt;p&gt;A typical request might look like this:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;curl -X POST https://api.getfidget.pro/v1/summarize \  &lt;br&gt;
-H "Authorization: Bearer sk-f9b9ba37-33b6-40e6-840e-e874d38e04a4" \  &lt;br&gt;
-H "Content-Type: application/json" \  &lt;br&gt;
-d '{"video_url": "https://example.com/video.mp4", "language": "en"}'&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;And have the response:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;{  &lt;br&gt;
 "success": true,  &lt;br&gt;
  "request_id": "fd7c9a1b-e8f2-4d3a-b8c5-2e7f3d8a9b1c",  &lt;br&gt;
  "processing_time": "0.87s",  &lt;br&gt;
  "video_metadata": {  &lt;br&gt;
    "title": "The Future of AI in Healthcare: Breakthroughs and Ethical Considerations",  &lt;br&gt;
    "duration": "15:42",  &lt;br&gt;
    "creator": "MedTech Insights",  &lt;br&gt;
    "language": "en",  &lt;br&gt;
    "topics": ["healthcare", "artificial intelligence", "ethics", "medical imaging", "drug discovery"]  &lt;br&gt;
  },  &lt;br&gt;
  "summary": {  &lt;br&gt;
    "executive_summary": "This comprehensive presentation explores how AI is transforming healthcare through advanced diagnostics, personalized treatment plans, and predictive analytics. The speaker appears optimistic and highlights recent breakthroughs in medical imaging analysis that have achieved 97.3% accuracy in early cancer detection, outperforming human radiologists by 11%. The discussion covers how machine learning has accelerated drug discovery timelines by 60% and how predictive analytics now forecast patient outcomes with 85% accuracy across multiple conditions.",  &lt;br&gt;
    "chapter_breakdown": [  &lt;br&gt;
      {  &lt;br&gt;
        "title": "Introduction to AI in Healthcare",  &lt;br&gt;
        "timestamp": "00:00 - 03:12",  &lt;br&gt;
        "summary": "Overview of current AI adoption in healthcare and historical context. The speaker appears happy and is standing against a whiteboard."  &lt;br&gt;
      },  &lt;br&gt;
      {  &lt;br&gt;
        "title": "Medical Imaging Breakthroughs",  &lt;br&gt;
        "timestamp": "03:13 - 07:45",  &lt;br&gt;
        "summary": "Detailed analysis of how AI systems detect patterns in medical images with 97.3% accuracy. Various x-ray images are shown to highlight the points being made by the speaker."  &lt;br&gt;
      },  &lt;br&gt;
      {  &lt;br&gt;
        "title": "Drug Discovery Revolution",  &lt;br&gt;
        "timestamp": "07:46 - 11:30",  &lt;br&gt;
        "summary": "Various scentists are shown working inside a lab performing medical tasks. The speaker is explaining the exploration of machine learning's role in accelerating pharmaceutical research"  &lt;br&gt;
      },  &lt;br&gt;
      {  &lt;br&gt;
        "title": "Ethical Considerations",  &lt;br&gt;
        "timestamp": "11:31 - 15:42",  &lt;br&gt;
        "summary": "The video takes a more serious tone while discussion of privacy concerns, algorithmic bias, and regulatory frameworks. The speaker is attempting to stay optimistic but they appear pensive."  &lt;br&gt;
      }  &lt;br&gt;
    ],  &lt;br&gt;
    "key_insights": [  &lt;br&gt;
      "AI systems can detect patterns in medical images that humans might miss, with 97.3% accuracy",  &lt;br&gt;
      "Machine learning has accelerated drug discovery timelines by 60%",  &lt;br&gt;
      "Predictive analytics can forecast patient outcomes with 85% accuracy",  &lt;br&gt;
      "Ethical frameworks must evolve alongside technological capabilities"  &lt;br&gt;
    ],  &lt;br&gt;
    "sentiment_analysis": {  &lt;br&gt;
      "overall": "positive",  &lt;br&gt;
      "confidence": 0.87,  &lt;br&gt;
      "segments": {  &lt;br&gt;
        "technological_advancements": "very positive",  &lt;br&gt;
        "ethical_considerations": "neutral",  &lt;br&gt;
        "future_outlook": "positive"  &lt;br&gt;
      }  &lt;br&gt;
    },  &lt;br&gt;
    "related_topics": [  &lt;br&gt;
      "precision medicine",  &lt;br&gt;
      "neural networks in diagnostics",  &lt;br&gt;
      "healthcare data privacy"  &lt;br&gt;
    ]  &lt;br&gt;
  },  &lt;br&gt;
  "model_version": "fidget-v2.3.1",  &lt;br&gt;
  "tokens_processed": 5842  &lt;br&gt;
}&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;If the input video is unavailable or otherwise unreadable, our API returns an HTTP 400 status with an error code and clients can try again.&lt;/p&gt;

&lt;p&gt;After the initial API design we sketched out our &lt;strong&gt;system flow&lt;/strong&gt;. Imagine a request arriving at &lt;code&gt;/v1/summarize&lt;/code&gt;: it first passes through an auth layer, then a rate-limiter and finally lands at a dispatcher that invokes the right downstream processes (we ended up calling them “modules.”) These gates ensure that a rogue client can’t soak up everyone else's resources or bypass business rules. This isn’t just about security; it also helps us maintain predictable performance as more users discover the Fidget API and allows us to scale up performantly.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7x0ovidlhlott6qs0cc2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7x0ovidlhlott6qs0cc2.png" alt="System diagram of the Fidget Multimodal Summary Engine" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;System diagram of the Fidget Multimodal Summary Engine&lt;/p&gt;

&lt;p&gt;Underpinning all of this is a &lt;strong&gt;strict interface&lt;/strong&gt; between components, which is especially important because we anticipate adding new “modalities” down the road (more on that soon). Every module, whether it handles video frames or audio transcripts, exposes a stable set of input and output parameters. A clearly defined interface means modules talk to each other in a universal dialect: JSON objects with named fields, standardized error codes, and documented versioning. This interface can (and probably will) change with new major versions of the API e.g. &lt;code&gt;/v1/summarize&lt;/code&gt;, &lt;code&gt;/v2/summarize&lt;/code&gt; etc… but we always plan to keep supporting all versions in-line by keeping the same modules around.&lt;/p&gt;

&lt;h3&gt;
  
  
  Defining Modal Sources (or Modalities) within the API
&lt;/h3&gt;

&lt;p&gt;A “modality” is just a fancier word for “data type” or “context source.” But not every piece of data is created equal — so we asked ourselves: &lt;strong&gt;what makes a good context source?&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Relevance:&lt;/strong&gt; If a video file’s metadata says it’s 2160p at 60 fps with a 10 Mbps bitrate, that’s interesting to our engine because it hints at video quality and length (for example.)&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Availability:&lt;/strong&gt; We prioritized sources that we could reliably extract at scale (e.g. standard container formats, well-defined audio codecs etc…)&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Signal-to-Noise Ratio:&lt;/strong&gt; A YouTube “tags” list might be partially user-generated and messy, while the actual audio waveform is unstructured but raw. We needed a sense of which fields tend to carry real, actionable value.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Once we identified our candidate sources — things like &lt;strong&gt;video metadata&lt;/strong&gt; (duration, resolution, codec, description text), &lt;strong&gt;audio tracks&lt;/strong&gt; (bits of speech or music) and &lt;strong&gt;key-frame snapshots&lt;/strong&gt; (image frames at specific intervals), we had to decide how to &lt;strong&gt;interpret the data&lt;/strong&gt;. Metadata often comes as JSON, so parsing fields like duration or bitrate is straightforward. But when we hit audio or visual data, things get messier: speech transcripts can be filled with filler words and images can be grainy or dark. That’s where our logic to &lt;strong&gt;handle noisy data&lt;/strong&gt; kicks in. For instance, silent parts of audio get flagged and skipped, low-confidence speech segments are marked “uncertain,” and blurred frames are discarded or given a low relevance score.&lt;/p&gt;

&lt;h3&gt;
  
  
  Extracting Data from Distinct Modalities (audio, video, metadata, YouTube)
&lt;/h3&gt;

&lt;p&gt;With our modalities defined, we built &lt;strong&gt;unique modules&lt;/strong&gt; for each one. Each of these modules lives inside the API using that &lt;strong&gt;strict interface&lt;/strong&gt; we mentioned earlier. In the end we ended up with three core services:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;Metadata Extractor:&lt;/strong&gt; Peels out raw JSON from tools like &lt;em&gt;ffprobe&lt;/em&gt; for video or &lt;em&gt;id3v2&lt;/em&gt; for audio.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Audio Transcriber:&lt;/strong&gt; Pulls audio tracks out of containers and sends them to our &lt;strong&gt;custom GPT-style omni model&lt;/strong&gt; for processing.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Frame Snapshotter:&lt;/strong&gt; Grabs “key frames” every few seconds or a configurable interval depending on confidence scores.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Each of these modules share a &lt;strong&gt;common set of input/output parameters&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;For example, every module accepts a payload like:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;{  &lt;br&gt;
  "resource_id": "abc123",  &lt;br&gt;
  "input_path": "/tmp/abc123/source.mp4",  &lt;br&gt;
  "settings": { /* e.g., sampling_interval: 10 */ }  &lt;br&gt;
}&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;…and produces something like:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;{  &lt;br&gt;
  "resource_id": "abc123",  &lt;br&gt;
  "output_path": "/tmp/abc123/frames/",  &lt;br&gt;
  "summary_path": "/tmp/abc123/frame_summaries.json"  &lt;br&gt;
}&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;The &lt;strong&gt;magic&lt;/strong&gt; is that any new modality we create in future e.g. OCR’d subtitles, social media comments, links in the description etc… just needs to implement the same interface in the Fidget engine.&lt;/p&gt;

&lt;p&gt;From there, we needed to &lt;strong&gt;plumb each module&lt;/strong&gt; together by registering it in a central “pipeline orchestrator.” When a request for summarization arrives, the orchestrator fans out to each active modality module simultaneously, waits asynchronously for each of their individual responses, and moves to the next stage. This approach means we can add or remove a modality with minimal friction.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Video Summary AI Combinator
&lt;/h3&gt;

&lt;p&gt;Once each module finishes its work, we collect everything into a staging area — which (for simplicity sake), ends up being a simple directory structure with JSON files and optional assets. To fuse these pieces, we built what we affectionately call &lt;strong&gt;“The Combinator.”&lt;/strong&gt; It’s kind of like a blender where each ingredient (modality) gets measured by a weight slider (relevance or confidence.)&lt;/p&gt;

&lt;p&gt;First, we had to &lt;strong&gt;define modality weights&lt;/strong&gt;. Some data types are inherently more relevant for particular tasks. For a news clip, speech transcripts might matter most; for a “how-to” cooking video, key frames and on-screen text could carry more weight. We set up a configuration file where we can assign relative weights like:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;audio_transcript: 0.4    &lt;br&gt;
key_frame_text:   0.3    &lt;br&gt;
metadata:         0.3&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;…to quickly and easily see how the different modalities affect the final output. Eventually, this will be automatic based on an initial scan and determined confidence values of the actual content.&lt;/p&gt;

&lt;p&gt;When the Combinator runs, it &lt;strong&gt;pulls data from all the modules&lt;/strong&gt; in a single step. Under the hood, it reads in &lt;em&gt;audio_transcript.json, frame_summaries.json&lt;/em&gt; and &lt;em&gt;metadata.json.&lt;/em&gt; It then normalizes fields (e.g. converting timestamps to a uniform “seconds since start” format) and constructs a consolidated in-memory representation like:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;{  &lt;br&gt;
  "resource_id": "abc123",  &lt;br&gt;
  "modalities": {  &lt;br&gt;
    "audio_transcript": [...],  &lt;br&gt;
    "frame_summaries": [...],  &lt;br&gt;
    "metadata": {...}  &lt;br&gt;
  },  &lt;br&gt;
  "weights": {  &lt;br&gt;
    "audio_transcript": 0.4,  &lt;br&gt;
    "frame_summaries": 0.3,  &lt;br&gt;
    "metadata": 0.3  &lt;br&gt;
  }  &lt;br&gt;
}&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Finally, the Combinator churns out a combined data‐set ready for the next stage: either on‐the‐fly summarization or feeding into a training pipeline.&lt;/p&gt;

&lt;h3&gt;
  
  
  Adding Modalities to AI Training Data
&lt;/h3&gt;

&lt;p&gt;With the Combinator’s output in hand, we &lt;strong&gt;add modalities as context&lt;/strong&gt; for our &lt;strong&gt;custom GPT-style AI model&lt;/strong&gt;. The idea is that each modality module’s data becomes part of the &lt;strong&gt;training context&lt;/strong&gt;. For example, our LM sees:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;[METADATA] Title: “How to Bake Bread”; Duration: 00:05:32    &lt;br&gt;
[AUDIO] 0:00–0:03: “Welcome to my bakery show...”    &lt;br&gt;
[FRAME] 0:05: Frame description: “Chef kneads dough.”    &lt;br&gt;
...&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;By feeding the LM a structured, modality‐tagged dataset, we teach it how to correlate, say, a mention of “kneading” in audio with the corresponding visual frame. During model training, we employ techniques like &lt;strong&gt;contextual embedding&lt;/strong&gt; where each modality’s tokens get their own positional encoding. We also up-weight or down-weight entire modalities based on the Combinator’s weights in this step. This ensures the final LM doesn’t drown in a flood of irrelevant information — no one wants a summarizer that fixates on bitrates instead of human speech!&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;In early tests, our prototype processed one hour of lecture video in under six minutes, with near 100% accuracy.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Once the training data is prepared, we hit the familiar “train” button (submitting jobs to our internal ML instances.) Over multiple cycles, the model learns to generate coherent summaries that weave together the information its been provided. Information like metadata blurbs, spoken dialogue, and visual descriptions all get combined and correlated. At this stage we also monitor validation loss carefully, making sure the model doesn’t overfit to one modality at the expense of others. However, as mentioned previously, we’re hoping to further automate this in future and have it feed back into the weighting system.&lt;/p&gt;

&lt;p&gt;Once all of this is done, we bundle it together in nice, neat, JSON format along with some other data relevant to the task (e.g. tokens processed, model used, time taken etc…) and return the response to the client with a lovely HTTP 200 status.&lt;/p&gt;

&lt;h3&gt;
  
  
  What’s Next for the API?
&lt;/h3&gt;

&lt;p&gt;We’re currently in &lt;strong&gt;alpha&lt;/strong&gt; with Fidget’s Multimodal Summary Engine, rolling it out to a handful of pilot customers at the moment. We’re aiming for a Summer 2025 public launch, where we’ll be monitoring the wider reception and community carefully.&lt;/p&gt;

&lt;p&gt;So far, our post-launch roadmap includes:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;Feedback Loops:&lt;/strong&gt; We’d like to add surveys and usage telemetry within the UX so that users can flag wildly inaccurate summaries or suggest new modalities themselves (like on-screen text recognition etc…)&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;New Modalities on Deck:&lt;/strong&gt; Imagine live chat comments for livestreams, social sentiment scores from Reddit posts or even things like linking into real-world news stories that are mentioned inside a video.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Fine-Tuning &amp;amp; Iteration:&lt;/strong&gt; We’ll iteratively tweak modality weights, refine our noise filters, and periodically update the underlying language model to keep pace with slang, jargon and evolving content trends.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Scalability &amp;amp; Availability:&lt;/strong&gt; We’re working hard to make every part of the Fidget API scalable, both in terms of usage and performance. We’ll be making this a top priority post-launch so you’ll always have Fidget available 24/7.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;In short, we’ve laid a robust, extensible foundation; an API that enforces permissions and rate limits, a set of plug-and-play modality extractors, a clever Combinator to merge it all and a training pipeline that teaches our models the &lt;em&gt;context&lt;/em&gt; behind the &lt;em&gt;content&lt;/em&gt;. The journey from a raw video link to a concise, readable summary is now as smooth as butter.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://getfidget.pro/" rel="noopener noreferrer"&gt;&lt;strong&gt;👉 Want to shape Fidget’s roadmap?&lt;/strong&gt; Join our API waitlist and receive early access, priority support and input into the development of Fidget.&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We can’t wait to see what you build using the Fidget API!&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>api</category>
      <category>saas</category>
    </item>
  </channel>
</rss>
