Forem: Tasha

AI-assisted removal of filler words from video recordings

Tasha — Wed, 01 Nov 2023 17:37:11 +0000

With the ongoing evolution of LLM-powered workflows, the limits of what AI can do with real-time and recorded video are rapidly expanding. AI can now contribute to post-processing through contextualized parsing of video, audio, and transcription output. Some results are production-worthy while others are exploratory, benefiting from an additional human touch. In the end, it’s human intuition and ingenuity that enables LLM-powered applications to shine.

In this post, I’ll explore one use case and implementation for AI-assisted post-processing that can make video presenters’ lives a little easier. We’ll go through a small demo which lets you remove disfluencies, also known as filler words, from any MP4 file. These can include words like “um”, “uh”, and similar. I will cover:

How the demo works from an end-user perspective
A before and after example video
The demo’s tech stack and architecture
Running the demo locally
What’s happening under the hood as filler words are being removed

How the demo works

When the user opens the filler removal web application, they’re faced with a page that lets them either upload their own MP4 file or fetch the cloud recordings from their Daily domain:

The front-end landing page

For this demo, I’ve stuck with the server framework's (Quart) default request size limit of 16MB (but feel free to configure this in your local installation). Once the user uploads an MP4 file, the back-end component of the demo starts processing the file to remove filler words. At this point, the client shows the status of the project in the app:

A disfluency removal project being processed

If the user clicks the “Fetch Daily recordings” button, all the Daily recordings on the configured Daily domain are displayed:

A list of Daily cloud recordings for the configured domain

The user can then click “Process” next to any of the recordings to begin removing filler words from that file. The status of the project will be displayed:

One recording being processed

Once a processing project is complete, a “Download Output” link is shown to the user, where they can retrieve their new, de-filler-ized video:

A successfully-processed video with an output download link

Here’s an example of a before and after result:

Before

After

As you can see, the output is promising but not perfect–I’ll leave some final impressions of both Deepgram and Whisper results at the end of this post.

Now that we’re familiar with the user flow, let’s look into the demo tech stack and architecture.

Tech stack and architecture

This demo is built using the following:

JavaScript for the client-side.
Python for the server component.
Quart for the processing server (similar to Flask, but designed to play nice with asynchronous programming.
moviepy to extract audio from, split, and then re-concatenate our original video files.
Deepgram and Whisper as two LLM transcription and filler word detection options:
Deepgram’s Python SDK to implement Deepgram transcription with their Nova-tier model, which lets us get filler words in the transcription output. This transcriber relies on a Deepgram API key.
whisper-timestamped, which is a layer on top of the Whisper set of models enabling us to get accurate word timestamps and include filler words in transcription output. This transcriber downloads the selected Whisper model to the machine running the demo and no third-party API keys are required.
Daily’s REST API to retrieve Daily recordings and recording access links. If a Daily API key is not specified, the demo can still be used by uploading your own MP4 file manually.

On the server-side, the key concepts are:

Projects. The Project class is defined in server/project.py. Each instance of this class represents a single video for which filler words are being removed. When a project is instantiated, it takes an optional transcriber parameter.
Transcribers. Transcribers are the transcription implementations that power filler word detection. As mentioned before, I’ve implemented Deepgram and Whisper transcribers for this demo. You can also add your own by placing any transcriber you’d like into a new class within the server/transcription/ directory (I’ll talk a bit more about that later). The steps an input video file goes through are as follows:

Running the demo locally

To run the demo locally, be sure to have Python 3.11 and FFmpeg installed.

Then, run the following commands (replacing the python3 and pip3 commands with your own aliases to Python and pip as needed):

# Clone the git repository
git clone https://github.com/daily-demos/filler-word-removal.git
cd filler-word-removal
git checkout v1.0

# Configure and activate a virtual environment
python3 -m venv venv
source venv/bin/activate

# Install dependencies
pip install -r requirements.txt

Optionally, copy the .env.sample file and assign your Deepgram and Daily API keys. Both of these are optional, but I think Deepgram results are usually superior to Whisper out of the box and I’d really suggest you try that out.

Now, run the following commands in two separate terminals within your virtual environment:

# Start the processing server
quart --app server/index.py --debug run

# Serve the front-end
python -m http.server --directory client

Open your web browser of choice (I suggest Chrome) to the localhost address shown in the second terminal window above. It will probably be http://localhost:8000.

Now that we’ve got the app running, let’s see what’s happening under the hood.

Under the hood of AI-powered video filler word removal

I’m going to mostly focus on the server side here, because that’s where all the magic happens. You can check out the source code for the client on GitHub to have a look at how it uses the server components below.

Server routes

All of the processing server routes are defined in server/index.py. They are:

POST /upload: Handles the manual upload of an MP4 file and begins processing the file to remove filler words.
POST /process_recording/<recording_id>: Downloads a Daily cloud recording by the provided ID and begins processing the file to remove disfluencies.
GET /projects/<project_id>: Reads the status file of the given filler-word-removal project and returns its contents. Enables the client to poll for status updates while processing is in progress.
GET /projects/<projct_id>/download: Downloads the output file for the given filler-word-removal project ID, if one exists.
GET /recordings: Retrieves a list of all Daily recordings for the configured Daily domain.

Let’s go through the manual upload flow and see how processing happens.

Processing an MP4 file with the `/upload` route

The /upload route looks as follows:

@app.route('/upload', methods=['POST'])
async def upload_file():
    """Saves uploaded MP4 file and starts processing.
    Returns project ID"""
    files = await request.files
    file = files["file"]
    project = Project()
    file_name = f'{project.id}.mp4'
    file_path = os.path.join(get_upload_dir_path(), file_name)
    try:
        await file.save(file_path)
        if not os.path.exists(file_path):
            raise Exception("uploaded file not saved", file_path)
    except Exception as e:
        return process_error('failed to save uploaded file', e)

    return process(project, file_path, file_name)

Above, I start by retrieving the file from the request. I then create an instance of Project(), which will generate a unique ID for itself when being instantiated as well as decide which transcriber to use. I’ll cover the Project instant setup shortly.

Next, I retrieve the path to which I’ll save the uploaded file based on the newly-created project ID. This directory can be configured in the application’s environment variables - check out the /server/config.py file for more information.

Once I have the file and the path to save it to, I save the file. If something goes wrong during this step, I return an error to the client. If the file saved successfully, I begin processing. I’ll dive into the processing step shortly. First, let’s take a quick look at the Project constructor I mentioned above:

`Project` setup

As mentioned above, the Project class constructor configures a unique ID for the project. It also decides which transcriber (Deepgram or Whisper) will be used:

class Project:
    """Class representing a single filler word removal project."""
    transcriber = None
    id = None

    def __init__(
            self,
            transcriber=None,
    ):
        if not transcriber:
            transcriber = Transcribers.WHISPER
            deepgram_api_key = os.getenv("DEEPGRAM_API_KEY")
            if deepgram_api_key:
                transcriber = Transcribers.DEEPGRAM
        self.transcriber = transcriber.value
        self.id = self.configure()

Above, if a transcriber argument is not passed in, Project will look for a DEEPGRAM_API_KEY environment variable. If a Deepgram API key has been configured, Deepgram will be used as the transcriber. Otherwise, it’ll fall back to a locally-downloaded Whisper model.

The project ID is a UUID generated in the configure() method, which checks for conflicts with any existing projects and sets up the temporary directory for this project instance:

def configure(self):
    """Generates a unique ID for this project and creates its temp dir"""
    proj_id = uuid.uuid4()
    temp_dir = get_project_temp_dir_path(proj_id)
    if os.path.exists(temp_dir):
        # Directory already exists, which indicates a conflict.
        # Pick a new UUID and try again
        return self.configure()
    os.makedirs(temp_dir)
    return proj_id

Now that we know how a project is configured, let’s dig into processing.

Beginning processing

The process() function in server/index.py takes the Project instance I created earlier, the path of the uploaded MP4 file, and the file name. It then processes the project in a Quart background task:

def process(project: Project, file_path: str, file_name: str) -> tuple[quart.Response, int]:
    """Runs filler-word-removal processing on given file."""
    try:
        app.add_background_task(project.process, file_path)

        response = {'project_id': project.id, 'name': file_name}
        return jsonify(response), 200
    except Exception as e:
        return process_error('failed to start processing file', e)

This way, the client’s request does not need to wait until the whole filler-word-removal process is complete, which can take a couple of minutes. The user will know right away that processing has started and receive a project ID which they can use to poll for status updates.

We’re now ready to dig into the critical part: What does project.process() do?

The processing step

The process() project instance method is responsible for all of the filler-word-removal operations and status updates on the project:

def process(self, source_video_path: str):
    """Processes the source video to remove filler words"""
    self.update_status(Status.IN_PROGRESS, '')
    try:
        self.update_status(Status.IN_PROGRESS, 'Extracting audio')
        audio_file_path = self.extract_audio(source_video_path)
    except Exception as e:
        traceback.print_exc()
        print(e, file=sys.stderr)
        self.update_status(Status.FAILED, 'failed to extract audio file')
        return

    try:
        self.update_status(Status.IN_PROGRESS, 'Transcribing audio')
        result = self.transcribe(audio_file_path)
    except Exception as e:
        traceback.print_exc()
        print(e, file=sys.stderr)
        self.update_status(Status.FAILED, 'failed to transcribe audio')
        return

    try:
        self.update_status(Status.IN_PROGRESS, 'Splitting video file')
        split_times = self.get_splits(result)
    except Exception as e:
        traceback.print_exc()
        print(e, file=sys.stderr)
        self.update_status(Status.FAILED, 'failed to get split segments')
        return

    try:
        self.update_status(Status.IN_PROGRESS, 'Reconstituting video file')
        self.resplice(source_video_path, split_times)
    except Exception as e:
        traceback.print_exc()
        print(e, file=sys.stderr)
        self.update_status(Status.FAILED, 'failed to resplice video')
        return

    self.update_status(Status.SUCCEEDED, 'Output file ready for download')

Aside from basic error handling and status updates, the primary steps being performed above are:

extract_audio(): Extracting the audio from the uploaded video file and saving it to a WAV file.
transcribe(): Transcribing the audio using the configured transcriber.
get_splits(): Getting the split times we’ll use to split and reconstitute the video with filler words excluded. This also uses the configured transcriber, since the data format here may be different across different transcription models or services.
resplice(): Cuts up and then splices the video based on the transcriber’s specified split times.

I’ve linked to each function in GitHub above. Let’s take a look at a few of them in more detail. Specifically, let’s focus on our transcribers, because this is where the LLM-powered magic happens.

Transcribing audio with filler words included using Deepgram

I’ll use Deepgram as the primary example for this post, but I encourage you to also check out the Whisper implementation to see how it varies.

In the server/transcription/dg.py module, I start by configuring some Deepgram transcription options:

DEEPGRAM_TRANSCRIPTION_OPTIONS = {
    "model": "general",
    "tier": "nova",
    "filler_words": True,
    "language": "en",
}

The two most important settings above are "tier" and "filler_words". By default, Deepgram omits filler words from the transcription result. To enable inclusion of filler words in the output, a Nova-tier model must be used. Currently, this is only supported with the English Nova model.

Let’s take a look at the dg module transcription step:

def transcribe(audio_path: str):
    """Transcribes give audio file using Deepgram's Nova model"""
    deepgram_api_key = os.getenv("DEEPGRAM_API_KEY")
    if not deepgram_api_key:
        raise Exception("Deepgram API key is missing")
    if not os.path.exists(audio_path):
        raise Exception("Audio file could not be found", audio_path)
    try:
        deepgram = Deepgram(deepgram_api_key)
        with open(audio_path, 'rb') as audio_file:
            source = {'buffer': audio_file, 'mimetype': "audio/wav"}
            res = deepgram.transcription.sync_prerecorded(
                source, DEEPGRAM_TRANSCRIPTION_OPTIONS
            )
        return res
    except Exception as e:
        raise Exception("failed to transcribe with Deepgram") from e

Above, I start by retrieving the Deepgram API key and raising an exception if it isn’t configured. I then confirm that the provided audio file actually exists and—you guessed it—raise an exception if not. Once we’re sure the basics are in place, we’re good to go with the transcription.

I then instantiate Deepgram, open the audio file, transcribe it via Deepgram’s sync_prerecorded() SDK method, and return the result.

Once the transcription is done, the result is returned back to the Project instance. With Deepgram, the result will be a JSON object that looks like this:

{
   "metadata":{
      //… Metadata properties here, not relevant for our purposes
   },
   "results":{
      "channels":[
         {
            "alternatives":[
               {
                  "transcript":"hello",
                  "confidence":0.9951172,
                  "words":[
                     {
                        "word":"hello",
                        "start":0.79999995,
                        "end":1.3,
                        "confidence":0.796875
                     }
                  ]
               }
            ]
         }
      ]
   }
}

The next step is to process this output to find relevant split points for our video.

Finding filler word split points in the transcription

After producing a transcription with filler words included, the same transcriber is also responsible for parsing the output and compiling all the split points we’ll need to remove the disfluencies. So, let’s take a look at how I do this in the dg module (I’ve left some guiding comments inline):

def get_splits(transcription) -> timestamp.Timestamps:
"""Retrieves split points with detected filler words removed"""
filler_triggers = ["um", "uh", "eh", "mmhm", "mm-mm"]
words = get_words(result)
splits = timestamp.Timestamps()
first_split_start = 0
try:
   for text in words:
       word = text["word"]
       word_start = text["start"]
       word_end = text["end"]
       if word in filler_triggers:
           # If non-filler tail already exists, set the end time to the start of this filler word
           if splits.tail:
               splits.tail.end = word_start


               # If previous non-filler's start time is not the same as the start time of this filler,
               # add a new split.
               if splits.tail.start != word_start:
                   splits.add(word_end, -1)
           else:
               # If this is the very first word, be sure to start
               # the first split _after_ this one ends.
               first_split_start = word_end


       # If this is not a filler word and there are no other words
       # already registered, add the first split.
       elif splits.count == 0:
           splits.add(first_split_start, -1)
   splits.tail.end = words[-1]["end"]
   return splits
    except Exception as e:
        raise Exception("failed to split at filler words") from e

Above, I retrieve all the words from Deepgram’s transcription output by parsing the transcription JSON (check out the get_words() function source if you’re curious about that object structure).

I then iterate over each word and retrieve its ”text”, ”start”, and ”end” properties. If the ”text” indicates a filler word, I end the previous split at the beginning of the filler. I then add a new split at the end of the filler.

The resulting splits could be visualized as follows:

The collection of split points is then returned back to the Project class instance, where the original video gets cut and diced.

Cutting and reconstituting the original video

The remainder of the work happens entirely in the Project class, because none of it is specific to the chosen transcription API. Once we get the split points as a collection of Timestamp nodes, the project knows what to do with them in the resplice() function:

def resplice(self, source_video_path: str, splits: Timestamps):
    """Splits and then reconstitutes given video file at provided split points"""
    tmp = get_project_temp_dir_path(self.id)

    clips = []
    current_split = splits.head
    idx = 0

   # The rest of the function below...

Above, I start by getting the temp directory path for the project based on its ID. This is where all the individual clips will be stored.

I then initialize an array of clips and define a current_split variable pointing to the head node of the timestamp collection.

Finally, I define a starting index for our upcoming loop. The next step is to split up the video:

def resplice(self, source_video_path: str, splits: Timestamps):
    # ...Previously-covered logic above...

    try:
        while current_split:
            start = current_split.start
            end = current_split.en
            # Overarching safeguard against 0-duration and nonsensical splits
            if start >= end:
                current_split = current_split.next
                continue
            clip_file_path = os.path.join(tmp, f"{str(idx)}.mp4")
            ffmpeg_extract_subclip(source_video_path, start, end,
                                   targetname=clip_file_path)
            clips.append(VideoFileClip(clip_file_path))
            current_split = current_split.next
            idx += 1
    except Exception as e:
        raise Exception('failed to split clips') from e

    # The rest of the function below...

Above, I traverse through every split timestamp we have. For each timestamp, I extract a subclip and save it to the project’s temp directory. I append the clip to the previously-defined clips collection. I then move on to the next split point and do the same, until we’re at the end of the list of timestamps.

Now that we’ve got all the relevant subclips extracted, it’s time to put them back together:

def resplice(self, source_video_path: str, splits: Timestamps):
    # ...Previously-covered logic above...
    try:
        final_clip = concatenate_videoclips(clips)
        output_file_path = get_project_output_file_path(self.id)
        final_clip.write_videofile(
            output_file_path,
            codec='libx264',
            audio_codec='aac',
            fps=60,
        )
    except Exception as e:
        raise Exception('failed to reconcatenate clips') from e

    # Remove temp directory for this project
    shutil.rmtree(tmp)

Above, I concatenate every clip I stored while splitting them and write them to the final output path. Feel free to play around with the codec, audio_codec, and fps parameters above.

Finally, I remove the temp directory associated with this project to avoid clutter.

And we’re done! We now have a shiny new video file with all detected filler words removed.

The client can now use the routes we covered earlier to upload a new file, fetch Daily recordings and start processing them, and fetch the latest project status from the server.

Final thoughts

Impressions of Deepgram and Whisper

I found that Whisper output seemed more aggressive than Deepgram’s in cutting out parts of valid words that aren’t disfluencies. I am confident that with some further tweaking and maybe selection of a different Whisper sub-model, the output could be refined.

Deepgram worked better out of the box in terms of not cutting out valid words, but also seemed to skip more filler words in the process. Both models ended up letting some disfluencies through.

Used out of the box, I’d suggest going with Deepgram to start with. If you want more configuration or to try out models from HuggingFace, play around with Whisper instead.

Plugging in another transcriber

If you want to try another transcription method, you can do so by adding a new module to server/transcription. Just make sure to implement two functions:

transcribe(), which takes a path to an audio file.
get_splits(), which takes the output from transcribe() and returns an instance of timestamp.Timestamps().

With those two in place, the Project class will know what to do! You can add your new transcriber to the Transcribers enum and specify it when instantiating your project.

Caveats for production use

Storage
This demo utilizes the file system to store uploads, temporary clip files, and output. No space monitoring or cleanup is implemented here (aside from removing temporary directories once a project is done). To use this in a production environment, be sure to implement appropriate monitoring measures and use a robust storage solution.

Security
This demo contains no authentication features. Processed videos are placed into a public folder that anyone can reach, associated with a UUID. Should a malicious actor guess or brute-force a valid project UUID, they can download processed output associated with that ID. For a production use case, access to output files should be gated.

Conclusion

Implementing powerful post-processing effects with AI has never been easier. Coupled with Daily’s comprehensive REST API, developers can easily fetch their recordings for further refinement with the help of an LLM. Disfluency removal is just one example of what’s possible. Keep an eye out for more demos and blog posts featuring video and audio recording enhancements with the help of AI workflows.

If you have any questions, don’t hesitate to reach out to our support team. Alternatively, hop over to our WebRTC community, peerConnection to chat about this demo.

Manage participants' media tracks in Angular (Part 3)

Tasha — Tue, 31 Oct 2023 17:21:09 +0000

By Jess Mitchell

In this series, we’re building a video call app with a fully customized UI using Angular and Daily’s Client SDK for JavaScript.

In the first two posts in this series, we:

Reviewed the app’s core features, as well as the general code structure and the role of each component. Instructions for setting up the demo app locally are also included.
Built the join flow for users to submit an HTML form to join a Daily room. This included keeping track of the participants list as people join or leave a call.
In this post, we’ll focus on:
Updating the participants list when participants toggle their device settings during a call.
How to render a video-tile component for each participant present in the call.
General recommendations for improving performance when rendering multiple video elements.

If you’re looking for more information on the app’s chat component, keep an eye out for the next post in this series.

Reviewing where we left off

So far in this series, we have an app-daily-container component, which connects the join-form and app-call components. It allows the information gathered in the join-form to be passed along to the app-call component.

Component structure in the Angular demo app

Once the HTML form in join-form is submitted, the app-call component is shown instead, which immediately creates an instance of Daily’s call object and attaches all of the event handlers related to managing the call and participants.

In terms of tracking participants as they join and leave the call, app-call uses a class variable – the participants object – which will get updated throughout the call. The key-value pair in participants is the participant’s session ID and a participant object, which contains the participant information we’ll need to update our app UI.

// call.component.ts

export type Participant = {
  videoTrack?: MediaStreamTrack | undefined;
  audioTrack?: MediaStreamTrack | undefined;
  videoReady: boolean;
  audioReady: boolean;
  userName: string;
  local: boolean;
  id: string;
};

type Participants = {

};

export class CallComponent {
  // … See source code
  participants: Participants = {};

Let’s now look at what happens when a participant updates their state after they join a call and are already in the participants list.

Updating `participants` to reflect track updates

In app-call, we previously looked at all the Daily event listeners added to the call object instance after it’s created. For this section, we’ll focus on the ”track-started” and ”track-stopped” events, which are emitted when a video or audio track becomes available or unavailable for a participant:

// Add event listeners for Daily events
this.callObject
  .on("track-started", this.handleTrackStartedStopped)
  .on("track-stopped", this.handleTrackStartedStopped)
  //...

Notice that both of the events attach this.handleTrackStartedStopped() as the event handler. When emitted, this.handleTrackStartedStopped() will then invoke this.updateTrack() and pass information from the Daily event payload, like so:

handleTrackStartedStopped = (e: DailyEventObjectTrack | undefined): void => {
  console.log("track started or stopped")
  if (!e || !e.participant || !this.joined) return;
  this.updateTrack(e.participant, e.type);
};

The goal of this.updateTrack() is two-fold:

To update a specific participant in our participants object when there’s a change that actually affects the UI. In short, this happens if the device is turned on or off, or if the media track itself has changed.
To only change the specific value of the key that registered an update, which means not updating or reassigning the whole participant object.

Muting video/audio cause track changes that impact the app UI (i.e., the state of the icons)

Let’s see how this.updateTrack() determines which participant values to update:

updateTrack(participant: DailyParticipant, newTrackType: string): void {
  const existingParticipant = this.participants[participant.session_id];
  const currentParticipantCopy = this.formatParticipantObj(participant);

  if (newTrackType === "video") {
    // If videoReady has changed, the track’s state was toggled on or off
    if (existingParticipant.videoReady !== currentParticipantCopy.videoReady) {
      existingParticipant.videoReady = currentParticipantCopy.videoReady;
    }

    // If the id has changed, a new track is available and should be used
    if (currentParticipantCopy.videoReady && existingParticipant.videoTrack?.id !== currentParticipantCopy.videoTrack?.id) {
      existingParticipant.videoTrack = currentParticipantCopy.videoTrack;
    }
    return;
  }

  if (newTrackType === "audio") {
    // If audioReady has changed, the track’s state was toggled on or off
    if (existingParticipant.audioReady !== currentParticipantCopy.audioReady) {
      existingParticipant.audioReady = currentParticipantCopy.audioReady;
    }

    // If the id has changed, a new track is available and should be used
    if (currentParticipantCopy.audioReady && existingParticipant.audioTrack?.id !== currentParticipantCopy.audioTrack?.id) {
      existingParticipant.audioTrack = currentParticipantCopy.audioTrack;
    }
  }
}

In this.updateTrack() we compare the old and new values related to audio and video tracks to see if there was a change that affects our app UI – for example, if the track ID is different, we know a new track is available.

The videoReady and audioReady values represent whether the video/audio track can be played (i.e., if the participant has the device on). As a reminder, these values are set when the participant object is reformatted before getting added to participants:

const PLAYABLE_STATE = "playable";
const LOADING_STATE = "loading";
//… See source code

formatParticipantObj(p: DailyParticipant): Participant {
    const { video, audio } = p.tracks;
    const vt = video?.persistentTrack;
    const at = audio?.persistentTrack;
    return {
      videoReady:
        !!(vt && (video.state === PLAYABLE_STATE || video.state === LOADING_STATE)),
      audioReady:
        !!(at && (audio.state === PLAYABLE_STATE || audio.state === LOADING_STATE)),
      // … See source code for full object
    };
  }

If the videoReady or audioReady value has changed, then we know the participant has toggled their device on or off.

The participant object is then updated as needed. We intentionally avoid reference changes as much as possible by updating the object instead of reassigning the existingParticipant variable to a copy of the object. This helps to avoid unnecessary rerenders of the video and audio elements found in video-tile component, which will be a major factor in building a performant video call app.

Now that we have our participants list and can update it as needed, let’s focus on video-tile to see how we turn participants into actual video and audio elements.

`video-tile`: rendering media tracks and device controls

As the participants variable is updated, app-call renders a video-tile component for each participant in participants:

// call.component.html
<div *ngIf="!error" class="participants-container">
  <video-tile
    *ngFor="let participant of Object.values(participants)"
    (leaveCallClick)="leaveCall()"
    (toggleVideoClick)="toggleLocalVideo()"
    (toggleAudioClick)="toggleLocalAudio()"
    [joined]="joined"
    [videoReady]="participant.videoReady"
    [audioReady]="participant.audioReady"
    [userName]="participant.userName"
    [local]="participant.local"
    [videoTrack]="participant.videoTrack"
    [audioTrack]="participant.audioTrack"></video-tile>
</div>

We pass several input and output properties, including every value in the participant object.

One extremely important detail to remember with Angular is that components only register reference changes for props, which means we can’t just pass each participant object as a single prop. This is because the object reference doesn’t change when the object values get updated in this.updateTrack(). To ensure prop changes are registered, we instead pass each object value as a separate prop (e.g., videoTrack).

Note: This also creates an interesting balance for our video-tile component because – as mentioned in the last section – we want to avoid reference changes that trigger unnecessary rerenders as much as possible; however, we also need to ensure relevant state changes reliably trigger UI updates.

There are three main aspects to be aware of with the video-tile component. It needs to render:

video and audio HTML elements for a specific participant. These elements need to update whenever track changes occur for the participant.
Participant information, including their name and icons to represent if their video and audio tracks are on or off. If the video is turned off, we’ll show a placeholder UI that covers the whole tile.
If the participant is local (if it’s you!), we’ll show a control panel with buttons to turn the video/audio on or off, as well as a button to leave the call.

Two video tiles: The local participant with the control panel and a remote participant with their video off

Let’s start by seeing how these features relate to the props declared in the VideoTileComponent class definition:

export class VideoTileComponent {
  @Input() joined: boolean;
  @Input() videoReady: boolean;
  @Input() audioReady: boolean;
  @Input() local: boolean;
  @Input() userName: string;
  @Input() videoTrack: MediaStreamTrack | undefined;
  @Input() audioTrack: MediaStreamTrack | undefined;
  videoStream: MediaStream | undefined;
  audioStream: MediaStream | undefined;

  @Output() leaveCallClick: EventEmitter<null> = new EventEmitter();
  @Output() toggleVideoClick: EventEmitter<null> = new EventEmitter();
  @Output() toggleAudioClick: EventEmitter<null> = new EventEmitter();
 // …

As mentioned, there are several input and output properties passed through the video-tile component. The input properties are the participant values passed from the app-call parent component. The output properties are the events that will be emitted back to app-call. (These are all the events that will be triggered by the local participant’s control panel buttons.)

You’ll also notice there are two class variables: videoStream and audioStream. Each instance of video-tile receives the video and audio tracks as input props, but that’s not what we’ll use in our video and audio HTML elements. Rather, we’ll create a MediaStream for each and swap out the track any time it changes. (More on this below.)

Creating media streams on init

When the VideoTileComponent class instance is initialized, we check if playable video and audio tracks exist for the participant and create media streams for them if so:

export class VideoTileComponent {
  // See source code for full class definition

  ngOnInit(): void {
    if (this.videoTrack) {
      this.addVideoStream(this.videoTrack);
    }
    if (this.audioTrack) {
      this.addAudioStream(this.audioTrack);
    }
  }

  // … See source code
  addVideoStream(track: MediaStreamTrack) {
    this.videoStream = new MediaStream([track]);
  }

  addAudioStream(track: MediaStreamTrack) {
    this.audioStream = new MediaStream([track]);
  }

  // … See source code

Updating media streams during the call

When these tracks are updated during the call, we are alerted to the change in the ngOnChanges lifecycle method:

ngOnChanges(changes: SimpleChanges): void {
  // Note: Only the props that have changed are included in changes.
  // If it's not included, we need to use existing version of the prop (e.g. this.videoTrack)
  const { videoTrack, audioTrack } = changes;

  // If the video stream hasn't been created and the track can be set, create a new stream.
  if (videoTrack?.currentValue && !this.videoStream) {
    // Use the new track and create a stream for it.
    this.addVideoStream(videoTrack.currentValue);
  }

  // If the video stream hasn't been created and the track can be set, create a new stream.
  if (audioTrack?.currentValue && !this.audioStream) {
    // Use the new track and create a stream for it.
    this.addAudioStream(audioTrack.currentValue);
  }

  // If the video stream exists and a track change occurred, replace the track only.
  if (videoTrack?.currentValue && this.videoStream) {
    this.updateVideoTrack(videoTrack.previousValue, videoTrack.currentValue);
  }

  // If the audio stream exists and a track change occurred, replace the track only.
  if (audioTrack?.currentValue && this.audioStream) {
    this.updateAudioTrack(audioTrack.previousValue, audioTrack.currentValue);
  }

ngOnChanges will only include the props that have triggered a change detection, so if the prop is present we already know there was a video or audio track change. From there, we just need to know if we should create a media stream (shown above in this.addVideoStream() and this.addAudioStream()) or swap out the track in the existing media stream.

If we’re updating an existing stream, we remove the old track and add the new. (We do not create a new media stream.)

updateVideoTrack(oldTrack: MediaStreamTrack, track: MediaStreamTrack) {
  // This should be true since it's a track change, but check just in case.
  if (oldTrack) {
    this.videoStream?.removeTrack(oldTrack);
  }
  this.videoStream?.addTrack(track);
}

updateAudioTrack(oldTrack: MediaStreamTrack, track: MediaStreamTrack) {
  // This should be true since it's a track change, but check just in case.
  if (oldTrack) {
    this.audioStream?.removeTrack(oldTrack);
  }
  this.audioStream?.addTrack(track);
}

By doing so, videoStream and audioStream will stay up-to-date with any changes to the participant’s media tracks.

`video-tile` HTML elements

Now that all the track state management is set up, we can render the HTML for the video-tile component, starting with the video and audio elements and participant information:

<video
  *ngIf="videoStream"
  autoPlay
  muted
  playsInline
  [srcObject]="videoStream"></video>

<div class="video-placeholder" *ngIf="!videoReady">
  <span>
    <img
      *ngIf="!videoReady && !local"
      src="../../assets/vid_off.svg"
      alt="Camera is off" />
  </span>
</div>

<audio
  *ngIf="audioStream && !local"
  autoPlay
  playsInline
  [srcObject]="audioStream">
  <track kind="captions" />
</audio>

<div class="participant-info">
  <p class="name">
    {{ userName }}
  </p>
  <img
    *ngIf="!audioReady && !local"
    src="../../assets/mic_off.svg"
    alt="Mic is off" />
  <img
    *ngIf="audioReady && !local"
    src="../../assets/mic_on.svg"
    alt="Mic is on" />
</div>
// …

If there’s a video stream, we render the video element and pass videoStream to the srcObject attribute. If the video isn’t ready to play (e.g., it’s turned off) we render a placeholder display instead. If there’s an audio stream and it’s not the local participant, we render the audio element using the audioStream variable as the srcObject. (Note: The local participant (you) doesn’t have an audio element because you don’t need to hear the playback of your own voice.) And, finally, we display the user’s name and if they’re muted.

A remote participant’s tile with their name and an icon for their audio state in the top right corner

With these elements added, we now have a functional video/audio component that responds whenever the participant toggles their media devices.

Next, we need to add a control panel for the local participant to turn their media devices on and off.

Building a control panel to manage local devices

The control panel elements are also defined in the video-tile component:

//…
<div *ngIf="local && this.joined" id="controls">

  <button class="media-control" (click)="toggleVideo()">
    <img
      *ngIf="!videoReady"
      src="../../assets/vid_off.svg"
      alt="Turn video on" />
    <img
      *ngIf="videoReady"
      src="../../assets/vid_on.svg"
      alt="Turn video off" />
  </button>

  <button class="media-control" (click)="toggleAudio()">
    <img *ngIf="!audioReady" src="../../assets/mic_off.svg" alt="Turn mic on" />
    <img *ngIf="audioReady" src="../../assets/mic_on.svg" alt="Turn mic off" />
  </button>

</div>

<button *ngIf="local" id="leaveCallButton" (click)="handleLeaveCallClick()">
  <img src="../../assets/leave_call.svg" alt="Leave call" />
</button>

There are three buttons in the control panel:

One to toggle the local participant’s video.
One to toggle the local participant’s audio.
One to leave the call.

Control panel buttons to toggle local video/audio or leave the call

Each button has its associated click handler attached to it:

When invoked, each of these will then emit an event that app-call is already listening for (the output properties mentioned before):

toggleVideo(): void {
  this.toggleVideoClick.emit();
}

toggleAudio(): void {
  this.toggleAudioClick.emit();
}

handleLeaveCallClick(): void {
  this.leaveCallClick.emit();
}

Let’s start with the device-related events (toggleVideoClick() and toggleAudioClick()) and see what happens when the app-call component receives the event.

// in call.component.ts

  toggleLocalVideo() {
    // Event is emitted from VideoTileComponent

    // Confirm they're in the call before updating media
    if (!this.joined) return;
    // Toggle current audio state
    const videoReady = this.callObject.localVideo();
    this.callObject.setLocalVideo(!videoReady);
  }

  toggleLocalAudio() {
    // Event is emitted from VideoTileComponent

    // Confirm they're in the call before updating media
    if (!this.joined) return;
    // Toggle current audio state
    const audioReady = this.callObject.localAudio();
    this.callObject.setLocalAudio(!audioReady);
  }

Each one will use the appropriate Daily call instance method (setLocalVideo() or setLocalAudio()) to toggle the device’s state. Invoking these instance methods will then cause the device state to change, which will in turn cause either the ”track-started” or ”track-stopped” event to be emitted, depending on the device’s final state.

The other event emitter – this.leaveCallClick() – will invoke app-call’s this.leaveCall() method, which will then invoke Daily’s leave() instance method:

// in call.component.ts

leaveCall(): void {
  this.error = "";
  if (!this.callObject) return;

  // Leave call
  this.callObject.leave();
}

Calling Daily’s leave() method will result in the ”participant-left” event being emitted for remote participants and ”left-meeting” being emitted if it’s the local participant leaving a call.

How these Daily events are handled has already been covered in this or the previous post, so we’ve come full circle!

With that, we have a functional call panel that allows each participant to update their devices, as well as to leave the call to reset the app’s state.

Joining a call, toggling media devices, and leaving a call

Wrapping up

In today’s post, we learned how to update tracks for call participants in an Angular app, as well as render video tiles for them, and toggle the state of their devices.

In our next post, we’ll look at how to add a chat component to the call so participants can message each other. (Spoiler: the chat feature is already in the source code.)

If you have any questions or thoughts about implementing your Daily-powered video app with Angular, reach out to our support team or head over to our WebRTC community, peerConnection.

Tracking connection quality with Daily's new connectivity test methods

Tasha — Thu, 26 Oct 2023 17:18:20 +0000

By Liza Shulyayeva

We’ve recently released three new call object instance methods to test connection and network quality:

testConnectionQuality() - Assesses the quality of a WebRTC connection using a given video track
testNetworkConnectivity() - Checks whether a stable connection can be established with Daily’s TURN server
testWebsocketConnectivity() - Determines whether your internet connection and network conditions support traffic over WebSockets

Each one can also be aborted mid-test with the associated method:

In this post, I’ll go through a small code sample showing you how to use each of these methods.

But first, let’s go through some common use cases for these connection test features.

Video call connection test use cases

We envision three common use cases for our new test methods:

Facilitating local troubleshooting
Tracking quality metrics internally
Informing ideal send setting configuration in the call

Let’s cover each of these in a little more detail.

Facilitating local troubleshooting

With the new connection test methods, developers can implement a pre-join UI in which the user’s connection quality and connectivity can be gauged. If shortcomings are detected, the user can be prompted with relevant suggestions to improve their experience.

For example, if testNetworkConnectivity() indicates a problem connecting to Daily’s TURN servers, the user can be prompted to check their firewall setup (especially if they’re on a corporate network).

Likewise, if users are experiencing degraded connection quality in the call itself, showing an indicator of connection quality with testConnectionQuality() can help participants narrow down which of them is experiencing connection difficulties at the time.

Tracking quality metrics internally

Helping users troubleshoot their local setups is great, but tracking performance metrics or flagging spikes in problematic sessions can help detect potential issues in advance. You can forward output from the test methods above to your existing metrics and telemetry pipeline and respond to any suspicious spikes in network problems accordingly.

Informing ideal send setting configuration

Daily provides reasonable media input and output configuration by default, but some highly customized implementations may benefit from more granular settings. Daily enables developers to fine-tune their send settings by defining their own simulcast layers or picking from a range of convenient presets.

By polling connection quality at regular intervals during the call, your application can respond to network fluctuations by updating the simulcast preset being used via our updateSendSettings() call instance method.

💡 You can also use our network quality events to inform send settings.

Now that we have some primary use cases established, let’s take a look at some code showing the usage of these methods.

What we’re building

This code sample shows a pre-call flow in which a local participant can test their connection quality, TURN server connectivity, and WebSockets connectivity. You, the local participant, will see their own camera feed when they start the test, but will not have joined a Daily room yet. When you click 'Run Tests', the three connectivity test methods mentioned above will be invoked and the results will be shown in the app UI.

Running the sample locally

You can find the sample in our JavaScript samples repository. You do not need a Daily account or a Daily room to run this locally. Perform the following steps:

Clone the repository: git clone https://github.com/daily-demos/daily-samples-js.git
Navigate to the relevant sample directory: cd daily-samples-js/samples/client-sdk/connectivity-tests
Check out the relevant tag: git checkout v1.0
Run npm i && npm run dev
Navigate to the port shown in your terminal in your preferred web browser. This will likely be localhost:8080 Once you have the demo app open in your browser, click the “Run Tests” button:

You should now see the in-progress test labels and your local camera feed appear. (Be sure to grant camera permissions in your browser if prompted.)

Now that you’ve got the code running locally, let’s walk through how I implemented usage of the new connectivity test methods. I’ve isolated all the logic relevant to this in the index.js file, so we’re going to focus on that.

Setting up the test button and Daily call object

When the DOM loads, the first thing I do is set up the “Run Tests” button:

window.addEventListener('DOMContentLoaded', () => {
  // Set up the primary test button and provide the handler to run when it is clicked.
  setupTestBtn(() => {
    const callObject = setupCallObject();
    setupLeaveBtn(() => {
      leave(callObject);
    });
    startTests(callObject);
  });
});

When the test button is clicked, three things happen:

A new Daily call object is instantiated (setupCallObject()).
The Leave button is set up (setupLeaveBtn()).
The tests are run (runTests()).

Let’s take a closer look at the most important part: call object setup. I’ve left some guiding comments in-line, but I’ll also provide an overview of what’s happening right underneath the code:

function setupCallObject() {
  // If call instance already exists, return it.
  let callObject = window.DailyIframe.getCallInstance();
  if (callObject) return callObject;

  // If call instance does not yet exist, create it.
  callObject = window.DailyIframe.createCallObject();
  const participantParentEle = document.getElementById('participants');

  // Set up relevant event handlers
  callObject
    .on('track-started', (e) => {
      enableControls();
      showTestResults();
      const p = e.participant;
      if (!p.local) return;

      const newTrack = e.track;

      addParticipantEle(p, participantParentEle);
      updateMediaTrack(p.session_id, newTrack);
      if (e.type === 'video') {
        runAllTests(callObject, newTrack);
      }
    })
    .on('error', (e) => {
      // If an unrecoverable error is received,
      // allow user to try to re-join the call.
      console.error('An unrecoverable error occurred: ', e);
      enableTestBtn();
    });

  return callObject;
}

Below are the highlights:

I check if a call instance (which can be a call object or a Daily Prebuilt call frame) already exists. If so, I return it as there’s no need to create a new one.
If a call instance doesn’t already exist, I create one using Daily’s createCallObject() factory method.
I then set up handlers for the two Daily events I’ll be handling: ”track-started" and ”error".
Finally, I return the new call object instance to the caller.

The "track-started” event is where all the magic happens. Here, I add a DOM element for the participant whose track has just started, and update it with the available tracks. For now, this will only be the local participant since we’re not actually joining any Daily video call room here.

Then, I check if the started track is a video track. Daily’s connection quality test and network connectivity test both take a video track as a parameter. So as soon as we get a video track, we’re ready to run the tests.

However, the local participant’s video won’t just start without us explicitly asking them to start their camera. Let’s take a look at where that’s done once the call object instance is created and set up.

Starting the camera to run video call connection tests

Now that the call object instance is all set up with all the handlers we’ll need, it’s time to actually start the tests. This is done I the ”DOMContentLoaded" event handler we covered above, by invoking the demo’s startTests() function and passing the new call object instance to it:

function startTests(callObject) {
  disableTestBtn();

  // Reset results, in case this is a re-run.
  resetTestResults();

  // If local participant already exists and has a track,
  // just run the tests.
  const localParticipant = callObject.participants().local;
  const videoTrack = localParticipant?.tracks?.video?.persistentTrack;
  if (videoTrack) {
    runAllTests(callObject, videoTrack);
    return;
  }

  // If there is not yet a local participant or a video track,
  // start the camera.
  try {
    callObject.startCamera();
  } catch (e) {
    console.error('Failed to start camera', e);
    enableTestBtn();
  }
}

Here’s what’s happening above:

The “Run Tests” button is disabled
The results displayed in the DOM are reset
I check if a local participant already exists and has a video track. If so, I go ahead and call runAllTests() with the participant’s existing video track
If a local participant does not yet exist, or doesn’t have a video track, I call Daily’s [startCamera()](https://docs.daily.co/reference/daily-js/instance-methods/start-camera#main) call object instance method. This will prompt the user for permissions to start their webcam. This, in turn, will trigger that ”track-started" event I covered above. With all the setup done and handlers hooked up, we’re ready to look at the runAllTests() method:

Running the video call connection tests

I run all three of Daily’s connection tests in parallel as follows:

function runAllTests(callObject, videoTrack) {
  Promise.all([
    testConnectionQuality(callObject, videoTrack),
    testNetworkConnectivity(callObject, videoTrack),
    testWebSocketConnectivity(callObject),
  ]).then(() => {
    enableTestBtn();
  });
}

Once all the tests are done, I re-enable the “Run Tests” button to let the local user run the tests again. Let’s take a look at each of the tests above.

`testConnectionQuality()`: Test quality of the WebRTC connection

This Daily connection quality test method returns a string reflecting the participant’s current connection quality. The method takes a video track and a duration value (how long you want the test to run for, in seconds). For the purposes of this example, I’ve set the duration to 5 seconds:

function testConnectionQuality(callObject, videoTrack) {
  return callObject
    .testConnectionQuality({
      videoTrack,
      duration: 5, // In seconds
    })
    .then((res) => {
      const testResult = res.result;
      let resultMsg = '';
      switch (testResult) {
        case 'aborted':
          resultMsg = 'Test aborted before any data was gathered.';
          break;
        case 'failed':
          resultMsg = 'Unable to run test.';
          break;
        case 'bad':
          resultMsg =
            'Your internet connection is bad. Try a different network.';
          break;
        case 'warning':
          resultMsg = 'Video and audio might be choppy.';
          break;
        case 'good':
          resultMsg = 'Your internet connection is good.';
          break;
        default:
          resultMsg = `Unexpected connection test result: ${testResult}`;
      }
      updateConnectionTestResult(resultMsg);
    })
    .catch((e) => {
      console.error('Failed to test connection quality:', e);
    });
}

Above, I invoke Daily’s testConnectionQuality() call object instance method. Once a result is returned, I interpret it via a switch statement and update the DOM with a relevant result message. The result object actually contains more data, like packet loss and max round trip time, which I’m not using here in order to keep the sample code simple. Refer to our testConnectionQuality() documentation to learn more about the data being returned.

`testNetworkConnectivity()`: Check connectivity to Daily’s TURN server

This test method returns a string with an ”aborted”, "passed", or ”failed" value. Like the connection quality test above, I interpret the result in a switch statement and update the DOM with a user-friendly message. Since the logic is very similar to the test we looked at above I won’t paste the code here, but you can check it out on GitHub.

`testWebsocketConnectivity()`: Check WebSocket traffic support

This Daily instance method returns an object with a result property of four potential string values: ”passed”, ”failed”, ”warning”, or ”aborted”. The object also contains an array of passed, failed, or aborted regions.

You can check out my implementation of this test on GitHub, since its handling is once again very similar to the two tests above.

Now that we’ve looked at running the tests and consuming the results, let’s take a look at the final piece: ending tests early.

Aborting Daily’s connection tests

If a user presses “Leave” while tests are running, I want to abort the remaining tests. I do that in my handler for the “Leave” button:

function leave(callObject) {
  callObject.setLocalVideo(false);

  // Abort any tests that may currently be running.
  callObject.abortTestNetworkConnectivity();
  callObject.abortTestWebsocketConnectivity();
  callObject.stopTestConnectionQuality();

  // Only enable test button again once call object is destroyed
  callObject.destroy().then(() => {
    enableTestBtn();
  });
}

Above, I turn off the user’s local video. I then stop all three tests (if a test isn’t currently running, this will be a no-op) and destroy the call object. Once the call object is destroyed, I re-enable the “Run Tests” button.

With that, our pre-call connection test implementation is complete.

Conclusion

Do you have any questions about using Daily’s new connection and connectivity tests? Reach out to our support team or head over to peerConnection, our WebRTC community.

Share admin privileges with participants during a real-time video call

Tasha — Tue, 24 Oct 2023 18:31:49 +0000

By Jess Mitchell

Having someone to oversee a video call meeting is important for numerous reasons, including being able to manage which participants are allowed in the call. With Daily’s Client SDK for JavaScript, participants can optionally join as meeting owners, which gives them special privileges like being able to remove participants from a call.

In some cases, it can be useful for meeting owners to share their administrator responsibilities with other participants, too.

In today’s post, we’ll look at sample code that demonstrates how to let meeting owners promote participants to admins, as well as how to let admins or owners remove other participants.

Daily meeting owners vs. admins

To start, let’s look at how meeting owners and admins differ.

Meeting owner privileges

A meeting owner is a participant with the most privileges related to managing room access.

Participants can join as a meeting owner by using a meeting owner token: a token that has the is_owner property set to true.

Being a meeting owner allows the participant to:

Share media (audio/video/screen) in owner-only broadcast mode, if these features are enabled in the Daily room.
Start or stop a live stream, if live streaming is enabled in Daily the room.
Start or stop call transcription.
Allow participants to join a private room by responding to knocking.
Update other participants, such as adding or revoking admin permissions and muting their devices.
(See an example of creating a token using in a fetch request in today’s sample app.)

Admins

Meeting admins are similar to meeting owners, with two key differences:

A meeting owner has to join the call with an owner meeting token. An admin can join with an admin token or be promoted to an admin mid-call.
Unlike an owner, an admin can be demoted and lose admin privileges.
An owner has all relevant privileges within the call, whereas an admin can be given more granular permissions. Depending on the specific permissions provided, admins can potentially do any of the actions listed above for meeting owners; however, they cannot remove a meeting owner from a call or remove their meeting owner privileges.

Now, onto the code!

Today’s goals

In this tutorial, we’ll look at sample code for how to build an admin panel to let meeting owners upgrade regular participants to admins or remove them from the call. We’ll use a sample app built with Next.js and Daily Prebuilt.

The meeting owner’s view of the demo, with buttons to remove a participant or make them an admin.

This tutorial will focus on two features:

A button that converts a participant to an admin.
A button that removes the participant (or admin) from the call. We won’t cover some of the more general Daily-related code like rendering Daily Prebuilt in your app, but we’ll include some related blog posts at the end!

To test the demo app yourself, follow the instructions included in its README.

Project structure

The Home component’s default view, which renders the Header and JoinForm components.

When the main route of our app (e.g., localhost:3000/) is visited, the component in the page.js file found at the top-level of the app’s /app directory is rendered – in this case, the Home component, which parents all other components in the app.

Component structure for this demo app.

The DailyContainer component is the home of our video call feature and has a number of conditionally-rendered components/elements:

A form (JoinForm) to join a call.
The AdminPanel component to upgrade regular participants to admins or remove them from the call.
The containerRef, a div which will contain the Daily Prebuilt UI once the JoinForm is submitted and the call is created.

The admin panel is highlighted in red and the Daily Prebuilt container is highlighted in blue.

Creating a Daily room and joining the call

Submitting the JoinForm form to create and join a call.

Let’s start by quickly familiarizing ourselves with what happens after the JoinForm is submitted.

The JoinForm itself is conditionally rendered and shown by default. Once it’s submitted, it’s destroyed and the call-related components are displayed instead.

Default view of the JoinForm.

There are also two conditions to be aware of when the JoinForm is rendered:

(Default) The user is creating a new Daily room to join and will join as a meeting owner with an owner meeting token. They can then share a link to that specific room.
If the participant is using a shared link with a url query param (e.g., http://localhost:3000/?url=https://domain.daily.co/[room-name]), a “Daily room URL” form input will be rendered to indicate which room is being joined and no new room or meeting token will be created for them.

Joining a specific Daily room.

This means there are two types of participants on join: meeting owners and regular participants. (No one joins as an admin!)

When the JoinForm is submitted, a few things happen.

If the participant is joining as a meeting owner:

A new room is created.
An owner meeting token is created.
An instance of a call frame (or DailyIframe class) is created. (Note: This is called callFrame in the code samples below.)
The call is joined using the meeting token.

If the participant is joining an existing call, the call frame is created and joined without a meeting token.

Now that we know how our participants can join the video call, let’s see how the AdminPanel component works.

Rendering the `AdminPanel` component

The AdminPanel component will render a list of all call participants. If the local participant has owner or admin privileges, each participant in the list will have buttons to remove that participant or promote them to be an admin. (Admins can’t remove owners, though!)

Owner view of a call with two participants.

Before getting into how we remove/promote participants, let’s first look at how we keep track of them in app state. In DailyContainer, there’s a participants object saved in the component’s state:

const [participants, setParticipants] = useState({});

Any time someone joins the call, we update the participants object by adding the participant’s session ID as the key and the participant’s object as the value.

setParticipants((p) => ({
   ...p,

 }));

Note: When someone joins as an owner or anyone is promoted to an admin, their status is tracked in DailyContainer’s state. These values are passed as props to AdminPanel, too.

If someone leaves the call, we delete their item from the object:

setParticipants((p) => {
    const currentParticipants = { ...p };
    delete currentParticipants[e.participant.session_id];
    return currentParticipants;
});

Using participants, we can render all call participants as a list. First we render the AdminPanel component itself, including number of props:

// DailyContainer.js
{callFrame && (
        <>
          <AdminPanel
            participants={participants}
            localIsOwner={isOwner}
            localIsAdmin={isAdmin}
            makeAdmin={makeAdmin}
            removeFromCall={removeFromCall}
          />
          // … See source code
        </>
      )}

The AdminPanel component is rendered for all participants, but only owners and admins will see buttons to update other participants:

An owner’s view of a two-person call where the second person is a regular participant.

A regular participant will instead see only the participant information:

A regular participant’s view of the AdminPanel.

The main functionality of AdminPanel is actually defined in its child component, ParticipantListItem:

export default function AdminPanel({
  participants,
  makeAdmin,
  removeFromCall,
  localIsOwner,
  localIsAdmin,
}) {
  return (
    <div className='admin-panel'>
      // … See source code

      <ul>
        {Object.values(participants).map((p, i) => {
          const handleMakeAdmin = () => makeAdmin(p.session_id);
          const handleRemoveFromCall = () => removeFromCall(p.session_id);
          return (
            <ParticipantListItem
              count={i + 1} // for numbered list
              key={p.session_id}
              p={p}
              localIsOwner={localIsOwner}
              localIsAdmin={localIsAdmin}
              makeAdmin={handleMakeAdmin}
              removeFromCall={handleRemoveFromCall}
            />
          );
        })}
      </ul>
    </div>
  );
}

Using the participants prop, a ParticipantListItem component is rendered for each participant in an unordered list (ul).

ParticipantListItem is passed most of the props AdminPanel received and then actually uses them:

const ParticipantListItem = ({
  p,
  makeAdmin,
  removeFromCall,
  localIsOwner,
  localIsAdmin,
  count,
}) => (
  <li>
    <span>
      {`${count}. `}
      {p.permissions.canAdmin && <b>{p.owner ? 'Owner | ' : 'Admin | '}</b>}
      <b>{p.local && '(You) '}</b>
      {p.user_name}: {p.session_id}
    </span>{' '}
    {!p.local && !p.owner && (localIsAdmin || localIsOwner) && (
      <span className='buttons'>
        {(!p.permissions.canAdmin || localIsOwner) && (
          <button className='red-button-secondary' onClick={removeFromCall}>
            Remove from call
          </button>
        )}
        {!p.permissions.canAdmin && (
          <button onClick={makeAdmin}>Make admin</button>
        )}
      </span>
    )}
  </li>
);

Here, we:

Add the participant’s number to their line item.
Indicate if they’re an owner or admin and if it’s the local participant (you!).
Render their username and session ID to confirm they’re unique participants.
Conditionally render two buttons to either remove the participant or make them an admin.

Next, let’s focus on the button to make a participant an admin.

Converting non-admins to admins

As we saw above, the ParticipantListItem component conditionally renders a button to let owners or admins make a regular participant an admin.

The “Make admin” button click handler uses the makeAdmin prop passed down from DailyContainer and includes the participant’s session_id so we known which participant is being upgraded to an admin:

// AdminPanel.js

return (
        // … See See source code for full code block
        {Object.values(participants).map((p, i) => {
          const handleMakeAdmin = () => makeAdmin(p.session_id);
          const handleRemoveFromCall = () => removeFromCall(p.session_id);

          return (
            <ParticipantListItem
makeAdmin={handleMakeAdmin}
    //… See source code

On click, the button invokes the makeAdmin() function declared in DailyContainer, which will upgrade the participant to an admin via Daily’s updateParticipant() instance method:

const makeAdmin = useCallback(
  (participantId) => {
    // https://docs.daily.co/reference/daily-js/instance-methods/update-participant#permissions
    callFrame.updateParticipant(participantId, {
      updatePermissions: {
        canAdmin: true,
      },
    });
  },
  [callFrame]
);

updateParticipant() can be used for various participant updates and will act differently depending on the options passed in the second parameter. In this case, we want to make the participant an admin so we set the canAdmin property to true.

Using updateParticipant() this way will in fact make the participant an admin, but it won’t update our UI. To do this, we need to listen for the ”participant-updated” Daily event, which is emitted any time a participant is updated in any way.

Updating the participant list UI for new admins

To see how we update the UI for new admins, we need to backtrack for a second.

When the call frame was first created, a series of Daily event listeners were attached to it:

const addDailyEvents = (dailyCallFrame) => {
  // https://docs.daily.co/reference/daily-js/instance-methods/on
  dailyCallFrame
    .on('joined-meeting', handleJoinedMeeting)
    .on('participant-joined', handleParticipantJoined)
    .on('participant-updated', handleParticipantUpdate)
    // … See source code

The handleParticipantUpdate() method is attached as the handler for ”participant-updated”:

const handleParticipantUpdate = (e) => {
  const { participant } = e;
  const id = participant.session_id;
  if (!prevParticipants.current[id]) return;
  // Only update the participants list if the permission has changed.

  if (
    prevParticipants.current[id].permissions.canAdmin !==
    participant.permissions.canAdmin
  ) {
    setParticipants((p) => ({
      ...p,
      [id]: participant,
    }));
    if (participant.local) {
      setIsAdmin(participant.permissions.canAdmin);
    }
  }
};

For the purposes of this app, we are only looking for changes to the participant.permissions.canAdmin status. If the previously saved value is different from the value included in the event’s payload for the participant, we update the participants list.

Note: prevParticipants is a ref that keeps a copy of the participants list so that we can compare the current list with possible updates.

Once the participant change is updated in state, the AdminPanel component will automatically receive the updates through the participants prop and update how the panel is rendered.

A participant being made an admin.

Ejecting a participant from the call

Ejecting (or removing) a participant from the call works almost identically from a code perspective.

In AdminPanel when the ParticipantListItems are rendered, the removeFromCall() prop is passed down, as well:

// AdminPanel

return (
        // … See source code
        {Object.values(participants).map((p, i) => {
          const handleMakeAdmin = () => makeAdmin(p.session_id);
          const handleRemoveFromCall = () => removeFromCall(p.session_id);

          return (
            <ParticipantListItem
removeFromCall={handleRemoveFromCall}
    //… See source code

This is then used by the button element in ParticipantListItem to remove a participant.

{(!p.permissions.canAdmin || localIsOwner) && (
  <button className='red-button-secondary' onClick={removeFromCall}>
    Remove from call
  </button>
)}

The actual action it triggers is defined in DailyContainer, which will invoke the updateParticipant() instance method:

const removeFromCall = useCallback(
  (participantId) => {
    // https://docs.daily.co/reference/daily-js/instance-methods/update-participant#setaudio-setvideo-and-eject
    callFrame.updateParticipant(participantId, {
      eject: true,
    });
  },
  [callFrame]
);

The main difference here is the properties passed to updateParticipant(); this time we use the eject property set to true.

Once they’ve been ejected, the ”participant-left” event is emitted, another event that we’re already listening for:

const addDailyEvents = (dailyCallFrame) => {
  dailyCallFrame
    .on('participant-left', handleParticipantLeft)
    // … See source code

When this event is received, the participants object is updated in handleParticipantLeft() and the UI updates to reflect the change:

const handleParticipantLeft = (e) => {
  console.log(e.action);
  setParticipants((p) => {
    const currentParticipants = { ...p };
    delete currentParticipants[e.participant.session_id];
    return currentParticipants;
  });
};

An admin’s view of being ejected from the call by a call owner.

And, like that, we can promote participants to admins or remove them from the call!

Conclusion

In today’s post we looked at sample code that demonstrates how to use the updateParticipant() instance method with the canAdmin property to share admin privileges with other call participants. We also looked at ejecting call participants with the updateParticipant() instance method.

To learn more about admin privileges and some of the topics we couldn’t cover in detail today, check out these related blog posts:

How to properly destroy a Daily video call instance

Tasha — Thu, 19 Oct 2023 17:31:31 +0000

By Liza Shulyayeva

In this post, I’ll show you how to appropriately destroy a Daily call object or call frame in your JavaScript application. How and when to destroy the call instance is a common question we get from developers building video apps with Daily, and by the end of this post you’ll know exactly what to do.

Developers often want to recreate or reconfigure the call object or call frame at some point in their application lifecycle. As only one instance of these can be active at a time, this means the previous instance needs to be either reused or destroyed. If you’re curious to learn more, check out my previous post about why you don’t need multiple call objects in your Daily-powered video application.

First, let’s cover some basics: What is a Daily call object/call frame?

A brief overview of Daily’s call object and call frame

When it comes to JavaScript implementations, we have two primary entry points to work with Daily:

Daily Prebuilt: A full-featured video call UI that you can embed into your own web app.
Daily’s Client SDK for JavaScript: An SDK that enables you to build video into your applications with granular control and flexibility over the UI, media handling, etc. The entry point to both of these approaches is daily-js, our JavaScript library. For Daily Prebuilt, you would use the DailyIframe.createFrame() factory method to get started. For a custom implementation with the client SDK, you’d use the DailyIframe.createCallObject() factory method.

In reality, both of these will return an instance of the same type (DailyCall), just configured appropriately for either Daily Prebuilt or a custom usage. For the remainder of this post, I’ll refer to both of these as the Daily call instance.

The Daily call instance will be your main interface with Daily. You’ll use it to set listeners for relevant events, update participants, toggle video and audio, and more.

Because the call object and call frame are actually two different configurations of the same underlying type, the principles of destroying them are the same.

Now that we’ve got a handle on what the Daily call instance is, let’s look at some basic video call lifecycle guidelines.

The Daily call instance lifecycle

As I mentioned above, a Daily call instance is created by using the createCallObject() or createFrame() factory methods of daily-js. Only one call instance can exist per window or iframe context in your web app. If you create more than one instance, you’ll see the following error in your console:

If you want to check whether a call instance already exists, use the getCallInstance() static method. If an instance is returned, you can either reuse it or destroy it and create a new one, depending on your needs:

Reusing a call instance:

let dailyCall = DailyIframe.getCallInstance();
if (!dailyCall) {
    dailyCall = DailyIframe.createCallObject();
}
// Proceed to use `dailyCall`...

Creating a new call instance:

let dailyCall = DailyIframe.getCallInstance();
if (dailyCall) {
  await dailyCall.destroy();
}
dailyCall = DailyIframe.createCallObject();

// Proceed to use `dailyCall`...

An instance that is returned from getCallInstance() is guaranteed to not have been destroyed. If you already have your own reference to a Daily call instance but aren’t sure if it’s been destroyed, you can use the isDestroyed() instance method to check.

Leaving a Daily video call

In most cases, reusing the call instance between calls is the most intuitive flow. In this case, you would call the leave() instance method and then simply call join() again to join (or rejoin) a video call room.

Keep in mind that calling leave() will keep all existing handlers in place. This means they can be reused (if relevant) without any re-initialization. But if you need to modify the handlers in any way for the next call session, be sure to turn them off when you leave the call.

In some cases, developers want a total reset of call object state at some point in their application. This is where destruction of the call object comes in.

Destroying a Daily video call instance

You should destroy a Daily call instance when:

You want a complete reset of the call state.
You’re done with using Daily and want to free up all related resources.
You want to create a new call instance.

To destroy a call object, you can leave the video call first or just call destroy() directly (which will leave the call as well). If you opt to leave first, you can await the return of the leave() method before destroying or listen for the ”left-meeting” Daily event to be emitted on the call instance. The latter is slightly more idiomatic and if you already have other event handlers in place, it will result in a more consistent implementation across the board (in that this will be just another event you handle). For example:

callObject.on("left-meeting", () => {
    callObject.destroy();
});

Once the call has been left, invoke the destroy() instance method. The destroy() method also returns a Promise. You must wait for it to resolve before recreating another call instance.

The above recommendations also go for pre-join UI scenarios (such as when you have used startCamera() for a pre-join flow, but have not yet joined a specific video call room).

Conclusion

In this post, we’ve covered how to best reset your video call state and destroy a Daily call instance.

If you have any questions about Daily call state or the call instance lifecycle, get in touch with our support team, or head on over to peerConnection, our WebRTC community.

Manage Daily video call state in Angular (Part 2)

Tasha — Tue, 17 Oct 2023 22:01:25 +0000

By Jess Mitchell

In this series, we’re building a video call app with a fully customized UI using Angular and Daily’s Client SDK for JavaScript. In the first post, we reviewed the app’s core features, as well as the general code structure and the role of each component. (Instructions for setting up the demo app locally are also included there.)

In today’s post, we’ll start digging into the Angular demo app’s source code available on GitHub. More specifically, we will:

Build the join-form component to allow users to join a Daily room.
Show how the app-daily-container component works, including how it responds to the join-form being submitted.
See how the app-call component sets up an instance of the Daily Client SDK’s call object.
Review how joining a Daily call works, and how to handle the events related to joining and leaving a call.

The next post in this series will cover rendering the actual video tiles for each participant, so stay tuned for that.

Now that we have a plan, let’s get started!

`app-daily-container:` The parent component

As a reminder from our first post, app-daily-container is the parent component for the video call feature:

Component structure in the Angular demo app.

It includes the join-form component, which allows users to submit an HTML form to join a Daily call. It also includes the app-call component, which represents the video call UI shown after the form is submitted.

Submitting the form in join-form will cause app-daily-container to render app-call instead.

In this sense, app-daily-container is the bridge between the two views because we need to share the form values obtained via join-form with the in-call components (app-call and its children).

💡 app-chat is also a child of app-daily-container, but we’ll cover that in a future post.

Looking at the class definition for app-daily-container, we see there are two class variables (dailyRoomUrl and userName), as well as methods to set or reset those variables.

import { Component } from "@angular/core";

@Component({
  selector: "app-daily-container",
  templateUrl: "./daily-container.component.html",
  styleUrls: ["./daily-container.component.css"],
})

export class DailyContainerComponent {
  // Store callObject in this parent container.
  // Most callObject logic in CallComponent.
  userName: string;
  dailyRoomUrl: string;

  setUserName(name: string): void {
    // Event is emitted from JoinForm
    this.userName = name;
  }

  setUrl(url: string): void {
    // Event is emitted from JoinForm
    this.dailyRoomUrl = url;
  }

  callEnded(): void {
    // Truthy value will show the CallComponent; otherwise, the JoinFormComponent is shown.
    this.dailyRoomUrl = "";
    this.userName = "";
  }
}

The HTML for app-daily-container shows how these variables and methods get shared with the child components:

<join-form
  *ngIf="!dailyRoomUrl"
  (setUserName)="setUserName($event)"
  (setUrl)="setUrl($event)"></join-form>
<app-call
  *ngIf="dailyRoomUrl"
  [userName]="userName"
  [dailyRoomUrl]="dailyRoomUrl"
  (callEnded)="callEnded()"></app-call>

We can see that the join-form and app-call components use the *ngIf attribute to determine when they should be rendered. They have opposite conditions, meaning they will never both be rendered at the same time. If the dailyRoomUrl variable is not set yet, it renders the join-form (the default view); otherwise, it renders the app-call component. This is because the dailyRoomUrl is set when the join form has been submitted, so if the value isn’t set, the form hasn’t been used yet.

join-form has two output properties: the setUserName() and setUrl() event emitters, which are invoked when the join form is submitted.

app-call receives the userName and dailyRoomUrl variables as input props, which automatically update when Angular detects a change in their values. It also has callEnded() as an output prop.

Understanding input and output props in Angular

In Angular, props can either be an input property – a value passed down to the child – or an output property – a method used to emit a value from the child to the parent component. (Data sharing is bidirectional between the parent and child when both types are used.)

You can tell which type of property it is based on whether the child component declares the prop as an input or output. For example, in app-call we see:

@Input() userName: string;
@Output() callEnded: EventEmitter<null> = new EventEmitter();

The userName prop is an input property, meaning the value will automatically update in app-call whenever the parent component updates it.

The callEnded prop is an output property. Whenever app-call wants to emit an update to the parent component (app-daily-container), it can call this.callEnded(value) and app-daily-container will receive the event and event payload.

`join-form`: Submitting user data for the call

The demo app’s join-form component.

The join-form component renders an HTML form element. It includes two inputs for the user to provide the Daily room they want to join, as well as their name:

<form [formGroup]="joinForm" (ngSubmit)="onSubmit()" class="join-form">
  <label for="name">Your name</label>
  <input type="text" id="name" formControlName="name" required />
  <label for="url">Daily room URL</label>
  <input type="text" id="url" formControlName="url" required />
  <input type="submit" value="Join call" [disabled]="!joinForm.valid" />
</form>

💡You can create a Daily room and get its URL through the Daily dashboard or REST API.

The onSubmit() class method is invoked when the form is submitted (ngSubmited).

The input values are bound to the component’s joinForm values, an instance of Angular’s FormBuilder class:

// … See the source code for the complete component

export class JoinFormComponent {
  @Output() setUserName: EventEmitter<string> = new EventEmitter();
  @Output() setUrl: EventEmitter<string> = new EventEmitter();

  joinForm = this.formBuilder.group({
    name: "",
    url: "",
  });

  constructor(private formBuilder: FormBuilder) {}

  onSubmit(): void {
    const { name, url } = this.joinForm.value;
    if (!name || !url) return;

    // Clear form inputs
    this.joinForm.reset();
    // Emit event to update userName var in parent component
    this.setUserName.emit(name);
    // Emit event to update URL var in parent component
    this.setUrl.emit(url);
  }
}

When onSubmit() is invoked, the form input values are retrieved through this.joinForm.value. The this.joinForm values are then reset, which also resets the inputs in the form UI to visually indicate to the user that the form was successfully submitted.

Next, the setUserName() and setURL() output properties are used to emit form input values to app-daily-container. (Remember: app-daily-container is listening for events already.)

Once these events are received by app-daily-container, the form values are assigned to the userName and dailyRoomUrl variables in app-daily-container, the latter of which causes the app-call component to be rendered instead of the join-form.

It’s a mouthful, but now we can move on to creating the actual video call!

`app-call`: The brains of this video call operation

Let’s start with app-call’s HTML so we know what’s going to be rendered:

<error-message
  *ngIf="error"
  [error]="error"
  (reset)="leaveCall()"></error-message>

<p *ngIf="dailyRoomUrl">Daily room URL: {{ dailyRoomUrl }}</p>

<p *ngIf="!isPublic">
  This room is private and you are now in the lobby. Please use a public room to
  join a call.
</p>

<div *ngIf="!error" class="participants-container">
  <video-tile
    *ngFor="let participant of Object.values(participants)"
    (leaveCallClick)="leaveCall()"
    (toggleVideoClick)="toggleLocalVideo()"
    (toggleAudioClick)="toggleLocalAudio()"
    [joined]="joined"
    [videoReady]="participant.videoReady"
    [audioReady]="participant.audioReady"
    [userName]="participant.userName"
    [local]="participant.local"
    [videoTrack]="participant.videoTrack"
    [audioTrack]="participant.audioTrack"></video-tile>
</div>

There is an error message that gets shown only if the error variable is true. (This gets set when Daily’s Client SDK’s ”error” event is emitted.)

The dailyRoomUrl value is rendered in the UI so the local participant can see which room they joined. We also let the participant know if the room is private since we haven’t added a feature yet to join a private call. (You could, for example, add a knocking feature).

Finally, the most important part: the video tiles. In Angular, you can use *ngFor attribute like a for of statement. For each value in the participants object (described in more detail below), a video-tile component will be rendered. We will look at how the video-tile component works in the next post so, for now, know that there’s one for each participant.

Next, let’s look at how we build our participants object in app-call’s class definition.

Defining the `CallComponent` class

Most of the logic related to interacting with Daily’s Client SDK for JavaScript is included in the app-call component. The daily-js library is imported at the top of the file (import DailyIframe from "@daily-co/daily-js";) which allows us to create the call for the local participant.

To build our call, we will need to:

Create an instance of the call object (i.e., the DailyIframe class) using the createCallObject() factory method. (The getCallInstance() static method can be used to see if a call object already exists, too.)
Attach event listeners to the call object. daily-js will emit events for any changes in the call so we’ll listen for the ones relevant to our demo use case.
Join the call using the dailyRoomUrl and the userName, which were provided in the join-form. This is done via the join() instance method.

Since the call component is rendered as soon as the join form is submitted, we need to set up the call logic as soon as the component is initialized. In Angular, we use the ngOnInit() lifecycle method to capture this:

// … See source code for more detail

export class CallComponent {
  Object = Object; // Allows us to use Object.values() in the HTML file
  @Input() dailyRoomUrl: string;
  @Input() userName: string;
  // .. See source code

  ngOnInit(): void {
    // Retrieve or create the call object
    this.callObject = DailyIframe.getCallInstance();
    if (!this.callObject) {
      this.callObject = DailyIframe.createCallObject();
    }

    // Add event listeners for Daily events
    this.callObject
      .on("joined-meeting", this.handleJoinedMeeting)
      .on("participant-joined", this.participantJoined)
      .on("track-started", this.handleTrackStartedStopped)
      .on("track-stopped", this.handleTrackStartedStopped)
      .on("participant-left", this.handleParticipantLeft)
      .on("left-meeting", this.handleLeftMeeting)
      .on("error", this.handleError);

    // Join Daily call
    this.callObject.join({
      userName: this.userName,
      url: this.dailyRoomUrl,
    });
  }

 // … See source code

All three of the steps mentioned above happen as soon as the component is initialized. The second step – attaching Daily event listeners – is where most of the heavy lifting happens for managing the call. For each event shown in the code block above, an event handler is attached that will be invoked when the associated event is received.

As an overview, each event used above represents the following:
”joined-meeting”: The local participant joined the call.
”participant-joined”: A remote participant joined the call.
”track-started”: A participant's audio or video track began.
”track-stopped”: A participant's audio or video track ended.
”participant-left”: A remote participant left the call.
”left-meeting”: The local participant left the call.
”error”: Something went wrong.

💡 The next post in this series will cover the track-related events.

Once the call object is initialized and join()-ed, we need to manage the call state related to which participants are in the call. Once we know who is in the call, we can render video-tile components for them.

Adding a new participant
The app-call component uses the participants variable (an empty object to start) to track all participants currently in the call.

export type Participant = {
  videoTrack?: MediaStreamTrack | undefined;
  audioTrack?: MediaStreamTrack | undefined;
  videoReady: boolean;
  audioReady: boolean;
  userName: string;
  local: boolean;
  id: string;
};

type Participants = {

};

export class CallComponent {
  // … See source code
  participants: Participants = {};

When the local participant or a new remote participant joins, we’ll respond the same way: by updating participants.

In both this.handleJoinedMeeting() and this.participantJoined() – the callbacks for ”joined-meeting” and ”participant-joined” – the this.addParticipant() method gets called to update participants. Let’s see how this works using this.participantJoined() as an example:

participantJoined = (e: DailyEventObjectParticipant | undefined) => {
  if (!e) return;
  // Add remote participants to participants list used to display video tiles
  this.addParticipant(e.participant);
};

With any Daily event, the event payload is available in the callback method (e in this case). The participant information is available in the event payload, which then gets passed to this.addParticipant().

addParticipant(participant: DailyParticipant) {
  const p = this.formatParticipantObj(participant);
  this.participants[participant.session_id] = p;
}

Two steps happen here:

A reformatted copy of the participant object is made.
The reformatted participant object gets added to the participants object.

The participant object doesn’t need to get reformatted for this demo to work, but we do this to make the object easier to work with. The participant object that gets returned in the event payload contains a lot of nested information – much of which doesn’t affect our current feature set. To extract the information we do care about, we use this.formatParticipantObj() to format a copy of the participant object, like so:

formatParticipantObj(p: DailyParticipant): Participant {
  const { video, audio } = p.tracks;
  const vt = video?.persistentTrack;
  const at = audio?.persistentTrack;
  return {
    videoTrack: vt,
    audioTrack: at,
    videoReady:
      !!(vt && (video.state === PLAYABLE_STATE || video.state === LOADING_STATE)),
    audioReady:
      !!(at && (audio.state === PLAYABLE_STATE || audio.state === LOADING_STATE)),
    userName: p.user_name,
    local: p.local,
    id: p.session_id,
  };
}

What we’re interested in here is:

The video and audio tracks.
Whether the tracks can be played in the demo’s UI, which is represented by videoReady and audioReady.
The user name of the participant, which we’ll display in the UI.
Whether they’re local, since local participants will have device controls in their video-tile.
The participant ID, so we can have a unique way of identifying them. (Refer to our docs for an example of all the other participant information returned in the event.)

Once the participant has been added to the participants object, a video-tile can be rendered for them, but more on that in our next post!

Removing a remote participant from the call
A call participant can leave the call in different ways. For example, they can use the “Leave” button in the video-tile component or they can simply exit their browser tab for the app. In any case, we need to know when a participant has left to properly update the app UI.

When any remote participant leaves the call, we need to update participants to remove that participant.

handleParticipantLeft = (
  e: DailyEventObjectParticipantLeft | undefined
): void => {
  if (!e) return;
  console.log(e.action);

  delete this.participants[e.participant.session_id];
};

In handleParticipantLeft() (the handler for the ”participant-left” event), we simply delete the entry for that participant in the participants object. This will in turn remove their video tile from the call UI.

Removing a local participant from the call
When the local participant leaves, we need to reset the UI by unmounting the app-call component and going back to rendering the join-form view instead, since the participant is no longer in the call. Instead of updating participants, we can just destroy() the call object and emit callEnded(), the output property passed from app-daily-container.

handleLeftMeeting = (e: DailyEventObjectNoPayload | undefined): void => {
  if (!e) return;
  console.log(e);
  this.joined = false; // this wasn’t mentioned above but is used in the UI
  this.callObject.destroy();
  this.callEnded.emit();
};

When callEnded() is emitted, app-daily-container will reset the dailyRoomUrl and userName class variables.

// In DailyContainerComponent:

callEnded(): void {
    // Truthy value will show the CallComponent; otherwise, the JoinFormComponent is shown.
    this.dailyRoomUrl = "";
    this.userName = "";
  }

As a reminder, when dailyRoomUrl is falsy, the join-form component gets rendered instead of the app-call.

And with that, we have a Daily call that can be joined and left!

Conclusion

In today’s post, we covered how participants can submit an HTML form in an Angular app to join a Daily video call, as well as how to manage participant state when participants join or leave the call.

In the next post, we’ll discuss how to render video tiles for each participant while they’re in the call, including our recommendations for optimizing video performance in an app like this.

If you have any questions or thoughts about implementing your Daily-powered video app with Angular, reach out to our support team or head over to our WebRTC community, peerConnection.

Build a Daily video call app with Angular and TypeScript (Part 1)

Tasha — Thu, 12 Oct 2023 17:23:49 +0000

By Jess Mitchell

If you’ve read Daily’s blog before, you’ll know we’re happy to show developers all the different ways you can incorporate Daily’s SDKs into your apps. One of the best aspects of Daily's Client SDK for JavaScript is that it’s front-end framework agnostic. If you love React and Next.js, we have several tutorials and demo apps for you! If you prefer Svelte, Vue, or plain JavaScript, we’ve got you covered there, too.

In this series, we’ll be expanding our front-end framework resources even further with our newest series on building an Angular app with Daily’s Client SDK for JavaScript.

This will start as a four-part series with the following topics:

An overview of the Angular demo app’s features and component structure.
Building the user flow of joining and leaving a Daily call.
How to render video tiles for each participant and manage any track changes.
Adding a custom chat component to the video call app so participants can send text-based messages to each other in the call.

In-call demo app view

Today’s post will focus on the first topic: an overview of the demo app and its components. To follow along with this series, we recommend having existing familiarity with JavaScript/TypeScript, HTML, and Angular. (We won’t go into the CSS styling much since it’s not a key factor in the functionality of the app.)

We’ve got a lot to get through, so let’s get started!

Testing the demo app locally

To follow along with this series, we recommend testing the demo app yourself and reviewing the source code.

Creating a Daily account and a Daily room

To use the demo app, you will first need to create a free Daily account. Once you have an account, you can create a Daily room. We recommend using the default settings when creating a room, but if you do update the settings, leave the room privacy configuration set to public. (The app isn’t currently set up to handle access requests, like knocking, but you could add that feature.)

Make note of the room’s URL, which will be required to use the demo app.

Running the demo app

To use the demo app locally, start by installing Angular if you haven’t already:

npm install -g @angular/cli

Next, clone the demo repository and navigate into the demo’s root directory on your local machine:

git clone https://github.com/daily-demos/daily-angular.git
cd daily-angular
Bash
Install its dependencies, start the local server, and visit the app in your browser of choice at localhost:4200.

npm install
npm run start

We encourage developers to add to this demo app as needed to try out different features available through Daily’s Client SDK for JavaScript. If you want to add a new feature, you can create new components with ng generate component <component-name>.

Now that we’re set up, let’s dig into the feature set.

App features overview

As mentioned, the demo app will be built with Angular and Daily’s Client SDK for JavaScript. It will support multi-participant calls, but will focus on the core functionality and not contain performance optimizations for larger calls, like conferences. (Check out our large calls guide for more information about building apps for up to 100,000 real-time participants.)

The app UI will be fully customized, but you could also embed Daily Prebuilt if you prefer a ready-made solution.

The actual features we’ll have built by the end of this series include:

Two views: a home page with a form to join the call and an in-call view.
The join form will accept two pieces of information: your name and the Daily room you’d like to join.
The in-call view will have a video tile for each participant. Participants can turn their video and audio on/off, and can click a button to leave the call. Video tiles will automatically update when a remote participant toggles their media or the track changes, (e.g., if they change their input device).
An error screen is shown for errors and a “Not Found” page is shown if you navigate away from the main route.

The demo app's join form in the home page view

App component structure

Let’s now see how our Angular components are structured relative to each other:

Component structure in the Angular demo app

Each of the components in the flow chart above is only rendered once except for video-tile component. A video-tile component is rendered for each participant present.

The names used in the chart refer to each component’s selector, which can be found in the component’s TypeScript file. For example:

@Component({
  selector: "app-call",
  templateUrl: "./call.component.html",
  styleUrls: ["./call.component.css"],
})

In general, we’ll refer to a component by its selector in this series, but we may also use its class name if we’re specifically referring to the class definition.

If you’re newer to Angular, another detail to be aware of is that each component has three files associated with it:

A TypeScript file that defines the component class.
An HTML file defining its DOM elements.
A CSS file to style the DOM elements.

The first two are the files we’ll focus on in this series.

And, lastly, if you’re specific looking for example code for the video-aspects of build a video app in Angular, the components with the most Daily-related code are app-call, app-chat, and video-tile.

Next, let’s see what each component does and when it's rendered.

`app-root` and routing

Every Angular app will have a root component that parents the entire app. It includes any HTML elements that are rendered on every page, regardless of the route (in our case, the header and main elements). It also manages app routing and will conditionally render a component based on how the routes are defined.

In app.component.html, we see:

<main class="content">
  <!-- App router -->
  <router-outlet></router-outlet>
</main>

router-outlet will render the component that matches the route currently being visited.

The logic for which component should be rendered is defined in app-routing.module.ts:

import { NgModule } from "@angular/core";
import { RouterModule, Routes } from "@angular/router";
import { PageNotFoundComponent } from "./page-not-found/page-not-found.component";
import { DailyContainerComponent } from "./daily-container/daily-container.component";

const routes: Routes = [
  { path: "", component: DailyContainerComponent },
  { path: "**", component: PageNotFoundComponent },
];

@NgModule({
  imports: [RouterModule.forRoot(routes)],
  exports: [RouterModule],
})
export class AppRoutingModule {}

In the routes array, each path is mapped to a different component. Since our app routing is quite simple, we only have two routes: the base path and literally anything else. If you are visiting the base path (e.g., localhost:4200/), the DailyContainerComponent is rendered (a.k.a. app-daily-container). Otherwise, a message is shown that the page isn’t found (the app-page-not-found component).

app-page-not-found view

To add more routes to the app, the routes array can be updated, though the wildcard option should be left as the last item to catch any routes that aren’t specifically included.

`app-daily-container`, `join-form`, and `app-call`

Moving further down the flow chart, we have app-daily-container next and two of its child components: join-form, and app-call.

The role of app-daily-container is to determine whether join-form or app-call should be rendered. They will never both be rendered at the same time. It does this by keeping track of whether the HTML form in join-form has been submitted and, if so, rendering the app-call component instead. (In the next post in this series, we’ll look more closely at the code that handles this.)

Submitting the form in join-form will cause app-daily-container to render app-call instead.

join-form accepts two pieces of information:

Your user name, which will be displayed in-call.
The URL for the Daily room you would like to join. (This is the URL for the room created above.) app-call is the actual video call UI where you can see video tiles for any participants who have joined the same Daily room.

`video-tile` and `error-message`

As mentioned, each participant will have a video-tile component rendered for them to indicate they are present in the call. The video-tile component includes the participant’s name and icons to indicate if their audio is currently on. It also has an HTML video element for all participants and an audio element for remote participants. (An HTML audio element is not rendered for the local participant – you! – because you can already hear yourself speak, meaning you don’t need to have your audio played back to you.) If the participant’s video is turned off, a placeholder tile is rendered over the video.

You, the local participant, will also see a control panel over your tile with buttons to toggle your audio and video on/off, as well as a button to leave the call.

Two video tiles: The local participant with the control panel and a remote participant with their video off.

If there is an error related to the call, the error-message component is shown instead of the list of video-tile components. It is rendered when the error event is emitted by Daily’s Client SDK for JavaScript. It displays the error message included in the error event’s payload and provides a button to leave the call.

`app-chat`

app-chat is also a child component of app-daily-container. It is not related to the video part of the call; rather, it represents the text-based chat feature. This component can actually be used in either a custom UI – like this demo app – or in an app that embeds Daily Prebuilt.

Like app-call, the chat component requires that the call has been joined before it will work, so it is rendered once the form in join-form has been submitted.

The app-chat component highlighted in the app’s UI.

Conclusion

Today, we reviewed the structure and features of Daily’s new Angular demo app. In the next post, we’ll look at how the app-daily-container component responds to submitting the form in the join-form component. We’ll also review how to use Daily’s Client SDK for JavaScript to let the local participant join or leave a call.

Check out the complete source code for this demo app to get a head start and keep an eye on our blog for future posts in this series!

Measuring performance impact of pagination in video apps

Tasha — Tue, 10 Oct 2023 16:58:29 +0000

By Olabisi Oduola

In our previous posts, we discussed how pagination can be used in a custom Daily video chat app to support larger meetings and how to manually manage media tracks in a paginated video call.

Now, I will show you how to benchmark your pagination implementation and see for yourself how much impact it has on your app’s performance and user experience.

When building a custom video app with Daily's Client SDK for JavaScript, developers can tailor their implementation to meet their exact product needs. This level of customization creates the possibility for developers to impact the performance of their app (both positively and negatively) depending on how they decide to approach certain features, such as video tile pagination. This is an important consideration for any type of video call app, and one we can help you navigate.

💡If you are looking for a video call solution that is already optimized for various platforms and network conditions, check out Daily Prebuilt–our fully-featured video call embed.

One common question with building any custom video app is: How many video tiles should we show per page? This decision can have a significant impact on the application’s performance and user experience.

If we show too few videos per page, we may:

Frustrate users: It can get annoying to click through many pages to find the participant video you are looking for, or miss out on some relevant or interesting videos.
Compromise retention: Users may lose patience if they have to wait too long for the next page of videos to load.

On the other hand, if we show too many videos per page, we may:

Decrease performance: Loading and rendering many videos at once comes with extra bandwidth and CPU consumption, which can slow down the app and degrade the quality of the videos over time.
Clutter the visual field: Showing too many videos per page may create a crowded and overwhelming visual layout, making it tricky for users to find and focus on the content they are interested in.

Therefore, finding the optimal number of video tiles per page is a matter of balancing these factors. We need to find the “sweet spot” that maximizes both performance and user satisfaction.

That sweet spot is different for everyone. Different video apps have their own goals, constraints, and user flows. Therefore, we need to consider some factors that may influence this decision, such as:

The nature of the app: Calls intended for smaller audiences, like small group classrooms or webinars, may be designed to support fewer, high-resolution video streams. Therefore, they may not benefit from pagination. On the other hand, if your app involves having many users interacting with each other, you will likely want to enable pagination. You’ll probably also experiment with bandwidth, output, and receive settings to sustain lively conversation in a larger group.
The device and network conditions: The optimal number of video tiles per page may vary depending on the device and network capabilities of the users. For example, mobile devices have smaller screens, lower CPU power, and lower bandwidth than desktop devices. This may affect how many videos can be loaded and displayed effectively.

How can you test and iterate on different pagination options?

We will use our existing pagination demo app introduced in a previous post, which allows you to control the number of video tiles shown in your app. We will utilize Chrome developer tools to compare the performance of different pagination scenarios. We’ll then provide some tips on how to find the optimal balance between performance and user experience.

Getting started

To follow along with this post, prepare your environment as follows:

Install Google Chrome if you don’t already have it.
Create a free Daily account.
Retrieve your API key from the developer dashboard.
Clone the daily-demos/track-subscriptions repo from GitHub.
Navigate to the project directory and check out my custom-tile-pagination branch:
Install the necessary dependencies by running yarn Create an .env.local file in the root directory of the app and add your own DAILY_DOMAIN and DAILY_API_KEY from your Daily account dashboard. Remember to never submit this file to source control! Your Daily API key is secret.

Finally, you can start the development server by running yarn dev

Open the app in Google Chrome by navigating to http://localhost:3000/

You can now create a new video call by clicking on the “Create room” button

Entering a Daily video call room

(Incognito Mode is preferable if you have a lot of extensions installed, it ensures that Chrome runs in a clean state as those extensions might create noise in your performance measurements.)

For the purpose of this article, I modified the hooks/useAspectGrid.js file in the cloned app so we can pass in custom values for customMaxTilesPerPage. This determines the maximum video tiles to be displayed on each page.

export const useAspectGrid = (
  gridRef,
  numTiles,
  customMaxTilesPerPage = 20
) => {

    // ...Existing logic

    const MIN_TILE_WIDTH = useMemo(() => {
        const { width } = dimensions;
        const maxColumnsForCustomMax = Math.max(1, Math.floor(width / customMaxTilesPerPage));
        return width / maxColumnsForCustomMax;
      }, [dimensions, customMaxTilesPerPage]);

    // The rest of the hook…
}

With this modification, the tile width will reduce as the customMaxTilesPerPage value increases.

When you increase the value of customMaxTilesPerPage above, you are telling the useAspectGrid() hook (which creates the grid of video tiles to be displayed on a page) to display more tiles on each page. This means each tile will need to be narrower in order to fit into the available space. In the code above, I set the maximum tiles to display per page to 20. This means each participant will subscribe to the video and audio tracks of up to 20 others video call users.

Now, we can test the effect this will have on performance.

Measuring the runtime performance of your pagination implementation using Chrome DevTools

I will be using the Chrome DevTools performance monitor to visualize how our app performs during runtime and identify bottlenecks or other inefficiencies. The performance monitor provides a real-time view of the runtime performance of the app by displaying graphs of various performance metrics that update in real time.

There are many metrics that can be used to measure the runtime performance of an app, but not all of them are relevant for our use case. For this Daily video demo app, we are interested in metrics that reflect how well the app handles multiple video renders, such as:

CPU: The CPU metric shows how much processing power your app consumes.
JS heap memory: This metric shows how much memory your app allocates and deallocates. (Check out my earlier post about memory management if you want to learn more about how heap memory works.)
FPS: The frames per second (FPS) metric shows how smoothly your app renders on the screen.
Network: The network metric shows how much data your app transfers over the network. Daily also monitors this, dynamically adjusting call participants' receive settings to provide them with the best possible experience for their network conditions.

Opening the performance monitor

First, open Chrome DevTools by using the shortcut Ctrl+Shift+I (Windows, Linux) or Command+Option+I (macOS).

In DevTools, on the main toolbar, select the “Performance monitor” tab. If that tab isn't visible, click the More tabs button, or else the More Tools button.

Opening the DevTools performance monitor

To get the FPS readings, select the Rendering tool, and turn on Frame Rendering Stats. A new overlay will appear in the top-left corner of your webpage. The Frame Rendering Stats overlay provides real-time estimates for FPS as your app runs.

20-tile in-app view with performance stats shown on the top left and right-hand sides

Based on the screenshot above, we can see that showing 20 video tiles results in a noticeable performance hit:

The CPU usage reaches levels of 80-100%, indicating a considerable load that might result in overheating or battery depletion on the device.
The memory consumption hovers between 80-100MB, displaying frequent fluctuations within the JavaScript heap graph.
The frames per second (FPS) drops below 20, a level considered quite low, and can potentially lead to rendering issues such as stuttering or lagging.

12-tile in-app view

When we switch to showing 12 tiles (which is the demo app’s default), we can see that:

The CPU usage is around 10-20%, which is not too high for most devices.
The application is using a moderate amount of memory (40-50MB). This means the app is not creating and deleting too many objects.
The FPS is around 50-60, which is ideal for a smooth rendering and smooth user experience.

Therefore, we can conclude that with no other changes to optimize output and receive bandwidth, showing 12 video tiles is optimal for our app. Showing 20 video tiles concurrently, in this configuration, is suboptimal. Seeing the effect first-hand can help us optimize our pagination settings for the ideal effect.

Of course, the number of video tracks being played is just one of several factors that have an impact on the app’s performance: a starting point. If your use case calls for playback of more (or fewer!) simultaneous video and audio tracks, you can then delve into Daily’s send settings and receive settings options to fine-tune your configuration further.

Let’s dig a little deeper into the different factors we’re using to judge the performance impact of our pagination configuration.

CPU

CPU usage numbers may vary depending on the device, browser, network, and other factors, but we can use some general guidelines to help us get a gauge of how our pagination or other settings affect performance:

We want to keep the CPU usage as low as possible to avoid overheating or draining the battery of the device.
Ideally, you want to keep your CPU usage below 50% for a fast and responsive app.

Note that mobile devices have much less CPU power than desktops and laptops. Whenever you profile a page, use CPU Throttling to simulate how your page performs on mobile devices. To simulate a mobile CPU, you can use 4x slowdown for mid-tier devices or 6x slowdown for low-end mobile devices. You may find that a mobile app can benefit from subscribing to and displaying fewer video tracks than desktop. This can also be beneficial from a UI perspective, considering on mobile there’s not as much display space to work with.

Heap memory

If your app is progressively using more and more memory, without it getting garbage collected, you have a leak. But leaks aside, how much memory usage is "too much"? There are no definitive numbers here, because different devices and browsers have different capabilities. JavaScript engines help us along with their own garbage collection processes. However, garbage collection itself can be an expensive operation.

In performance recordings, frequent changes (rising and falling) to the JS heap graphs indicate frequent garbage collection. This may be worth investigating, especially if you notice degraded performance on certain devices.

Frames per second (FPS)

A high FPS value means that your app is responsive and fluid, while low FPS can result in sluggish and choppy rendering. Ideally, you want to maintain an FPS of 50 or higher for a smooth user experience. The FPS is shown in the left-hand corner of your web app when the performance monitor is open.

Conclusion

To achieve optimal performance and a great user experience in your video call application, focus on finding the right balance between performance and UX:

Consider device constraints: Account for CPU and bandwidth limitations on mobile and older devices. Limit active video cameras to maintain call quality.
Customisation: Tailor room settings to suit different use cases. Consider Interactive Live Streaming: For large audiences, consider live streaming with minimal delay to enhance performance. Daily’s ILS supports 100,000 real-time participants.
Find your sweet spot: Find your ideal balance of pagination settings and media resolution to suit your specific use case.
Look at your app holistically: Video is likely just one part of your application. Be aware of other components that may impact the performance of your application in conjunction with the video implementation.

By taking these performance considerations into account, you can create a video call app that delivers both top-notch performance and an enjoyable user experience.

Automatic short-form video: Highlight reels at scale with AI and VCSRender

Tasha — Fri, 06 Oct 2023 16:41:31 +0000

By Pauli Olavi Ojala

Daily’s developer platform powers audio and video experiences for millions of people all over the world. Our customers are developers who use our APIs and client SDKs to build audio and video features into applications and websites.

Today, as part of AI Week, Pauli Olavi Ojala discusses automation of video editing. Pauli is a senior engineer and architect of Daily's compositing framework VCS (Video Component System).

Our kickoff post for AI Week goes into more detail about what topics we’re diving into and how we think about the potential of combining WebRTC, video and audio, and AI. Feel free to click over and read that intro before (or after) reading this post.

💡 This post introduces the problem of mining short-form “video gold” out of a mountain of raw data, and why current AI models can’t solve this problem directly. We take a look at an experimental approach being built at Daily and how it was applied to a Cloud Poker Night session, to create automatic highlight reels. Finally we note a new open source project that provides some missing infrastructure pieces for rendering short-form videos at the large scale enabled by AI.

These days people love watching bite-sized video clips, typically 15-60 seconds long, on platforms like TikTok, Instagram, and YouTube. For many user groups, this kind of short-form video has displaced more traditional social media content formats such as Twitter-style short text posts, Instagram-style static photos, and the classic extended YouTube video.

Businesses and content creators could reach their audiences in this popular new format, but they face a massive problem in how to produce the content. Creating simple tweets and picture posts is well understood and often automated by various tools. While YouTube-style long-form video can be laborious and also requires frequent posting, newer formats up the ante.

Short-form video manages to combine the most difficult aspects of both. Like with tweets, you need a high volume of content because the clips are so short and viewers are flooded. On the other hand, producing a great 45-second video can be as much work as making a great ten-minute video. Often it’s even harder — in the classic rambling YouTube format, a creator can often keep people watching simply by talking well, but a short reel must have tight editing, cool graphics, flashy transitions, and so on.

Although short-form video seems like a pure social media phenomenon today, it’s worth noting that social modes of expression and habits of content consumption tend to get integrated into other digital products over time. Consider Slack: a very popular business product, but the user experience has much more in common with Discord and Twitter than old-school business software like Outlook. Fifteen years ago it would have been unimaginable to use emojis and meme pictures in a business context. As a generation of users grows up with short-form video, its expressive features may similarly be adopted by great products in all kinds of verticals. Even plain old business meetings could benefit from tightly edited video summaries, especially if the editor could incorporate useful context from outside the meeting recording.

Meanwhile, we see at Daily that our customers often have a problem of abundance. They have the raw materials for interesting video content, but at massive volume. They can record down to the level of individual camera streams in their Daily-based applications, but nobody has the time or attention span to watch all those raw recordings. In the mining business they would call this a “low-grade ore" — there’s gold somewhere in this mountain of video, but extracting it isn’t trivial!

Could supply meet demand, and the mass of raw video be refined into striking short-form content? Clearly this problem can’t be solved by throwing more human effort at it. A modern gold mine can’t operate with the manual pans and sluice boxes from the days of the California gold rush. Likewise, the amount of video content that flows through real-time applications on Daily can’t realistically be edited by professionals sitting at Adobe Premiere workstations.

Mine, machine minds

Enter AI. Could we send a video editor robot down the audiovisual mine shaft to extract the short-form gold?

As we’ve seen over the past few years, large AI models have reached truly impressive capabilities in language processing as well as image recognition and generation. Creating high-quality textual summaries and relevant illustrations is practically a solved problem, which is why AI already has so many applications for traditional social media content.

But short-form video is a bit different because it tends to use multiple media formats in a dense time-based arrangement. The vocabulary available to a video editor is quite rich: there are cuts, text overlays, infographics, transitions, sound effects, and so on. Ideally these all work in unison to create a cohesive, compelling experience less than a minute long.

A large language model (LLM) like ChatGPT operates on language tokens (fragments of words). An image diffusion model like Dall-E or Midjourney operates on pixels. An audio model like Whisper operates on voice samples. We have some bridges that map between these distinct worlds so that images turn into words and vice versa, but a true multimedia model that could natively understand time-based audiovisual streams and also the underlying creative vocabulary needed to produce editing structures remains out of reach for now.

Waiting another decade for more generalized AI might be an option if you’re very patient. But if you’re looking to get a product advantage now, we have to come up with something else.

CutBot

The solution we’ve been working on at Daily is to break down the problem of video editing into a conversation of models with distinct and opposing capabilities.

We’re developing an experimental system called CutBot. It’s similar in design to a traditional kind of expert system, but it commands the vocabulary of video editing. (“Expert system” is an older form of AI modeling where a workflow is manually codified into a program. In many ways it’s the opposite approach to most present-day AI that uses tons of training data to create very large and opaque models.)

CutBot knows enough about the context of the video that it can apply a sequence of generally mechanical steps to produce a short-form reel whose superficial features tick the right boxes—rapid cuts, graphics overlays with appropriate progressive variation over time, and so on. What this system lacks is any kind of creative agency.

For that, CutBot has access to a high-powered LLM which can make creative decisions when provided enough knowledge about the source footage in a text format. We can use voice recognition and image recognition to create those textual representations. CutBot can also share further context derived from application-specific data. (In the next section we’ll see a practical example of what this can be.)

In this partnership of systems, the LLM is like the director of the reel, bringing its experience and commercial taste so it can answer opinion-based questions in the vein of: “What’s relevant here?” and “What’s fun about this?”

On the other side of the table, CutBot plays the complementary role of an editing technician who knows just enough so it can pose the right questions to the director, then turn those opinions into specific data that will produce the output videos. Note the plural—since we’re doing fully automated editing, it becomes easy to render many variations based on the same core of creative decisions, for example different form factors such as portrait or landscape video, short and extended cuts like 30 or 60 seconds, branded and non-branded versions, and so on.

A bit further down we’re going to show some early open source tooling we have developed for this purpose. But first, let’s take a look at a real example from a Daily customer.

Highlights from Cloud Poker Night

Cloud Poker Night is a fun and striking new way to play poker. The platform is aimed at a business audience, and it uses Daily to provide a rich social experience. You can see and hear all the other players just like at a real poker table, and a professional dealer keeps the game flowing.

Short highlight reels would be a great way for players to share the experiences they had on Cloud Poker Night. If these could be generated automatically, they could even be made player-specific, so that you can share your own best moments with friends on social media. What makes this quite difficult is that the game is so rich and carries so many simultaneous data streams — it wouldn’t be an easy task even for a human video editor.

In the image below, on the left-hand side is the Cloud Poker Night game interface as it appears to a player. On the right-hand side, we have a glimpse into the underlying raw data that is used to build up the game’s visuals:

This is a great example of an application that has very interesting raw materials for short-form video, but the effort of producing reels manually is too great to even consider. A game session can last for an hour. There can be a dozen or more players across two tables. That would be a lot of video tracks for a human to watch.

In addition to these raw AV streams, there are also the game events like winning hands, emoji reactions sent by players, etc. A human editor would probably just watch the video tracks and try to figure out what was interesting in the game, but a computer can use the game event data and players’ emoji reactions to hone its attention onto interesting stuff more directly.

So we take these inputs and use a first stage of AI processing to create textual representations. Then the “partnership of systems” described in the previous section — CutBot and LLM — makes a series of editing decisions to produce a cut for the highlight reel:

Once we have the cut, it’s not yet set in stone — or rather, fixed in pixels. The output from the AI is an intermediate representation that remains “responsive” in the sense that we can render into multiple configurations. For example, having both portrait and landscape versions of the same reel can be useful for better reach across social media platforms:

We’ll discuss the specifics that enable this kind of responsive visual design in a minute. Before we jump to the rendering side, let’s pull together the threads of AI processing. What we’ve shown here with reel generation from Cloud Poker Night’s raw data is a prototype. How could this be deployed for a real-world application, and how would users access this functionality?

User interface considerations

CutBot isn’t a fully generalized AI. It’s a combination of the “dumb but technically qualified” editing-specific expert system and the “smart but loose” generic LLM. To some extent, the heuristics used by the expert system need to be tuned for each application. We’re still in the early stages of making this process more accessible to developers.

For the Cloud Poker Night example shown above, the tuning was done manually. The demo was programmed with some details of the poker application and the desired visual vocabulary for the video. Partially this was done using what could be called “low-code” methods. For example, the visual templates for the overlay graphics in the Cloud Poker Night clip were designed in the VCS Simulator using its GUI. It has a feature that outputs code blocks ready for use in Daily’s compositing system. Those code blocks can simply be copied over into the CutBot’s repertoire of graphics.

For some applications, the video generation process can be fully automated. This is a valid approach when you want to deliver the short-form videos as soon as possible, as would be the case for Cloud Poker Night. You’d usually want to enable players to share their highlight reels immediately after the game, and it doesn’t matter if the reels are 100% polished.

The other possible approach is to have a human in the loop providing input to the process. This means building a user interface that presents options from CutBot and lets the user make choices. For most users, this wouldn’t need to resemble the traditional video editing interface—the kind where you’re manually slipping tracks and trimming clips with sub-second precision. The paradigm here can be inverted because the cut is primarily created in conversation with the LLM-as-director. The end-user UI can reflect that reality and operate on higher-level content blocks and associated guidance.

VCSRender

What happens when CutBot has produced a cut — how does it turn into the actual AV artifacts, one or more output videos?

We mentioned before that producing multiple form factors and durations from the same AI-designed cut is a common requirement. For this reason, it’s not desirable to have the output from CutBot be precisely nailed down so that every visual layer is expressed in absolute display coordinates and animation keyframe timings. Instead, we’d like to have an intermediate format which represents the high-level rendering intent, but leaves the details of producing each frame to a smart compositor that gets executed for each output format.

As it happens, we already have such a thing at Daily — and it’s open source to boot. Our smart compositor engine is called VCS, the Video Component System.

VCS is based on React, so you can express dynamic and stateful rendering logic with the same tools that are familiar to front-end developers everywhere. This makes it a great fit for the compositor we need to generate videos from CutBot.

Until now we’ve provided two renderers for VCS:

A server-side engine for live streaming and recording. It runs on Daily’s cloud and operates on live video inputs. You can access it via Daily’s API.
A client-side engine that runs directly in the web browser. It operates on video inputs from a Daily room, or MediaStream instances provided by your application. This library is open source. (You can create your own JS bundles directly, and we’re also working on a renderer package that will make it much easier to use the VCS engine in your Daily-using app.)

We’ll be adding a third renderer called VCSRender that’s specifically designed for batch rendering situations like these short-form videos.

VCSRender is an open-source CLI tool and library that produces videos from inputs like video clips, graphics, and VCS events that drive the composition. It’s a lightweight program with few dependencies, and it’s designed to adhere to the “Unix way” where possible. This means all non-essential tasks like encoding and decoding media files are left to existing programs like FFmpeg which already perform these admirably. This fine-grained division of labor makes it easier to deploy VCSRender in modern “serverless” style cloud architectures where swarms of small workers can be rapidly summoned when scale is needed, but no resources are consumed when usage is low.

VCSCut

VCSRender discussed above is fundamentally a low-level tool. To provide a more manageable interface to this system, we’re also releasing a program called VCSCut. It defines the “cut” file format which is produced by our CutBot AI. These cut files are expressed in a simple JSON format so it’s easily possible to create them with other tooling or even by manual editing.

VCSCut is really a script that orchestrates the following open source components:

FFmpeg for decoding and encoding
VCS batch runner for executing and capturing the VCS React state
VCSRender for final compositing

Together these provide a rendering environment that’s well suited for short-form video. In a cloud environment, you could even split these tasks onto separate cloud functions orchestrated by the VCSCut script, with something like AWS S3 providing intermediate storage. This kind of design can be cost-effective for variable workloads because you don’t need to keep a pool of workers running so there’s no expense when the system isn’t used, but it still can scale up very rapidly.

It’s worth emphasizing that the VCSCut+VCSRender combo is in early alpha. It’s tuned for short-form video and is not a solution for every video rendering need. For long videos and real-time streaming sessions, we internally use GStreamer, an extremely mature solution with a different set of trade-offs. (There’s an interesting story to tell about how VCS works with GStreamer, but that will have to wait for another day!)

The alpha release will be part of the daily-vcs repo on Github. You can watch and star the repo if you want to stay up to date on the code directly! We’ll be talking about it on our blog and social media channels too. (And you can refer to our reference documentation for more information about VCS and Daily’s recording features.)

We are always excited to discuss these topics — you can find us on social media; join us on the peerConnection WebRTC forum; or find us online or IRL at one of the events we host.

How to talk to an LLM (with your voice)

Tasha — Thu, 05 Oct 2023 18:46:59 +0000

By Kwindla Hultman Kramer

Today, as part of AI Week, we’re talking about building voice-driven AI applications.

Lately, we’ve been doing a lot of experimenting and development with the newest AI large language models (LLMs). We’ve shipped products that use GPT-4. We’ve helped customers build features that leverage several kinds of smaller, specialized models. We’ve developed strong opinions about which architectures work best for which use cases. And we have strong opinions about how to wire up real-time audio and video streams bidirectionally to AI tools and services.

We built a little demo of an LLM that tells you a choose-your-own-adventure story, alongside DALL-E generative art. The demo is really, really fun. Feel free to go play with it now and talk with the LLM to generate a story!

If you have such a good time that you don’t make it back here, just remember these three things for when you start building your own voice-driven LLM app:

Run everything in the cloud (if you can afford to)
Don’t use web sockets for audio or video transport (if you can avoid them)
Squeeze every bit of latency you can out of your data flow (because users don’t like to wait)

The demo is built on top of our daily-python SDK. Using the demo as a starting point, it should be easy to stand up a voice-driven LLM app on any cloud provider. If you’re just here for sample code (you know who you are) just scroll to the bottom of this post.

If you’re more in a lean-back mode right now, you can also watch this video that Annie made, walking through the demo. (In her adventure, the LLM tells a story about a brave girl who embarks on a quest — aided by a powerful night griffin — to free her cursed kingdom.)

In this post, we'll take a look at the architecture and work behind the demo, covering:

3 components to build an LLM application
data flow
speech-to-text: client-side or in the cloud?
web sockets or WebRTC
where to run your app's speech-to-LLM-to-speech logic
phrase detection and endpointing
LLM-prompting APIs, and streaming the response data
natural sounding speech synthesis
how hard can audio buffer management be?
using llm-talk to build voice-driven LLM apps

Let's dive in.

Frontier LLMs are good conversationalists

There’s a lot of really cool stuff that today’s newest, most powerful, “frontier” large language models can do. They are good at taking unstructured text and extracting, summarizing, and structuring it. They’re good at answering questions, especially when augmented with a knowledge base that’s relevant to the question domain. And they’re surprisingly good conversationalists!

At Daily, we’re seeing a lot of interesting use cases for conversational AI: teaching and tutoring, language learning, speech-to-speech translation, interview prep, and a whole host of experiments with creative chatbots and interactive games.

You need three basic components to get started building an LLM application:

Speech-to-text (abbreviated as STT, and also called transcription or automatic speech recognition)
Text-to-speech (abbreviated as TTS, and also called voice synthesis)
The LLM itself

Interestingly, all of these are “AI.” The underlying technology for state- of-the-art speech-to-text models, text-to-speech models, and large language models are quite similar.

Data flow

Here are the processing and networking steps common to every speech-to-speech AI app.

Voice capture
Speech-to-text
Text input to LLMs (and possibly other AI tools and services)
Text-to-speech
Audio playout

These are the low-level, generic building blocks that you’ll need. All the specific design and logic that makes an application unique — how the UX is designed, how you prompt the LLM, the workflow, unique data sets, etc — are built on top of these basic components.

Some considerations (things that can be tricky)

You can hack together a speech-to-speech app prototype pretty quickly, these days. There are lots of easy-to-use libraries for speech recognition, networking, and voice synthesis.

If you’re building a production application, though, there are a few things that can take a lot of time to debug and optimize. There are also some easy-to-get-started-with approaches that are likely to be dead-ends for production apps at scale.

Let’s talk about a few choices and tradeoffs, and a few things that we’ve seen trip people up as they move from prototyping into production and scaling.

Speech-to-text – client-side or in the cloud?

Your first technology decision is whether to run speech-to-text locally on each user’s device, or whether to run it in the cloud. Local speech recognition has gotten pretty good. Running locally doesn’t cost any money, whereas using a cloud STT service does.

But local speech recognition today is not nearly as fast or as accurate as the best STT models running on big, optimized, cloud infrastructure. For many applications, speech-to-text that is both as fast as possible and as accurate as possible is make-or-break. One way to think about this is that the output of speech recognition is the input to the LLM. The better the input to the LLM, the better the LLM can perform.

For both speed and accuracy, we recommend Deepgram. We benchmark all of the major speech services regularly. Deepgram consistently wins on both speed and accuracy. Their models are also tunable if you have a specific vocabulary or use case that you need to hyper-optimize for. Because Deepgram is so good, we’ve built a tightly coupled Deepgram integration into our SFU codebase, which helps just a little bit more in reducing latency.

Azure, GCP, and AWS all have invested recently in improving their ASR (Automatic Speech Recognition) offerings. So there are several other good options if for some reason you can’t use Deepgram.

Web sockets or WebRTC?

Getting audio from a user’s device to the cloud means compressing it, sending it over a network channel to your speech-to-text service of choice, and then receiving the text somewhere you can process it.

One option for sending audio over the internet is to use web sockets. Web sockets are widely supported by both front-end and back-end development frameworks and generally work fine for audio streaming when network conditions are ideal.

But web socket connections will run into problems streaming audio under less-than-ideal network conditions. In production, a large percentage of users have less-than-ideal networks!

Web sockets are an abstraction built on top of TCP, which is a relatively complex networking layer designed to deliver every single data packet as reliably as possible. Which sounds like a good thing, but turns out not to be when real-time performance is the priority. If you need to stream packets at very low latency, it’s better to use a faster, simpler network protocol called UDP.

These days, the best choice for real-time audio is a protocol called WebRTC. WebRTC is built on top of UDP and was designed from the ground up for real-time media streaming. It works great in web browsers, in native desktop apps, and on mobile; and its standout feature is that it’s good at delivering audio and video at real-time latency across a very wide range of real-world network connections.

Some things that you get with WebRTC that are difficult or impossible to implement on top of web sockets include:

Selectively resending dropped packets based on whether they’re likely to arrive in time to be inserted into the playout stream
Dynamic feedback from the network layer to the encoding layer of the media stack, so the encoding bitrate can adapt on the fly to changing network conditions
Hooks for the monitoring and observability that you need in order to understand the overall performance of media delivery to your end users

There’s a lot of really cool stuff to talk about in this domain. If you are interested in reading more about the trade-offs involved in designing protocols for real-time media streaming, see our in-depth post about WebRTC, HLS, and RTMP.

WebRTC is a significant improvement over web sockets for audio. Using WebRTC is a necessity for video.

Video uses much more network bandwidth than audio, so dropped and delayed packets are more frequent. Video codecs also can’t mitigate packet loss as well as audio codecs can. (Modern audio codecs use nifty math and can be quite robust in the face of partial data loss). But WebRTC’s bandwidth adaptation features enable reliable video delivery on almost any network.

WebRTC also makes it easy to include multiple users in a real-time session. So, for example, you can invite our storybot demo to join a video call between a child and her grandparents.

Where to run your app’s speech-to-LLM-to-speech logic

Should you run the part of your application logic that glues together speech-to-text, an LLM, and text-to-speech locally or in the cloud?

Running everything locally is nice. You’re just writing one program (the code running on the user’s device). The downside of running everything locally is that for each piece of the conversation between the user and the LLM, you are making three network connections from the user’s device to the cloud.

The alternative is to move some of your application logic into the cloud.

Here are two diagrams showing the difference.

In general, moving part of your application code into the cloud will improve reliability and lower latency.

This is especially true for users with not-great network connections. And it’s especially, especially true if the user is in a different geographic region from the infrastructure that your services are running on.

We have a lot of data on this kind of thing at Daily, and the performance improvements that come from doing as few “first-mile” round trips as possible varies a great deal between different user cohorts and geographies. But for an average user in the US, the six network connection arrows in the left diagram would typically add up to a total latency of about 90ms. For the same average user, the eight network connection arrows in the right diagram would typically add up to about 55ms.

Moving some of your application logic into the cloud has other benefits in addition to reducing total latency. You control the compute and network available to your app in the cloud. So you can do things in the cloud that you simple can’t do locally. For example, you can run your own private LLM – say, the very capable Llama2 70B – inside the app logic box in the right diagram!

Phrase detection and endpointing

Once voice data is flowing through your speech-to-text engine, you’ll have to figure out when to collect the output text and prompt your LLM. In the academic literature, various aspects of this problem are referred to as “phrase detection,” “speech segmentation,” and “endpointing.” (The fact that there is academic literature about this is a clue that it’s a non-trivial problem.)

Most production speech recognition systems use pauses (the technical term is, “voice activity detection”), plus some semantic analysis, to identify phrases.

Sometimes humans pause … and then keep talking. (This is called "barge-in.") To handle that case, it’s useful to build logic into your app that allows you to interrupt the LLM query or the text to speech output and restart the data flow.

Note that in the demo app, we have a hook for implementing barge-in, but it's not actually wired up yet. We're using a combination of client- and server-side logic to manage conversation flow. On the client side, we're using JavaScript audio APIs to monitor the input level of the microphone to determine when the user has started and stopped talking. On the server side, Deepgram's transcription includes endpointing information in the form of end-of-sentence punctuation. We use both of those signals to determine when the user is done talking, so we can play the chime sound and continue the story.

LLM prompting APIs, and streaming the response data

A subtle, but important, part of using today’s large language models is that the APIs and best practices for usage differ for different services and even for different model versions released by the same company.

For example, GPT-4’s chat completion API expects a list of messages, starting with a system prompt and then alternating between “user” and “assistant” (language model) messages.



messages=[
        {"role": "system", "content": "You are a storyteller who loves to make up fantastic, fun, and educational stories for children between the ages of 5 and 10 years old. Your stories are full of friendly, magical creatures. Your stories are never scary. Stop every few sentences and give the child a choice to make that will influence the next part of the story. Begin by asking what a child wants you to tell a story about. Then stop and wait for the answer.."},
        {"role": "assistant", "content": ""What would you like me to tell a story about? Would you like a fascinating story about brave fairies, a happy tale about a group of talking animals, or an adventurous journey with a mischievous young dragon?""},
        {"role": "user", "content": "Talking animals!"}
    ]

We’ve often seen developers who are using the GPT-4 API for the first time put their entire prompt into a single “user” message. But it’s frequently possible to get better results using a “system” prompt and a series of messages entries, especially for multi-turn interactions.

Using a “system” prompt also helps maintain the predictability, tone, and safety of the LLM’s output. This technology is so new that we don’t yet know how to make sure that LLMs stay on task (as defined by the application developer). For some background, see this excellent overview by Simon Willison.

To keep latency low, it’s important to use any streaming mode configurations and optimizations that your service supports. You’ll want to write your response handling code to process the data as individual, small chunks, so that you can start streaming to your text-to-speech service as quickly as possible.

It's also worth thinking about ways you can actually use the capabilities of the LLM to your advantage. LLMs are good at outputting structured and semi-structured text. In our demo code, we're instructing the LLM to insert a break word between the "pages" of the story. When we see that break word as we're processing the streaming response from the LLM, we know we can send that portion of the story to TTS.

Natural-sounding speech synthesis

As is the case with text-to-speech, you can run speech-to-text locally on a user’s computer or phone, but you probably don’t want to. State of the art, cloud TTS services produce output that sounds significantly more natural and less robotic than the best local models, today.

There are several good options for cloud TTS. We usually recommend Azure Speech or Elevenlabs. The landscape is evolving rapidly, though, and there are frequent and exciting updates and new contenders.

GPT-4 and other top-performing LLMs are good at streaming their completion data at a consistent rate, without big delays in the middle of a response. (And if you’re running your processing code in the cloud, network-related delays will be very rare.) So for many applications, you can just feed the streaming data directly from your LLM into your text- to-speech API.

But if you are filtering or processing the text from the LLM, before sending it to TTS, you should think about whether your processing can create temporal gaps in the data stream. Any gaps longer than a couple hundred milliseconds will be audible as gaps in the speech output.

How hard can audio buffer management be?

Finally, once you have an audio stream coming back from your text to speech service, you’ll need to send that audio over the network. This involves taking the incoming samples, possibly stripping out header metadata, then shuffling the samples into memory buffers you can hand over to your low-level audio encoding and network library.

Here, I’ll pause to note that I’ve written code that copies audio samples from one data structure in memory to some other data structure in memory at least a couple of hundred times during my career and in at least half a dozen different programming languages.

Yet still, when I write code like this from scratch, there’s usually at least one head-scratching bug I have to spend time tracking down. I get the length of the header that I’m stripping out wrong. My code works on macOS but initially fails in a Linux VM because I didn’t think about endianness. I don’t know the Python threading internals well enough to avoid buffer underruns. Sample rate mismatches. Etc etc.

Sadly, GPT-4 code interpreter and Github Copilot don’t yet just write bug free audio buffer management functions for you. (I know, because I tried with both tools last week. Both are very impressive – they can provide guidance and code snippets in a way that I would not have believed possible a few months ago. But neither produced production ready code even with some iterative prompting.)

Which brings us to the framework code we wrote to help make all of the tricky things described above a little bit easier.

Using our `llm-talk` sample code to build voice-driven LLM apps

We’ve posted the source code for this demo in a github repo called llm-talk. The repo includes both a bunch of useful sample code for building a voice-driven app, and orchestrator framework code that tries to abstract away a lot of the low-level functionality common to most voice-driven and speech-to-speech LLM apps.

All this code is written on top of our daily-python SDK and leverages Daily’s global WebRTC infrastructure to give you very low-latency connectivity most places around the world
Deepgram speech to text capabilities are integrated into Daily’s infrastructure
Probable phrase endpoints are marked in the speech-to-text event stream
Orchestrator methods include support for restarting both LLM inference and speech synthesis
The library handles audio buffer management (and threading) for you Here’s some sample code from daily-llm.py, showing how we're joining a Daily call and listening for transcription events.



def configure_daily(self):
    Daily.init()
    self.client = CallClient(event_handler = self)

    self.mic = Daily.create_microphone_device("mic", sample_rate = 16000, channels = 1)
    self.speaker = Daily.create_speaker_device("speaker", sample_rate = 16000, channels = 1)
    self.camera = Daily.create_camera_device("camera", width = 512, height = 512, color_format="RGB")

    self.client.set_user_name(self.bot_name)
    self.client.join(self.room_url, self.token)

    self.my_participant_id = self.client.participants()['local']['id']
    self.client.start_transcription()

def on_participant_joined(self, participant):
    self.client.send_app_message({ "event": "story-id", "storyID": self.story_id})
    self.wave()
    time.sleep(2)

    # don't run intro question for newcomers
    if not self.story_started:
        #self.orchestrator.request_intro()
        self.orchestrator.action()
        self.story_started = True

def on_transcription_message(self, message):
    if message['session_id'] != self.my_participant_id:
        if self.orchestrator.started_listening_at:
            self.transcription += f" {message['text']}"
            if re.search(r'[\.\!\?]$', self.transcription):
                print(f"✏️ Sending reply: {self.transcription}")
                self.orchestrator.handle_user_speech(self.transcription)
                self.transcription = ""
            else:
                print(f"✏️ Got a transcription fragment, waiting for complete sentence: \"{self.transcription}\"")

Here’s what the flow of data in the storybot demo looks like.

Our goal is for the llm-talk code to be a complete starting point for your apps, and for the orchestrator layer to take care of all the low-level glue, so that you can focus on writing the actual application that you’re imagining! As usually happens with a project like this, we've accumulated a pretty good backlog of improvements we want to make and extensions we want to add. And we're committed to supporting voice-driven and speech-to-speech apps. So expect regular updates to this code.

AI progress is happening very fast. We expect to see a lot of experimentation and innovation around voice interfaces, because the new capabilities of today’s speech recognition, large language model, and speech synthesis technologies are so impressive and complement each other so well.

Here’s a voice-driven LLM application that I’d like to use: I want to talk – literally – to The New York Times or The Wall Street Journal. I grew up in a household that subscribed to multiple daily newspapers. I love newspapers! But these days I rarely read the paper the old fashioned way. (The San Francisco Chronicle won’t even deliver a physical paper to my house. And I live in a residential neighborhood in San Francisco.)

Increasingly, I think that individual news stories posted to the web or embedded in an app feel like they’re an old format, incompletely ported to a new platform. Last month, I wanted to ask the New York Times about the ARM IPO and get a short, current summary about what had happened on the first day of trading, then follow up with my own questions.

The collective knowledge of the Times beat reporters and the paper’s decades-spanning archive are both truly amazing. The talking heads shows on TV kind of do what I want, but with the show hosts as a proxy for me and a Times reporter as a proxy for the New York Times. I don’t want anyone to proxy for me; I want to ask questions myself. An LLM could be a gateway providing nonlinear access to the treasure trove that is The Grey Lady’s institutional memory. (Also, I want to navigate through this treasure trove conversationally, while cooking dinner or doing laundry.)

What new, voice-driven applications are you excited about? Our favorite thing at Daily is that we get to see all sorts of amazing things that developers create with the tools we’ve built. If you’ve got an app that uses real-time speech in a new way, or ideas you’re excited about, or questions, please ping us on social media, join us on the peerConnection WebRTC forum, or find us online or IRL at one of the events we host.

Cerebrium + Daily: Simplifying deployments for your AI-powered voice and video apps

Tasha — Tue, 03 Oct 2023 17:29:09 +0000

By Varun Singh

Our AI Week series looks at how developers can combine real-time video with AI as they build with our platform. Today we're announcing a partnership with Cerebrium, a serverless infrastructure platform for training, deploying, and monitoring machine learning models. You can now run daily-python seamlessly as part of a Cerebrium application. You can read more about the Daily Python SDK here.

Learn more about the topics we’re covering in this AI Week series, and how we think about the potential of combining WebRTC, video and audio, and AI, in our kickoff post. Feel free to click over and read that intro before (or after) reading this post.

As part of our ongoing AI Week series, we are thrilled to unveil our latest partnership with Cerebrium – a serverless deployment option for the daily-python SDK. Cerebrium is making it easy to run daily-python alongside hosted AI models in the same container. This development completely eliminates the need of managing underlying infrastructure, giving engineers easy access to hosted ML capabilities along with voice and video streams.

Elevate your apps with AI: running `daily-python` on serverless infrastructure

Bringing AI models into the world of voice and video applications at scale can feel daunting, particularly when these AI models require intensive computational resources. Tasks such as object tracking, object segmentation, video analysis, or speech transcription demand the right mix of I/O, memory, CPU, GPU resources to ensure real-time performance.

With Cerebrium’s serverless deployment, you can sidestep the intricacies of scaling the CPU/GPU and the Kubernetes resources. Plus, you can easily deploy off-the-shelf ML Prebuilt models that are fine-tuned on your data. Cerebrium already offers an extensive library of over 20 Prebuilt models. This off-the-shelf deployment already contains a broad range of LLMs and generative voice and video, such as:

Llama 2
GPT4All
OpenAI’s Whisper
Meta Seamless
ControlNet
Stable Diffusion
Meta’s Segment Anything
And more

Cerebrium's Serveless Architecture with daily-python

Here is what daily-python within Cerebrium brings to developers:

Real-time media processing: the ability to send media from the call to an ML model at up to 15 FPS
Quick model inferences: receive inferences from the model in 100s of milliseconds, since the model is co-located
Instant action: as soon as the inference is received from the model, be able to send the modified audio and video or a notification message into the call For those eager to dive into practical examples, the Cerebrium engineering team has thoughtfully prepared an example repository of daily-demos for you to play with. There are currently three demos:

content-moderation uses OpenAI's CLIP model.
pet-object-detection uses a Ultralytics’ YOLO v8 model for object detection.
transcription uses OpenAI's Whisper Model.

My favorite is the pet detection demo. The server-side does the following:

Join the call with daily-python, subscribing to video streams from each participant
Detect pets by passing video frames from each participant to the YOLO v8 model
When a pet is detected, the model returns the frame with a bounding box
daily-python sends the frame with the bounding box into the call, allowing everyone to enjoy a dedicated feed of pets! You can achieve all of this in roughly 15 lines of code and you will not have to fret about latency or scaling!

daily-python detects a cat in one of the frames, sends it as active speaker

# main.py
def predict(item, run_id, logger):
    item = Item(**item)

    # initialize daily-python
    Daily.init()

    # initialize the AI model
    pet_detector = PetDetection()

    #On startup, connect to room with username Pet Detector
    bot_name = "Pet Detector"
    client = pet_detector.client
    client.set_user_name(bot_name)

    # join daily call with room URL
    pet_detector.join(item.room)

    # iterate on video frame from each participant or user
    for participant in client.participants():
        # ignore local frames, these are self-created pet-detected frames
        if participant != "local":
            # send participant frames to the pet detector
            client.set_video_renderer(participant, callback = pet_detector.on_video_frame)

# pet_detection.py

# Load the model weights
pet_detection = YOLO("weights.pt")

class PetDetection(EventHandler):

    def on_video_frame(self, participant, frame):
        self.frame_count += 1
        if self.frame_count >= self.frame_cadence:
          self.frame_count = 0
          self.queue.put(frame.buffer)
          worker_thread = threading.Thread(target=self.process_frame, daemon=True)
          worker_thread.start()

    def process_frame(self):
        buffer = self.queue.get()
        IMAGE_WIDTH = 1280
        IMAGE_HEIGHT = 720

        image = Image.frombytes('RGBA', (IMAGE_WIDTH, IMAGE_HEIGHT), buffer)
        image = cv2.cvtColor(np.array(image), cv2.COLOR_RGBA2BGR)

        detections = pet_detection(image)
        if len(detections[0].boxes) > 0:
          plotted_image = plot_bboxes(image, detections[0].boxes, score=False)
          plotted_image = cv2.cvtColor(plotted_image, cv2.COLOR_BGR2RGB)
          is_success, buffer = cv2.imencode(".png", plotted_image)
          image_stream = io.BytesIO(buffer)
          self.camera.write_frame(Image.open(image_stream).tobytes())

        # Indicate that a formerly enqueued task is complete
        self.queue.task_done()

Ready to get started with Cerebrium? Here's how:

Sign up for a Daily account. You will need to create a Daily room through the dashboard or programmatically using your Daily API key.
Sign up for an Cerebrium account, since we will need to get our API keys to deploy this example.
Git Clone the repository and install the necessary packages by running these commands in the terminal

pip install --upgrade cerebrium
cerebrium login <private_api_key> 
cerebrium deploy --config-file ./config.yaml petdetection

This already bundles daily-python.

Manually invite the pet bot to the call by using REST API.

curl --location --request POST 'https://run.cerebrium.ai/v3/p-xxxx/pet-detection/predict' \
--header 'Authorization: <JWT_TOKEN>' \
--header 'Content-Type: application/json' \
--data '{"room": "Your Daily Room URL"}'

This manual step can be later automated by listening to the participant-joined webhook.

Invite people to join the created Daily room URL, especially those that have pets available at hand!

One of our north stars is developer time to value, in this case, how much time it takes for a developer with a cool application for an ML model to build, deploy, and start field testing. With our announcement today, what used to take developers weeks can now take minutes. You can get started with Cerebrium’s free tier and Daily’s 10,000 free monthly minutes.

Let us know how we can help support you as you build. Reach out to our developer support or head over to peerConnection, our WebRTC forum. You can always find us online or IRL at one of our regularly-hosted events. And check back for more this week in our AI Week series.

The technology behind AI-powered Clinical Notes API for Telehealth

Tasha — Mon, 02 Oct 2023 21:52:25 +0000

By Kwindla Hultman Kramer, Nina Kuruvilla

In our AI Week series, we’re introducing two new toolkits, several new components of our global infrastructure, and a series of AI-focused partnerships.

Our kickoff post this week goes into more detail about what topics we’re diving into and how we think about the potential of combining WebRTC, video and audio, and AI. Feel free to click over and read that intro before (or after) reading this post.

AI that gives healthcare providers time back, every day

Telehealth usage grew rapidly during the COVID-19 pandemic, accelerating regulatory, billing, and technology changes that had started already. It's now embedded in healthcare delivery, and can help expand access to care and improve patient outcomes.

From a technology point of view, one powerful benefit of telehealth interactions is that all audio is captured digitally, making it ready to be transcribed and summarized.

Earlier this year, several of our telehealth customers started asking us whether we could help them understand the landscape of HIPAA-compliant AI tools, and how to use those tools as part of their telehealth workflows.

Healthcare providers, our customers told us, spend 10 to 15 hours each week writing clinical care notes. These notes summarize a patient visit, along with the provider’s assessments and recommended next steps. Writing clinical notes is time-consuming; yet in general it doesn’t leverage a provider’s expertise very well, and isn't work that's regarded as interesting or creative. It’s a necessary task that humans can do, and computers couldn’t. . . . Until now.

Our telehealth customers were seeing examples of GPT-4 producing summaries of things like sales calls, customer support interactions, podcasts, and YouTube videos — and asking if an AI Large Language Model (LLM) could produce good first drafts of clinical notes.

A general approach to summarization and post-processing

Today’s most advanced “frontier,” Large Language Models have an impressive range of use cases. But perhaps the most impressive thing about them, to a computer programmer, is that they are good at turning unstructured input data into structured output. This is a genuinely new capability, and is perhaps the biggest reason so many engineers are so excited about these new tools.

For most use cases that are complex enough to be interesting, the transformation from unstructured data to structured output needs to happen at multiple levels. At the level of content, the Large Language Model processes the input text and summarizes or prioritizes the parts of the text that are most important. At the level of format, the LLM organizes the output into sequences or sections that make sense for the specific use case.

Generating clinical notes is a good example of this. Audio from a telehealth session is transcribed, sent to one or more AI models, and turned into output that has consistent content characteristics and a consistent structure.

Producing a clinical note in the widely used SOAP format (Subjective, Objective, Assessment, and Plan)

An example of SOAP notes draft generation

This is a data pipeline with four steps:

Capture audio
Transcribe the audio
Process and transform the transcription
Validate and store the output

Next week we'll be writing more about new APIs that support building AI-powered data pipelines like this one. These pipelines take audio, video, and metadata from real-time video sessions as input, and provide hooks for processing this data with Large Language Models and other AI tools and services.

To extend Daily’s “building blocks” into this new world of generative AI, we’re leveraging our deep experience with video, audio, and transcription, along with our global infrastructure that was built from the ground up to route, manage, process, and store video and audio.

What matters most: data privacy, accuracy, reliability, flexibility

The clinical notes use case is a demanding test for AI workflow APIs.

All data processing and storage must protect patient privacy and be HIPAA-compliant.
The quality of the output is highly sensitive to the accuracy of the transcription.
Clinical notes are a critical part of the healthcare workflow, so the APIs that power them have to work at scale, with predictable latency.
There are several common output formats for clinical notes, so the pipeline needs to allow LLM prompts and other pipeline steps to be configurable.

We’ve extended Daily’s existing HIPAA-compliant infrastructure to include support for these new workflow APIs. We’ve also signed HIPAA Business Associate Agreements with three new partners. (More on that below.)

Our global WebRTC infrastructure and extensively tuned SDKs are key to delivering the best possible audio as input to the transcription step in the pipeline. Higher quality audio makes possible more accurate speech-to-text transcriptions. Daily’s bandwidth management and very low average first-hop latency everywhere in the world (13ms) guarantee that as many audio packets as possible will be successfully transmitted and recorded.

To perform the transcription step, we are working with our long-time partner Deepgram. Deepgram is an industry leader in both overall accuracy and in the flexibility of their translation models. Daily has offered direct access to Deepgram’s real-time transcription service for several years. We’re now wrapping Deepgram’s batch-mode transcription APIs to make it easy to build transcription-driven post-processing workflows. We’ve also signed a HIPAA BAA with Deepgram.

For the Large Language Model step, we’re working with Microsoft Azure OpenAI and the Microsoft Software & Digital Platforms Group. Microsoft has given Daily early access to the new Azure HIPAA-compliant OpenAI GPT-4 service.

We’ve tested clinical notes generation extensively with GPT-3.5, GPT-4, and the Llama-2 family of models. GPT-4 produces the highest-quality output. For the clinical notes use case, the benefits of using GPT-4 outweigh the higher cost of GPT-4 compared to less powerful models.

To deliver even better results beyond what GPT-4 does by itself, we’ve also partnered with ScienceIO, a company that pioneered using Large Language Models to enrich and structure medical data. We use ScienceIO’s medical structured data APIs in combination with GPT-4.

Deepgram, ScienceIO, and GPT-4 together form a state-of-the-art technology stack for processing audio and generating high-quality output for healthcare use cases.

As AI technology evolves, we expect this pipeline to evolve, too. We’re optimistic, for example, that fine-tuning Llama-2 has the potential to open up additional possibilities for patient and provider workflows beyond clinical notes.

Accessing these new APIs, and what’s next

Our Clinical Notes API for Telehealth is available today. If you have a use case you’re particularly excited about, or any questions, please contact us.

We’re excited about our roadmap for post-processing APIs! We are releasing new features throughout the next few weeks. In addition to audio- and transcription-centric use cases, we have a full set of features in development for video analytics, composition, and editing. We’ll have a bit more to say about that next week.

As always, we love to write and talk about video and audio, real-time networking, and now AI. So if you’re interested in these topics, too, please check out the rest of our AI Week posts, join us on the peerConnection WebRTC forum, or find us online or IRL at one of the events we host.

Forem: Tasha

AI-assisted removal of filler words from video recordings

How the demo works

Tech stack and architecture

Running the demo locally

Under the hood of AI-powered video filler word removal

Server routes

Processing an MP4 file with the /upload route

Project setup

Beginning processing

The processing step

Transcribing audio with filler words included using Deepgram

Finding filler word split points in the transcription

Cutting and reconstituting the original video

Final thoughts

Impressions of Deepgram and Whisper

Plugging in another transcriber

Caveats for production use

Conclusion

Manage participants' media tracks in Angular (Part 3)

Reviewing where we left off

Updating participants to reflect track updates

video-tile: rendering media tracks and device controls

Creating media streams on init

Updating media streams during the call

video-tile HTML elements

Building a control panel to manage local devices

Wrapping up

Tracking connection quality with Daily's new connectivity test methods

Video call connection test use cases

Facilitating local troubleshooting

Tracking quality metrics internally

Informing ideal send setting configuration

What we’re building

Running the sample locally

Setting up the test button and Daily call object

Starting the camera to run video call connection tests

Running the video call connection tests

testConnectionQuality(): Test quality of the WebRTC connection

testNetworkConnectivity(): Check connectivity to Daily’s TURN server

testWebsocketConnectivity(): Check WebSocket traffic support

Aborting Daily’s connection tests

Conclusion

Share admin privileges with participants during a real-time video call

Daily meeting owners vs. admins

Meeting owner privileges

Admins

Today’s goals

Project structure

Creating a Daily room and joining the call

Rendering the AdminPanel component

Converting non-admins to admins

Updating the participant list UI for new admins

Ejecting a participant from the call

Conclusion

How to properly destroy a Daily video call instance

A brief overview of Daily’s call object and call frame

The Daily call instance lifecycle

Leaving a Daily video call

Destroying a Daily video call instance

Conclusion

Manage Daily video call state in Angular (Part 2)

app-daily-container: The parent component

Understanding input and output props in Angular

join-form: Submitting user data for the call

app-call: The brains of this video call operation

Defining the CallComponent class

Conclusion

Build a Daily video call app with Angular and TypeScript (Part 1)

Testing the demo app locally

Creating a Daily account and a Daily room

Running the demo app

App features overview

App component structure

app-root and routing

app-daily-container, join-form, and app-call

video-tile and error-message

app-chat

Conclusion

Measuring performance impact of pagination in video apps

Processing an MP4 file with the `/upload` route

`Project` setup

Updating `participants` to reflect track updates

`video-tile`: rendering media tracks and device controls

`video-tile` HTML elements

`testConnectionQuality()`: Test quality of the WebRTC connection

`testNetworkConnectivity()`: Check connectivity to Daily’s TURN server

`testWebsocketConnectivity()`: Check WebSocket traffic support

Rendering the `AdminPanel` component

`app-daily-container:` The parent component

`join-form`: Submitting user data for the call

`app-call`: The brains of this video call operation

Defining the `CallComponent` class

`app-root` and routing

`app-daily-container`, `join-form`, and `app-call`

`video-tile` and `error-message`

`app-chat`

Using our `llm-talk` sample code to build voice-driven LLM apps

Elevate your apps with AI: running `daily-python` on serverless infrastructure