Forem: Simli

How to Create Real-time AI Video Avatar in 7 Minutes

Dalu46 — Sat, 07 Dec 2024 11:10:15 +0000

What if you could turn dull, static text and audio-based content into exciting videos with the help of AI? With AI video avatar generators, you can easily create high-quality videos that grab your audience's attention starting from simple text or audio.

These AI video generators can serve several purposes, from being deployed as a customer support agent in your application to give more engaging support and enhance customer satisfaction to being used as an educational tool to engage students in interactive learning environments. They can also be used to create virtual assistants that guide users through getting started with a product or tool without needing to go through the documentation.

This tutorial will guide you in setting up and implementing real-time AI video avatars using Simli, an AI video avatar generator. Simli provides developers with a speech-to-video API to create Lipsynced AI avatars with lifelike, animated characters, realistic head movements, and synchronized speech.

Following this guide, you will learn how to quickly create a video avatar from voice inputs, ready to be deployed in interactive projects. So, let's get started right away!

The complete source code for the project is available on GitHub.

Prerequisites

You should have:

A basic understanding of JavaScript and React.
Node and Node Package Manager (NPM) installed on your computer.

Before setting up the API environment and then moving on to creating a real-time AI video avatar using the Simli API, let's briefly look at the steps needed to create an AI video avatar with Simli.

Steps to Create an AI Video Avatar with Simli:

Obtain the API key
Choose a face ID
Initialize the Simli client
Call the simliClient.start() function to set the WebRTC connection
Stream audio using sendAudioData()

Set Up Your API Environment in Minutes

Start by signing up on Simli to retrieve your API key. For a quick sign-in, you can choose Google.
Once you’ve successfully created an account, you will be redirected to the user profile dashboard, where you can generate your API key and track your API usage.

Click the icon above to copy your API key and store it securely. After retrieving your API key, select an avatar to display on the frontend.

Choose Your AI Avatar

Simli provides sample AI avatars that can be accessed through its available faces, with new avatars being added constantly.

Here are a few of the available faces:

To get the ID for each face, copy the random text after the name. For example, the ID for Jenna will be tmp9i8bbq7c.

If you don’t want to use any available avatars, Simli has a create avatar tool that lets you create custom avatars simply by uploading images. However, this tutorial will use an existing avatar.

Now that you have the face ID and the Simli API key, let’s create a Next.js app.

Create a Next.js App

To bootstrap a Next.js application, open your terminal, cd into the directory where you would like to create the application, and run this command:

npx create-next-app@latest simli-demo

This command will prompt a few questions about configuring the Next.js application. Here’s what you should respond to each question:

Select the response for each question as shown above by pressing enter.

Installing Dependencies

Next, install the simli-client and AudioContext packages by running this code:

npm install simli-client standardized-audio-context

The SimliClient, also known as Simli’s WebRTC frontend client, is a tool to integrate real-time video and audio streaming capabilities into web applications using WebRTC. This will enable you to avoid the manual WebRTC setup.

The AudioContext is used to downsample the audio and convert it into chunks that the SimliClient can process.

Initialize the SimliClient in Your Project

In your Next.js application, navigate to the page.js file and paste the following code:

// src/app//page.js
...
// Declare video and audio ref 
... 

import { useRef, useEffect } from 'react';

function Home() {
  const videoRef = useRef(null);
  const audioRef = useRef(null);

return (
  <div>
    <video ref={videoRef} autoPlay playsInline></video>
    <audio ref={audioRef} autoPlay></audio>
  </div>
);
...

In the code above, a videoRef and audioRef was created using the useRef hook to access the <video> and <audio> HTML elements in the component. The SimliClient SDK uses videoRef and audioRef to attach live WebRTC video and audio streams to these HTML elements. The <video> and <audio> elements will be used to render the video and audio data from the remote streams on the client side.

The next step is to configure SimliClient and pass in the video and audio ref. To do so, paste the following code inside Interview.js:

// src/app//page.js
...
// configure the simli client
...  

import { SimliClient } from 'simli-client';

const simliClient = new SimliClient();

const simliConfig = {
  apiKey: "your api key",
  faceID: "tmp9i8bbq7c",
  handleSilence: true,
  maxSessionLength: 3600,
  maxIdleTime: 600,
  videoRef: videoRef,
  audioRef: audioRef,
};
...

This block of code creates a new instance of the SimliClient and a simliConfig object. Let’s break down each part of the simliConfig object:

apiKey: This is a unique key when creating an account with Simli.
faceID: Represents the avatar face ID that will be rendered in the video stream. Simli provides different avatars; you can choose one using its face ID.
handleSilence: This boolean indicates whether the client should handle silent moments in the audio stream (e.g., muting or pausing the video if no audio is detected).
maxSessionLength: Sets the maximum session length (in seconds). Here, it's set to 1 hour (3600 seconds), limiting the duration of any single connection session.
maxIdleTime: Sets the maximum idle time (in seconds). The session will disconnect after 600 seconds (10 minutes) without activity.
videoRef and audioRef: These are references to the video and audio elements where the media streams will be displayed in the browser. SimliClient can connect the WebRTC streams directly to these elements by passing these refs.

Start Real-time Streaming with AI Video Avatar

Once you have successfully configured SimliClient, the next step is establishing the webRTC connection.

But before that, you need to create a function that will reduce the audio to 16 kHz and break it into smaller pulse-code modulation (PCM) chunks. This guide will use a prerecorded mp3 audio that will be sent to the SimliClient. You can download and use any audio of your choice.

Paste the following code inside page.js file to create the downsampleAndChunkAudio function:

// src/app//page.js
...
// Downsample the audio to PCM chunks
... 

const downsampleAndChunkAudio = async (audioUrl, chunkSizeInMs = 100) => {
  // Create an AudioContext with a target sample rate of 16kHz
  const audioContext = new AudioContext({ sampleRate: 16000 });
  // Fetch and decode audio file
  const response = await fetch(audioUrl);
  const arrayBuffer = await response.arrayBuffer();
  const audioBuffer = await audioContext.decodeAudioData(arrayBuffer);
  // Extract PCM data from audio buffer
  const rawPCM = audioBuffer.getChannelData(0); // assuming mono audio for simplicity
  // Calculate chunk size in samples (16-bit PCM)
  const chunkSizeInSamples = (chunkSizeInMs / 1000) * 16000;
  const pcmChunks = [];
  // Loop through the raw PCM data and create chunks
  for (let i = 0; i < rawPCM.length; i += chunkSizeInSamples) {
    const chunk = rawPCM.subarray(i, i + chunkSizeInSamples);
    // Convert each chunk to Int16Array PCM data
    const int16Chunk = new Int16Array(chunk.length);
    for (let j = 0; j < chunk.length; j++) {
      int16Chunk[j] = Math.max(-32768, Math.min(32767, chunk[j] * 32768));
    }
    pcmChunks.push(int16Chunk);
  }
  return pcmChunks;
};

This downsampleAndChunkAudio function takes audio as an argument and processes the audio file by downsampling it to 16 kHz and breaking it into smaller PCM chunks. This format is required for audio to be sent to the SimliClient.

Next, you have to initialize SimliClinet and establish the WebRTC connection. To do so, paste the following code inside page.js file:

// src/app//page.js
...
// Initialize simli client
... 

async function initializeClient() {
   try {
     simliClient.Initialize(simliConfig);
     await simliClient.start();
     // setIsInitialized(true);
     // Send audio data in chunks
     const pcmChunks = await downsampleAndChunkAudio(audioUrl);
     const interval = setInterval(() => {
       const chunk = pcmChunks.shift();
       // if (isInitialized && chunk) {
         chunk && simliClient.sendAudioData(chunk);
       // }
       if (!pcmChunks.length) clearInterval(interval);
       console.log("PCM ", chunk);
     }, 120);
   } catch(error){
     alert(error);
   }
 }

The initializeClient function initializes the SimliClient with the simliConfig object that was earlier declared. It then calls the downsampleAndChunkAudio function to break the audio into chunks of type PCM16 before sending it to the Simli client.

Note: The audio data should be of PCM16 type and have a sample rate 16KHz.

PCM16 is a standard audio format ideal for voice processing. When you send this audio format to Simli's API, it helps maintain synchronization between the audio and the avatar's lip movements. This enhances the viewer experience, as it mimics natural speaking in real-time.

Render and Integrate the AI Avatar on the Frontend

Now that you have finished building the application, let’s render it on the browser. To do so, open your terminal and run this code:

npm run dev

This command will start a local host server on http://localhost:3000.

Watch the application in action through this video.

You should checkout this GitHub repository to explore a hands-on example on how to integrate Simli's API for building interactive AI avatars.

Conclusion

This quick guide showed how to create a real-time AI video avatar using the Simli API. While this article covered only the basics—such as sending prerecorded audio to the Simli API—Simli offers capabilities that extend far beyond this scope.

To unlock Simli's full potential, you can enhance your AI video avatars by integrating additional tools like OpenAI for language models and Deepgram or Elevenlabs for converting text to speech. These tools work seamlessly with Simli to create more engaging and interactive video experiences.
Check this tutorial for a more advanced use case of Simli. Sign up on Simli today to get started!

5 Tips for Using Digital Avatars in Recruitment Videos

Ezekiel Lawson — Wed, 04 Dec 2024 09:51:39 +0000

Recruiting good-fit talent these days can be challenging and time-consuming. Traditional hiring methods often involve lengthy interview rounds, each with high costs and logistical hurdles. To streamline this process, companies are increasingly turning to video-based recruitment—and even taking it a step further with digital avatars.

Digital avatars offer a fresh, engaging approach to recruitment videos, enabling companies to create tailored, responsive interactions with potential hires. These AI avatars can handle initial candidate interactions, answer common questions, and give candidates a glimpse into company culture—without requiring live interactions at every stage. This frees up valuable time for hiring teams and offers candidates a better experience.

Many organizations, such as Dropbox, Starbucks, and GE Aerospace, have embraced video recruitment to showcase their work environments and core values. Integrating digital avatars can amplify these benefits, making recruitment faster, more accessible, and even more appealing to today’s top talent.

Since AI avatars are a relatively new technology, adopting them effectively requires strategy. This article will explore the top five tips to help you make the most of digital avatars in recruitment videos.

What is a Digital Avatar?

A digital avatar is a virtual character generated through AI that interacts with users in real time. Designed to resemble a realistic person or a stylized character, digital avatars are embedded with conversational abilities that make them suitable for interactive roles, like guiding candidates in recruitment videos.

They can welcome applicants, provide information about the company, and answer common questions, giving candidates a feel for the company without requiring live interaction.

The popularity of digital avatars is rising because they offer a flexible, scalable solution for candidate engagement. Instead of scheduling numerous initial calls, hiring teams can use avatars to interact with many candidates simultaneously, saving time and resources.

Additionally, digital avatars add a layer of personalization, giving candidates an engaging experience that stands out from traditional recruitment methods. This approach appeals to today’s tech-savvy candidates who value innovation and convenience in their job search.

Steps In Using Digital Avatars In Recruitment Videos

Our tutorial, "Turn Job Descriptions Into a Recruitment Video With AI," goes into transforming text-based job descriptions into interactive video interviews. By leveraging Simli’s API, you can set up a virtual interviewer that delivers pre-generated questions based on the job description, making it possible to handle multiple applicants simultaneously and reducing recruitment time from weeks to just a few days.

The tutorial covers all the necessary steps, including:

Setting up Simli and OpenAI accounts.
Configuring API keys to connect to these services.
Generating interview questions and audio responses using OpenAI’s chat completions and text-to-speech APIs.
Integrating Simli's API to generate a lip-synced AI avatar that presents questions to applicants.

Head to the complete tutorial for code samples and setup instructions to create your AI-driven recruitment video setup.

5 Tips for Using Digital Avatars in Recruitment Videos

Here are five practical steps to adopt this technology in your hiring videos.

1. Set Clear Objectives

Are you looking to create an engaging pre-interview, introduce candidates to your company culture, or evaluate specific competencies? Defining the purpose of your avatar-based recruitment videos is critical. For instance, if you’re aiming to screen for technical skills, outline what the avatar will communicate about the role and gather from candidates.

Specific objectives help shape the interview flow and question prompts, which can be tailored into concise, captivating segments delivered by the avatar. Without clear goals, the video may lack focus, affecting candidate engagement and data collection quality.

2. Choose the Right Avatar Personality

Selecting an avatar that aligns with your brand and the job role is key. A straightforward and professional avatar can convey the right tone for technical roles, whereas a more relaxed persona may be better suited for creative roles. Platforms like Simli offer a variety of avatars, and many tools allow for customization down to the voice.

Opt for avatars that fit the technical expectations of the role, ensuring they set the appropriate atmosphere and leave a positive impression on candidates. A misaligned avatar can disrupt the candidate experience and dilute your brand’s message.

3. Use AI to Generate Tailored Interview Questions

Automate interview questions with an AI tool like OpenAI's chat completions API. This approach lets you dynamically create questions based on a job description, making the interaction more personalized.

Feed the avatar prompts that match the technical and soft skills you're seeking, making certain the questions align with the real-world challenges of the role. This step can save time in interview preparation while providing candidates with an authentic glimpse into their potential responsibilities.

4. Integrate Lip-Synced Audio for Realistic Interactions

To create a natural feel in your videos, convert text-based questions to audio, then sync the audio with the avatar's lip movements using a platform like Simli. This process requires downsampling audio to 16kHz PCM to match typical WebRTC audio standards, which many tools automate.

Using lip-synced avatars allows candidates to experience a smoother, more lifelike conversation, making the interaction feel less automated and more engaging.

5. Test and Iterate Your Videos

Before launching avatar-led interviews, conduct tests with a diverse group to gather feedback. Watch for awkward pauses, audio misalignments, or unclear prompts, and refine the script based on real interactions. Regularly update the avatar content to reflect current job demands, company culture, and candidate feedback. This iterative process will make your video content relevant and polished, maintaining a professional edge that resonates with your ideal candidates.

Scale Recruitment Efforts With Cost-efficient Digital Avatars

Digital avatars are an accessible, budget-friendly way to scale recruitment. With tools like Simli, which offers pay-as-you-go pricing, you can keep costs low by paying only for what you use.

Simli’s AI Video Avatar platform enables developers to create avatars that mimic human movement and speech convincingly, adding a personal touch to recruitment videos that resonates with candidates.

Beyond hiring, Simli’s avatars can be adapted for employee interview practice, sales support, language lessons, and coaching, making it a versatile tool for various business needs.

Simli provides a demo page to explore character options and offers an open-source project that integrates with OpenAI, providing hands-on guidance for developers. Its starter project lets developers clone and experiment with a digital avatar setup quickly, with 60 free minutes of API use and no credit card requirement. This trial period offers a low-risk way to experiment with the technology before committing to further usage.

Conclusion

Digital avatars reduce the stress and time recruiters spend using traditional methods in employee interviews and cut the cost of hiring more recruiters. Tools like Simli make it easy for companies to add AI avatars to their hiring process in just a few clicks.

Simli has a Discord channel where you can connect with other members and ask questions about an issue you encounter with the tool, and learn more about maximizing its features.

How to Build Low-latency AI Video Avatars for Real-time Interactions

Tiioluwani — Wed, 27 Nov 2024 09:48:45 +0000

AI avatars are changing the face of content creation and how appealing videos can be created for various purposes, whether it’s training video content, internal team communication, or customer support. Advanced AI tools and video generators now make it possible to create studio-quality, customized videos featuring avatars that can speak multiple languages. An AI video generator enables you to create videos with human-like digital avatars that can connect with a global audience.

In this guide, you'll explore how to build a low-latency AI video avatar using Simli's API and WebRTC. By the end, you’ll be able to create and upload AI-generated videos that sync perfectly with real-time speech, ideal for training, interactive simulations, and presentations. You’ll also be able to generate stunning AI avatars and animated videos in minutes, reaching audiences at scale with human-like, multilingual content.

Prerequisites

To follow through this tutorial, you'll need:

A basic understanding of Javascript and Next.js
A Simli account. To get started, create a free Simli account.
OpenAI account (paid version).

The complete source code is available on GitHub.

With these essentials in place, let's now explore the core elements driving low-latency AI avatars and why low latency is required to provide a fluid, realistic user experience.

Understanding the Key Components for Low-Latency AI Avatars

To demonstrate the importance of low latency, imagine a virtual assistant that takes several seconds to respond to a user’s question. Frustrating, right? Another illustration is an AI instructor in a training video whose voice and lip movements lag behind the audio in AI videos. These irregularities affect the effectiveness of AI interactions, making low latency an absolute necessity.

Latency determines the amount of interaction and realistic experience with AI-powered avatars. In scenarios like training sessions, live presentations, or customer support, an avatar needs to respond as naturally as a human would. A latency of more than 200 milliseconds makes the responses appear delayed and unnatural for any human interaction. Low latency, therefore, gives way to flawless and real-time engagement, thereby leading to a lot more satisfaction and a much better user experience.

Let's look closer at the core technologies and tools behind low-latency AI video avatars and see how each adds to creating an ideal, real-time interaction.

Core Technologies for Building Low-Latency AI Video Avatars

To accomplish low-latency setup, you’ll use the following technologies and format:

Simli's API: Simli's API allows you to create lip-synced AI avatars that convert audio into video in real time. This API provides the foundation for transforming audio into synchronized visual avatars quickly.
WebRTC: This handles the media stream capture and adds to the peer connection.
PCM and Audio Downsampling: To minimize latency in audio transmission, pulse code modulation (PCM) is used for audio data, which WebRTC can handle efficiently. PCM digitizes audio signals by sampling analog waveforms and converting them into uncompressed binary data. Downsampling to 16kHz PCM ensures seamless compatibility with WebRTC and the Simli client, supporting real-time, synchronized audio playback.
OpenAI API: Powers real-time language understanding and response generation, enabling the avatar to engage in dynamic, intelligent interactions.

Having covered the technologies and tools, let's dive into how you can set up your API environment. For the following steps, you will be configuring access to both Simli and OpenAI so you can have smooth integration and interaction within your application.

Set Up Your API Environment

You'll need API keys for both Simli and OpenAI to set up the API environment. To get a Simli API key, go to Simli's official website, click "Get API Access" and create an account.

After logging in, you should see your developer dashboard with your API keys as shown below.

Copy and keep the API key somewhere safe because you will need it to authenticate requests within your application.

With your API key ready, let’s move on to customizing the appearance of your AI avatar. Navigate to the ‘Available Faces’ section in Simli’s API documentation to find listed Avatar face IDs as seen below:

You can also create your own avatar by clicking on the “Create Avatar” Icon on your dashboard. Once you follow the steps highlighted on the page, you’ll be able to create your own Avatar.

To get the ID for each face, copy the random text after the name. For example, the ID for Jenna will be tmp9i8bbq7c. If you created your own Avatar, an ID would also be given alongside. Now that you have the face ID and the Simli API key, let’s create a Next.js app.

Create the Project

Create a Next.js app by running this command in your terminal:

npx create-next-app@latest simli-demo

Answer the following prompts:

What is your project named? my-app
Would you like to use TypeScript? No / Yes
Would you like to use ESLint? No / Yes
Would you like to use Tailwind CSS? No / Yes
Would you like your code inside a `src/` directory? No / Yes
Would you like to use App Router? (recommended) No / Yes
Would you like to use Turbopack for `next dev`?  No / Yes
Would you like to customize the import alias (`@/*` by default)? No / Yes
What import alias would you like configured? @/*

Typescript and Tailwind are essential for this tutorial, as they simplify styling and provide type safety, making it easier to create a responsive and maintainable avatar interface. Once that is done, install the dependencies or packages for this project. They include:

simli-client: This will be used for handling facial animations or avatar interactions.
openai: This is the main OpenAI SDK and will be used for accessing the AI service.
openai/realtime-api-beta: This is OpenAI's real-time language API that will be used for voice conversations.
Ws: WebSocket client and server implementation for real-time communication.

Open up your terminal once more and run the command below, it will install all the dependencies needed for the project.

npm install simli-client openai github:openai/openai-realtime-api-beta ws

Add Environment Variables
Create a .env file and copy your Simli API key and OpenAI key. Paste it in as shown below:

NEXT_PUBLIC_SIMLI_API_KEY="SIMLI-API-KEY"
NEXT_PUBLIC_OPENAI_API_KEY="OPENAI-API-KEY"

Create the VideoBox Components

You'll need to create a component to handle the video feed display and the audio playing. Inside the app folder, create a folder called components, and inside it, create a file called VideoBox.tsx and add the following code:

// app/components/VideoBox.tsx

    // Render a styled video and audio container with auto-play functionality.

    export default function VideoBox(props: any) {
       return (
           <div className="aspect-video flex rounded-sm overflow-hidden items-center h-[350px] w-[350px] justify-center bg-simligray">
               <video ref={props.video} autoPlay playsInline></video>
               <audio ref={props.audio} autoPlay ></audio>
           </div>
       );
    }

The code defines a VideoBox component that displays a video and audio player within a styled container. This component expects two props: video and audio, which are references passed to the <video> and <audio> elements. Both elements have autoPlay enabled, starting playback immediately, and playsInline is set on the video to prevent it from going fullscreen on mobile devices. These ref attributes (props.video and props.audio) allow the primary component to control playback, pause, and other media interactions.

Build the SimliOpenAIPushToTalk Component

Inside the app folder, create a file called SimliOpenAIPushToTalk.tsx. This file will contain much of the core functionality. Start by adding the following code:

// app/SimliOpenAIPushToTalk.tsx

    // import the necessary packages
    ...

    import React, { useCallback, useEffect, useRef, useState } from "react";
    import { RealtimeClient } from "@openai/realtime-api-beta";
    import { SimliClient } from "simli-client";
    import VideoBox from "./Components/VideoBox";
    import IconExit from "@/media/IconExit";
    interface SimliOpenAIPushToTalkProps {
      simli_faceid: string;
      openai_voice: "echo" | "alloy" | "shimmer";
      initialPrompt: string;
      onStart: () => void;
      onClose: () => void;
    }
    const simliClient = new SimliClient();
    const SimliOpenAIPushToTalk: React.FC<SimliOpenAIPushToTalkProps> = ({
      simli_faceid,
      openai_voice,
      initialPrompt,
      onStart,
      onClose,
    }) => {
      const [isLoading, setIsLoading] = useState(false);
      const [isAvatarVisible, setIsAvatarVisible] = useState(false);
      const [error, setError] = useState("");
      const [isRecording, setIsRecording] = useState(false);
      const [userMessage, setUserMessage] = useState("...");
      const [isButtonDisabled, setIsButtonDisabled] = useState(false);
      const videoRef = useRef<HTMLVideoElement>(null);
      const audioRef = useRef<HTMLAudioElement>(null);
      const openAIClientRef = useRef<RealtimeClient | null>(null);
      const audioContextRef = useRef<AudioContext | null>(null);
      const streamRef = useRef<MediaStream | null>(null);
      const processorRef = useRef<ScriptProcessorNode | null>(null);
      const audioChunkQueueRef = useRef<Int16Array[]>([]);
      const isProcessingChunkRef = useRef(false);
    ...

In the above codes, the necessary packages and hooks are imported, a prop interface SimliOpenAIPushToTalkProps is defined for component configuration, and an instance of SimliClient is created to manage multimedia interactions. The SimliOpenAIPushToTalk component requires a specific set of props, which are defined in the SimliOpenAIPushToTalkProps interface.

References are also set up for video and audio elements, the OpenAI client, audio context, media stream, and audio processor, allowing direct control over multimedia resources within the component.

Next, add the following code:

// app/SimliOpenAIPushToTalk.tsx
    ...
    // initializing the SimliClient
    ...

    const initializeSimliClient = useCallback(() => {
        if (videoRef.current && audioRef.current) {
          const SimliConfig = {
            apiKey: process.env.NEXT_PUBLIC_SIMLI_API_KEY,
            faceID: simli_faceid,
            handleSilence: true,
            videoRef: videoRef,
            audioRef: audioRef,
          };
          simliClient.Initialize(SimliConfig as any);
          console.log("Simli Client initialized");
        }
      }, [simli_faceid]);
      const initializeOpenAIClient = useCallback(async () => {
        try {
          console.log("Initializing OpenAI client...");
          openAIClientRef.current = new RealtimeClient({
            apiKey: process.env.NEXT_PUBLIC_OPENAI_API_KEY,
            dangerouslyAllowAPIKeyInBrowser: true,
          });
          await openAIClientRef.current.updateSession({
            instructions: initialPrompt,
            voice: openai_voice,
            turn_detection: { type: "server_vad" },
            input_audio_transcription: { model: "whisper-1" },
          });
          openAIClientRef.current.on(
            "conversation.updated",
            handleConversationUpdate
          );
          openAIClientRef.current.on(
            "input_audio_buffer.speech_stopped",
            handleSpeechStopped
          );
          await openAIClientRef.current.connect();
          console.log("OpenAI Client connected successfully");
          setIsAvatarVisible(true);
        } catch (error: any) {
          console.error("Error initializing OpenAI client:", error);
          setError(`Failed to initialize OpenAI client: ${error.message}`);
        }
      }, [initialPrompt]);
      const handleConversationUpdate = useCallback((event: any) => {
        console.log("Conversation updated:", event);
        const { item, delta } = event;
        if (item.type === "message" && item.role === "assistant") {
          console.log("Assistant message detected");
          if (delta && delta.audio) {
            const downsampledAudio = downsampleAudio(delta.audio, 24000, 16000);
            audioChunkQueueRef.current.push(downsampledAudio);
            if (!isProcessingChunkRef.current) {
              processNextAudioChunk();
            }
          }
        } else if (item.type === "message" && item.role === "user") {
          setUserMessage(item.content[0].transcript);
        }
      }, []);
    ...

The code above does the following:

Initializes the SimliClient with configuration parameters, including simli_faceid for Simli's face recognition and video references, and manages audio-visual synchronization with the AI avatar using the API key and face ID. With this, you'll have a real-time lip-sync between the avatar's visuals and audio input.
Sets up and connects the OpenAI client—registers event listeners for updates in conversation and speech detection. The initializeOpenAIClient function sets up the connection to the OpenAI client, using the provided API key and settings, and registers event listeners to handle conversation updates and detect when speech stops.
Handles real-time updates from the conversation, processing both assistant messages and user inputs. For audio responses from the assistant, the code downsamples and adds chunks to a queue.

Directly after handleConversationUpdate, add the following code:

// app/SimliOpenAIPushToTalk.tsx
    ...
    // process audio chunks and send to Simli
    ...

    const processNextAudioChunk = useCallback(() => {
        if (
          audioChunkQueueRef.current.length > 0 &&
          !isProcessingChunkRef.current
        ) {
          isProcessingChunkRef.current = true;
          const audioChunk = audioChunkQueueRef.current.shift();
          if (audioChunk) {
            const chunkDurationMs = (audioChunk.length / 16000) * 1000; 

            simliClient?.sendAudioData(audioChunk as any);
            console.log(
              "Sent audio chunk to Simli:",
              chunkDurationMs,
              "Duration:",
              chunkDurationMs.toFixed(2),
              "ms"
            );
            isProcessingChunkRef.current = false;
            processNextAudioChunk();
          }
        }
      }, []);

      const handleSpeechStopped = useCallback((event: any) => {
        console.log("Speech stopped event received", event);
      }, []);

      const downsampleAudio = (
        audioData: Int16Array,
        inputSampleRate: number,
        outputSampleRate: number
      ): Int16Array => {
        if (inputSampleRate === outputSampleRate) {
          return audioData;
        }
        const ratio = inputSampleRate / outputSampleRate;
        const newLength = Math.round(audioData.length / ratio);
        const result = new Int16Array(newLength);
        for (let i = 0; i < newLength; i++) {
          const index = Math.round(i * ratio);
          result[i] = audioData[index];
        }
        return result;
      };
    ...

Here:

The processNextAudioChunk manages the processing of audio chunks from the assistant, sending them immediately to Simli.
The downsampleAudio converts audio data to a 16kHz PCM format. PCM captures audio in an uncompressed, raw format, ideal for maintaining low latency. The PCM audio is then sent to SimliClient without encoding delays.

After the downsampleAudio function, add the following code:

// app/SimliOpenAIPushToTalk.tsx
    ...

    // Manages audio recording setup and cleanup for real-time processing.
    ...

    const startRecording = useCallback(async () => {
        if (!audioContextRef.current) {
          audioContextRef.current = new AudioContext({ sampleRate: 24000 });
        }
        try {
          console.log("Starting audio recording...");
          streamRef.current = await navigator.mediaDevices.getUserMedia({
            audio: true,
          });
          const source = audioContextRef.current.createMediaStreamSource(
            streamRef.current
          );
          processorRef.current = audioContextRef.current.createScriptProcessor(
            2048,
            1,
            1
          );
          processorRef.current.onaudioprocess = (e) => {
            const inputData = e.inputBuffer.getChannelData(0);
            const audioData = new Int16Array(inputData.length);
            for (let i = 0; i < inputData.length; i++) {
              audioData[i] = Math.max(
                -32768,
                Math.min(32767, Math.floor(inputData[i] * 32768))
              );
            }
            openAIClientRef.current?.appendInputAudio(audioData);
          };
          source.connect(processorRef.current);
          processorRef.current.connect(audioContextRef.current.destination);
          setIsRecording(true);
          console.log("Audio recording started");
        } catch (err) {
          console.error("Error accessing microphone:", err);
          setError("Error accessing microphone. Please check your permissions.");
        }
      }, []);

      const stopRecording = useCallback(() => {
        if (processorRef.current) {
          processorRef.current.disconnect();
          processorRef.current = null;
        }
        if (streamRef.current) {
          streamRef.current.getTracks().forEach((track) => track.stop());
          streamRef.current = null;
        }
        setIsRecording(false);
        console.log("Audio recording stopped");
      }, []);
    ...

The startRecording function accesses the microphone in real time using getUserMedia, captures audio with AudioContext, processes it as 16-bit PCM, and sends it to the OpenAI client. The stopRecording function stops the stream and releases resources. This setup uses AudioContext for real-time audio, while WebRTC configuration is managed by the Simli client.

Next to add is the stopRecording function, paste this code below:


// app/SimliOpenAIPushToTalk.tsx
...
// Control interaction, push-to-talk, and audio visualization.

...

const handleStart = useCallback(async () => {
    setIsLoading(true);
    setError("");
    onStart();
    try {
      await simliClient?.start();
      await initializeOpenAIClient();
    } catch (error: any) {
      console.error("Error starting interaction:", error);
      setError(`Error starting interaction: ${error.message}`);
    } finally {
      setIsAvatarVisible(true);
      setIsLoading(false);
    }
  }, [initializeOpenAIClient, onStart]);
  const handleStop = useCallback(() => {
    console.log("Stopping interaction...");
    setIsLoading(false);
    setError("");
    setIsAvatarVisible(false);
    simliClient?.close();
    openAIClientRef.current?.disconnect();
    if (audioContextRef.current) {
      audioContextRef.current.close();
    }
    stopRecording();
    onClose();
    console.log("Interaction stopped");

    window.location.reload();
  }, [stopRecording]);

  const handlePushToTalkStart = useCallback(() => {
    if (!isButtonDisabled) {
      startRecording();

      simliClient?.ClearBuffer();
      openAIClientRef.current?.cancelResponse("");
    }
  }, [startRecording, isButtonDisabled]);
  const handlePushToTalkEnd = useCallback(() => {
    setTimeout(() => {
      stopRecording();
    }, 500);
  }, [stopRecording]);

  const AudioVisualizer = () => {
    const [volume, setVolume] = useState(0);
    useEffect(() => {
      const interval = setInterval(() => {
        setVolume(Math.random() * 100);
      }, 100);
      return () => clearInterval(interval);
    }, []);
    return (
      <div className="flex items-end justify-center space-x-1 h-5">
        {[...Array(5)].map((_, i) => (
          <div
            key={i}
            className="w-2 bg-black transition-all duration-300 ease-in-out"
            style={{
              height: `${Math.min(100, volume + Math.random() * 20)}%`,
            }}
          />
        ))}
      </div>
    );
  };

The code above does the following:

Initializes the Simli and OpenAI clients through the handleStart function, and resets the error state. It also sets isLoading to true while the clients start.
Stops recording, disconnects clients through the handleStop function and resets the component state.
Manages push-to-talk functionality through the Push-to-Talk functions starting and stopping recording when the user interacts with the button.
Displays an animated audio visualizer that gives feedback on microphone volume through the audio visualizer.

Following the audioVisualizer function, add the code:

// app/SimliOpenAIPushToTalk.tsx
    ...
    // Initialize Simli client and render video and push-to-talk UI.

    useEffect(() => {
        initializeSimliClient();
        if (simliClient) {
          simliClient?.on("connected", () => {
            console.log("SimliClient connected");
            const audioData = new Uint8Array(6000).fill(0);
            simliClient?.sendAudioData(audioData);
            console.log("Sent initial audio data");
          });
        }
        return () => {
          try {
            simliClient?.close();
            openAIClientRef.current?.disconnect();
            if (audioContextRef.current) {
              audioContextRef.current.close();
            }
          } catch {}
        };
      }, [initializeSimliClient]);
      return (
        <>
          <div className="transition-all duration-300 ">
            <VideoBox video={videoRef} audio={audioRef} />
          </div>
          <div className="flex flex-col items-center">
            {!isAvatarVisible ? (
              <button
                onClick={handleStart}
                disabled={isLoading}
                className="w-full h-[52px] mt-4 disabled:bg-gray-600 disabled:text-white disabled:hover:rounded-full bg-green-500 text-white py-3 px-6 rounded-full transition-all duration-300 hover:text-black hover:bg-white hover:rounded flex justify-center items-center"
              >
                {isLoading ? (
                  <span>Loading...</span>
                ) : (
                  <span className="font-abc-repro-mono font-bold w-[164px]">
                    Test Interaction
                  </span>
                )}
              </button>
            ) : (
              <>
                <div className="flex items-center gap-4 w-full">
                  <button
                    onMouseDown={handlePushToTalkStart}
                    onTouchStart={handlePushToTalkStart}
                    onMouseUp={handlePushToTalkEnd}
                    onTouchEnd={handlePushToTalkEnd}
                    onMouseLeave={handlePushToTalkEnd}
                    disabled={isButtonDisabled}
                    className={`mt-4 text-white flex-grow bg-green-500 hover:rounded hover:bg-opacity-70 h-[52px] px-6 rounded-full transition-all duration-300 ${
                      isRecording ? "bg-gray-900 rounded hover:bg-opacity-100" : ""
                    }`}
                  >
                    <span className="font-abc-repro-mono font-bold w-[164px]">
                      {isRecording ? "Release to Stop" : "Push & hold to talk"}
                    </span>
                  </button>
                  <button
                    onClick={handleStop}
                    className=" group w-[52px] h-[52px] flex items-center mt-4 bg-red text-white justify-center rounded-[100px] backdrop-blur transition-all duration-300 hover:bg-white hover:text-black hover:rounded-sm"
                  >
                    <IconExit className="group-hover:invert-0 group-hover:brightness-0 transition-all duration-300" />
                  </button>
                </div>
              </>
            )}
          </div>
        </>
      );
    };
    export default SimliOpenAIPushToTalk;

In the code above:

The useEffect hook runs when the page mounts to initialize simliClient and sets up cleanup operations to close simliClient and audioContextRef when the component unmounts.
Render the UI to:
- Display the VideoBox component, which receives video and audio props for controlling media playback, along with controls to start interaction or initiate push-to-talk.
- A button triggers handleStart or handleStop, while the push-to-talk button manages recording on mouse down and up.

Now that the core components and UI elements are set up, let’s move on to integrating these into the main demo component. Head over to page.tsx and replace its contents with the following code to complete the setup.

Set Up Demo Component
Navigate to page.tsx and replace it with the following code:

    // app/page.tsx

    // render the AI avatar with push-to-talk functionality
    "use client";
    import React, { useEffect, useState } from "react";
    import SimliOpenAIPushToTalk from "./SimliOpenAIPushToTalk";
    interface avatarSettings {
      name: string;
      openai_voice: "echo" | "alloy" | "shimmer";
      simli_faceid: string;
      initialPrompt: string;
    }
    const avatar: avatarSettings = {
      name: "Frank",
      openai_voice: "echo",
      simli_faceid: "5514e24d-6086-46a3-ace4-6a7264e5cb7c",
      initialPrompt:
        "You are a helpful AI assistant named Frank. You are friendly and concise in your responses. Your task is to help users with any questions they might have. Your answers are short and to the point, don't give long answers be brief and straightforward.",
    };
    const Demo: React.FC = () => {
      const [interactionMode, setInteractionMode] = useState<
        "pushToTalk" | undefined
      >(undefined);
      useEffect(() => {
        const storedInteractionMode = localStorage.getItem("interactionMode");
        if (storedInteractionMode === "pushToTalk") {
          setInteractionMode("pushToTalk");
        }
      }, []);
      const onStart = () => {
        console.log("Setting setshowDottedface to false...");
      };
      const onClose = () => {
        console.log("Setting setshowDottedface to true...");
      };
      return (
        <div className="bg-black min-h-screen flex flex-col items-center font-abc-repro font-normal text-sm text-white p-8">
          <div className="flex flex-col items-center gap-6 bg-effect15White p-6 pb-[40px] rounded-xl w-full">
            <div>
              <SimliOpenAIPushToTalk
                openai_voice={avatar.openai_voice}
                simli_faceid={avatar.simli_faceid}
                initialPrompt={avatar.initialPrompt}
                onStart={onStart}
                onClose={onClose}
              />
            </div>
          </div>
        </div>
      );
    };
    export default Demo;

In the above code, you created a component that renders the AI avatar with push-to-talk functionality. It also initializes settings for the avatar. You can also customize your avatar here by changing it to any face you want for interaction. Simli supports a range of custom avatar faces that you can apply to various use cases.

Set Up Layout
The last thing you’ll need to do is edit the layout, in the layout.tsx, paste this code below:

 // app/layout.tsx

    // edit the layout
    import type { Metadata } from "next";
    import { Inter } from "next/font/google";
    import { abcRepro, abcReproMono } from './fonts/fonts';
    import "./globals.css";

    const inter = Inter({ subsets: ["latin"] });
    export const metadata: Metadata = {
     title: "Simli App",
     description: "create-simli-app (OpenAI)",
    };
    export default function RootLayout({
     children,
    }: Readonly<{
     children: React.ReactNode;
    }>) {
     return (
       <html lang="en" className={`${abcReproMono.variable} ${abcRepro.variable}`}>
         <body className={inter.className}>{children}</body>
       </html>
     );
    }

Test the application by running the following command on your terminal:

npm run dev

Check the video demonstration of the project to see it in action. You can also explore the Simli repository to build avatars and work on more projects using Simli API.

Conclusion

In this project, you have implemented a low-latency AI video avatar for realistic and dynamic interactions. Simli's API with WebRTC allows developers to easily deploy and manage latency-free, perfectly synchronized audio-visual avatars that are very well-suited for a wide range of applications, such as training simulations, interactive presentations, customer support, and real-time virtual assistants.

This setup opens the door to engaging, human-like interactions across various settings.

Explore the documentation to learn more about the features and tools Simli offers.

Turn Job Descriptions Into Recruitment Video With AI

Dalu46 — Tue, 19 Nov 2024 13:28:16 +0000

According to Statistics by the Genius, 60% of job seekers quit because the interview process is too long or complicated, and the average process takes about 23 days.

The traditional recruitment process is time-consuming and prone to bias. Instead of interviewing each candidate individually, you can use an AI avatar to simultaneously conduct interviews with all candidates. This way, you no longer need to wait as long as 23 days as the recruitment process can be completed in two to three days.

This article will guide you through building a recruitment AI video avatar that will interview prospective candidates using Simli’s API. It will show how to build an application that can turn job descriptions into interactive interviews, streamlining your hiring process and improving the candidate experience.

Prerequisites

To follow along with this tutorial, make sure you have the following:

An understanding of JavaScript and React.
Node and the Node Package Manager installed.
A Simli account. To get started, create a free Simli account.
An OpenAI account.

You can find the complete source code on GitHub.

Generate Interview Questions With OpenAPI

OpenAI provides APIs that use a large language model trained on large quantities of data to generate text from a prompt. One of these APIs is the chat completions endpoint. By using this endpoint and providing a job description as a prompt, you can generate customized interview questions.

Once the questions are generated, the next step is to convert them to audio so they can be sent to the SimliClient API, which will generate a Lipsynced AI avatar to interview the applicant. Simli is an AI video generator that provides you with a speech-to-video API to create video AI avatars.

Also, OpenAI provides an audio API with a speech endpoint based on its TTS (text-to-speech) model. This speech endpoint requires three key inputs: the model, the text to be converted into audio, and the voice for audio generation. This functionality will be used to convert the generated questions to audio.

Here’s How The Application Will Work

The recruiter enters a job description that is sent to OpenAI’s chat completions API.
The API returns a list of interview questions as text.
These questions are then sent to the speech endpoint, where they are converted into audio.
Finally, the audio is sent to Simli’s API, which generates a realistic, lip-synced video avatar to deliver the questions, creating an interactive and engaging interview experience for candidates. ## Set Up Your API Environment

Simli’s endpoint requires an API key, which you can get by creating a Simli account. Once you’ve successfully created an account, you will be redirected to the user profile dashboard, where you can generate your API key and track your API usage.
Click the copy button and save it.

Next, create an account on OpenAI to receive the API key. Once you create an account, you’ll be redirected to the profile dashboard. On the dashboard, navigate to the API keys section.

In the API keys section, click Create new secret key, give the key a name, and click Create secret key button.

Your secret key will now be generated. Be sure to copy and store it in a secure location to retrieve it later easily.

Note: The OpenAI real-time API is in beta and only available for paying users.

With your API keys ready, you can build your AI-driven interview experience. The next step is selecting an AI avatar that aligns with your brand and hiring role.

Choosing Your AI Avatar for Recruitment

When selecting an avatar for your AI recruiter, consider the desired brand image and the role you're hiring for. This tutorial will use the 'Franco' avatar, which was randomly chosen. You can explore Simli's library of avatars to find the perfect fit for your needs.

Simli has a create avatar tool that allows users to create custom avatars by uploading images. Consider using this feature if none of the available faces suit your needs.

Here’s a picture of Franco in the red box:

Once you’ve selected your avatar, let’s bring it to life by building a Next.js application.

Setting Up the Next.js Project

To get started, create a Next.js app by running the following command:

npx create-next-app@latest interview-simli

This command will prompt a few questions about configuring the Next.js application. Here's the response to each question:

Next, navigate to the application and install dependencies. Run the following command:

cd recruitment-video-app
npm install simli-client openai github:openai/openai-realtime-api-beta

The SimliClient is a tool for integrating real-time audio and video streaming capabilities into your web applications using WebRTC.

Once the project is set up, run the development server:

npm run dev

Your Next.js project should now be running at http://localhost:3000.

In your project's root directory, create a .env file and store the Simli and OpenAI API key credentials as shown below.

NEXT_PUBLIC_SIMLI_API_KEY="your simli api key"
NEXT_PUBLIC_OPENAI_API_KEY="your openai api key"

Create Real-time Video Interactions With Applicants

In your project, Navigate to the src/app folder, create a components folder, and create an Interview.js file. This component will set up the interactive interview interface where users can initiate and respond to interview questions generated by an AI avatar.

First, you need to declare state variables and references that help control and monitor different aspects of the component. To do so, paste the following code snippet:

// src/app/components/Interview.js
// State management
//...
const [isLoading, setIsLoading] = useState(false);
const [isAvatarVisible, setIsAvatarVisible] = useState(false);
const [error, setError] = useState("");
const [isRecording, setIsRecording] = useState(false);
const [userMessage, setUserMessage] = useState("...");
// Refs for various components and states
const videoRef = useRef(null);
const audioRef = useRef(null);
const openAIClientRef = useRef(null);
const audioContextRef = useRef(null);
const streamRef = useRef(null);
const processorRef = useRef(null);
// New refs for managing audio chunk delay
const audioChunkQueueRef = useRef([]);
const isProcessingChunkRef = useRef(false);
//...

This code block declares different state variables to track isLoading, isAvatarVisible, error and userMessage state.
It also creates multiple refs. Let’s look at what each one does:

videoRef and audioRef provide direct access to the video and audio elements. They are required for configuring the SimliClient.
audioContextRef and processorRef manage audio processing and encoding, which is critical for capturing and sending audio data.
audioQueueRef is a buffer for audio chunks, ensuring seamless audio playback by queuing chunks until they’re ready to be sent. The SimliClient requires the audio to be sent in chunks in PCM16 format, with a 16KHz sample rate.

Simli Client Initialization

Now, let’s initialize the Simli client. Paste the following code snippet inside the Interview.js file:

// src/app/components/Interview.js
...
// Initializes the Simli client with the provided configuration
...
const initializeSimliClient = useCallback(() => {
   if (videoRef.current && audioRef.current) {
     const SimliConfig = {
       apiKey: process.env.NEXT_PUBLIC_SIMLI_API_KEY,
       faceID: simli_faceid,
       handleSilence: true,
       maxSessionLength: 60, // in seconds
       maxIdleTime: 60, // in seconds
       videoRef: videoRef,
       audioRef: audioRef,
     };
     simliClient.Initialize(SimliConfig);
     console.log("Simli Client initialized");
   }
 }, [simli_faceid]);
//...

This code block above initializes and configures a new instance of the SimliClient. Let’s break down each part of the SimliConfig function:

apiKey: This is Simli API key.
faceID: Represents the avatar face ID that will be rendered in the video stream.
handleSilence: This boolean indicates whether the client should handle silent moments in the audio stream (e.g., muting or pausing the video if no audio is detected).
maxSessionLength: Sets the maximum session length (in seconds).
maxIdleTime: Sets the maximum idle time (in seconds). The session will disconnect after 600 seconds (10 minutes) without activity.
videoRef and audioRef: These are references to the video and audio elements where the media streams will be displayed in the browser.

OpenAI Client Initialization

The next step is to initialize the OpenAI client. To do so, paste the following code inside the Interview.js file:

// src/app/components/Interview.js
...
// Initializes the OpenAI client, sets up event listeners, and connects to the API
...
const initializeOpenAIClient = useCallback(async () => {
  try {
    console.log("Initializing OpenAI client...");
    openAIClientRef.current = new RealtimeClient({
      apiKey: process.env.NEXT_PUBLIC_OPENAI_API_KEY,
      dangerouslyAllowAPIKeyInBrowser: true,
    });
    await openAIClientRef.current.updateSession({
      instructions: initialPrompt,
      voice: openai_voice,
      turn_detection: { type: "server_vad" },
      input_audio_transcription: { model: "whisper-1" },
    });
    // Set up event listeners
    openAIClientRef.current.on(
      "conversation.updated",
      handleConversationUpdate
    );
    openAIClientRef.current.on("conversation.interrupted", () => {
      interruptConversation();
    });
    openAIClientRef.current.on(
      "input_audio_buffer.speech_stopped",
      handleSpeechStopped
    );
    // openAIClientRef.current.on('response.canceled', handleResponseCanceled);
    await openAIClientRef.current.connect();
    console.log("OpenAI Client connected successfully");
    setIsAvatarVisible(true);
  } catch (error) {
    console.error("Error initializing OpenAI client:", error);
    setError(`Failed to initialize OpenAI client: ${error.message}`);
  }
}, [initialPrompt]);
//...

The function initializeOpenAIClient initializes the OpenAI client, which will handle real-time conversations with the applicant. The client is set up with an API key and an initial message that welcomes the user to the interview. After that, the event listeners are added to manage mistakes and updates within the conversation. Once the client is configured, isAvatarVisible is set to true, which makes the avatar appear in the user interface.

Note: Setting dangerouslyAllowAPIKeyInBrowser to true is generally for development or prototyping, as your OpenAI API key could be exposed to the client-side, which is vulnerable. In a production environment, API calls are better handled on the server side by creating a secure Next.js API route, which keeps your key hidden.

Audio Processing and Sending

You need to create a function to reduce audio to 16 kHz and break it into smaller PCM chunks. To do so, paste the following code inside Interview.js file:

// src/app/components/Interview.js
...
// Downsamples audio data from one sample rate to another
...
const downsampleAudio = (
  audioData,
  inputSampleRate,
  outputSampleRate
) => {
  if (inputSampleRate === outputSampleRate) {
    return audioData;
  }
  const ratio = inputSampleRate / outputSampleRate;
  const newLength = Math.round(audioData.length / ratio);
  const result = new Int16Array(newLength);
  for (let i = 0; i < newLength; i++) {
    const index = Math.round(i * ratio);
    result[i] = audioData[index];
  }
  return result;
 };
 //...

Sending Audio Data To Simli

To send audio data to SimliClient for playback, you’ll create a function. To do so, paste the following code snippet:

// src/app/components/Interview.js
...
// Processes the next audio chunk in the queue.
...

const processNextAudioChunk = useCallback(() => {
  if (
    audioChunkQueueRef.current.length > 0 &&
    !isProcessingChunkRef.current
  ) {
    isProcessingChunkRef.current = true;
    const audioChunk = audioChunkQueueRef.current.shift();
    if (audioChunk) {
      const chunkDurationMs = (audioChunk.length / 16000) * 1000; // Calculate chunk duration in milliseconds
      // Send audio chunks to Simli immediately
      simliClient?.sendAudioData(audioChunk);
      console.log(
        "Sent audio chunk to Simli:",
        chunkDurationMs,
        "Duration:",
        chunkDurationMs.toFixed(2),
        "ms"
      );
      isProcessingChunkRef.current = false;
      processNextAudioChunk();
    }
  }
 }, []);
//...

The processNextAudioChunk function checks if there are any chunks in audioQueueRef. If so, it takes the next chunk, sends it to Simli for playback, and removes it from the queue. This ensures that only one chunk is sent at a time, providing a smooth playback experience for the user without overlapping audio. Then, recursively call the function to process the next chunk in the queue.

Handle OpenAI Responses

Next, a function will be created to manage responses from the OpenAI API.

// src/app/components/Interview.js
...
// Handles conversation updates, including user and assistant messages
...
const handleConversationUpdate = useCallback((event) => {
  console.log("Conversation updated:", event);
  const { item, delta } = event;
  if (item.type === "message" && item.role === "assistant") {
    console.log("Assistant message detected");
    if (delta && delta.audio) {
      const downsampledAudio = downsampleAudio(delta.audio, 24000, 16000);
      audioChunkQueueRef.current.push(downsampledAudio);
      if (!isProcessingChunkRef.current) {
        processNextAudioChunk();
      }
    }
  } else if (item.type === "message" && item.role === "user") {
    setUserMessage(item.content[0].transcript);
  }
 }, []);
//Handles interruptions in the conversation flow.
const interruptConversation = () => {
  console.warn("User interrupted the conversation");
  simliClient?.ClearBuffer();
  openAIClientRef.current?.cancelResponse("");
};
//...

The code above defines two functions: handleConversationUpdate and interruptConversation. The handleConversationUpdate function first checks if the message is from the assistant. Then, it checks If the assistant’s message includes audio data; if true, it uses the downsampleAudio function to convert the audio to a lower sample rate (24,000 Hz to 16,000 Hz).

This downsampled audio is added to audioChunkQueueRef.current, a reference for storing audio chunks. If no chunk is being processed, the processNextAudioChunk() function (which we have previously declared) is called to start processing the audio chunks.

The interruptConversation function handles conversation interruption. If the user interrupts the interviewer, the function clears the SimliClient buffer and cancels the ongoing response from the OpenAI API.

Audio Recording

Next, let’s create functions to handle audio recording when the prospective candidate is talking. Paste the following code:

// src/app/components/Interview.js
...
// Starts audio recording from the user's microphone
...

const startRecording = useCallback(async () => {
   if (!audioContextRef.current) {
     audioContextRef.current = new AudioContext({ sampleRate: 24000 });
   }
   try {
     console.log("Starting audio recording...");
     streamRef.current = await navigator.mediaDevices.getUserMedia({
       audio: true,
     });
     const source = audioContextRef.current.createMediaStreamSource(
       streamRef.current
     );
     processorRef.current = audioContextRef.current.createScriptProcessor(
            2048,
            1,
            1
          );
     processorRef.current.onaudioprocess = (e) => {
       const inputData = e.inputBuffer.getChannelData(0);
       const audioData = new Int16Array(inputData.length);
       let sum = 0;
       for (let i = 0; i < inputData.length; i++) {
         const sample = Math.max(-1, Math.min(1, inputData[i]));
         audioData[i] = Math.floor(sample * 32767);
         sum += Math.abs(sample);
       }
       openAIClientRef.current?.appendInputAudio(audioData);
     };
     source.connect(processorRef.current);
          processorRef.current.connect(audioContextRef.current.destination);
     setIsRecording(true);
     console.log("Audio recording started");
   } catch (err) {
     console.error("Error accessing microphone:", err);
     setError("Error accessing microphone. Please check your permissions.");
   }
 }, []);

// Stops audio recording from the user's microphone

const stopRecording = () => { const stopRecording = useCallback(() => {
   if (processorRef.current) {
     processorRef.current.disconnect();
     processorRef.current = null;
   }
   if (streamRef.current) {
     streamRef.current.getTracks().forEach((track) => track.stop());
     streamRef.current = null;
   }
   setIsRecording(false);
   console.log("Audio recording stopped");
 }, []);
//...

Here’s what each function does:

startRecording:
- Requests microphone access, creates an audio context and streams audio data.
- Audio data is captured, converted to PCM format for compatibility, and sent to the OpenAI client for processing.
stopRecording: Closes the audio context and disconnects the processor.

Interaction Start and Stop

// src/app/components/Interview.js
...
// Handles starting the interaction
...
const handleStart = useCallback(async () => {
  setIsLoading(true);
  setError("");
  try {
    await simliClient?.start();
    await initializeOpenAIClient();
  } catch (error) {
     console.error("Error starting interaction:", error);
     setError(`Error starting interaction: ${error.message}`);
  } finally {
     setIsAvatarVisible(true);
     setIsLoading(false);
        }
      }, [initializeOpenAIClient]);

      // Handles stopping the interaction, cleaning up resources and resetting states.

const handleStop = useCallback(() => {
  console.log("Stopping interaction...");
  setIsLoading(false);
  setError("");
  stopRecording();
  setIsAvatarVisible(false);
  simliClient?.close();
  openAIClientRef.current?.disconnect();
  if (audioContextRef.current) {
    audioContextRef.current.close();
  }
  stopRecording();
  onClose();
  console.log("Interaction stopped");
 }, [stopRecording]);

In the code above, the handleStart function initializes the interaction by starting the necessary clients and preparing the interface for recording. By calling the simliClient?.start() method, the SimliClient initiates a WebRTC handshake to negotiate a connection between the client and Simli's server.

The handleStop function stops the interaction by calling the simliClient?.close() method, cleaning up used resources, like client connections, and updating the loading and avatar visibility states.

Component Mount and Cleanup

Finally, for this component, we need to initialize the simliClient when the component mounts. Paste the following code:

// src/app/components/Interview.js
...
// Effect to initialize Simli client once the component mounts and clean up resources on unmount
... 

useEffect(() => {
  initializeSimliClient();
  if (simliClient) {
    simliClient?.on("connected", () => {
      console.log("SimliClient connected");
      const audioData = new Uint8Array(6000).fill(0);
      simliClient?.sendAudioData(audioData);
      console.log("Sent initial audio data");
      startRecording();
    });
    simliClient?.on("disconnected", () => {
      console.log("SimliClient disconnected");
    });
  }
  return () => {
    try {
      simliClient?.close();
      openAIClientRef.current?.disconnect();
      if (audioContextRef.current) {
        audioContextRef.current.close();
      }
    } catch {}
  };
}, [initializeSimliClient]);

This useEffect hook initializes the simliClient when the component mounts, setting up event listeners for when it connects or disconnects. On connection, it sends a silent audio signal to keep the connection alive and starts recording audio. The cleanup function, triggered on component unmount, closes simliClient, and disconnects the OpenAI client.

Next, navigate to the src/app/pages.js file and paste the following code:

// src/app/components/Interview.js
...
// configure the avatar and display the home page
... 
"use client";
import React, { useState, useEffect } from "react";
import Interview from "./components/Interview";

const Demo = () => {
  const [jobDescription, setJobDescription] = useState("");
  const avatar = {
    name: "Frank",
    openai_voice: "alloy",
    simli_faceid: "5514e24d-6086-46a3-ace4-6a7264e5cb7c",
    initialPrompt: `Your name is Frank, an interviewer hiring for a specific role. You are looking for a candidate whose expertise aligns closely with the following job description: ${jobDescription}. Please generate three interview questions that assess key qualifications and relevant experience. Begin by introducing yourself and asking the interviewee to share a bit about their background.`,
  };
  return (
    <div className="bg-black min-h-screen flex flex-col items-center font-abc-repro font-normal text-sm text-white p-8">
      <div className="flex flex-col items-center mt-4">
        <label htmlFor="job-description" className="font-bold mb-2">
              Add Job Description
        </label>
        <textarea
              id="job-description"
              placeholder="Enter job description, e.g., Responsibilities, Requirements"
              value={jobDescription}
              onChange={(e) => setJobDescription(e.target.value)}
              className="p-2 border border-gray-300 rounded-md w-80 h-24 resize-none mb-4 text-black"
            />
      </div>
      <div className="flex flex-col items-center gap-6 bg-effect15White p-6 pb-[40px] rounded-xl w-full">
        <div>
          <Interview
            openai_voice={avatar.openai_voice}
            simli_faceid={avatar.simli_faceid}
            initialPrompt={avatar.initialPrompt}
          />
        </div>
      </div>
    </div>
  );
};
export default Demo;

This code handles the job description input from the recruiter and passes it to the initial prompt, the face ID, and the OpenAI voice that should be sent to the OpenAI API. Then, it is passed on as props to Interview.js. component.

The Final Result

To test the app, the user inputs the job description in a text area, which OpenAI API processes. The OpenAI API uses this prompt to generate three questions for the avatar to ask the applicant.
Here's the video link to see how the application works:

Conclusion

By leveraging SimliClient and OpenAI, this guide provides a comprehensive solution to the problem of time-consuming recruitment by building an application that converts static job descriptions into dynamic, interactive interview videos. Combining these tools is a game-changer because together they can be utilized to automate the initial candidate screening process and ease the stress of going through the traditional recruitment process.

Simli’s API comes with a free plan. Sign up on Simli today to get started.

Supersonic GPU MelSpectrogram for your real-time applications

Antonyesk601 — Wed, 16 Oct 2024 08:21:28 +0000

Here at Simli, we care the most about latency. That's what we're all about after all: low latency video. On the other hand, some of the most used algorithms in Audio Machine Learning have really really slow implementations. To be clear, these implementations are usually fine for creating the models themselves or for batched inferencing. But for us at Simli, a couple milliseconds could mean the difference between a stuttering mess or a smooth video.
Luckily for me (and by proxy you the reader), this guide does not require much knowledge in math, much smarter people have already figured out how to get the correct answer, we're just making the compute more efficient. If you need more info to understand what the MelSpectrogram even is, you can read this article. There are multiple ways to calculate the spectrogram, it heavily depends on your application. So, we're focusing on the mels required for running our internal models for the sake of convenience to the author.

The common solution: Librosa

You are most likely here after encountering a repo that’s using Librosa. It’s a pretty handy library, to be honest. There are a ton of utilities, easy ways to read the audio on disk, and quick access to many commonly required functionality such as audio resampling, channel downmixing, and others. In our case, we’re interested in one particular functionality: melspectrogram calculation. In librosa, getting the melspectrogram is straightforward.

import librosa

# load in any audio to test
sampleAudio, sr = librosa.load("sample.mp3", sr=None) # sr=None means the original sampling rate
spectrogram = librosa.feature.melspectrogram(
    y=sampleAudio,
    sr=sr,
    n_fft=int(0.05 * sr),  # 50ms
    hop_length=int(0.0125 * sr),  # 12.5ms
    win_length=int(0.05 * sr),
)

Straightforward and it takes on average around 2ms on a GCP g2 VM. Well, there are two main issues:

Usually, when working with DL models, you would need to run the model on a GPU. This means that part of your chain runs on the CPU and then you copy the results back to the GPU. For batched inference, this is mostly fine since you should collect as much data as you can fit on the GPU/transfer. However, in our case, we often work with one frame at a time to reduce waiting and processing time.
Our total time budget is roughly 33ms/frame. This includes transfer latency from the API server to the ML inference server, CPU to GPU Copy, preprocessing, and postprocessing for the models including the melspectrogram. Every millisecond matters when you’re working with such a tight budget. These two milliseconds actually contributed towards having a working live rendered video stream for Simli (well it was many optimizations each worth a millisecond or two).

Looking online for solutions

While trying to look at how other people have done it (luckily this is not a unique problem for us), I found this article that explained both how melspectrograms work and provided a reference implementation that for some reason took only 1ms (50% improvement). That's a good start but there's still the first problem, not everything was on the GPU. We're using PyTorch and have been relying on the torch.compile with the mode=reduce-overhead for maximum speed improvements. However, data transfer like this is may tank the performance as the PyTorch compiler will not be able to optimize the function as well. The solution is a bit tedious but relatively easy, rewrite it in torch. The PyTorch team have made sure a lot of their syntax and functionality is as close to NumPy as possible (with some edge cases that are usually well documented, apart from one that lost me a couple of days but that's a story for a different blog).

The PyTorch rewrite

So there are a couple of steps we need to do in order to successfully rewrite everything in Pytorch. Melspectrograms can be split into three steps:

Computing Short time Fourier transform
Generating the mel scale frequency banks
Generating the spectrogram.

There’s good and bad news. The good news is all required functionality is readily available in pytorch or torchaudio. The bad news is the default behavior is a lot different from librosa so there’s a lot of configuration and trial and error to get it right. I’ve been through that and I’m sharing the info cause I can’t even wish that hell upon my worst enemy. One thing that we need to understand is this code heavily relies on caching some of our results to be used later. This is done in an initialization function that pregenerates all of the static arrays (mel frequency banks for example depends on the sampling rate and the number of mels you need). Here’s our optimized version of melspectrogram function using PyTorch


import torch

if torch.cuda.is_available
    @torch.compile(mode="reduce-overhead")
else:
    @torch.compile
def melspecrogram_torch(wav:torch.Tensor,sample_rate:int, hann_window: torch.Tensor, mel_basis: torch.Tensor):
    stftWav = torch.stft(
            wav,
            n_fft=int(sample_rate*0.05),
            win_length=int(sample_rate*0.05),
            hop_length=int(sample_rate*0.0125),
            window=hann_window,
            pad_mode="constant",
            return_complex=True,
        ).abs()
    stftWav = stftWav.squeeze()
    mel_stftWav = torch.mm(mel_basis, stftWav)
    return mel_stftWav

device = "cuda" if torch.cuda.is_available() else "cpu"

melspectrogram_torch(
    sampleAudio,
    sr,
    torch.hann_window(int(sample_rate*0.05), device=device, dtype=torch.float32),
    torchaudio.functional.melscale_fbanks(
        sample_rate=sr,
        n_freqs=(int(sample_rate*0.05) // 2 + 1),
        norm="slaney", # this is the normalization algorithm used by librosa
        # this is an example that's related to our own pipeline, check what you need for yours
        n_mels=80,
        f_min=55,
        f_max=7600,
    )
    .T.to(device)
)

After the initial compilation run, we measured this function to take 350 microseconds using an Nvidia L4 GPU (with caching the hann_window and melscale_fbanks). Adjusted call will look like this:

hann=torch.hann_window(int(sample_rate*0.05), device=device, dtype=torch.float32),
melscale=torchaudio.functional.melscale_fbanks(
        sample_rate=sr,
        n_freqs=(int(sample_rate*0.05) // 2 + 1),
        norm="slaney", # this is the normalization algorithm used by librosa
        # this is an example that's related to our own pipeline, check what you need for yours
        n_mels=80,
        f_min=55,
        f_max=7600,
    )
    .T.to(device)
melspectrogram_torch(
    sampleAudio,
    sr,
    hann,
    melscale,
)

This is one part of a series of articles about how we optimized our deployed pretrained models, optimizing the preprocessing and postprocessing steps. You can check https://www.simli.com/demo to see the deployed models and the lowest latency avatars we provide