1. The Use Case: Your Cloud Mentor, On-Demand
Navigating the vast and sometimes complex world of AWS documentation and account management can be overwhelming—especially when you've reached your 50th browser tab. Imagine having a voice assistant that understands your questions, visually analyzes your AWS Console screen, quickly fetches documentation, and provides precise, context-aware answers—almost like a senior peer or mentor sitting right next to you. This is the vision that inspired me to build this tool.
My goal was straightforward yet ambitious: to build a cloud-powered, real-time, speech-to-speech assistant leveraging AWS's powerful Nova models. The result? See the demo below for yourself:
2. Choosing the right models
Aside from its catchy, hedgehog-inspired name, the Amazon Nova Sonic model (available via Amazon Bedrock) is specifically engineered for real-time conversational exchanges. Optimized for minimal latency and designed for speech-to-speech interaction, it provides a seamless conversational experience. Complementing Nova Sonic is Amazon Nova Lite, a powerful multimodal model capable of visually analyzing shared screens, adding another dimension to the assistant's capabilities.
Key reasons for choosing Nova Sonic and Nova Lite:
- Real-time conversational streaming: Immediate responses without awkward pauses. Since we're talking real-time streaming with screen sharing analysis, we must use the models optimized for speed.
- Natural speech interactions: Fluid and human-like conversation.
- Multimodal Screen Analysis: Nova Lite effortlessly interprets screenshots of the AWS Console to provide context-aware advice.
- Tool augmentation: Nova models can seamlessly invoke backend functions for specific tasks, such as profile lookups and knowledge-base queries.
- Cost effective: Nova models are very cost effective, especially when compared to 3rd party models.
Additionally, I integrated an Amazon Bedrock Knowledge Base, pre-populated with official user guides for EC2 and ECS. This ensures deeper, accurate insights on these services, while other general inquiries leverage Nova Sonic's built-in knowledge.
3. How I Built It: From Zero to Sonic
This tool is built upon the great foundation of the AWS Sample Sonic CDK Agent repository, which includes:
- Frontend: Amazon CloudFront and S3 provide swift static asset delivery.
- Backend: A Python application deployed on ECS, leveraging the flexibility of containers.
- Authentication: Amazon Cognito for secure, controlled access.
- Real-time communication: WebSockets via Network Load Balancer enable real-time conversational interactions.
The Speech-to-Speech Workflow is as follows:
- Sign In: Secure authentication via Amazon Cognito through a CloudFront-hosted frontend.
- Session Start: Users initiate a secure WebSocket connection with JWT validation for added security.
- Interactive Conversation: Users speak naturally, optionally sharing AWS Console screenshots.
- Intelligent Assistance: Nova Sonic transcribes speech, Nova Lite visually analyzes screens, and responses are streamed in real-time. When deeper insights are needed, the Amazon Bedrock Knowledge Base is queried.
- Contextual Answers: Backend tools execute necessary queries, returning comprehensive answers instantly.
Integrating Nova Sonic with Backend Tools
Amazon Nova Sonic extends its capabilities beyond pre-trained knowledge by supporting tool use, also known as function calling. This feature enables the model to interact with external functions, APIs, and data sources during a conversation, allowing for dynamic and context-aware responses. How this works:
- Tool Definition: Developers define tools by providing a JSON schema that describes each tool's functionality and input requirements.
- Tool Invocation: When a user query is received, Nova Sonic analyzes it to determine if a tool is necessary to generate a response. If so, it returns the name of the tool and the parameters to use.
- Execution and Response: The backend executes the specified tool with the provided parameters and returns the results to Nova Sonic. The model then incorporates this information into its response to the user.
Visual Screen Analysis with Nova Lite
For visual analysis, Nova Lite interprets the user screen. My approach was to provide screen stills every 3 seconds:
mediaStream = await navigator.mediaDevices.getDisplayMedia({ video: true });
videoElement = document.createElement('video');
videoElement.srcObject = mediaStream;
videoElement.play();
// Wait for video metadata, then set up canvas for frame capture
canvas = document.createElement('canvas');
canvas.width = videoElement.videoWidth * scale;
canvas.height = videoElement.videoHeight * scale;
ctx = canvas.getContext('2d');
captureIntervalId = setInterval(() => captureFrame(0.7), interval);
With pre-analysis being done asynchronously with Nova Lite, so Nova Sonic has the insights ready to retrieve when being asked to:
async def processToolUse(self, toolName, toolUseContent):
tool = toolName.lower()
results = {}
if tool == "screenanalysis":
image_b64 = self.latest_image_snapshot
if image_b64:
results = screen_analysis.main(image_b64) # This can be awaited or run in an executor for true async
self.latest_image_snapshot = None
else:
results = {"status": "error", "message": "No image snapshot available"}
return results
4. Lessons learned
One of the first realizations was the critical importance of minimizing latency: real-time conversational AI needs sub-200ms round-trip audio processing to feel “natural”, and choosing Nova Sonic for streaming speech-to-speech paid off by delivering responses faster than competing models. However, even with Nova Sonic’s low-latency foundation, any additional overhead—such as external API calls for tool use—could introduce perceptible delays.
The unified architecture of Nova Sonic—handling speech understanding and generation in one model—simplified development by eliminating the need for separate ASR and TTS pipelines, yet it also meant there was less room for customizing each stage independently. When integrating tools (function calling), I learned that careful schema design is the key to success - Nova Sonic will happily generate a “tool invocation” even when none exists, so validating each request and sanitizing inputs became a non-negotiable safety check to prevent hallucinations or misrouted calls
On the cost front, real-time voice applications can become expensive if not architected properly. Nova Sonic’s pricing—around $0.0034 per 1,000 tokens for input and $0.0136 per 1,000 tokens for output—scales to roughly $7 per day for ten hours of continuous speech. While seemingly modest, costs can spike if you neglect to tackle hot-word detection or silence trimming, as idle streams continue to consume tokens. Implementing VAD (voice activity detection) can reduce unnecessary token usage.
Working with Nova Lite for screen analysis added another layer of complexity. Streaming a screenshot every three seconds, then asynchronously pre-analyzing it with Nova Lite, ensured that most visual insights were ready by the time the user asked context-specific questions. However, sending high-resolution images at that frequency stressed the Lambda functions responsible for resizing and encoding. The lesson here was to pre-process screenshots client-side (downscaling to 720p) before sending, which cut payload sizes in half without noticeably degrading visual recognition accuracy.
Conclusion
For me, this project isn't just another cool Gen AI demo—it's a practical step towards bringing interactive, conversational AI mentorship closer to reality, augmented with precise knowledge coming from Amazon Bedrock Knowledge Bases. Without realizing, we are closer than ever to bringing Jarvis to life!
Take some time to experience the conversational power of Nova Sonic and the visual intelligence of Nova Lite for yourself, and let's keep building AWsome things!
About Me
My name is Gustavo Alejandro Romero Sanchez and I'm a Principal Cloud Solutions Architect working at Encora. You can find me on LinkedIn.
Top comments (7)
Really impressive how you combined real-time voice and screen understanding into an actual AWS workflow assistant. Have you tried it with real ops teams yet, and what was their feedback?
Thanks for the kind words!
I've run initial tests with tech-savvy folks in my organization, and early feedback has been very good. I'm actually looking to more extensive testing with actual ops teams soon, as their insights will definitely help refine it further.
Pretty cool seeing someone actually build this out instead of just talking about it. Makes me wanna try hacking on something similar.
I'm glad you liked it David - that's what I strive for, inspiring others to build, happy hacking!
honestly this is super lit - been thinking a lot about real-time support and never saw it done like this. you think these kind of voice assistants could ever really replace support teams or nah?
Thanks, glad you liked it 🔥
While the concept is powerful and can handle many routine questions and even complex scenarios, I don't think they'll fully replace support teams, it will enhance their current capabilities!
Very nice Gustavo!