This repository serves as a comprehensive developer guide for Google's Gemini Multimodal Live API. Through a structured, hands-on approach, you'll learn how to build sophisticated real-time applications that can see, hear, and interact naturally using Gemini's multimodal capabilities.
By following this guide, you'll be able to:
- Build real-time audio chat applications with Gemini
- Implement live video interactions through webcam and screen sharing
- Create multimodal experiences combining audio and video
- Deploy production-ready AI assistants
- Choose between Development API and Vertex AI implementations
The guide progresses from basic concepts to advanced implementations, culminating in a Project Astra-inspired AI assistant that demonstrates the full potential of the Gemini Multimodal Live API.
-
Real-time Communication:
- WebSocket-based streaming
- Bidirectional audio chat
- Live video processing
- Turn-taking and interruption handling
-
Audio Processing:
- Microphone input capture
- Audio chunking and streaming
- Voice Activity Detection (VAD)
- Real-time audio playback
-
Video Integration:
- Webcam and screen capture
- Frame processing and encoding
- Simultaneous audio-video streaming
- Efficient media handling
-
Production Features:
- Function calling capabilities
- System instructions
- Mobile-first UI design
- Cloud deployment
- Enterprise security
Part 1: Introduction to Gemini's Multimodal Live API
Basic concepts and SDK usage:
- SDK setup and authentication
- Text and audio interactions
- Real-time audio chat implementation
Part 2: WebSocket Development with Gemini Developer API
Direct WebSocket implementation, building towards Project Pastra - a production-ready multimodal AI assistant inspired by Project Astra:
- Low-level WebSocket communication
- Audio and video streaming
- Function calling and system instructions
- Mobile-first deployment
Part 3: WebSocket Development with Vertex AI API
Enterprise-grade implementation using Vertex AI, mirroring Part 2's journey with production-focused architecture:
- Proxy-based authentication
- Service account integration
- Cloud deployment architecture
- Enterprise security considerations
Below is a comprehensive overview of where each feature is implemented across the Development API and Vertex AI versions:
Feature | Part 2 - Dev API Chapter | Part 3 - Vertex AI Chapter |
---|---|---|
Basic WebSocket Setup | Chapter 3 | - |
Text-to-Speech | Chapter 4 | - |
Real-time Audio Chat | Chapter 5 | Chapter 9 |
Multimodal (Audio + Video) | Chapter 6 | Chapter 10 |
Function Calling & Instructions | Chapter 7 | Chapter 11 |
Production Deployment (Project Pastra) | Chapter 8 | Chapter 12 |
Note: Vertex AI implementation starts directly with advanced features, skipping basic WebSocket and text-to-speech examples.
- Google Cloud Project (for Vertex AI)
- AI Studio API key (for Gemini Developer API)
- OpenWeather API key (if you want to use the weather tool)
- Python 3.9 or higher
- Modern web browser
- Basic understanding of:
- JavaScript and HTML
- WebSocket communication
- Audio/video processing concepts
- Simple API key authentication
- Direct WebSocket connection
- All tools available simultaneously
- Single-service deployment
- Ideal for prototyping and learning
- Service account authentication
- Proxy-based architecture
- Single tool limitation
- Two-service deployment (app + proxy)
- Suited for enterprise deployment
- Start with Part 1 to understand basic SDK concepts
- Choose your implementation path:
- For quick prototyping: Follow Part 2 (Dev API)
- For enterprise deployment: Skip to Part 3 (Vertex AI)
This project is licensed under the Apache License.