Voice AI for Builders
Voice is the fastest growing AI interface — and most builders have no idea how to build one. This course teaches you the full voice AI stack: transcribing speech with Whisper, generating natural voices with ElevenLabs, streaming audio responses, and connecting voice to LLM reasoning. You will build a working voice assistant by lesson 4 and learn the production patterns in the remaining lessons.
What you'll learn
Course outline
Free — no account needed
The Voice AI Stack — STT, TTS, and Streaming
How voice AI actually works end-to-end — the components, the latency, and the trade-offs
ElevenLabs API — Voices, Models, and Streaming
Generate natural-sounding speech programmatically and stream it to the browser
Whisper — Transcribing Speech to Text
Capture audio from the browser, send it to Whisper, and get an accurate transcript
Full course — $69 one-time
Voice + LLM — Building a Complete Voice Assistant
Connect STT → Claude → TTS into a working voice assistant with a push-to-talk interface
Telephony Integration — AI on Phone Calls
Use Twilio to build an AI phone assistant that handles inbound calls with natural voice
Voice UX — Designing Conversations That Feel Natural
The principles that separate voice interfaces that feel like talking to a person from ones that feel broken
Production Voice AI — Reliability, Fallbacks, and Cost Control
The changes needed to go from demo to deployed voice AI that handles real users at real scale
Voice AI Use Cases — What to Build and Where It Wins
The verticals and patterns where voice AI delivers 10x the value of text interfaces
Get the full course
8 lessons — from ElevenLabs streaming TTS to production telephony with Twilio.
About this course
Voice AI — combining speech-to-text, language model reasoning, and text-to-speech into real-time conversational experiences — has reached production quality and is being deployed in customer service, healthcare, education, and accessibility tools. Building voice AI applications means understanding real-time audio streaming, the latency constraints that make voice feel natural or robotic, and the integration patterns that connect speech, language models, and audio output into a coherent pipeline.
This course is for developers who want to build voice interfaces and conversational agents. After completing it you will be able to build a voice AI application using Deepgram for transcription, the Claude or OpenAI API for reasoning, and ElevenLabs or Cartesia for voice synthesis — with the streaming architecture that makes conversations feel natural rather than laggy.
Frequently asked questions
What is the voice AI technology stack?
A typical voice AI pipeline has three stages: speech-to-text (Deepgram, Whisper, or AssemblyAI converts audio to text), language model reasoning (Claude, GPT-4o, or Gemini generates a response), and text-to-speech (ElevenLabs, Cartesia, or Play.ai converts the text back to audio). Latency at each stage compounds — optimising the end-to-end delay below 800ms is what makes voice conversations feel natural.
What is the hardest part of building voice AI?
Latency is the dominant challenge. A text chatbot can take 2–3 seconds to respond and feel acceptable. A voice conversation that takes 3 seconds feels broken because human conversation has turn-taking delays measured in milliseconds. Achieving low latency requires streaming at every stage, choosing models optimised for speed, and pre-warming connections. This course covers the architecture decisions that get end-to-end latency below 1 second.
How do I handle interruptions in voice conversations?
Natural conversation involves interruptions — the user starts speaking before the AI has finished. Handling this requires detecting when the user is speaking (voice activity detection), stopping the AI audio output immediately, sending the new audio to the transcription service, and processing the interruption. Most production voice applications use WebSockets or WebRTC for real-time audio and implement VAD at the client side.
What is the best text-to-speech voice for a professional product?
ElevenLabs and Cartesia both produce highly natural-sounding voices. ElevenLabs has a larger voice library and better emotional range; Cartesia has lower latency, which matters more for real-time conversation. OpenAI also provides TTS with good quality at a low price. The best voice depends on your use case — this course covers how to evaluate voices for your specific tone, language, and latency requirements.
Can I build a voice AI that speaks multiple languages?
Yes — most speech-to-text providers support 30+ languages. Most LLMs handle multilingual text natively. Text-to-speech providers have varying language coverage — ElevenLabs supports 29 languages. The main challenge is detecting which language the user is speaking and switching to the appropriate TTS voice automatically. This course covers multilingual pipeline design.