Realtime Audio¶
Build voice agents with axio-audio and RealtimeAgent. The audio package
provides microphone capture and speaker playback; RealtimeAgent (in the axio
core) drives a RealtimeTransport session, dispatching tool calls concurrently
with streaming audio output.
Install¶
pip install axio-audio axio-transport-google
axio-audio depends on sounddevice and numpy.
Architecture¶
┌──────────────┐ PCM16 audio ┌────────────────────────┐
│ Microphone │──────────────────▶│ │
└──────────────┘ │ RealtimeAgent │
│ (axio core) │
┌──────────────┐ PCM16 audio │ │
│ Speaker │◀──────────────────│ │
└──────────────┘ └────────────┬───────────┘
│ WebSocket
┌────────────▼───────────┐
│ GeminiLiveTransport │
│ (Gemini Live API) │
└────────────────────────┘
The session is full-duplex: you send raw PCM16 chunks from the microphone and
receive AudioOutputDelta events to feed to the speaker.
Minimal example¶
import asyncio
from axio_audio import Microphone, Speaker
from axio_transport_google.realtime import GeminiLiveTransport
from axio.realtime import RealtimeAgent
from axio.events import AudioOutputDelta, TurnComplete
async def main() -> None:
transport = GeminiLiveTransport()
async with (
RealtimeAgent(system="You are a helpful assistant.", transport=transport) as agent,
Microphone() as mic,
Speaker() as spk,
):
async def send_mic() -> None:
async for chunk in mic:
await agent.send(chunk)
async def play_output() -> None:
async for ev in agent.events():
if isinstance(ev, AudioOutputDelta):
await spk.feed(ev.data)
await asyncio.gather(send_mic(), play_output())
asyncio.run(main())
Microphone¶
Microphone is an async-iterable that yields AudioBlock chunks of PCM16 mono
audio at a configurable sample rate.
from axio_audio import Microphone
async with Microphone(sample_rate=24000, chunk_ms=50) as mic:
async for chunk in mic:
# chunk is an AudioBlock with PCM16 bytes
await agent.send(chunk)
Parameter |
Default |
Description |
|---|---|---|
|
|
Sample rate in Hz |
|
|
Chunk duration in milliseconds |
|
|
sounddevice device index or name |
|
|
Internal queue size (chunks) |
Speaker¶
Speaker is an async-friendly PCM16 playback buffer. Feed it raw bytes; the
audio callback drains the buffer as the device requests samples.
from axio_audio import Speaker
async with Speaker(sample_rate=24000) as spk:
async for ev in agent.events():
if isinstance(ev, AudioOutputDelta):
await spk.feed(ev.data)
Parameter |
Default |
Description |
|---|---|---|
|
|
Sample rate in Hz |
|
|
sounddevice device index or name |
|
|
Optional |
Call spk.stop() to immediately clear the playback buffer - use this to honour
user interruptions so the assistant goes silent right away.
DuplexAudio¶
Independent Microphone and Speaker open separate PortAudio streams on
different host clocks. On consumer audio stacks (PipeWire, PulseAudio) those
clocks drift by ~10-25 ms/sec, which destroys echo canceller quality.
DuplexAudio opens a single sd.RawStream so mic and speaker share one
PortAudio clock. Use it when echo cancellation matters.
from axio_audio import DuplexAudio
from axio.events import AudioOutputDelta
async with DuplexAudio(sample_rate=48000, chunk_ms=20) as duplex:
async def consume_mic() -> None:
async for chunk in duplex.mic_chunks():
await agent.send(chunk)
async def play_output() -> None:
async for ev in agent.events():
if isinstance(ev, AudioOutputDelta):
await duplex.feed_speaker(ev.data)
await asyncio.gather(consume_mic(), play_output())
DuplexAudio also exposes .mic and .speaker properties that satisfy the
same async context-manager interface as Microphone / Speaker, so you can
swap them with minimal changes to existing code.
Parameter |
Default |
Description |
|---|---|---|
|
|
Sample rate in Hz |
|
|
Chunk duration in milliseconds |
|
|
Device index, name, or |
|
|
Expose a mono API even on multi-channel devices |
|
|
Mic chunk queue size |
|
|
Optional |
RealtimeAgent¶
RealtimeAgent (in the axio core) is the duplex counterpart to Agent. It
drives a RealtimeSession and dispatches tool calls as background tasks so
audio output is never blocked by slow tools.
from axio.realtime import RealtimeAgent
from axio import Tool
agent = RealtimeAgent(
system="You are a helpful assistant.",
transport=transport,
tools=[Tool(name="my_tool", handler=my_handler)],
voice="Aoede",
input_audio_format="audio/pcm;rate=16000",
output_audio_format="audio/pcm;rate=24000",
)
Parameter |
Default |
Description |
|---|---|---|
|
required |
System prompt |
|
required |
A |
|
|
Tools available to the model |
|
|
Voice name (transport-specific) |
|
|
Audio format sent to the model |
|
|
Audio format received from the model |
|
|
Re-raise exceptions from |
Handling interruptions¶
When the user starts speaking while the assistant is talking, the model signals
a SpeechStarted event. Call agent.interrupt() to cancel in-flight tool tasks
and tell the session to stop generating:
from axio.events import AudioOutputDelta, SpeechStarted
async for ev in agent.events():
match ev:
case SpeechStarted():
spk.stop() # clear speaker buffer
await agent.interrupt() # cancel tools, stop generation
case AudioOutputDelta(data=pcm):
spk.feed(pcm)
Audio format notes¶
PCM16 = signed 16-bit integer, little-endian, mono
Gemini Live default input:
audio/pcm;rate=16000Gemini Live default output:
audio/pcm;rate=24000DuplexAudiodefaults to 48 kHz (higher quality, resampled by sounddevice)
GeminiLiveTransport parameters¶
Parameter |
Default |
Description |
|---|---|---|
|
|
Developer API key |
|
|
Model for live sessions |
|
|
BCP-47 language code (e.g. |
30 languages are supported, including Arabic, French, German, Hindi, Japanese, Korean, Portuguese, Spanish, and Vietnamese.
For Vertex AI, use VertexLiveTransport and optionally call
probe_nearest_live_region() to pick the fastest available region:
from axio_transport_google.realtime import VertexLiveTransport, probe_nearest_live_region
region = await probe_nearest_live_region()
transport = VertexLiveTransport(location=region, auto_region=True)
Examples¶
The repository includes two complete examples:
examples/realtime_smoke/- minimal smoke test: sends a text message, receives audio, plays itexamples/realtime_chat/- full-featured voice chat with echo cancellation, volume metering, interruption handling, and Playwright-based screenshot tool