Realtime Audio

Build voice agents with axio-audio and RealtimeAgent. The audio package provides microphone capture and speaker playback; RealtimeAgent (in the axio core) drives a RealtimeTransport session, dispatching tool calls concurrently with streaming audio output.

Install

pip install axio-audio axio-transport-google

axio-audio depends on sounddevice and numpy.

Architecture

┌──────────────┐    PCM16 audio    ┌────────────────────────┐
│  Microphone  │──────────────────▶│                        │
└──────────────┘                   │     RealtimeAgent      │
                                   │  (axio core)           │
┌──────────────┐    PCM16 audio    │                        │
│   Speaker    │◀──────────────────│                        │
└──────────────┘                   └────────────┬───────────┘
                                                │  WebSocket
                                   ┌────────────▼───────────┐
                                   │  GeminiLiveTransport   │
                                   │  (Gemini Live API)     │
                                   └────────────────────────┘

The session is full-duplex: you send raw PCM16 chunks from the microphone and receive AudioOutputDelta events to feed to the speaker.

Minimal example

import asyncio
from axio_audio import Microphone, Speaker
from axio_transport_google.realtime import GeminiLiveTransport
from axio.realtime import RealtimeAgent
from axio.events import AudioOutputDelta, TurnComplete


async def main() -> None:
    transport = GeminiLiveTransport()

    async with (
        RealtimeAgent(system="You are a helpful assistant.", transport=transport) as agent,
        Microphone() as mic,
        Speaker() as spk,
    ):
        async def send_mic() -> None:
            async for chunk in mic:
                await agent.send(chunk)

        async def play_output() -> None:
            async for ev in agent.events():
                if isinstance(ev, AudioOutputDelta):
                    await spk.feed(ev.data)

        await asyncio.gather(send_mic(), play_output())


asyncio.run(main())

Microphone

Microphone is an async-iterable that yields AudioBlock chunks of PCM16 mono audio at a configurable sample rate.

from axio_audio import Microphone

async with Microphone(sample_rate=24000, chunk_ms=50) as mic:
    async for chunk in mic:
        # chunk is an AudioBlock with PCM16 bytes
        await agent.send(chunk)

Parameter

Default

Description

sample_rate

24000

Sample rate in Hz

chunk_ms

50

Chunk duration in milliseconds

device

None

sounddevice device index or name

queue_maxsize

100

Internal queue size (chunks)

Speaker

Speaker is an async-friendly PCM16 playback buffer. Feed it raw bytes; the audio callback drains the buffer as the device requests samples.

from axio_audio import Speaker

async with Speaker(sample_rate=24000) as spk:
    async for ev in agent.events():
        if isinstance(ev, AudioOutputDelta):
            await spk.feed(ev.data)

Parameter

Default

Description

sample_rate

24000

Sample rate in Hz

device

None

sounddevice device index or name

playback_tap

None

Optional Callable[[bytes], None] for echo-cancel reference

Call spk.stop() to immediately clear the playback buffer - use this to honour user interruptions so the assistant goes silent right away.

DuplexAudio

Independent Microphone and Speaker open separate PortAudio streams on different host clocks. On consumer audio stacks (PipeWire, PulseAudio) those clocks drift by ~10-25 ms/sec, which destroys echo canceller quality.

DuplexAudio opens a single sd.RawStream so mic and speaker share one PortAudio clock. Use it when echo cancellation matters.

from axio_audio import DuplexAudio
from axio.events import AudioOutputDelta

async with DuplexAudio(sample_rate=48000, chunk_ms=20) as duplex:
    async def consume_mic() -> None:
        async for chunk in duplex.mic_chunks():
            await agent.send(chunk)

    async def play_output() -> None:
        async for ev in agent.events():
            if isinstance(ev, AudioOutputDelta):
                await duplex.feed_speaker(ev.data)

    await asyncio.gather(consume_mic(), play_output())

DuplexAudio also exposes .mic and .speaker properties that satisfy the same async context-manager interface as Microphone / Speaker, so you can swap them with minimal changes to existing code.

Parameter

Default

Description

sample_rate

48000

Sample rate in Hz

chunk_ms

20

Chunk duration in milliseconds

device

None

Device index, name, or (input, output) tuple

mono_io

True

Expose a mono API even on multi-channel devices

queue_maxsize

100

Mic chunk queue size

playback_tap

None

Optional Callable[[bytes], None] for echo-cancel reference

RealtimeAgent

RealtimeAgent (in the axio core) is the duplex counterpart to Agent. It drives a RealtimeSession and dispatches tool calls as background tasks so audio output is never blocked by slow tools.

from axio.realtime import RealtimeAgent
from axio import Tool

agent = RealtimeAgent(
    system="You are a helpful assistant.",
    transport=transport,
    tools=[Tool(name="my_tool", handler=my_handler)],
    voice="Aoede",
    input_audio_format="audio/pcm;rate=16000",
    output_audio_format="audio/pcm;rate=24000",
)

Parameter

Default

Description

system

required

System prompt

transport

required

A RealtimeTransport (e.g. GeminiLiveTransport)

tools

[]

Tools available to the model

voice

None

Voice name (transport-specific)

input_audio_format

"audio/pcm;rate=16000"

Audio format sent to the model

output_audio_format

"audio/pcm;rate=24000"

Audio format received from the model

raise_on_error

True

Re-raise exceptions from Error events; set False to handle them in-loop

Handling interruptions

When the user starts speaking while the assistant is talking, the model signals a SpeechStarted event. Call agent.interrupt() to cancel in-flight tool tasks and tell the session to stop generating:

from axio.events import AudioOutputDelta, SpeechStarted

async for ev in agent.events():
    match ev:
        case SpeechStarted():
            spk.stop()                 # clear speaker buffer
            await agent.interrupt()    # cancel tools, stop generation
        case AudioOutputDelta(data=pcm):
            spk.feed(pcm)

Audio format notes

  • PCM16 = signed 16-bit integer, little-endian, mono

  • Gemini Live default input: audio/pcm;rate=16000

  • Gemini Live default output: audio/pcm;rate=24000

  • DuplexAudio defaults to 48 kHz (higher quality, resampled by sounddevice)

GeminiLiveTransport parameters

Parameter

Default

Description

api_key

GEMINI_API_KEY

Developer API key

model

gemini-3-flash-preview

Model for live sessions

language_code

None

BCP-47 language code (e.g. "en-US")

30 languages are supported, including Arabic, French, German, Hindi, Japanese, Korean, Portuguese, Spanish, and Vietnamese.

For Vertex AI, use VertexLiveTransport and optionally call probe_nearest_live_region() to pick the fastest available region:

from axio_transport_google.realtime import VertexLiveTransport, probe_nearest_live_region

region = await probe_nearest_live_region()
transport = VertexLiveTransport(location=region, auto_region=True)

Examples

The repository includes two complete examples:

  • examples/realtime_smoke/ - minimal smoke test: sends a text message, receives audio, plays it

  • examples/realtime_chat/ - full-featured voice chat with echo cancellation, volume metering, interruption handling, and Playwright-based screenshot tool