Multimodal¶
Axio agents can send and receive images, audio, and video alongside text. Content is represented as typed blocks throughout the stack - from tool results to conversation history to LLM responses.
Content blocks¶
All content is a block. The core package defines four block types:
Block |
Media types |
|---|---|
|
- |
|
|
|
|
|
|
from axio.blocks import TextBlock, ImageBlock, AudioBlock, VideoBlock
All block types are frozen dataclasses with two fields: media_type and data: bytes.
Sending images to the agent¶
agent.run() accepts a plain string as the user message. To include images or
other media, append a Message with the appropriate blocks to the context
before calling run():
import asyncio
from axio import Agent, MemoryContextStore
from axio.messages import Message
from axio.blocks import TextBlock, ImageBlock
from axio.testing import StubTransport, make_text_response
async def main() -> None:
image_data = open("screenshot.png", "rb").read()
context = MemoryContextStore()
await context.append(Message(
role="user",
content=[
TextBlock(text="What is shown in this screenshot?"),
ImageBlock(media_type="image/png", data=image_data),
],
))
transport = StubTransport([make_text_response("A terminal window.")])
agent = Agent(system="You are a helpful visual assistant.", tools=[], transport=transport)
reply = await agent.run("Describe it in detail.", context)
assert reply == "A terminal window."
asyncio.run(main())
The run() call appends its string argument as an additional user message.
The LLM therefore sees the image message followed by the “Describe it”
message, just as if two separate turns happened.
Reading media files as tools¶
read_file from axio-tools-local detects file extensions and returns
multimodal blocks automatically - no extra configuration needed:
Extension |
Returned block |
|---|---|
|
|
|
|
|
|
For a text file, read_file returns a plain str. For a media file it
returns [TextBlock(text="...metadata..."), <MediaBlock>]. The transport
forwards the blocks to the LLM as native multimodal content.
A vision-capable model (Gemini, Claude, GPT-4o) receiving read_file results
for images can describe, compare, and reason about the pixel content directly.
Tools returning multimodal content¶
A tool handler can return a list of blocks to pass rich content back to the
model. This is how read_file works internally:
import asyncio
from axio import Agent, MemoryContextStore, Tool
from axio.blocks import TextBlock, ImageBlock
from axio.testing import StubTransport, make_tool_use_response, make_text_response
async def capture_chart() -> list[TextBlock | ImageBlock]:
"""Capture the current chart as an image."""
chart_bytes = b"\x89PNG..." # real implementation would render a chart
return [
TextBlock(text="Chart captured."),
ImageBlock(media_type="image/png", data=chart_bytes),
]
async def main() -> None:
transport = StubTransport([
make_tool_use_response("capture_chart", "t1", {}),
make_text_response("The chart shows an upward trend."),
])
agent = Agent(
system="You are a data analyst.",
tools=[Tool(name="capture_chart", handler=capture_chart)],
transport=transport,
)
reply = await agent.run("Describe the current chart.", MemoryContextStore())
assert reply == "The chart shows an upward trend."
asyncio.run(main())
Realtime audio¶
For low-latency voice agents that stream raw PCM audio in both directions, use
RealtimeAgent with axio-audio. See the Realtime Audio guide.
Transport support¶
Not every transport supports every modality. Check the model’s declared capabilities before sending:
Capability |
Meaning |
|---|---|
|
Accepts |
|
Accepts |
|
Accepts |
|
Can produce images |
|
Can produce videos |
from axio.models import Capability
model = getattr(transport, "model", None)
caps = getattr(model, "capabilities", frozenset())
if Capability.vision in caps:
print("This model can see images")
See the Model Registry reference for the full capabilities list.