Spatial Computing Real-time Proximity Audio

Towards Real-Time Spatial Agent Communication

Agents that exist in space, not just in a thread.

Live Demos ↗ GitHub ↗ ← Back to Omo

Latency

<300ms

Audio

48kHz Stereo

Model Size

<100M params

Inference

CPU

Current AI agents live in threads — text boxes, chat windows, API calls. Spatial Agent Communication moves agents into the real world. Agents have positions. Voice volume fades with distance. Room boundaries define context. Multiple agents coexist in one space with distinct voices and behaviors. This isn't a chatbot with a 3D avatar — it's an agent that exists in space the same way you do.

The entire stack runs on consumer hardware — sub-100M parameter models, CPU inference, no cloud dependency. Position tracking feeds into proximity models that modulate spatial TTS output. SpeechBus serializes all spoken output through a single queue to prevent overlap. The result: agents that feel present, not remote.

Features

Proximity-Aware Voice

Distance Fades

Voice volume fades naturally with distance. Spatial audio is anchored at the agent's position — stereo panning and volume computed from agent-to-listener distance in real time. Walk closer, hear clearly. Walk away, it fades.

Room-Bound Context

Boundary-Aware

Agents know room boundaries and switch behavior when a player enters or leaves their space. Room-level context means the agent's knowledge, tools, and personality are tied to a physical location — not a global namespace.

Multi-Voice

Distinct Per Agent

Distinct voice per agent via zero-shot voice cloning. Multiple agents coexist in one space, each with a unique voice fingerprint. MOSS-TTS-Nano handles multi-voice output with no quality degradation across voices.

Edge Deployment

No Cloud

All models run locally on consumer hardware. Sub-100M parameters for real-time CPU inference. No cloud dependency, no data center, no latency spikes from network round-trips. Everything runs on the machine in front of you.

Architecture

Position tracking → Proximity model → Spatial TTS → SVC spatial audio. MOSS-TTS-Nano for voice cloning. SpeechBus serializes all spoken output (one queue, no overlap).

Position Layer

Tracking + Context

Real-time position tracking for both player and agents. Room boundary detection and context switching. Proximity computed as Euclidean distance — feeds directly into the audio pipeline for volume and panning.

Position Tracking Room Context Proximity Calc

Proximity Model

Distance → Volume

Maps agent-to-player distance to audio parameters: volume attenuation, stereo panning, and reverb. Smooth transitions — no abrupt cuts. The model runs at frame rate for continuous spatial audio updates.

Volume Attenuation Stereo Panning Frame-Rate

Spatial Audio Output

TTS + SVC Stereo

MOSS-TTS-Nano generates speech with zero-shot voice cloning. SVC applies spatial audio effects — stereo panning, distance attenuation, environmental reverb. SpeechBus queue prevents overlapping speech from multiple agents.

MOSS-TTS-Nano SVC Spatial SpeechBus

<300ms latency 48kHz Stereo <100M params CPU Inference Open Source

Demos

Papers & Reports

Technical reports and papers. All work is open access with accompanying code.

Towards Real-Time Spatial Agent Communication

Preprint · 2025

PDF ↗

Momo: A Voice-First Holographic AI Command Center

Technical Report · 2025

PDF ↗