Towards Real-Time Spatial Agent Communication
Agents that exist in space, not just in a thread.
Current AI agents live in threads — text boxes, chat windows, API calls. Spatial Agent Communication moves agents into the real world. Agents have positions. Voice volume fades with distance. Room boundaries define context. Multiple agents coexist in one space with distinct voices and behaviors. This isn't a chatbot with a 3D avatar — it's an agent that exists in space the same way you do.
The entire stack runs on consumer hardware — sub-100M parameter models, CPU inference, no cloud dependency. Position tracking feeds into proximity models that modulate spatial TTS output. SpeechBus serializes all spoken output through a single queue to prevent overlap. The result: agents that feel present, not remote.
Features
Proximity-Aware Voice
Distance Fades
Voice volume fades naturally with distance. Spatial audio is anchored at the agent's position — stereo panning and volume computed from agent-to-listener distance in real time. Walk closer, hear clearly. Walk away, it fades.
Room-Bound Context
Boundary-Aware
Agents know room boundaries and switch behavior when a player enters or leaves their space. Room-level context means the agent's knowledge, tools, and personality are tied to a physical location — not a global namespace.
Multi-Voice
Distinct Per Agent
Distinct voice per agent via zero-shot voice cloning. Multiple agents coexist in one space, each with a unique voice fingerprint. MOSS-TTS-Nano handles multi-voice output with no quality degradation across voices.
Edge Deployment
No Cloud
All models run locally on consumer hardware. Sub-100M parameters for real-time CPU inference. No cloud dependency, no data center, no latency spikes from network round-trips. Everything runs on the machine in front of you.
Architecture
Position tracking → Proximity model → Spatial TTS → SVC spatial audio. MOSS-TTS-Nano for voice cloning. SpeechBus serializes all spoken output (one queue, no overlap).
Position Layer
Tracking + Context
Real-time position tracking for both player and agents. Room boundary detection and context switching. Proximity computed as Euclidean distance — feeds directly into the audio pipeline for volume and panning.
Proximity Model
Distance → Volume
Maps agent-to-player distance to audio parameters: volume attenuation, stereo panning, and reverb. Smooth transitions — no abrupt cuts. The model runs at frame rate for continuous spatial audio updates.
Spatial Audio Output
TTS + SVC Stereo
MOSS-TTS-Nano generates speech with zero-shot voice cloning. SVC applies spatial audio effects — stereo panning, distance attenuation, environmental reverb. SpeechBus queue prevents overlapping speech from multiple agents.
Demos
Papers & Reports
Technical reports and papers. All work is open access with accompanying code.