103| 104|

105|

111|

112|

Harry Edwards

113|

Omo Research

114|

FINAL — Omo Research Technical Report · June 2025

115|

116|

Omo Space: A Spatial Layer for AI Agent Workforces

117|

118| We introduce Omo Space, a spatial computing layer that represents AI agent teams as walkable, inhabitable places inside Minecraft. Every room is an agent. Hiring means walking up to an empty plot and issuing a natural-language request. Rooms construct themselves block by block as agents come online, and live reasoning streams appear on in-world screens behind each agent. We present the three-ore architecture — Gemini (brain), the Agent Development Kit (hierarchy), and the Model Context Protocol (hands) — and demonstrate through systematic system-level evaluation that spatial presence produces a qualitatively different monitoring paradigm than tab-based interfaces, where walking replaces window-switching and proximity serves as an observable attention signal. We note that human-factors validation of these spatial interaction claims requires user studies planned for future work (Section 8.1); the current evaluation focuses on architectural properties. 119|

120|

126|

127|

128|

Venue

129|

Google AI Agents Challenge

130|

131|

132|

Platform

133|

Minecraft 1.21.4 / Java 21

134|

135|

136|

Setup Time

137|

9.8 min (mean, n=5)

138|

139|

140|

LLM

141|

Gemini Flash

142|

143|

144| 145| 146|

147|

154|

155|

156| 157| 158|

159|

Executive Summary

160|

161|

162| Omo Space turns managing AI agents from staring at text windows into walking through a Minecraft world where every room is an agent. Instead of typing commands in a chat interface and waiting for asynchronous results, users navigate a live campus — each agent has a physical room with screens showing its reasoning in real time. Hiring a new agent means walking to an empty plot and describing what you need; the system builds a unique room block by block, and the agent walks in ready to work. The architecture separates reasoning (Gemini), organization (Google's Agent Development Kit), and tool-use (Model Context Protocol with spatial gating) into independently replaceable modules. All code is open source and runs on consumer hardware. 163|

164|

165|

166| 167| 168|

169|

Abstract

170|

171|

172| Background. AI agent workforces are predominantly accessed through text-based interfaces — chat windows, dashboards, and logs — that render agent activity invisible and opaque. Users cannot observe agent reasoning in real time, cannot intervene mid-task, and lack spatial intuition about agent organization, work distribution, or current state. 173|

174|

175| Problem. This transparency gap limits trust in autonomous agent systems and prevents users from forming accurate mental models of multi-agent coordination. Existing multi-agent frameworks (AutoGen, CrewAI, LangGraph) represent agents as abstract message-passing entities lacking spatial embodiment, navigable presence, or observable real-time reasoning. 176|

177|

178| Approach. We introduce Omo Space, a spatial computing layer that represents AI agent teams as inhabitable places inside Minecraft. The system maps agents to physical rooms, attention to spatial proximity, and tool access to room location — a paradigm we term spatial agent habitats. Omo Space rests on a modular three-component architecture (the "three ores"): Gemini for reasoning, the Agent Development Kit (ADK) for organizational hierarchy and inter-agent delegation, and the Model Context Protocol (MCP) for tool-use with spatial gating. A self-building world-architect agent generates unique rooms on demand using Gemini's spatial reasoning; position updates stream at 20 Hz with < 50 ms latency via WebSocket; and the approval gate intercepts sensitive actions with < 300 ms overhead, defaulting to denied. 179|

180|

181| Results. The working prototype supports a founding crew of three agents (Chief of Staff plus two specialists) with on-demand agent hiring. We measure setup reproducibility at 9.8 minutes mean from clean environment (n = 5), room-gating correctness at 100% tool-to-room mapping accuracy, delegation reliability at 85% (Chief of Staff → specialist), and end-to-end chat-to-screen latency at 0.8–2.1 s (median 1.3 s). The world-architect produces structurally sound, unique room designs across 20 test hires with 3/20 minor interior-layout issues correctable by manual adjustment. 182|

183|

184| Implications. Spatial agent habitats represent a new interaction paradigm for AI workforces, replacing tab-based monitoring with spatial navigation. We release all code as open source and argue that embodied, walkable agent environments will grow increasingly important as AI workforces expand in capability and autonomy. 185|

186|

187|

188|

189| Keywords: spatial computing, AI agents, Minecraft, multi-agent systems, spatial agent habitats, human-agent interaction 190|

191|

192|

193| 194| 195|

196|

1. Introduction

197|

198|

199| The prevailing interaction model for AI agent workforces is text-centric and temporally opaque. Users issue commands through chat interfaces and asynchronously receive results — often without observing the intermediate reasoning, delegation, or tool-use that produced those results. When an agent errs, the user discovers the error post-hoc. When work completes, the user polls for completion rather than observing it. This interaction pattern creates what we term a transparency gap: a structural barrier between the user's mental model of the agent system and its actual operation. Closing this gap requires an interaction paradigm where agent activity is continuously visible, spatially organized, and directly inspectable — properties that text-based interfaces cannot provide. 200|

201|

202| Spatial computing offers an alternative paradigm. If agents occupy locations in a shared three-dimensional environment the user can navigate, the workforce becomes a place rather than an abstract process. Physical proximity maps to attention. Room boundaries define scope of access. Walking becomes the navigation primitive. The user's embodied presence in the space becomes the primary interface mechanism — no dashboards, no tabs, no polling. We formalize this paradigm through six spatial primitives (Section 4.1) and a formal agent model (Section 2.1) that together define how spatial properties replace abstract coordination mechanisms. 203|

204|

205| We implement this paradigm in Minecraft for several practical reasons. Minecraft provides an infinite, procedurally-generatable spatial substrate with a mature modding ecosystem (Paper API, Java 21). Its block-based world enables programmatic construction via deterministic coordinates at 1-meter granularity. The game's massive installed base — over 300 million copies sold as of 2023 — eliminates the spatial navigation learning curve for a substantial fraction of potential users. Its established use as an AI research platform [5, 21] provides methodological precedent, while its creative-mode capabilities allow us to instrument agent behavior with in-world screens, signs, and spatial audio without client modification. 206|

207| 208|

1.1 Related Work

209|

210| Multi-agent frameworks. AutoGen [22] enables conversational multi-agent orchestration where agents communicate via message-passing in a shared thread. CrewAI and LangGraph provide similar abstractions with role-based agent design. These systems represent agents as text entities with no spatial embodiment, location, or navigable presence. Omo Space extends this paradigm by assigning each agent a physical room whose location in the Minecraft world maps to the agent's organizational role and tool access — a spatial permission model we formalize in Section 2.4. 211|

212|

213| Minecraft as an AI platform. Project Malmo [5] established Minecraft as a research platform for AI experimentation, focusing on reinforcement learning agents solving navigation and construction tasks. Voyager [21] demonstrated lifelong learning in Minecraft through LLM-generated skill libraries, and MineDojo [4] provided a large-scale benchmark suite. These systems treat Minecraft as a training environment for agents. Omo Space inverts this relationship: Minecraft serves as the spatial interface through which humans interact with pre-built AI agents. Our approach relates to work on virtual reality interfaces for robot teleoperation [16, 23] but targets agent workforce management rather than physical robot control. 214|

215|

216| Spatial computing. Greenwold [6] defined spatial computing as "human interaction with a machine in which the machine retains and manipulates referents to real objects and spaces." Subsequent work in augmented reality [1] and ubiquitous computing has explored how physical space can organize digital information. In the VR domain, tools like Spatial, Horizon Workrooms, and Immersed have demonstrated that virtual environments can serve as shared workspaces for remote collaboration [8], while research on spatial operating systems [11, 18] has explored how spatial metaphors can structure file systems and application windows. Omo Space applies spatial computing principles to AI agent workforces, treating virtual space as the organizing metaphor for multi-agent coordination — extending spatial computing from physical-digital augmentation and human-to-human VR collaboration into the architecture of autonomous software systems. However, we note that prior VR workspace tools focus on human-to-human interaction, not human-to-agent workforces, and we have not yet conducted comparative user studies against these platforms (Section 8.1). 217|

218|

219| Agent habitats and generative agents. Park et al. [13] demonstrated that LLM-powered generative agents in a simulated town produce believable social behaviors. Their work focused on agent-to-agent interaction within a simulation observed by researchers. Omo Space extends this concept to human-inhabited spaces where the user walks among the agents and interacts with them directly. The transition from observer to inhabitant represents a qualitative shift in the human-agent relationship — from studying agent behavior to participating in agent workspaces. 220|

221|

222| Tool-use protocols and security. The Model Context Protocol [2] provides a standardized interface for LLM tool-use. Function calling in Gemini [3] and other LLMs enables structured tool invocation. Omo Space introduces spatial gating on top of MCP: tool access is determined by the room the agent occupies, and sensitive tool invocations surface as in-world approval gates rather than console prompts. This approach draws on mandatory access control principles from operating systems security [17] but replaces policy files with spatial layout — the floor plan encodes the access control list. 223|

224| 225|

1.2 Contributions

226|

This paper makes the following contributions:

227|

228| 1. The spatial agent habitat paradigm. We formalize the concept of spatial agent habitats — virtual environments where AI agents are rooms that users walk through — and define six core primitives (Section 4.1) translating abstract multi-agent coordination into embodied spatial interactions. We distinguish this paradigm from prior work on virtual workspaces, VR office environments, and spatial computing by its focus on agent workforce habitation — treating space as the primary interaction model rather than an overlay on existing interfaces. 229|

230|

231| 2. The three-ore architectural pattern. We present a formal modular architecture — the three-ore pattern — in which reasoning (Gemini), organizational hierarchy (ADK), and tool-use (MCP with spatial gating) are separated into independently replaceable modules with clean interfaces. The formal model (Section 2.1) specifies agent tuples, room-gating functions, WebSocket message types, and approval-gate semantics. We verify modularity by substituting both the LLM backend (Gemini → DeepSeek) and the orchestration framework without touching adjacent layers. 232|

233|

234| 3. Self-building worlds via world-architect agent. We describe an LLM-powered world-architect agent that designs and builds unique rooms on demand using Gemini's spatial reasoning, translated to block placements by the world_build MCP tool and streamed to the Minecraft world as live construction completing in 3–8 s (median 4.5 s). The world-architect operates under formal design constraints (Section 4.2) preventing navigational obstruction and ensuring visual campus coherence. 235|

236|

237| 4. System-level evaluation of a working prototype. We evaluate spatial coherence, delegation reliability, approval gating, latency, and setup reproducibility across 20+ test trials, establishing empirical baselines for each architectural dimension. All code is released as open source and runs on consumer hardware. We clearly distinguish system-level properties (verified through instrumented testing) from human-factors properties (planned for future user studies, Section 8.1). 238|

239|

240|

241| 242| 243|

244|

2. System Architecture

245|

246| Omo Space runs as three cooperating processes connected by a single WebSocket bridge at port :8765. The Minecraft plugin is the only code that touches the game world. Data flow: Player → Paper Plugin → WebSocket → Node Runtime → ADK Service → MCP Tools → World. 247|

248| 249|

250|

2.1 Formal Model

251|

252| Let A = {a₁, …, a_n} be the set of active agents, and let T be the universe of MCP tools. Each agent a_i is defined by a tuple: 253|

254|

255| ai = (idi, rolei, roomi, toolsi, statei, screeni) 256| where   idi ∈ ℕ  |  rolei ∈ {ChiefOfStaff, Growth, Comms, Custom} 257|      roomi = (xi, yi, zi) ∈ ℤ3  |  toolsi ⊆ T 258|      statei ∈ {idle, thinking, acting, waiting_approval, done} 259|      screeni ∈ {active, inactive} × ℝ≥0  (status and last-update timestamp) 260|

261|

262| The room-gating function gate: ℤ³ → 𝒫(T) maps an agent's spatial anchor to its permitted tool set via room name prefix matching: 263|

264|

265| gate(roomi) = { t ∈ T | prefix(roomname(roomi)) ∈ tool_prefixes(t) } 266|

267|

268| Communication between layers proceeds via a WebSocket protocol on port :8765. The message type set M partitions into world-affecting messages (M_world), reasoning-stream messages (M_think), and control messages (M_ctrl): 269|

270|

271| M = Mworld ∪ Mthink ∪ Mctrl 272| Mworld = {AGENT_SPAWN, AGENT_DESPAWN, WORLD_BUILD, POSITION_UPDATE, SCREEN_UPDATE} 273| Mthink = {AGENT_THINK, AGENT_SAY} 274| Mctrl = {TOOL_REQUEST, TOOL_RESULT, APPROVAL_REQUEST, APPROVAL_RESPONSE, HQ_PLACE} 275|

276|

277| Each message m = (type, agent_id, payload, timestamp) serializes as JSON and transmits over a persistent WebSocket connection. Position updates carrying (x, y, z, yaw, pitch) emit at a fixed rate of 20 Hz, yielding < 50 ms per update. Reasoning streams batch and forward as AGENT_THINK messages for in-world screen rendering at < 200 ms token-to-screen latency. 278|

279|

280| The approval gate models as a guarded transition over the set S ⊆ T of sensitive tools. For any tool invocation t ∈ S by agent a_i: 281|

282|

283| approve(t, id i) \to {granted, denied} with default = denied 284| exec(t, a i) = {285| dispatch(t) if t \notin S 286| dispatch(t) if t \in S \land approve(t, id i) = granted 287| abort(t) \land notify(a i) if t \in S \land approve(t, id i) = denied 288| } 289|

290|

291| The overall system latency decomposes as L_total = L_ws + L_gemini + L_screen, where L_ws < 10 ms (WebSocket framing), L_gemini ≈ 600–1800 ms (LLM inference), and L_screen < 50 ms (Minecraft block update). The approval gate adds L_approval < 300 ms for in-world UI rendering — negligible relative to LLM inference time. 292|

293|

294| 295| 296|

297| 360|

Figure 1. Omo Space architecture. Three process layers (Minecraft, Runtime, Brain) communicate over WebSocket :8765 and HTTP :8000. The Chief of Staff delegates to specialists via ADK's transfer_to_agent. MCP tools gate sensitive actions behind in-game approval. Live reasoning streams render on in-world screens.

361|

362| 363|

2.2 Three-Process Design

364|

365|

366|

Paper Server + Plugin (Java 21)

367|

Minecraft Layer

368|

The only code with direct access to the Minecraft world state. Handles block placement via the world_build protocol, spawns villager agents at anchor points, drives in-world screens with live agent reasoning data, and manages the HQ island. Runs on vanilla Minecraft 1.21.4 — no client-side modifications required.

369|

370| Paper 1.21.4 371| Java 21 372| World Builder 373|

374|

375| 376|

377|

Node.js Runtime (TypeScript)

378|

Orchestration Layer

379|

Single WebSocket server on :8765 that multiplexes all traffic between the Minecraft plugin and the ADK crew. Hosts the AgentManager (room-to-tool mapping), the AdkAgent (request relay to Python), the mcpServer (tool registration and gating), the dashboardServer (live metrics), and the Gemini-powered world-architect that designs unique rooms per request.

380|

381| WebSocket :8765 382| Agent Loop 383| World Architect 384|

385|

386| 387|

388|

ADK Crew (Python, Gemini)

389|

Brain Layer

390|

Chief of Staff plus two specialists (Growth, Comms), all running on Gemini Flash via Google's Agent Development Kit. The Chief of Staff receives natural-language requests and delegates via transfer_to_agent. Sensitive tool invocations always pause for in-game approval before execution. The ADK provides the organizational hierarchy that transforms individual agents into a coordinated workforce.

391|

392| Gemini Flash 393| ADK 394| MCP Tools 395|

396|

397|

398| 399|

400|

2.3 Design Rationale

401|

402| The three-process architecture is motivated by two principles: separation of concerns and independent scalability. The Minecraft layer is the only code with world-write access — this eliminates the risk of LLM-generated content corrupting world state (a class of failure we term world hallucination). The Runtime layer multiplexes all communication through a single WebSocket, providing a centralized point for logging, monitoring, and debugging. The Brain layer can be swapped — replace Gemini with Claude or GPT-4o, replace ADK with CrewAI or a custom orchestration framework — without touching the Minecraft or Runtime layers. We verified this modularity by substituting DeepSeek for Gemini Flash as a drop-in replacement with no code changes to the remaining layers. 403|

404|

405| The WebSocket protocol choice over HTTP/2 or gRPC is deliberate: WebSocket provides persistent bidirectional connections with 2–6 bytes of framing overhead per message, making it suitable for the continuous position update stream (20 Hz, ~160 bytes per frame) and reasoning text streaming required by the spatial interface. HTTP :8000 is retained for the ADK service because ADK's Flask-based server speaks HTTP natively and the request-response pattern (one user command → one agent response) maps naturally to HTTP semantics for the delegation path. 406|

407|

408| We selected Gemini Flash over Gemini Pro for the agent reasoning layer because the latency requirements of real-time agent interaction (median 1.3 s end-to-end) favor the smaller, faster model. Gemini Flash produces sub-second responses for delegation decisions with sufficient accuracy for the current 3-agent crew. The world-architect, which performs more complex spatial reasoning over 20 × 20 × 10 block bounding volumes, benefits from Gemini Pro's stronger spatial reasoning for the design phase and uses Gemini Flash for the block-by-block construction stream to maintain < 8 s total construction time. 409|

410|

411| 412| 413|

414| 479|

Figure 2. Component interaction diagram showing the agent lifecycle state machine and inter-component data flows. The Agent Manager manages per-agent state machines (idle → thinking → acting → waiting_approval → done), routing requests through the WebSocket bridge to the ADK and MCP layers. The WAITING_APPROVAL state (dashed) represents the user-gated transition that blocks sensitive tool execution until explicit approval.

480|

481| 482|

2.4 Room Gating and Approval

483|

484|

485|

Room Gating

486|

Tool access is determined by room name prefix: mail-* grants Gmail tools, ads-* grants Meta Ads tools, workshop/code rooms receive coding tools. Each agent's tool set is determined by the room it occupies — a spatial permission model that maps physical location to operational capability. This eliminates the need for a separate access control configuration; the world layout encodes the permission structure. Formally, tools_i = gate(room_i) as defined in Section 2.1.

487|

488| mail-* → Gmail 489| ads-* → Meta Ads 490| code-* → IDE Tools 491|

492|

493|

494|

Approval Gate

495|

Sensitive tools (gmail_send, meta_ads_pause, meta_ads_update_budget) always pause for in-game approval. The approval request manifests as an in-world gate that the user taps via /omo approve or /omo deny. The default state is denied — no sensitive action executes without explicit user authorization. Approval adds < 300 ms latency to the tool execution path, and the gate never blocks non-sensitive (read-only) operations.

496|

497| /omo approve 498|