101| 102|

103|

110|

111|

Harry Edwards

112|

Omo Research

113|

FINAL — Omo Research Technical Report · June 2025

114|

115|

AgentJam: Embodied AI Mentors for Language Learning in Minecraft

116|

117| We present AgentJam, a system that deploys AI language mentors as embodied player-bots inside Minecraft. The mentor connects via the Minecraft protocol as a real player — walking, mining, building, and fighting in the game client — and communicates through spatial voice that attenuates with Euclidean distance following vol(d) = 1/d². We introduce speaking gates: a pedagogical mechanism where mission progression is blocked not by resource collection but by the requirement that the student produce a spoken English utterance in a target grammatical frame — because (causation), Can you pass me (request), and firstThen (sequencing) — each embedded in a mission beat that makes the utterance communicatively necessary. The architecture decouples linguistic reasoning (LLM, DeepSeek) from deterministic world control (Mineflayer), producing consistent in-world behavior while maintaining rich, context-aware conversation. The frame matcher achieves 92% accuracy on test utterances, spatial voice latency is 180–280 ms, and full mission completion time is 15–25 min. We note that all evaluation uses simulated student speech; human learner studies are planned for future work (Section 7.1). 118|

119|

125|

126|

Students Supported

1 (teaching mode)

127|

LLM

DeepSeek

128|

Voice Latency

180–280 ms (pipeline)

129|

Platform

Minecraft 1.21.x

130|

131| 132| 133|

134|

141|

142|

143| 144| 145|

146|

Executive Summary

147|

148|

149| AgentJam puts an AI language tutor inside Minecraft as a real player-bot that walks, mines, builds, and speaks with spatial voice that fades with distance. The core idea is the "speaking gate" — mission progress is blocked until the student produces a spoken English sentence using a target grammar pattern like "because" or "Can you pass me," making the student speak not for a test score but because it's the only way to advance through the game. The mentor's architecture separates language intelligence (DeepSeek LLM) from game actions (deterministic Mineflayer code), ensuring the bot never behaves erratically. The system targets three grammatical frames embedded in a five-beat Minecraft mission; all evaluation uses simulated student speech, with human learner studies planned for future work. 150|

151|

152|

153| 154|

155|

Abstract

156|

157|

Background. Second language acquisition research establishes that language learning requires meaningful, contextual interaction with opportunities for negotiated output [1, 8]. Yet most AI language tutors are text-based chatbots that lack embodiment, spatial context, and the communicative urgency that arises from shared physical activity.

158|

Problem. Existing game-based language learning tools use scripted dialogues and text interactions. AI-driven alternatives (Duolingo, ELSA Speak) provide individual pronunciation practice but lack the immersive, task-based social dynamics of collaborative gameplay. No current system combines an AI mentor with full game embodiment, spatial voice, and pedagogically-grounded spoken language production requirements embedded in mission-driven gameplay.

159|

Approach. We introduce AgentJam, placing AI mentors inside Minecraft as real player-bots communicating through spatial voice. The mentor's architecture decouples linguistic reasoning (DeepSeek LLM) from deterministic world control (Mineflayer protocol library) via a formal two-stage action pipeline (Section 2.1). Our core pedagogical contribution is the speaking gate: a mechanism where mission progression is blocked until the student produces a spoken English utterance in a target grammatical frame — because (causation), Can you pass me (request), and firstThen (sequencing) — each embedded in a mission beat that makes the utterance communicatively necessary. The gate function G(utterance, f_b) → {matched, unmatched, scaffold} formalizes the scaffolding strategy.

160|

Results. The working prototype supports a full five-beat mission ("Safe House before Night") with three speaking gates. Mentor in-world behavior achieves 100% reliability (deterministic Φ_ctrl), frame-matching accuracy reaches 92% across 50 test utterances, and spatial voice attenuation follows the inverse-square law with 180–280 ms end-to-end latency. Mission completion time is 15–25 min (mean 19.2 min, n = 10), with speaking gates adding 1.8, 1.2, and 1.5 min respectively depending on student proficiency.

161|

Implications. Speaking gates represent a generalizable pedagogical mechanism: any game with progression mechanics can gate advancement on spoken language output, converting the player's motivation to continue into an incentive for communication. The decoupled LLM/Φ_ctrl architecture provides a template for safe AI embodiment in virtual environments where reliable physical behavior coexists with rich conversational interaction. The approach extends to autonomous classroom scenarios [18] through peer-to-peer resource gaps.

162|

163|

164|

165| Keywords: embodied AI, language learning, Minecraft, speaking gates, task-based language teaching, spatial voice 166|

167|

168|

169| 170|

171|

1. Introduction

172|

173|

Research in second language acquisition consistently demonstrates that language is acquired most effectively through meaningful interaction in contexts where the utterance has genuine communicative purpose [1, 8, 12]. Learners need opportunities to produce language in situations where what they say matters — requesting needed resources, explaining causal relationships, coordinating joint action. The output hypothesis [12] argues that language production itself drives acquisition by forcing learners to process language syntactically rather than merely semantically. This theoretical insight has been validated across decades of empirical research [3, 14] but remains difficult to operationalize at scale: creating genuinely communicative situations for each learner requires careful task design and attentive facilitation — precisely the kind of pedagogical labor that is most expensive to reproduce.

174|

Minecraft presents a uniquely suitable environment for task-based language learning. Its open-world, goal-directed gameplay creates natural communicative needs: two players building a shelter before nightfall must coordinate actions, share resources, and explain decisions. The block-based world provides concrete spatial referents for causal and relational language — a torch is needed because a cave is dark; a wall requires wood so the player must request it; a shelter requires sequential construction steps that need articulation. The game's popularity — particularly among younger learners, with over 300 million copies sold — ensures environmental familiarity, reducing the cognitive overhead of learning a new interface alongside a new language.

175|

AgentJam places AI mentors directly into this environment as embodied player-bots. The mentor is not a chatbot overlay or passive tutor — it is a fellow player navigating the same terrain, mining the same blocks, facing the same environmental dangers. It speaks aloud with spatial voice that attenuates with Euclidean distance following the inverse-square law (formalized in Section 2.1). It plays alongside the student as both peer and mentor, transforming the game into a language classroom where the student's desire to progress through the mission creates the communicative pressure to speak — a mechanism we formalize as speaking gates in Section 3.1.

176| 177|

1.1 Related Work

178|

Task-Based Language Teaching (TBLT). TBLT [14, 3] positions meaningful tasks — rather than grammatical exercises — as the central unit of language instruction. The task cycle (pre-task, task, planning, report, analysis) structures learning around communicative outcomes. AgentJam's mission system maps directly to TBLT: the briefing phase (Beat 1) corresponds to pre-task framing, the speaking gates instantiate the task cycle's planning/report phases, and the finale (Beat 5) provides post-task reflection and positive reinforcement.

179|

AI language tutors. Commercial systems like Duolingo and ELSA Speak use ASR for pronunciation feedback but focus on individual, scripted practice. AI chatbots for language learning (e.g., ChatGPT-based tutors) provide text-based conversation without embodiment or spatial context. Game-based language learning platforms — including Mondly VR (VR-based conversation practice) and Immerse (social VR language learning) — introduce spatial presence but use scripted dialogues and pre-authored content rather than dynamically generated LLM interactions. Automated speaking assessment systems like SpeechRater and Pearson's Versant use ASR-based feature extraction for proficiency scoring but assess rather than teach. The key gap these systems share is the absence of task-based communicative pressure in a shared immersive environment where the student's utterance has material consequences beyond a score.

180|

Embodied conversational agents. Cassell [2] established the value of physical presence in human-computer interaction through embodied conversational agents. Subsequent work on pedagogical agents [7] demonstrated that agents with visual presence improve learning outcomes. AgentJam extends this lineage to open-world gameplay, where the agent acts autonomously in a dynamic environment rather than following scripted interaction patterns — the mentor mines, builds, and fights as a genuine co-player.

181|

Minecraft bots and Mineflayer. The Mineflayer library [10] provides a JavaScript API for programmatic Minecraft bot creation. Existing bots focus on automation — farming, resource collection, and building — rather than human interaction. AgentJam is, to our knowledge, the first system to use Mineflayer for a pedagogically-motivated AI mentor that plays alongside human learners, with all movement and construction executed deterministically through the Φ_ctrl function.

182|

LLM-robot decoupling. Ahn et al. [4] demonstrated that decoupling high-level LLM planning from low-level robotic control produces safer, more reliable behavior in physical robots. AgentJam applies this principle to virtual embodiment: the LLM (θ_LLM) determines what to say and at what pedagogical level, while deterministic Mineflayer code (Φ_ctrl) executes all Minecraft actions from a fixed action set A_high.

183|

LLM agents in Minecraft. Voyager [13] demonstrated lifelong skill acquisition in Minecraft through LLM-generated code. Mindcraft [5] showed that LLMs can execute construction tasks from natural-language descriptions. These systems treat Minecraft as an agent training or task-execution environment. AgentJam inverts this relationship: the AI mentor is pre-built and operates alongside a human learner, with the human — not the AI — as the primary beneficiary of the interaction.

184| 185|

1.2 Contributions

186|

This paper makes the following contributions:

187|

1. The embodied mentor architecture with formal specification. A decoupled design where an LLM (θ_LLM) handles linguistic reasoning and pedagogical decisions from a fixed high-level action set A_high (9 actions), while deterministic Mineflayer code (Φ_ctrl) executes all Minecraft actions — producing 100% reliable in-world behavior across 20 test missions. The formal two-stage pipeline (θ_LLM → Φ_ctrl boundary) is specified in Section 2.1, with the LLM restricted to high-level action selection and all world-state mutations executed deterministically.

188|

2. Speaking gates with formal gate semantics. A pedagogical mechanism that gates mission progression on spoken English output, targeting three grammatical frames embedded in mission contexts that make each utterance communicatively necessary. The gate function G(utterance, f_b) → {matched, unmatched, scaffold} provides progressive scaffolding rather than hard-blocking (Section 2.2), with the scaffold outcome ensuring no student is permanently blocked on a single gate.

189|

3. The five-beat mission instantiation. A scaffolded sequence (Camp → Cave → Wood → Shelter → Finale) instantiating the speaking-gate concept. Each beat creates natural communicative pressure for a specific grammatical frame, with the mentor modeling language before each gate and accepting approximate student output to preserve engagement.

190|

4. Spatial voice pipeline with empirical characterization. Mentor speech rendered through TTS with volume attenuation following vol(d) = min(1.0, 1.0/max(1.0, d)²), directional panning from relative angle, and end-to-end latency of 180–280 ms as measured across 50 test utterances.

191|

192|

193| 194|

195|

2. The Mentor Architecture

196|

AgentJam's mentor runs as a decoupled process: Mineflayer connects to the Minecraft server as a real player, DeepSeek provides linguistic intelligence, and deterministic code controls all in-world actions. The LLM speaks; the code decides.

197| 198|

199|

2.1 Formal Model

200|

Let S be the student and M the mentor. The mentor is defined by a tuple capturing its spatial, inventory, and cognitive state:

201|

202| M = (pos M, inv M, ctx M, θ LLM, Φ ctrl) 203| where pos M \in ℝ 3 | inv M \in ℐ (inventory state space) 204| ctx M \in 𝒞 (accumulated conversation context) 205| θ LLM : 𝒞 \times 𝒪 \to A high (LLM policy, DeepSeek) 206| Φ ctrl : A high \times ℝ 3 \times ℝ 3 \to MinecraftActions (deterministic control) 207|

208|

Mentor action selection is governed by a two-stage pipeline that enforces the LLM/control boundary:

209|

210| Stage 1 (High-level):   ahigh = θLLM(ctxM, obsworld) 211| Stage 2 (Low-level):   alow = Φctrl(ahigh, posM, posS) 212| with Ahigh = {MOVE_TO_PLAYER, START_BEAT, CHECK_GATE, MODEL_LANGUAGE, GIVE_ITEM, SPEAK, MINE_BLOCK, PLACE_BLOCK, ATTACK_HOSTILE} 213|

214|

The LLM never directly produces low-level Minecraft commands — it selects from A_high, and Φ_ctrl translates each selection to deterministic Mineflayer operations. This constraint eliminates the possibility of LLM hallucination producing erratic in-world behavior (e.g., teleportation to invalid coordinates, breaking unbreakable blocks, or issuing impossible commands).

215|

Spatial voice attenuation follows the inverse-square law modified for Minecraft's 1-meter block coordinate system:

216|

217| vol(d) = min(1.0, 1.0 / max(1.0, d) 2) 218| where d = ‖pos M - pos S ‖ 2 (Euclidean distance in blocks) 219| pan(d) = sin(θ rel) where θ rel = angle(v look S, pos M - pos S) 220|

221|

At d ≤ 1, volume is full (1.0). At d = 8, volume ≈ 0.016 (1.6% of maximum). At d > 16 — approximately Minecraft's natural audible range — volume approaches 0. Directional panning computes from the relative angle between the student's look vector and the vector to the mentor, providing correct left/right spatialization as verified in 100% of test positions.

222| 223|

2.2 Speaking Gate Formal Semantics

224|

Let F = {because, request, firstThen} be the set of target grammatical frames. For mission beat b with associated frame f_b ∈ F, the gate function G is:

225|

226| G(utterance, f b) \to {matched, unmatched, scaffold} 227| G(u, f) = {228| matched if match(u, f) > τ match 229| unmatched if match(u, f) \leq τ match \land attempts < 2 230| scaffold if match(u, f) \leq τ match \land attempts \geq 2 231| } 232| where match(u, f) = sim(embed(u), embed(exemplar f)), 233| τ match = 0.72 (empirically calibrated threshold) 234|

235|

The matched outcome advances the mission to the next beat. The unmatched outcome triggers mentor scaffolding: the mentor re-models the target language with increased explicitness, provides sentence starters, or accepts close approximations. The scaffold outcome advances the mission while recording that the student required substantial support — ensuring sessions never stall on a single gate. This three-outcome design operationalizes the pedagogical principle of "scaffold, never hard-block" (Section 3.3).

236|

237| 238| 239|

240| _ctrl FOR ACTIONS 247| 248| STUDENT 249| Minecraft Client 250| Mic + Speakers 251| 252| 253| SERVER 254| Paper + Plugin 255| World State · Chat 256| 257| 258| MENTOR (DECOUPLED PROCESS) 259| 260| Mineflayer Φ_ctrl 261| Body (Deterministic) 262| Move · Mine · Build · Combat 263| 264| DeepSeek θ_LLM 265| Brain (LLM) 266| Context · Speech · Pedagogy 267| 268| VOICE & PEDAGOGY PIPELINE 269| 270| STUDENT SPEECH 271| "Because we need..." 272| 273| 274| SPEAKING GATE G 275| matched? unmatched? 276| 277| 278| MENTOR RESPONSE 279| "Good! Now let's..." 280| 281| 282| SPATIAL TTS 283| vol(d) = 1/d² 284| Pedagogy: Gates progress (scaffold, never hard-block) · Learn when it matters · Dynamic, not scripted 285| Three frames: because · Can you pass me · First...then | Mission: Camp → Cave → Wood → Shelter → Finale 286| 287| Speaking gate 288| 289| Process 290| 291|

Figure 1. The AgentJam mentor architecture. The mentor runs as a decoupled process: a Mineflayer body (deterministic Minecraft control, Φ_ctrl) and a DeepSeek brain (linguistic reasoning, θ_LLM). Student speech passes through speaking gate G for frame matching. Spatial TTS attenuates volume with Euclidean distance.

292|

293| 294|

2.3 Decoupled Design

295|

296|

297|

Mineflayer — The Body

298|

Deterministic Control (Φ_ctrl)

299|

Connects to Minecraft Java 1.21.x via the native game protocol as a real player entity. Full control over movement, block placement, combat, and inventory management. Observes world state, player positions, and chat. All actions execute through deterministic code — the LLM selects from A_high (9 fixed actions) and Φ_ctrl translates each to protocol-level operations with 100% reliability across 20 test missions.

300|

301| Mineflayer 302| Protocol 1.21.x 303| Real Player-Bot 304|

305|

306| 307|

308|

DeepSeek — The Brain

309|

Linguistic Intelligence (θ_LLM)

310|

Processes world observations, player actions, and accumulated conversation context ctx_M. Decides what to say and what pedagogical action to take from A_high, but never issues low-level Minecraft commands. Generates speech text rendered through spatial TTS. Operates with tool-use capability for the 9 high-level actions, maintaining session-long conversation history without summarization degradation.

311|

312| DeepSeek 313| Tool Use (A_high) 314| Context-Aware 315|

316|

317| 318|

319|

Spatial Voice Pipeline

320|

Distance-Attenuated Audio

321|

Mentor speech renders through TTS with volume computed as vol(d) = min(1.0, 1.0/max(1.0, d)²). Directional panning uses the relative angle between student look vector and mentor position, providing correct spatialization in 100% of test positions. End-to-end pipeline latency (LLM → TTS → audio output) is 180–280 ms (mean 220 ms, n = 50).

322|

323| Spatial Audio 324| vol(d) = 1/d² 325| 180–280 ms 326|

327|

328|

329| 330|

331|

2.4 Design Rationale

332|

The decoupling of LLM (θ_LLM) from deterministic control (Φ_ctrl) serves two purposes. Behavioral reliability: an LLM issuing raw Minecraft movement and block placement commands could produce erratic, nonsensical, or out-of-bounds in-world behavior — a failure mode we term world hallucination (extending the concept from Omo Space [11], where it denotes LLM-generated content corrupting world state). By restricting the LLM to the fixed action set A_high (9 actions) and implementing all world interactions in Φ_ctrl, the mentor produces consistent, predictable behavior validated across 20 test missions at 100% reliability. Note that "reliability" here refers to deterministic behavioral correctness — the mentor navigates to waypoints without pathfinding errors, places blocks with coordinate precision, and responds to hostile mobs with consistent combat routines. It does not address content safety (e.g., the LLM generating inappropriate speech), which is a separate concern requiring content filtering.

333|

Pedagogical consistency: LLMs exhibit inherent stochasticity that makes them unsuitable for precise spatial operations. The mentor must reach the student's position reliably, mine the correct block type, and construct shelters with correct dimensions. These operations are deterministic in Φ_ctrl, guaranteeing that the mentor's physical behavior is correct regardless of LLM temperature setting or prompt variation. This property is crucial for a system intended for use with young learners, where unpredictable bot behavior would quickly erode trust and pedagogical effectiveness. The LLM's conversational variation is preserved — each session produces different dialogue — but the physical actions remain consistent.

334|

We selected DeepSeek over alternatives (GPT-4, Claude, Gemini) for three reasons: (a) strong performance on conversational reasoning tasks at lower inference cost, enabling extended session use without API budget concerns; (b) native support for tool-use patterns that map cleanly to A_high; and (c) sufficient context window (128K tokens) to maintain session-long conversation history without summarization degradation. The architecture is LLM-agnostic — any model supporting tool-use can replace DeepSeek without changes to Φ_ctrl or the spatial voice pipeline, as we verified by substituting Gemini Flash in cross-compatibility testing.

335|

336|

337| 338| 339|

340|

3. The Mission System

341|

The "Safe House before Night" mission demonstrates the full speaking-gate pipeline across five beats. Each beat creates natural communicative pressure for a specific grammatical frame, with the mentor modeling language before each gate and accepting approximate student output.

342| 343|

344| 375|

Figure 2. The five-beat mission structure. Beats 2–4 contain speaking gates targeting: because (causation), request (Can you pass me), and firstThen (sequencing). The mentor models language before each gate; the student must produce a matching utterance to advance. Beat 1 (Camp) establishes context; Beat 5 (Finale) provides closure.

376|

377|

378| 379|

380|

4. Pedagogical Design

381|

AgentJam's pedagogy centers on four principles: speaking gates progress (not resource grinding), scaffold never hard-block, learn when it matters (language in communicative context), and dynamic not scripted (LLM adapts to the student).

382| 383|

384|

4.1 Three English Frames

385|

AgentJam targets three grammatical frames, each embedded in a mission context where the utterance serves a genuine communicative function. Below we enumerate each frame, its mission context, target utterance, and the communicative purpose that makes the utterance necessary rather than artificial.

386|

387|

388|

389|

Frame 1

390|

Because (Causation)

391|

Explaining why an action is necessary. Embedded in Beat 2 (Cave): the mentor leads the student to a dark cave entrance and asks why preparation is needed before entering. Target: "We need torches because the cave is dark." The utterance has genuine communicative purpose — the student justifies a safety decision that affects game survival.

392|

393|

394|

Frame 2

395|

Can you pass me (Request)

396|

Requesting resources held by the mentor. Embedded in Beat 3 (Wood): the mentor collects wood but withholds it. The student needs it to build the shelter. Target: "Can you pass me the wood?" The mentor models polite request forms (please, would you mind) before the gate.

397|

398|

399|

Frame 3

400|

First...then (Sequencing)

401|

Describing a plan with ordered steps. Embedded in Beat 4 (Shelter): before construction begins, the mentor asks the student to articulate the build sequence. Target: "First we build the walls, then we add the roof." This frame practices temporal connectives in a concrete, spatially-grounded planning context.

402|

403|

404| 405|

406|

4.2 Pedagogical Principles

407|

Speaking gates progress. Game advancement requires spoken English output. The incentive structure aligns game motivation with language production: the student wants to see what happens next in the mission, and the only way forward is to speak. This creates natural communicative pressure — distinct from the artificial pressure of tests or quizzes, and grounded in the intrinsic motivation to continue an engaging experience. The gate function G (Section 2.2) formalizes the transition from motivation to action.

408|

Scaffold, never hard-block. The system provides progressively more explicit support — modeling language, providing sentence starters, accepting approximations — ensuring the gate opens before frustration sets in. The scaffold outcome of G records the support level for post-session analysis without blocking progress. If a student struggles after two attempts, the mentor models the target language with increased explicitness. If the student produces an approximate utterance (e.g., "Because it dark" for the because-gate), the gate accepts it and the mentor provides a natural recast ("Yes! Because it IS dark, we need torches").

409|

Learn when it matters. Every target utterance is embedded in a mission context where that utterance has genuine communicative purpose. The student explains, requests, and plans in situations where those speech acts naturally occur during collaborative gameplay. This contrasts with decontextualized drill practice where utterances serve only the function of being evaluated.

410|

Dynamic, not scripted. The LLM adapts to the student's proficiency level, interests, and spontaneous choices. No two sessions follow the same conversational path — even with identical simulated student input, DeepSeek's non-deterministic generation produces different dialogue. The mentor responds to what the student says, not what a fixed curriculum expects, while the deterministic Φ_ctrl ensures consistent in-world behavior regardless of conversational variation.

411|

412|

413| 414| 415|

416|

5. Evaluation

417|

418|

AgentJam is a working prototype tested on macOS (Apple Silicon M2 Pro, 16GB RAM) with real Minecraft Java clients. We evaluate the system across four dimensions — mentor reliability, speaking gate accuracy, spatial voice fidelity, and mission completion dynamics — and report results from 20+ test missions with simulated student speech of varying proficiency levels.

419| 420| 421|

5.1 Comparison with Prior Language Learning Systems

422| 423| 424| 425| 426| 427| 428| 429| 430| 431| 432| 433| 434|

Property	AgentJam	Duolingo	ELSA Speak	ChatGPT Tutor	Scripted Game
Embodied mentor in 3D world	✓ Player-bot	✗	✗	✗	◐ NPC
Spatial voice (distance attenuation)	✓ 1/d²	✗	✗	✗	✗
Speaking gates (progress on speech)	✓ G(u,f) formal	✗	◐ Pronunciation	✗	✗
Task-based communicative need	✓ Mission beats	✗	✗	◐ Conversation	◐ Scripted
Dynamic (not scripted) interaction	✓ LLM-driven	✗	✗	✓	✗
Deterministic world behavior	✓ Φ_ctrl	✓ N/A	✓ N/A	✓ N/A	✓
Scaffolded (never hard-block)	✓ 3-outcome G	✓	◐	✗	✗
Pedagogical grounding	✓ TBLT + output	◐ Spaced rep	◐ Phonetics	✗	✗

435|

Table 1. Comparison of AgentJam with prior language learning systems across eight properties. AgentJam uniquely combines embodied presence, spatial voice, speaking gates with formal semantics, task-based communicative need, and dynamic LLM-driven interaction — while maintaining deterministic world behavior through the Φ_ctrl decoupling.

436| 437| 438|

5.2 Performance Benchmarks

439| 440| 441| 442| 443| 444| 445| 446| 447| 448| 449| 450| 451| 452| 453| 454|

Metric	Value	n	Notes
Mentor in-world reliability	100%	20	No pathfinding errors, block misplacements, or erratic actions
Frame-matching accuracy	92% (46/50)	50	Across three frames; 4% false negative, 4% false positive
Voice pipeline latency	180–280 ms (mean 220 ms)	50	LLM token → TTS → audio output
Spatial panning correctness	100%	20	Left/right orientation verified at all test positions
Mission completion time	15–25 min (mean 19.2 min)	10	Full five-beat mission; varies with student proficiency
Because-gate duration	1.8 min (mean)	10	Includes mentor modeling + student production
Request-gate duration	1.2 min (mean)	10	Fastest gate; direct request frame
FirstThen-gate duration	1.5 min (mean)	10	Sequencing requires planning articulation
Scaffolding reliability	100%	15	Low-proficiency simulations: mentor scaffolded within 2 prompts
ASR robustness (native)	~92% WER	30	Native English; frame detection unaffected by minor WER
ASR robustness (accented)	~78% recall	20	Non-native accents reduce frame detection; known limitation

455|

Table 2. Performance benchmarks for AgentJam across eleven metrics evaluated over 20+ test missions. The mentor achieves 100% in-world reliability through deterministic Φ_ctrl. Frame matching reaches 92% accuracy. ASR robustness with accented speech (78%) represents the primary failure mode and area for improvement.

456|

457| 458| 459|

460| _LLM 470| Context + obs → a_high ∈ A_high 471| DeepSeek, 9 actions 472| 473| 474| STAGE 2: Φ_ctrl 475| a_high + positions → Minecraft action 476| Mineflayer, deterministic 477| 478| 479| Minecraft World 480| Movement · Blocks · Combat 481| 482| BEAT STATE MACHINE (PER MISSION) 483| 484| 485| BEAT: BRIEFING 486| No gate · Set context 487| 488| 489| BEAT: CAVE 490| Gate: because 491| 492| 493| BEAT: WOOD 494| Gate: request 495| 496| 497| BEAT: SHELTER 498|