1| 2| 3| 4| 5| 6| 7| 8| 9| AgentJam: Embodied AI Mentors for Language Learning in Minecraft — Omo Research 10| 11| 12| 16| 17| 32| 33| 34| 35| 36| 88| 94| 95|
96| Papers 97|
← Home
98|
99| 100|
101| 102|
103|
104| Embodied Agents 105| Language Learning 106| Minecraft 107| Voice AI 108| Education 109|
110|
111|

Harry Edwards

112|

Omo Research

113|

FINAL — Omo Research Technical Report · June 2025

114|
115|

AgentJam: Embodied AI Mentors for Language Learning in Minecraft

116|

117| We present AgentJam, a system that deploys AI language mentors as embodied player-bots inside Minecraft. The mentor connects via the Minecraft protocol as a real player — walking, mining, building, and fighting in the game client — and communicates through spatial voice that attenuates with Euclidean distance following vol(d) = 1/d². We introduce speaking gates: a pedagogical mechanism where mission progression is blocked not by resource collection but by the requirement that the student produce a spoken English utterance in a target grammatical frame — because (causation), Can you pass me (request), and firstThen (sequencing) — each embedded in a mission beat that makes the utterance communicatively necessary. The architecture decouples linguistic reasoning (LLM, DeepSeek) from deterministic world control (Mineflayer), producing consistent in-world behavior while maintaining rich, context-aware conversation. The frame matcher achieves 92% accuracy on test utterances, spatial voice latency is 180–280 ms, and full mission completion time is 15–25 min. We note that all evaluation uses simulated student speech; human learner studies are planned for future work (Section 7.1). 118|

119| 125|
126|

Students Supported

1 (teaching mode)

127|

LLM

DeepSeek

128|

Voice Latency

180–280 ms (pipeline)

129|

Platform

Minecraft 1.21.x

130|
131| 132| 133|
134| 135| 136|

137| This browser can't display the embedded PDF. 138| Download the PDF ↓ 139|

140|
141|
142|
143| 144| 145|
146|

Executive Summary

147|
148|

149| AgentJam puts an AI language tutor inside Minecraft as a real player-bot that walks, mines, builds, and speaks with spatial voice that fades with distance. The core idea is the "speaking gate" — mission progress is blocked until the student produces a spoken English sentence using a target grammar pattern like "because" or "Can you pass me," making the student speak not for a test score but because it's the only way to advance through the game. The mentor's architecture separates language intelligence (DeepSeek LLM) from game actions (deterministic Mineflayer code), ensuring the bot never behaves erratically. The system targets three grammatical frames embedded in a five-beat Minecraft mission; all evaluation uses simulated student speech, with human learner studies planned for future work. 150|

151|
152|
153| 154|
155|

Abstract

156|
157|

Background. Second language acquisition research establishes that language learning requires meaningful, contextual interaction with opportunities for negotiated output [1, 8]. Yet most AI language tutors are text-based chatbots that lack embodiment, spatial context, and the communicative urgency that arises from shared physical activity.

158|

Problem. Existing game-based language learning tools use scripted dialogues and text interactions. AI-driven alternatives (Duolingo, ELSA Speak) provide individual pronunciation practice but lack the immersive, task-based social dynamics of collaborative gameplay. No current system combines an AI mentor with full game embodiment, spatial voice, and pedagogically-grounded spoken language production requirements embedded in mission-driven gameplay.

159|

Approach. We introduce AgentJam, placing AI mentors inside Minecraft as real player-bots communicating through spatial voice. The mentor's architecture decouples linguistic reasoning (DeepSeek LLM) from deterministic world control (Mineflayer protocol library) via a formal two-stage action pipeline (Section 2.1). Our core pedagogical contribution is the speaking gate: a mechanism where mission progression is blocked until the student produces a spoken English utterance in a target grammatical frame — because (causation), Can you pass me (request), and firstThen (sequencing) — each embedded in a mission beat that makes the utterance communicatively necessary. The gate function G(utterance, fb) → {matched, unmatched, scaffold} formalizes the scaffolding strategy.

160|

Results. The working prototype supports a full five-beat mission ("Safe House before Night") with three speaking gates. Mentor in-world behavior achieves 100% reliability (deterministic Φctrl), frame-matching accuracy reaches 92% across 50 test utterances, and spatial voice attenuation follows the inverse-square law with 180–280 ms end-to-end latency. Mission completion time is 15–25 min (mean 19.2 min, n = 10), with speaking gates adding 1.8, 1.2, and 1.5 min respectively depending on student proficiency.

161|

Implications. Speaking gates represent a generalizable pedagogical mechanism: any game with progression mechanics can gate advancement on spoken language output, converting the player's motivation to continue into an incentive for communication. The decoupled LLM/Φctrl architecture provides a template for safe AI embodiment in virtual environments where reliable physical behavior coexists with rich conversational interaction. The approach extends to autonomous classroom scenarios [18] through peer-to-peer resource gaps.

162|
163|
164|

165| Keywords: embodied AI, language learning, Minecraft, speaking gates, task-based language teaching, spatial voice 166|

167|
168|
169| 170|
171|

1. Introduction

172|
173|

Research in second language acquisition consistently demonstrates that language is acquired most effectively through meaningful interaction in contexts where the utterance has genuine communicative purpose [1, 8, 12]. Learners need opportunities to produce language in situations where what they say matters — requesting needed resources, explaining causal relationships, coordinating joint action. The output hypothesis [12] argues that language production itself drives acquisition by forcing learners to process language syntactically rather than merely semantically. This theoretical insight has been validated across decades of empirical research [3, 14] but remains difficult to operationalize at scale: creating genuinely communicative situations for each learner requires careful task design and attentive facilitation — precisely the kind of pedagogical labor that is most expensive to reproduce.

174|

Minecraft presents a uniquely suitable environment for task-based language learning. Its open-world, goal-directed gameplay creates natural communicative needs: two players building a shelter before nightfall must coordinate actions, share resources, and explain decisions. The block-based world provides concrete spatial referents for causal and relational language — a torch is needed because a cave is dark; a wall requires wood so the player must request it; a shelter requires sequential construction steps that need articulation. The game's popularity — particularly among younger learners, with over 300 million copies sold — ensures environmental familiarity, reducing the cognitive overhead of learning a new interface alongside a new language.

175|

AgentJam places AI mentors directly into this environment as embodied player-bots. The mentor is not a chatbot overlay or passive tutor — it is a fellow player navigating the same terrain, mining the same blocks, facing the same environmental dangers. It speaks aloud with spatial voice that attenuates with Euclidean distance following the inverse-square law (formalized in Section 2.1). It plays alongside the student as both peer and mentor, transforming the game into a language classroom where the student's desire to progress through the mission creates the communicative pressure to speak — a mechanism we formalize as speaking gates in Section 3.1.

176| 177|

1.1 Related Work

178|

Task-Based Language Teaching (TBLT). TBLT [14, 3] positions meaningful tasks — rather than grammatical exercises — as the central unit of language instruction. The task cycle (pre-task, task, planning, report, analysis) structures learning around communicative outcomes. AgentJam's mission system maps directly to TBLT: the briefing phase (Beat 1) corresponds to pre-task framing, the speaking gates instantiate the task cycle's planning/report phases, and the finale (Beat 5) provides post-task reflection and positive reinforcement.

179|

AI language tutors. Commercial systems like Duolingo and ELSA Speak use ASR for pronunciation feedback but focus on individual, scripted practice. AI chatbots for language learning (e.g., ChatGPT-based tutors) provide text-based conversation without embodiment or spatial context. Game-based language learning platforms — including Mondly VR (VR-based conversation practice) and Immerse (social VR language learning) — introduce spatial presence but use scripted dialogues and pre-authored content rather than dynamically generated LLM interactions. Automated speaking assessment systems like SpeechRater and Pearson's Versant use ASR-based feature extraction for proficiency scoring but assess rather than teach. The key gap these systems share is the absence of task-based communicative pressure in a shared immersive environment where the student's utterance has material consequences beyond a score.

180|

Embodied conversational agents. Cassell [2] established the value of physical presence in human-computer interaction through embodied conversational agents. Subsequent work on pedagogical agents [7] demonstrated that agents with visual presence improve learning outcomes. AgentJam extends this lineage to open-world gameplay, where the agent acts autonomously in a dynamic environment rather than following scripted interaction patterns — the mentor mines, builds, and fights as a genuine co-player.

181|

Minecraft bots and Mineflayer. The Mineflayer library [10] provides a JavaScript API for programmatic Minecraft bot creation. Existing bots focus on automation — farming, resource collection, and building — rather than human interaction. AgentJam is, to our knowledge, the first system to use Mineflayer for a pedagogically-motivated AI mentor that plays alongside human learners, with all movement and construction executed deterministically through the Φctrl function.

182|

LLM-robot decoupling. Ahn et al. [4] demonstrated that decoupling high-level LLM planning from low-level robotic control produces safer, more reliable behavior in physical robots. AgentJam applies this principle to virtual embodiment: the LLM (θLLM) determines what to say and at what pedagogical level, while deterministic Mineflayer code (Φctrl) executes all Minecraft actions from a fixed action set Ahigh.

183|

LLM agents in Minecraft. Voyager [13] demonstrated lifelong skill acquisition in Minecraft through LLM-generated code. Mindcraft [5] showed that LLMs can execute construction tasks from natural-language descriptions. These systems treat Minecraft as an agent training or task-execution environment. AgentJam inverts this relationship: the AI mentor is pre-built and operates alongside a human learner, with the human — not the AI — as the primary beneficiary of the interaction.

184| 185|

1.2 Contributions

186|

This paper makes the following contributions:

187|

1. The embodied mentor architecture with formal specification. A decoupled design where an LLM (θLLM) handles linguistic reasoning and pedagogical decisions from a fixed high-level action set Ahigh (9 actions), while deterministic Mineflayer code (Φctrl) executes all Minecraft actions — producing 100% reliable in-world behavior across 20 test missions. The formal two-stage pipeline (θLLM → Φctrl boundary) is specified in Section 2.1, with the LLM restricted to high-level action selection and all world-state mutations executed deterministically.

188|

2. Speaking gates with formal gate semantics. A pedagogical mechanism that gates mission progression on spoken English output, targeting three grammatical frames embedded in mission contexts that make each utterance communicatively necessary. The gate function G(utterance, fb) → {matched, unmatched, scaffold} provides progressive scaffolding rather than hard-blocking (Section 2.2), with the scaffold outcome ensuring no student is permanently blocked on a single gate.

189|

3. The five-beat mission instantiation. A scaffolded sequence (Camp → Cave → Wood → Shelter → Finale) instantiating the speaking-gate concept. Each beat creates natural communicative pressure for a specific grammatical frame, with the mentor modeling language before each gate and accepting approximate student output to preserve engagement.

190|

4. Spatial voice pipeline with empirical characterization. Mentor speech rendered through TTS with volume attenuation following vol(d) = min(1.0, 1.0/max(1.0, d)²), directional panning from relative angle, and end-to-end latency of 180–280 ms as measured across 50 test utterances.

191|
192|
193| 194|
195|

2. The Mentor Architecture

196|

AgentJam's mentor runs as a decoupled process: Mineflayer connects to the Minecraft server as a real player, DeepSeek provides linguistic intelligence, and deterministic code controls all in-world actions. The LLM speaks; the code decides.

197| 198|
199|

2.1 Formal Model

200|

Let S be the student and M the mentor. The mentor is defined by a tuple capturing its spatial, inventory, and cognitive state:

201|
202| M = (posM, invM, ctxM, θLLM, Φctrl)

203| where   posM ∈ ℝ3  |  invM ∈ ℐ  (inventory state space)
204|       ctxM ∈ 𝒞  (accumulated conversation context)
205|       θLLM : 𝒞 × 𝒪 → Ahigh  (LLM policy, DeepSeek)
206|       Φctrl : Ahigh × ℝ3 × ℝ3 → MinecraftActions  (deterministic control) 207|
208|

Mentor action selection is governed by a two-stage pipeline that enforces the LLM/control boundary:

209|
210| Stage 1 (High-level):   ahigh = θLLM(ctxM, obsworld)
211| Stage 2 (Low-level):   alow = Φctrl(ahigh, posM, posS)

212| with Ahigh = {MOVE_TO_PLAYER, START_BEAT, CHECK_GATE, MODEL_LANGUAGE, GIVE_ITEM, SPEAK, MINE_BLOCK, PLACE_BLOCK, ATTACK_HOSTILE} 213|
214|

The LLM never directly produces low-level Minecraft commands — it selects from Ahigh, and Φctrl translates each selection to deterministic Mineflayer operations. This constraint eliminates the possibility of LLM hallucination producing erratic in-world behavior (e.g., teleportation to invalid coordinates, breaking unbreakable blocks, or issuing impossible commands).

215|

Spatial voice attenuation follows the inverse-square law modified for Minecraft's 1-meter block coordinate system:

216|
217| vol(d) = min(1.0, 1.0 / max(1.0, d)2)

218| where d = ‖posM − posS2   (Euclidean distance in blocks)

219| pan(d) = sin(θrel)   where θrel = angle(vlookS, posM − posS) 220|
221|

At d ≤ 1, volume is full (1.0). At d = 8, volume ≈ 0.016 (1.6% of maximum). At d > 16 — approximately Minecraft's natural audible range — volume approaches 0. Directional panning computes from the relative angle between the student's look vector and the vector to the mentor, providing correct left/right spatialization as verified in 100% of test positions.

222| 223|

2.2 Speaking Gate Formal Semantics

224|

Let F = {because, request, firstThen} be the set of target grammatical frames. For mission beat b with associated frame fbF, the gate function G is:

225|
226| G(utterance, fb) → {matched, unmatched, scaffold}

227| G(u, f) = {
228|   matched   if match(u, f) > τmatch
229|   unmatched  if match(u, f) ≤ τmatch ∧ attempts < 2
230|   scaffold   if match(u, f) ≤ τmatch ∧ attempts ≥ 2
231| }

232| where match(u, f) = sim(embed(u), embed(exemplarf)),
233| τmatch = 0.72 (empirically calibrated threshold) 234|
235|

The matched outcome advances the mission to the next beat. The unmatched outcome triggers mentor scaffolding: the mentor re-models the target language with increased explicitness, provides sentence starters, or accepts close approximations. The scaffold outcome advances the mission while recording that the student required substantial support — ensuring sessions never stall on a single gate. This three-outcome design operationalizes the pedagogical principle of "scaffold, never hard-block" (Section 3.3).

236|
237| 238| 239|
240| 241| 242| 243| 244| 245| 246| FIGURE 1: MENTOR STACK — LLM FOR WORDS, DETERMINISTIC Φctrl FOR ACTIONS 247| 248| STUDENT 249| Minecraft Client 250| Mic + Speakers 251| 252| 253| SERVER 254| Paper + Plugin 255| World State · Chat 256| 257| 258| MENTOR (DECOUPLED PROCESS) 259| 260| Mineflayer Φctrl 261| Body (Deterministic) 262| Move · Mine · Build · Combat 263| 264| DeepSeek θLLM 265| Brain (LLM) 266| Context · Speech · Pedagogy 267| 268| VOICE & PEDAGOGY PIPELINE 269| 270| STUDENT SPEECH 271| "Because we need..." 272| 273| 274| SPEAKING GATE G 275| matched? unmatched? 276| 277| 278| MENTOR RESPONSE 279| "Good! Now let's..." 280| 281| 282| SPATIAL TTS 283| vol(d) = 1/d² 284| Pedagogy: Gates progress (scaffold, never hard-block) · Learn when it matters · Dynamic, not scripted 285| Three frames: because · Can you pass me · First...then | Mission: Camp → Cave → Wood → Shelter → Finale 286| 287| Speaking gate 288| 289| Process 290| 291|

Figure 1. The AgentJam mentor architecture. The mentor runs as a decoupled process: a Mineflayer body (deterministic Minecraft control, Φctrl) and a DeepSeek brain (linguistic reasoning, θLLM). Student speech passes through speaking gate G for frame matching. Spatial TTS attenuates volume with Euclidean distance.

292|
293| 294|

2.3 Decoupled Design

295|
296|
297|

Mineflayer — The Body

298|

Deterministic Control (Φctrl)

299|

Connects to Minecraft Java 1.21.x via the native game protocol as a real player entity. Full control over movement, block placement, combat, and inventory management. Observes world state, player positions, and chat. All actions execute through deterministic code — the LLM selects from Ahigh (9 fixed actions) and Φctrl translates each to protocol-level operations with 100% reliability across 20 test missions.

300|
301| Mineflayer 302| Protocol 1.21.x 303| Real Player-Bot 304|
305|
306| 307|
308|

DeepSeek — The Brain

309|

Linguistic Intelligence (θLLM)

310|

Processes world observations, player actions, and accumulated conversation context ctxM. Decides what to say and what pedagogical action to take from Ahigh, but never issues low-level Minecraft commands. Generates speech text rendered through spatial TTS. Operates with tool-use capability for the 9 high-level actions, maintaining session-long conversation history without summarization degradation.

311|
312| DeepSeek 313| Tool Use (Ahigh) 314| Context-Aware 315|
316|
317| 318|
319|

Spatial Voice Pipeline

320|

Distance-Attenuated Audio

321|

Mentor speech renders through TTS with volume computed as vol(d) = min(1.0, 1.0/max(1.0, d)²). Directional panning uses the relative angle between student look vector and mentor position, providing correct spatialization in 100% of test positions. End-to-end pipeline latency (LLM → TTS → audio output) is 180–280 ms (mean 220 ms, n = 50).

322|
323| Spatial Audio 324| vol(d) = 1/d² 325| 180–280 ms 326|
327|
328|
329| 330|
331|

2.4 Design Rationale

332|

The decoupling of LLM (θLLM) from deterministic control (Φctrl) serves two purposes. Behavioral reliability: an LLM issuing raw Minecraft movement and block placement commands could produce erratic, nonsensical, or out-of-bounds in-world behavior — a failure mode we term world hallucination (extending the concept from Omo Space [11], where it denotes LLM-generated content corrupting world state). By restricting the LLM to the fixed action set Ahigh (9 actions) and implementing all world interactions in Φctrl, the mentor produces consistent, predictable behavior validated across 20 test missions at 100% reliability. Note that "reliability" here refers to deterministic behavioral correctness — the mentor navigates to waypoints without pathfinding errors, places blocks with coordinate precision, and responds to hostile mobs with consistent combat routines. It does not address content safety (e.g., the LLM generating inappropriate speech), which is a separate concern requiring content filtering.

333|

Pedagogical consistency: LLMs exhibit inherent stochasticity that makes them unsuitable for precise spatial operations. The mentor must reach the student's position reliably, mine the correct block type, and construct shelters with correct dimensions. These operations are deterministic in Φctrl, guaranteeing that the mentor's physical behavior is correct regardless of LLM temperature setting or prompt variation. This property is crucial for a system intended for use with young learners, where unpredictable bot behavior would quickly erode trust and pedagogical effectiveness. The LLM's conversational variation is preserved — each session produces different dialogue — but the physical actions remain consistent.

334|

We selected DeepSeek over alternatives (GPT-4, Claude, Gemini) for three reasons: (a) strong performance on conversational reasoning tasks at lower inference cost, enabling extended session use without API budget concerns; (b) native support for tool-use patterns that map cleanly to Ahigh; and (c) sufficient context window (128K tokens) to maintain session-long conversation history without summarization degradation. The architecture is LLM-agnostic — any model supporting tool-use can replace DeepSeek without changes to Φctrl or the spatial voice pipeline, as we verified by substituting Gemini Flash in cross-compatibility testing.

335|
336|
337| 338| 339|
340|

3. The Mission System

341|

The "Safe House before Night" mission demonstrates the full speaking-gate pipeline across five beats. Each beat creates natural communicative pressure for a specific grammatical frame, with the mentor modeling language before each gate and accepting approximate student output.

342| 343|
344| 345| 346| 347| 348| FIGURE 2: MISSION "SAFE HOUSE BEFORE NIGHT" — FIVE BEATS, THREE SPEAKING GATES 349| 350| BEAT 1 351| Camp 🏕️ 352| Briefing · Set goal 353| 354| 355| BEAT 2 ⚡GATE 356| Cave 🕳️ 357| Frame: because 358| 359| 360| BEAT 3 ⚡GATE 361| Wood 🌲 362| Frame: request 363| 364| 365| BEAT 4 ⚡GATE 366| Shelter 🏠 367| Frame: firstThen 368| 369| 370| BEAT 5 371| Finale 🎉 372| Celebrate · Review 373| ⚡ = Speaking gate (student must produce target English frame to advance) · Mentor models → Student produces → Gate opens → Mission continues 374| 375|

Figure 2. The five-beat mission structure. Beats 2–4 contain speaking gates targeting: because (causation), request (Can you pass me), and firstThen (sequencing). The mentor models language before each gate; the student must produce a matching utterance to advance. Beat 1 (Camp) establishes context; Beat 5 (Finale) provides closure.

376|
377|
378| 379|
380|

4. Pedagogical Design

381|

AgentJam's pedagogy centers on four principles: speaking gates progress (not resource grinding), scaffold never hard-block, learn when it matters (language in communicative context), and dynamic not scripted (LLM adapts to the student).

382| 383|
384|

4.1 Three English Frames

385|

AgentJam targets three grammatical frames, each embedded in a mission context where the utterance serves a genuine communicative function. Below we enumerate each frame, its mission context, target utterance, and the communicative purpose that makes the utterance necessary rather than artificial.

386|
387|
388|
389|

Frame 1

390|

Because (Causation)

391|

Explaining why an action is necessary. Embedded in Beat 2 (Cave): the mentor leads the student to a dark cave entrance and asks why preparation is needed before entering. Target: "We need torches because the cave is dark." The utterance has genuine communicative purpose — the student justifies a safety decision that affects game survival.

392|
393|
394|

Frame 2

395|

Can you pass me (Request)

396|

Requesting resources held by the mentor. Embedded in Beat 3 (Wood): the mentor collects wood but withholds it. The student needs it to build the shelter. Target: "Can you pass me the wood?" The mentor models polite request forms (please, would you mind) before the gate.

397|
398|
399|

Frame 3

400|

First...then (Sequencing)

401|

Describing a plan with ordered steps. Embedded in Beat 4 (Shelter): before construction begins, the mentor asks the student to articulate the build sequence. Target: "First we build the walls, then we add the roof." This frame practices temporal connectives in a concrete, spatially-grounded planning context.

402|
403|
404| 405|
406|

4.2 Pedagogical Principles

407|

Speaking gates progress. Game advancement requires spoken English output. The incentive structure aligns game motivation with language production: the student wants to see what happens next in the mission, and the only way forward is to speak. This creates natural communicative pressure — distinct from the artificial pressure of tests or quizzes, and grounded in the intrinsic motivation to continue an engaging experience. The gate function G (Section 2.2) formalizes the transition from motivation to action.

408|

Scaffold, never hard-block. The system provides progressively more explicit support — modeling language, providing sentence starters, accepting approximations — ensuring the gate opens before frustration sets in. The scaffold outcome of G records the support level for post-session analysis without blocking progress. If a student struggles after two attempts, the mentor models the target language with increased explicitness. If the student produces an approximate utterance (e.g., "Because it dark" for the because-gate), the gate accepts it and the mentor provides a natural recast ("Yes! Because it IS dark, we need torches").

409|

Learn when it matters. Every target utterance is embedded in a mission context where that utterance has genuine communicative purpose. The student explains, requests, and plans in situations where those speech acts naturally occur during collaborative gameplay. This contrasts with decontextualized drill practice where utterances serve only the function of being evaluated.

410|

Dynamic, not scripted. The LLM adapts to the student's proficiency level, interests, and spontaneous choices. No two sessions follow the same conversational path — even with identical simulated student input, DeepSeek's non-deterministic generation produces different dialogue. The mentor responds to what the student says, not what a fixed curriculum expects, while the deterministic Φctrl ensures consistent in-world behavior regardless of conversational variation.

411|
412|
413| 414| 415|
416|

5. Evaluation

417|
418|

AgentJam is a working prototype tested on macOS (Apple Silicon M2 Pro, 16GB RAM) with real Minecraft Java clients. We evaluate the system across four dimensions — mentor reliability, speaking gate accuracy, spatial voice fidelity, and mission completion dynamics — and report results from 20+ test missions with simulated student speech of varying proficiency levels.

419| 420| 421|

5.1 Comparison with Prior Language Learning Systems

422| 423| 424| 425| 426| 427| 428| 429| 430| 431| 432| 433| 434|
PropertyAgentJamDuolingoELSA SpeakChatGPT TutorScripted Game
Embodied mentor in 3D world✓ Player-bot◐ NPC
Spatial voice (distance attenuation)✓ 1/d²
Speaking gates (progress on speech)✓ G(u,f) formal◐ Pronunciation
Task-based communicative need✓ Mission beats◐ Conversation◐ Scripted
Dynamic (not scripted) interaction✓ LLM-driven
Deterministic world behavior✓ Φctrl✓ N/A✓ N/A✓ N/A
Scaffolded (never hard-block)✓ 3-outcome G
Pedagogical grounding✓ TBLT + output◐ Spaced rep◐ Phonetics
435|

Table 1. Comparison of AgentJam with prior language learning systems across eight properties. AgentJam uniquely combines embodied presence, spatial voice, speaking gates with formal semantics, task-based communicative need, and dynamic LLM-driven interaction — while maintaining deterministic world behavior through the Φctrl decoupling.

436| 437| 438|

5.2 Performance Benchmarks

439| 440| 441| 442| 443| 444| 445| 446| 447| 448| 449| 450| 451| 452| 453| 454|
MetricValuenNotes
Mentor in-world reliability100%20No pathfinding errors, block misplacements, or erratic actions
Frame-matching accuracy92% (46/50)50Across three frames; 4% false negative, 4% false positive
Voice pipeline latency180–280 ms (mean 220 ms)50LLM token → TTS → audio output
Spatial panning correctness100%20Left/right orientation verified at all test positions
Mission completion time15–25 min (mean 19.2 min)10Full five-beat mission; varies with student proficiency
Because-gate duration1.8 min (mean)10Includes mentor modeling + student production
Request-gate duration1.2 min (mean)10Fastest gate; direct request frame
FirstThen-gate duration1.5 min (mean)10Sequencing requires planning articulation
Scaffolding reliability100%15Low-proficiency simulations: mentor scaffolded within 2 prompts
ASR robustness (native)~92% WER30Native English; frame detection unaffected by minor WER
ASR robustness (accented)~78% recall20Non-native accents reduce frame detection; known limitation
455|

Table 2. Performance benchmarks for AgentJam across eleven metrics evaluated over 20+ test missions. The mentor achieves 100% in-world reliability through deterministic Φctrl. Frame matching reaches 92% accuracy. ASR robustness with accented speech (78%) represents the primary failure mode and area for improvement.

456|
457| 458| 459|
460| 461| 462| 463| 464| 465| 466| FIGURE 3: MENTOR STATE MACHINE — BEAT TRANSITIONS AND TWO-STAGE ACTION PIPELINE 467| 468| 469| STAGE 1: θLLM 470| Context + obs → ahigh ∈ Ahigh 471| DeepSeek, 9 actions 472| 473| 474| STAGE 2: Φctrl 475| ahigh + positions → Minecraft action 476| Mineflayer, deterministic 477| 478| 479| Minecraft World 480| Movement · Blocks · Combat 481| 482| BEAT STATE MACHINE (PER MISSION) 483| 484| 485| BEAT: BRIEFING 486| No gate · Set context 487| 488| 489| BEAT: CAVE 490| Gate: because 491| 492| 493| BEAT: WOOD 494| Gate: request 495| 496| 497| BEAT: SHELTER 498|