1| 2| 3| 4| 5| 6| 7| 8| 9| Spoken English Sessions: Autonomous Multi-Agent Language Classrooms — Omo Research 10| 11| 12| 13| 14| 29| 30| 31| 32| 84| 90|
91| Papers 92|
← Home
93|
94|
95|
96|
97| Multi-Agent SystemsLanguage EducationMinecraftVoice AIAutonomous Classrooms 98|
99|
100|

Harry Edwards

101|

Omo Research

102|

FINAL — Omo Research Technical Report · June 2025

103|
104|

Spoken English Sessions: Autonomous Multi-Agent Language Classrooms in Minecraft

105|

106| We present Spoken English Sessions, a system for fully autonomous 50–60 minute spoken-English classes in Minecraft supporting 4–6 students. An AI captain — the Captain Dyad (Game Agent ⊕ Teaching Agent) — directs the session; peer AI bots model target language; and resource gaps engineered into the game world require student-to-student English communication to resolve. The architecture combines a deterministic Session Manager for orchestration, a text-only LLM Captain Dyad for dynamic direction with formal action spaces, and a voice bridge with streaming ASR (faster-whisper) and multi-voice TTS (MOSS-TTS-Nano, < 100M parameters). Per-student exponent scoring provides live, transparent assessment across three weighted dimensions. No human teacher is required during the session. End-to-end voice latency is 1.5–2.5 s; the SpeechBus serializes multi-agent audio output; and the Session Manager achieves 100% structural reliability across 10 test runs. All evaluation uses simulated student scenarios; human learner studies are planned for future work (Section 7.1). 107|

108| 114|
115|

Class Size

4–6 students + AI peers

116|

Duration

50–60 minutes

117|

Voice Latency

1.5–2.5 s (end-to-end)

118|

TTS Model

MOSS-TTS-Nano (< 100M)

119|
120| 121| 122|
123| 124| 125|

126| This browser can't display the embedded PDF. 127| Download the PDF ↓ 128|

129|
130|
131|
132| 133| 134|
135|

Executive Summary

136|
137|

138| Spoken English Sessions is a fully autonomous AI-directed classroom that runs 50–60 minute spoken-English lessons for 4–6 students in Minecraft, with no human teacher required. An AI "Captain Dyad" — two specialized LLMs handling game actions and teaching respectively — orchestrates the session while AI peer bots hold critical items that real students need, creating "resource gaps" that force students to communicate with each other in English to progress. The system tracks each student's performance live through a weighted scoring formula that counts exchanges, frame accuracy, and task completion. Voice processing uses streaming speech recognition (faster-whisper) and multi-voice text-to-speech (MOSS-TTS-Nano), with a SpeechBus ensuring AI voices don't overlap. All evaluation currently uses simulated student scenarios; classroom studies with real learners are planned for future work. 139|

140|
141|
142| 143|
144|

Abstract

145|
146|

Background. Autonomous AI-directed classrooms represent a frontier in educational technology: can an AI system orchestrate a full language class — including briefing, task phases, peer interaction, scoring, and debrief — without a human teacher present? Existing AI tutoring systems provide one-on-one practice but strip away the social dynamics (peer interaction, group tasks, multi-party communication) that give classrooms their distinctive pedagogical value.

147|

Problem. Scaling language speaking practice to classrooms is fundamentally limited by teacher availability. Human teachers cannot simultaneously monitor, assess, and provide feedback to 4–6 students engaged in spoken interaction. AI tutors address individual practice but cannot reproduce the communicative pressure that arises when a peer holds a resource another student needs — a dynamic we formalize as the resource gap mechanism (Section 3.1).

148|

Approach. We introduce Spoken English Sessions, a working prototype that orchestrates autonomous 50–60 minute spoken-English classes for 4–6 students inside Minecraft. The architecture rests on three pillars: (1) a deterministic Session Manager serving as the single source of truth for session state, scores, and beat progression; (2) the Captain Dyad — a Game Agent (Agame, controlling Minecraft actions) coupled with a Teaching Agent (Ateach, managing pedagogy and scoring) — both text-only LLMs communicating over a Captain Bridge at port :8768; and (3) a voice bridge with streaming ASR (faster-whisper) and multi-voice TTS (MOSS-TTS-Nano) in dedicated sidecars, with a SpeechBus serializer preventing audio overlap. Our core pedagogical innovation is peer-to-peer speaking gates: AI peer bots hold items that real students need, creating resource gaps that require student-to-student English communication to resolve — extending the speaking-gate concept [10] from one-on-one mentoring to multi-student classrooms. Per-student exponent scoring, formalized as S(pi) = 0.4·Xi + 0.35·Fi + 0.25·Ci, provides live, transparent assessment.

149|

Results. The Session Manager reliably orchestrates beat transitions and scoring updates across full 50–60 minute sessions with 100% structural reliability (n = 10). End-to-end voice latency (student speech → ASR → LLM → TTS → audio) is 1.5–2.5 s (mean 1.9 s), decomposed as VAD (~100 ms) + ASR (~400 ms) + LLM (~800 ms) + TTS (~400 ms) + SpeechBus (~200 ms). Exchange detection matches ground truth in 94% of test cases; frame accuracy scoring agrees with human annotators at 88%. The SpeechBus successfully prevents audio overlap in all test scenarios.

150|

Implications. Autonomous multi-agent classrooms represent a new category of educational technology — one where AI systems orchestrate peer interaction rather than replacing it. The Captain Dyad architecture and resource gap mechanism are generalizable to any domain where collaborative task completion can be engineered to require peer communication. All code is released as open source.

151|
152|
153|

154| Keywords: autonomous classrooms, multi-agent systems, language education, Minecraft, voice AI, peer-to-peer learning 155|

156|
157|
158| 159|
160|

1. Introduction

161|
162|

Language classrooms face an irreducible scaling constraint: providing each student with sufficient speaking practice — the activity most strongly associated with language acquisition [12] — requires teacher attention that cannot be simultaneously distributed across students. A teacher can listen to one student at a time. In a class of 25, each student receives at most 2–3 minutes of individual speaking attention per hour. AI tutors can provide unlimited one-on-one practice [17], but they strip away the social dynamics that give classrooms their distinctive pedagogical value: peer interaction, negotiated meaning, group problem-solving, and the communicative urgency that arises when another person holds a resource you need.

163|

Spoken English Sessions addresses both problems simultaneously — scaling and peer dynamics — through AI-orchestrated resource engineering. The system orchestrates a complete classroom of 4–6 human students alongside AI peer bots, entirely autonomously. The AI captain (the Captain Dyad) directs the session through briefing, task beats, a finale, and a debrief. AI peer bots model target language at an appropriate proficiency level and, critically, hold items that real students need. This engineered resource scarcity creates a gap that demands student-to-student English communication to resolve. The AI orchestrates the conditions for peer interaction; it does not participate in the communicative transaction itself — a design principle we formalize through the resource gap predicate in Section 3.1.

164| 165|

1.1 Related Work

166|

Intelligent tutoring systems. VanLehn [14] demonstrated that well-designed ITS can approach human tutoring effectiveness, but these systems operate through screen-based interfaces with individual learners. Spoken English Sessions extends the ITS paradigm to multi-student classroom orchestration in an immersive 3D environment — shifting the unit of analysis from individual learner to classroom ecosystem.

167|

Computer-supported collaborative learning (CSCL). Stahl et al. [11] established that peer interaction — not just individual task completion — drives learning in collaborative settings. Key CSCL findings relevant to our work include: (a) the importance of task structures that require genuine interdependence rather than optional collaboration [3]; (b) the role of shared artifacts in grounding communication [11]; and (c) the finding that engineered resource interdependence (jigsaw-style task designs) produces more equitable participation than unstructured group work. CSCL platforms (Knowledge Forum, CoFFEE, Collage) provide structured environments for peer knowledge construction but operate through text-based interfaces. Our system instantiates CSCL principles in a voice-driven, 3D game environment where communication produces immediate material consequences — obtaining items needed to continue playing. The resource gap mechanism directly operationalizes engineered interdependence from CSCL theory, but with AI-orchestrated rather than teacher-orchestrated gap assignment.

168|

Multi-agent systems for education. Baylor and Kim [1] demonstrated that pedagogical agents playing distinct roles (expert, motivator, mentor) improve learning outcomes, but their agents were animated characters delivering scripted content. Subsequent work on multi-agent learning environments has explored agent teams for collaborative problem-solving and peer modeling, though primarily in screen-based, non-immersive settings. Our system extends this with LLM-driven agents that adapt dynamically to student behavior, combined with a resource-gap mechanism that shifts communication from student-to-agent to student-to-student. The Captain Dyad pattern — splitting orchestration between a Game Agent and a Teaching Agent — draws on the established finding that role specialization in pedagogical agent teams improves effectiveness [1], but applies it to autonomous rather than scripted agents.

169|

Speaking gates. The speaking-gate concept was introduced in AgentJam [10] for one-on-one mentoring, formalized as G(utterance, fb) → {matched, unmatched, scaffold}. Spoken English Sessions extends this to multi-student classrooms via peer-to-peer speaking gates, where AI peers hold resources that real students must request from each other — creating communicative pressure between humans rather than between human and AI.

170|

Voice AI for education. Commercial systems (ELSA Speak, Duolingo) use ASR for pronunciation assessment but focus on individual, scripted practice. Voice agent architectures [8, 9] have demonstrated low-latency conversational AI but in single-speaker contexts. Our system combines streaming ASR (faster-whisper) with multi-voice TTS (MOSS-TTS-Nano, distinct voices per speaker) for multi-party classroom interaction — the SpeechBus solves the "cacophony problem" that arises when multiple AI agents attempt to speak simultaneously.

171| 172|

1.2 Contributions

173|

This paper makes the following contributions:

174|

1. The Captain Dyad architectural pattern. A two-agent orchestration pattern (Game Agent Agame ⊕ Teaching Agent Ateach) where both are text-only LLMs communicating over a typed-message Captain Bridge, with audio processing delegated to dedicated sidecar processes for independent scalability. Each agent maintains a focused context — world-state vs. student-assessment — with a shared coordination context preventing context pollution. This pattern generalizes beyond language education: any domain requiring simultaneous management of a virtual environment and a human-facing interaction could adopt the dyad structure.

175|

2. Peer-to-peer speaking gates via resource gaps. A mechanism extending the speaking-gate concept from one-on-one mentoring (AgentJam [10]) to multi-student classrooms: AI peer bots hold items that real students need, creating resource gaps formalized through the gap predicate (Section 3.1). The key design principle is that the AI orchestrates the conditions for peer communication but does not participate in the communicative transaction — the exchange occurs between human learners.

176|

3. The Session Manager with formal state machine. A deterministic runtime orchestrating the complete session lifecycle — player connections, voice bridges, scoring, state transitions, beat progression — as the system's single source of truth, achieving 100% structural reliability in test runs. The Session Manager enforces the structural scaffold on which the dynamic Captain Dyad operates, ensuring reliable session structure even when LLM content varies.

177|

4. Multi-voice pipeline with SpeechBus. A voice architecture combining faster-whisper for streaming ASR, MOSS-TTS-Nano (< 100M parameters) for multi-voice TTS with distinct voices per speaker, and a SpeechBus serializer that prevents audio overlap in multi-agent scenarios through priority-sorted queuing (captain utterances > peer utterances).

178|
179|
180| 181|
182|

2. System Architecture

183|

The system is built on three pillars: a deterministic Session Manager (runtime, single source of truth), a dynamic Captain Dyad (text-only LLM: Game Agent ⊕ Teaching Agent), and a Voice Bridge (streaming ASR + multi-voice TTS sidecars). LLMs process text only; audio runs in dedicated processes.

184| 185|
186|

2.1 Formal Model

187|

Let P = {p1, …, pk} be the set of human students (k ∈ [4, 6]) and B = {b1, …, bm} be the set of AI peer bots. The session is defined as a formal structure:

188|
189| Session = (state, beat, scores, clock, bridgecap, bridgevoice)

190| where   state ∈ {idle, briefing, in_progress, finale, debrief, ended}
191|       beat ∈ ℕ  (current beat index, 0-indexed)
192|       scores : P → ℝ≥0  (per-student exponent score)
193|       clock ∈ ℝ≥0  (session elapsed time in seconds) 194|
195|

The Captain Dyad consists of two specialized LLM agents sharing context over the Captain Bridge:

196|
197| Captain = (Agame, Ateach, ctxshared)

198| Agame : ctxshared × obsworld → Aminecraft
199| Ateach : ctxshared × obsstudents × scores → Apedagogy

200| where   Aminecraft = {MOVE_BOT, PLACE_BLOCK, GIVE_ITEM, TELEPORT, BUILD_STRUCTURE, ANNOUNCE_GAP}
201|       Apedagogy = {START_BEAT, END_BEAT, UPDATE_SCORE, ANNOUNCE, PROMPT_STUDENT, MODEL_LANGUAGE, MANAGE_PACING} 202|
203|

Communication over the Captain Bridge at port :8768 uses JSON-serialized typed messages:

204|
205| Mbridge ∈ {BEAT_TRANSITION, SCORE_UPDATE, PEER_ACTION, STUDENT_EVENT,
206|         CAPTAIN_SPEECH, PEER_SPEECH, RESOURCE_GAP_CREATE, RESOURCE_GAP_RESOLVE} 207|
208|
209| 210| 211|
212| 213| 214| 215| 216| 217| FIGURE 1: SESSION MANAGER + CAPTAIN DYAD + VOICE BRIDGE 218| 219| SESSION MANAGER — SINGLE SOURCE OF TRUTH 220| Deterministic Runtime · State · Scores · Player Manager · Beat Timer 221| 222| 223| 224| CAPTAIN DYAD (TEXT-ONLY LLM) 225| 226| Agame: Game Agent 227| Minecraft Actions 228| 229| Ateach: Teaching Agent 230| Pedagogy · Scoring 231| 232| 233| VOICE BRIDGE 234| faster-whisper (ASR) 235| MOSS-TTS-Nano (< 100M) 236| SpeechBus · Distinct Voices 237| Captain Bridge :8768 238| 239| SESSION BEAT FLOW — 50–60 MINUTES 240| 241| BRIEFING (5 min) 242| Set goals 243| 244| 245| BEATS ⚡PEER GATES 246| 3–4 task beats 247| 248| 249| FINALE (5–10m) 250| Celebrate 251| 252| 253| DEBRIEF (5m) 254| Scores out 255| 256| RESOURCE GAP MECHANISM 257| AI peer holds what student needs 258| Text-only LLM over Captain Bridge :8768 · Audio in sidecars · SpeechBus serializes multi-agent output 259| S(pi) = 0.4·Xi + 0.35·Fi + 0.25·Ci | All states deterministic in Session Manager 260| 261|

Figure 1. System architecture. The Session Manager (deterministic runtime) orchestrates the Captain Dyad (text-only LLM: Agame ⊕ Ateach) and the Voice Bridge (streaming ASR + multi-voice TTS). The Captain Bridge at :8768 carries all LLM communication. Session flow: Briefing → Task Beats with peer gates → Finale → Debrief.

262|
263| 264|

2.2 Component Details

265|
266|
267|

Session Manager

268|

Deterministic Runtime

269|

The Session Manager is the single source of truth for all session data. It orchestrates the complete lifecycle: player connections/disconnections, voice bridge activation, scoring pipeline, session state transitions (idle → briefing → in_progress → finale → debrief → ended), and beat progression. Implemented as deterministic TypeScript code — no LLM calls, no stochastic behavior. Achieves 100% structural reliability across 10 test runs.

270|
Session StatePlayer ManagerScoring Pipeline
271|
272| 273|
274|

Captain Dyad

275|

Agame ⊕ Ateach

276|

Two specialized text-only LLM agents in synergy. Agame controls Minecraft actions: peer bot movement, item distribution, building, environment manipulation (6 action types). Ateach manages pedagogy, scoring decisions, beat pacing, and session narrative (7 action types). Both communicate over the Captain Bridge at :8768. Text-only design keeps inference fast (< 800 ms Gemini Flash) and fully inspectable through logs.

277|
AgameAteachText-only LLM
278|
279| 280|
281|

Voice Bridge

282|

ASR + TTS Sidecars

283|

Audio processing runs in dedicated sidecar processes, separate from LLM reasoning. Streaming ASR via faster-whisper transcribes student speech in real time with VAD gating (min 300 ms utterance). MOSS-TTS-Nano (< 100M parameters) generates speech with distinct voices per speaker. The SpeechBus serializes output — only one voice plays at a time, preventing cacophony through priority-sorted queuing (captain > peers).

284|
faster-whisperMOSS-TTS-NanoSpeechBus
285|
286|
287| 288|
289|

2.3 Design Rationale

290|

Why text-only LLMs in the Captain Dyad? The decision keeps inference fast and inspectable. Text-only LLMs (Gemini Flash) produce sub-second responses for the short decision prompts used by both agents. Every pedagogical decision and game action is traceable through text logs, enabling post-session analysis and debugging. The reasoning pipeline (LLM) and audio pipeline (ASR/TTS) can be upgraded independently — we verified this by swapping between Gemini Flash and DeepSeek with zero changes to the voice bridge.

291|

Why a dyad rather than a single agent? A single agent managing both Minecraft actions and pedagogical decisions faces a context-switching cost: world-state context and student-assessment context differ substantially. Agame maintains a focused world-state context (bot positions, block states, item distribution) while Ateach maintains a focused student-assessment context (exchange counts, frame accuracy, participation distribution). The shared context ctxshared carries only coordination data (current beat, resource gap status, session phase), preventing context pollution.

292|

Why deterministic Session Manager? The Session Manager enforces the beat structure deterministically — the session always follows Briefing → Beats → Finale → Debrief with fixed time windows. This ensures that even if the Captain Dyad produces suboptimal output (off-topic responses, delayed scoring in ~15% of interactions), the session maintains structural integrity. Students know what to expect: the structure is reliable even when LLM content varies. This hybrid approach — deterministic structure, dynamic content — mirrors effective human teaching where lesson plans provide reliable structure while the teacher adapts delivery.

293|
294|
295| 296| 297|
298|

3. Peer-to-Peer Pedagogy

299|

The core pedagogical contribution shifts communicative pressure from student↔AI to student↔student. AI peer bots model language and create resource gaps, but real students must speak to each other to progress.

300| 301|
302| 303| 304| 305| 306| 307| 308| FIGURE 2: RESOURCE GAP MECHANISM — STUDENT↔STUDENT COMMUNICATION 309| 310| 311| STUDENT A (needs wood) 312| Inventory: stone, torch 313| Missing: wood ✗ 314| 315| 316| STUDENT C (has wood) 317| Inventory: wood × 8 318| Has what A needs ✓ 319| 320| 321| AI PEER BOT 322| Announces: "Student C 323| has extra wood!" 324| 325| 326| "Can you give me wood?" 327| 328| announces gap 329| 330| 331| FORMAL GAP PREDICATE & RESOLUTION 332| 333| gap(pi, pl, item) ⟺ item ∈ req(pi, beatj) ∧ item ∉ inv(pi) ∧ item ∈ inv(pl) 334| Ateach assigns items such that each student lacks ≥ 1 item held by a peer · Peer bot releases item after valid English exchange detected 335| 336| 337| Language Modeling 338| 339| Resource Holding 340| 341| Participation Balancing 342| 343|

Figure 2. Resource gap mechanism. The Teaching Agent (Ateach) assigns items such that Student A lacks wood that Student C holds. A peer bot announces the gap. Student A must walk to Student C and produce a spoken English request. The Session Manager detects the exchange and updates scores. Three peer bot functions — language modeling, resource holding, participation balancing — operate in concert.

344|
345| 346|
347|

3.1 The Resource Gap Mechanism

348|

In one-on-one mentoring (AgentJam [10]), the AI mentor holds items the student needs, and the speaking gate requires the student to request them using target language. This model is effective for individual practice but does not scale to classrooms — in a multi-student setting, students should speak to each other, not just to AI agents. Our solution is the resource gap: Ateach distributes required items among students unevenly and directs peer bots to announce who holds what.

349|

Formally, let inv(b) be the inventory of peer bot b and req(pi, beatj) be the set of items student pi needs to complete beat j. A resource gap exists when:

350|
351| gap(pi, pl, item) ⟺ item ∈ req(pi, beatj) ∧
352|   item ∉ inv(pi) ∧ item ∈ inv(pl)

353| The Teaching Agent solves the assignment: ∀piP, ∃plP\\{pi}, ∃item such that gap(pi, pl, item)
354| This constraint ensures every student must communicate with at least one peer. 355|
356|

A typical scenario: Student A needs wood to build a wall. Peer bot B1 announces "Student C has extra wood. You should ask C for it." Student A walks to Student C and produces a spoken English request. Student C (a real human) responds. The Session Manager tracks exchange completion through frame matching and updates scores. Peer bots hold additional items; these release only after detecting a valid peer exchange. This creates genuine communicative need between real students — the AI orchestrates the situation but does not participate in the communicative transaction.

357| 358|

3.2 Per-Student Exponent Scoring

359|

Each student receives a live exponent score updating continuously throughout the session. The score is a real-time metric, not a post-hoc grade. Ateach computes scores as a weighted combination:

360|
361| S(pi) = 0.4 · Xi + 0.35 · Fi + 0.25 · Ci

362| where   Xi = normalized exchange count (peer-to-peer transactions)
363|       Fi = frame accuracy (correct usage of target frames)
364|       Ci = task completion (mission beats completed with communication weighting) 365|
366|
367|
368|

Metric 1

Exchange Count (0.4)

Number of successful English exchanges with other students. Each peer-to-peer transaction — request, response, acknowledgment — increments the count. Detected via frame matching on transcribed speech by the Session Manager.

369|

Metric 2

Frame Accuracy (0.35)

Correct usage of target grammatical frames (because, request, firstThen). The frame matcher detects usage in transcribed speech. Close approximations receive partial credit; exact matches receive full credit.

370|

Metric 3

Task Completion (0.25)

Successful completion of mission beats. Communication-weighted: finishing a beat through peer exchanges scores higher than completing it through individual resource collection. Encourages collaborative over solo play.

371|
372| 373|
374|

3.3 Peer Bots as Language Models and Resource Holders

375|

AI peer bots serve three functions. Language modeling: bots speak at a proficiency level calibrated to the students, demonstrating target grammatical frames in natural conversation. Resource holding: bots carry items that students need but only release them after detecting valid peer exchanges — they are the mechanism through which resource gaps are instrumented. Participation balancing: Ateach directs peer bots to engage quieter students, redistributing items or prompting specific students to speak, ensuring equitable speaking opportunities. Peer bots have distinct voices via MOSS-TTS-Nano, making them audibly distinguishable from the captain and from each other.

376|
377|
378| 379|
380|

4. The Voice Pipeline

381|

Voice processing runs in dedicated sidecar processes, separate from LLM reasoning. Streaming ASR via faster-whisper for student speech. Multi-voice TTS via MOSS-TTS-Nano for distinct voices per speaker. SpeechBus serializes output through priority-sorted queuing.

382|
383|

4.1 Streaming ASR

384|

Student speech is captured via Simple Voice Chat, a Minecraft voice mod providing proximity-based audio. The audio stream feeds into faster-whisper [13], an optimized CTranslate2-based implementation of OpenAI's Whisper model running locally on consumer hardware at approximately 3× real-time on Apple Silicon M2 Pro, with word-level timestamps enabling real-time speaker identity detection and utterance boundary identification.

385|

Voice activity detection (VAD) gates the ASR pipeline — audio processes only when speech is detected above a configurable energy threshold with a minimum utterance duration of 300 ms. This reduces computational load by an estimated 60–70% during silent periods. Transcribed text forwards to the Captain Bridge at :8768, where Ateach processes it for frame matching, scoring, and pedagogical adaptation.

386|
387|
388|

4.2 Multi-Voice TTS and the SpeechBus

389|

Spoken output uses MOSS-TTS-Nano, a lightweight TTS model with fewer than 100 million parameters supporting multiple distinct voices through speaker embedding conditioning. Each speaker — the captain and each peer bot — receives a unique voice embedding, enabling students to identify speakers without visual cues.

390|

The SpeechBus serializes all spoken output through a priority-sorted buffer. Because multiple agents may attempt to speak simultaneously, the SpeechBus queues utterances and plays them sequentially. Captain speech receives priority over peer bot speech; within the same priority level, utterances play in FIFO order. Let Q be the SpeechBus queue with ordering <: captain utterances < peer utterances, and within each class, u1 < u2 iff arrival(u1) < arrival(u2). This prevents audio overlap — the "cacophony problem" — while maintaining natural conversational flow through rapid turn-taking. The SpeechBus successfully serialized all utterances without clipping in test scenarios with 3 simultaneous speaking requests.

391|
392|
393| 394| 395|
396| 397| 398| 399| 400| 401| FIGURE 3: SESSION STATE MACHINE AND PER-STUDENT SCORING PIPELINE 402| 403| 404| IDLE 405| Waiting 406| 407| 408| BRIEFING 409| ~5 min 410| 411| 412| IN_PROGRESS 413| 3–4 beats 414| 415| 416| FINALE 417| 5–10 min 418| 419| 420| DEBRIEF 421| ~5 min 422| 423| 424| 425| 426| SCORING PIPELINE: STUDENT SPEECH → ASR → FRAME MATCH → SCORE UPDATE 427| 428| STUDENT SPEECH 429| Audio → faster-whisper 430| 431| 432| TEXT → FRAME MATCH 433| Xi, Fi detected 434| 435| 436| Ateach UPDATE 437| Compute S(pi) 438| 439| 440| SESSION MANAGER 441| Single source of truth 442| 443| 444| Latency: VAD (~100ms) → ASR (~400ms) → LLM (~800ms) → TTS (~400ms) → SpeechBus (~200ms) = 1.5–2.5s end-to-end 445| Score updates propagate with < 100ms latency from detection to Session Manager display 446| 447|

Figure 3. Session state machine (idle → briefing → in_progress → finale → debrief → ended) enforced deterministically by the Session Manager, and the per-student scoring pipeline: student speech → ASR → frame match → Ateach score update → Session Manager. The latency decomposition shows the 1.5–2.5 s end-to-end voice pipeline, with score updates propagating in < 100 ms.

448|
449| 450|
451|

5. Evaluation

452|
453|

Spoken English Sessions is a working prototype tested with simulated multi-student scenarios on macOS (Apple Silicon M2 Pro, 16GB RAM). We evaluate across five dimensions with 10+ test runs per dimension where applicable.

454| 455|

5.1 Comparison with Prior Classroom/Tutoring Systems

456| 457| 458| 459| 460| 461| 462| 463| 464| 465| 466| 467| 468|
PropertySpoken English SessionsAgentJam [10]ITS [14]Duolingo ClassroomHuman Classroom
Multi-student orchestration✓ 4–6 students✗ 1:1 only✗ 1:1 only◐ Group tracking
Autonomous (no teacher)✓ Full session
Peer-to-peer gates✓ Resource gaps✗ AI↔student◐ Group work
3D immersive environment✓ Minecraft✓ Minecraft✗ Screen✗ Screen◐ Physical room
Spatial multi-voice TTS✓ Distinct voices◐ Single voice✓ Natural
Live per-student scoring✓ Exponent S(pi)◐ Delayed
Deterministic session flow✓ Session Manager◐ Beat structure◐ Lesson plan
Text-only LLM + audio sidecars✓ Independent scaling✗ N/A
469|

Table 1. Comparison across eight properties. Spoken English Sessions uniquely combines multi-student orchestration, full autonomy, peer-to-peer speaking gates via resource gaps, live exponent scoring, and a deterministic session flow — extending the AgentJam approach into multi-student classroom scenarios.

470| 471|

5.2 Performance Benchmarks

472| 473| 474| 475| 476| 477| 478| 479| 480| 481| 482| 483| 484| 485| 486| 487|
MetricValuenNotes
Session structural reliability100%10All phases completed; no structural failures
Voice latency (end-to-end)1.5–2.5 s (mean 1.9 s)50VAD+ASR+LLM+TTS+SpeechBus
Exchange detection accuracy94%50vs. ground-truth scripted exchanges
Frame scoring agreement (human)88%40vs. human annotator judgments
Score update latency< 100 ms30Detection → Session Manager display
SpeechBus overlap prevention100%203 simultaneous requests; no clipping
ASR latency (faster-whisper)~400 ms50Per utterance; ~3× real-time
TTS rendering (MOSS-TTS-Nano)~400 ms50Per utterance; < 100M parameters
LLM inference (Gemini Flash)~800 ms (mean)50Ateach pedagogical decisions
Frame detection (native English)89%30Target frame recall in simulated exchanges
Frame detection (accented English)~76%20Non-native accents; primary failure mode
488|

Table 2. Performance benchmarks across eleven metrics. The Session Manager achieves 100% structural reliability. End-to-end voice latency of 1.9 s (mean) is acceptable for classroom interaction. Accented speech robustness (76% frame detection) represents the primary area for improvement.

489| 490|

5.3 Limitations

491|

No human learner studies. All evaluation uses simulated student speech, scripted scenarios, and author-operated test runs. We have not conducted experiments with real language learners in classroom settings. This means: (a) exchange detection accuracy (94%) was measured against scripted exchanges, not spontaneous learner interactions; (b) frame detection rates (89% native, 76% accented) come from controlled test utterances, not authentic classroom speech with overlapping talk, background noise, and conversational disfluencies; (c) the resource gap mechanism's claim to produce "genuine communicative need" is architecturally motivated but untested — we do not know whether students perceive engineered resource scarcity as genuine or artificial; and (d) all pedagogical claims rest on CSCL and TBLT theory rather than empirical outcomes from classroom deployment.

492|

Scoring calibration untested. The exponent scoring weights (0.4 exchange count, 0.35 frame accuracy, 0.25 task completion) were chosen based on pedagogical principles — prioritizing communicative exchange over individual task completion — but have not been calibrated against external proficiency measures or human teacher judgments. The 88% agreement with human annotators (Table 2) refers to frame detection agreement, not overall score validity. The scoring formula S(pi) = 0.4·Xi + 0.35·Fi + 0.25·Ci should be validated against standardized speaking assessments (e.g., IELTS Speaking, TOEFL Speaking) or teacher-assigned grades before use in high-stakes educational contexts.

493|

Cooperative student assumption. The resource gap mechanism assumes cooperative behavior. Griefing, refusal to speak, item trading without English communication, or one student dominating all exchanges are not detected or managed. Real classroom deployment requires behavioral moderation mechanisms — potentially including the Session Manager flagging non-participating students and the Teaching Agent re-engineering resource gaps to force participation from quieter students.

494|

ASR robustness ceiling. faster-whisper WER increases substantially with non-native English accents (from ~8% native to 22–28% non-native), reducing frame detection accuracy to approximately 65–70% for accented speakers. This creates an equity problem: students with stronger non-native accents receive less accurate assessment and potentially less effective scaffolding, widening rather than narrowing proficiency gaps.

495|
496|
497| 498|