Education Minecraft Multi-agent Voice

Spoken English Sessions

Autonomous AI Classes in Minecraft

Students speak. The captain listens, scores, and orchestrates. AI peers model the target language. Real English, real Minecraft.

Live Demos ↗ GitHub ↗ ← Back to Omo

Class Size

4–6 students

Duration

50–60 min

Scoring

Per-student exponent

Voice

Always-on ASR + TTS

Spoken English Sessions are fully autonomous AI-directed classes that run inside Minecraft. An AI captain directs the entire session — briefing, task beats, finale, and debrief — while peer AI bots model the target language and create resource gaps that require student↔student English communication. Every student has an always-on microphone. The captain listens, scores, and adapts in real time.

The system is a dyad architecture: a Game Agent handles Minecraft actions while a Teaching Agent manages pedagogy, scoring, and session flow. Text-only LLM keeps reasoning fast and cheap. Audio is handled by sidecars — faster-whisper for streaming ASR, MOSS-TTS-Nano for multi-voice output with distinct voices per speaker.

Features

Autonomous Captain

AI Director

The AI captain runs the full class autonomously — briefing students, assigning task beats, orchestrating the finale, and leading the debrief. No human teacher required during the session.

Peer AI Bots

Language Models

AI students model the target language at an appropriate level. They create resource gaps — holding items other students need — that require student↔student English communication to resolve.

Real Voice

Always-On

Always-on microphone with voice activity detection. Streaming ASR via faster-whisper. TTS with distinct voices per speaker via MOSS-TTS-Nano. Students speak naturally — the system listens continuously.

Live Scoring

Per-Student Exponent

Per-student exponent scoring updated in real time. Peer-exchange gates require English communication between students. The runtime is the single source of truth — no post-hoc grading.

Architecture

Session Manager + Captain Dyad + Voice Bridge. Text-only LLM for reasoning. Audio handled by sidecars. MOSS-TTS-Nano for multi-voice output.

Session Manager

Runtime

Orchestrates the entire session lifecycle — player connections, voice bridges, scoring pipeline, and session state. The single source of truth for all session data: scores, exchanges, and runtime events.

Session State Player Manager Scoring Pipeline

Captain Dyad

Game + Teaching Agent

Two specialized LLM agents: the Game Agent controls Minecraft actions (movement, building, item distribution), the Teaching Agent manages pedagogy, scoring decisions, and session pacing. Text-only, fast, cheap.

Game Agent Teaching Agent Text-only LLM

Voice Bridge

ASR + TTS Sidecars

Streaming ASR via faster-whisper for student speech. MOSS-TTS-Nano for multi-voice output — distinct voice per speaker (captain, peer bots). Audio runs in sidecar processes, separate from the LLM reasoning.

faster-whisper MOSS-TTS-Nano Multi-Voice

Captain Dyad faster-whisper ASR MOSS-TTS-Nano Open Source

Demos

Papers & Reports

Technical reports and papers. All work is open access with accompanying code.

Spoken English Sessions: Autonomous Multi-Student AI Classes in Minecraft

Technical Report · 2025

PDF ↗

AgentJam: AI Mentors That Play Minecraft With You

Technical Report · 2025

PDF ↗