System Overview
UnaMentis is designed as a modular, provider-agnostic voice AI tutoring platform. The architecture separates concerns into distinct layers, making it easy to swap providers, add new features, and maintain high performance.
Mobile Client Architecture (iOS)
The mobile client is implemented for iOS using Swift 6.0 with strict concurrency. A web client (Next.js) provides browser-based voice tutoring, and an Android app is in development. All services are implemented as actors to ensure thread safety, and types crossing actor boundaries are Sendable.
Directory Structure
UnaMentis/
├── Core/ # Business logic (actors)
│ ├── Audio/ # AudioEngine - iOS audio I/O
│ ├── Session/ # SessionManager - conversation orchestration
│ ├── Curriculum/ # CurriculumEngine - learning materials
│ ├── Telemetry/ # TelemetryEngine - metrics & costs
│ ├── Routing/ # PatchPanel - intelligent endpoint routing
│ ├── Config/ # API keys, server discovery
│ ├── Logging/ # Remote logging
│ └── Persistence/ # Core Data models
├── Services/ # Provider implementations
│ ├── STT/ # Speech-to-text (9 providers)
│ ├── TTS/ # Text-to-speech (7 providers)
│ ├── LLM/ # Language models (5 providers)
│ ├── VAD/ # Voice activity detection
│ ├── Embeddings/ # Semantic search
│ └── Protocols/ # Service interfaces
├── Intents/ # Siri & App Intents
│ ├── StartLessonIntent # "Hey Siri, start a lesson"
│ ├── ResumeLearningIntent # "Hey Siri, resume my lesson"
│ └── ShowProgressIntent # "Hey Siri, show my progress"
└── UI/ # SwiftUI views
├── Session/ # Live conversation interface
├── Curriculum/ # Topic browsing
├── History/ # Past sessions
├── Analytics/ # Performance metrics
├── Settings/ # Configuration
└── Debug/ # Development tools
Key Design Principles
Actor-Based Concurrency
All services are Swift actors, providing automatic thread safety and eliminating data races.
Protocol-First Design
Each service type has a protocol, enabling easy testing and provider swapping.
Sendable Types
All types crossing actor boundaries conform to Sendable for compile-time safety.
@MainActor ViewModels
All ViewModels run on the main actor, ensuring UI updates happen safely.
Siri & App Intents
Deep integration with Siri and iOS Shortcuts enables hands-free control. Start lessons, resume where you left off, or check progress using voice commands or automation.
Voice Commands
"Hey Siri, start a lesson on calculus" or "resume my lesson" launches directly into tutoring.
Shortcuts Integration
Build custom automations that incorporate UnaMentis into your learning routine.
Apple Watch Companion
The Apple Watch companion app provides a control plane during active tutoring sessions, automatically installing with the iOS app. It uses WatchConnectivity for real-time state synchronization with support for extended runtime during 90+ minute sessions.
Session Controls
Mute/unmute microphone, pause/resume session, and stop session directly from your wrist.
At-a-Glance Progress
Circular progress gauge showing lesson completion, current topic, and session mode indicator.
Curriculum Auto-Continuation
Seamless topic-to-topic transitions enable uninterrupted learning sessions. When a topic completes, the next topic begins automatically with a brief audio announcement.
Pre-generation
At 70% progress through a topic, the system begins generating audio for the next topic in the background, ensuring zero delay at transition.
Segment Caching
All audio segments for the current topic are cached locally (up to 50MB), enabling instant replay and navigation.
Navigation Controls
Go back one segment, replay the current topic, or skip ahead to the next topic using on-screen controls.
User Control
Auto-continuation is enabled by default but can be toggled off in settings for manual topic progression.
Visual Content Support
Rich visual content enhances voice-based learning with diagrams, equations, and images.
LaTeX Rendering
Mathematical formulas render using SwiftMath with Unicode fallback for unsupported expressions.
Timed Display
Visual assets appear at specific points during audio playback, synchronized to the curriculum structure.
Web Client Architecture
The web client provides browser-based voice tutoring using Next.js 15+ and React 19. It serves as our planned path for all platforms beyond iOS and Android native apps.
Features
Real-time Voice
OpenAI Realtime API via WebRTC for low-latency conversations with streaming audio.
Curriculum Browser
Full UMCF content navigation with hierarchy, visual assets, and LaTeX rendering.
Responsive Design
Desktop and mobile optimized with light/dark theme support.
Cost Tracking
Real-time session cost display for transparency during conversations.
Provider Support
| Component | Provider |
|---|---|
| STT | OpenAI Realtime, Deepgram, AssemblyAI, Groq, Self-hosted |
| TTS | OpenAI Realtime, ElevenLabs, Self-hosted |
| LLM | OpenAI Realtime, Anthropic Claude, Groq |
Next.js 15+ / React 19 / TypeScript / Tailwind CSS
Voice Pipeline
The voice pipeline is the heart of UnaMentis. It handles real-time audio capture, voice activity detection, speech recognition, language model inference, and speech synthesis in a carefully orchestrated flow.
Session State Machine
The SessionManager maintains a state machine for turn-taking:
Interruption Handling
UnaMentis supports natural interruptions. When the user starts speaking while the AI is talking, the system gracefully stops TTS playback, cancels pending audio, and transitions to processing the user's new input.
Provider Stack
UnaMentis supports multiple deployment configurations, from fully on-device processing to cloud-based services, with intelligent fallback between them.
Kyutai Pocket: On-Device Neural TTS
We've standardized on Kyutai Pocket as our primary on-device TTS engine. This 100M parameter neural model represents a paradigm shift for on-device speech synthesis, delivering natural, expressive voices with zero network latency. Unlike system TTS (AVSpeechSynthesizer), Kyutai Pocket produces human-quality speech entirely on-device, eliminating the traditional trade-off between privacy and voice quality.
- 100M parameters: Compact enough for mobile, powerful enough for natural speech
- Zero latency: No network round-trip, instant voice synthesis
- Complete privacy: Audio never leaves the device
- Battery efficient: Optimized for extended mobile sessions
On-Device Capabilities
Run entirely on-device for maximum privacy and offline use:
| Component | Technology |
|---|---|
| VAD | Silero (Core ML) |
| STT | Apple Speech, GLM-ASR (Whisper + GLM-ASR-Nano) |
| LLM | Ministral-3B-Instruct (primary), TinyLlama-1.1B (fallback) via llama.cpp |
| TTS | Kyutai Pocket (100M neural, primary), Apple AVSpeechSynthesizer (fallback) |
Self-Hosted Options
Run your own servers for privacy-first deployments:
| Component | Technology |
|---|---|
| LLM | Ollama, llama.cpp server, vLLM |
| TTS | Kyutai 1.6B (40+ voices), Chatterbox (emotion control, voice cloning), Fish Speech (30+ languages), VibeVoice, Piper |
| STT | GLM-ASR server (port 11401, WebSocket streaming) |
Cloud Fallback
High-quality cloud services when needed:
| Component | Provider |
|---|---|
| LLM | Anthropic Claude 3.5 Sonnet, OpenAI GPT-4o, GPT-4o-mini, Realtime API |
| STT | Deepgram Nova-3, AssemblyAI, OpenAI Whisper, Groq Whisper |
| TTS | ElevenLabs Turbo v2.5, Deepgram Aura-2, OpenAI Realtime |
Privacy Architecture
UnaMentis offers three privacy tiers, allowing users to choose their preferred balance between capability and data privacy. Each tier uses different providers with different data handling characteristics.
Tier 1: Maximum Privacy
On-Device Only
All processing happens locally on your device. No audio or text ever leaves your phone.
- STT: Apple Speech Framework
- TTS: Kyutai Pocket (100M neural, primary), AVSpeechSynthesizer (fallback)
- LLM: On-Device (Ministral-3B, TinyLlama)
Tier 2: High Privacy
Self-Hosted Servers
Data stays on servers you control. Ideal for organizations with compliance requirements.
- STT: Whisper.cpp, faster-whisper
- TTS: Kyutai 1.6B, Chatterbox, Fish Speech, VibeVoice, Piper
- LLM: Ollama, llama.cpp, vLLM
Tier 3: Standard Privacy
Cloud with DPAs
Commercial cloud providers with data processing agreements and enterprise-grade security.
- STT: Deepgram, AssemblyAI, Groq
- TTS: ElevenLabs, Deepgram
- LLM: OpenAI, Anthropic
Users can mix tiers based on their needs. For example, using on-device STT for privacy while using cloud TTS for voice quality. The provider-agnostic architecture makes these combinations seamless.
Core Services
AudioEngine
Core/Audio/AudioEngine.swift
Manages all iOS audio I/O with hardware voice processing optimization:
- Hardware AEC (Acoustic Echo Cancellation)
- AGC (Automatic Gain Control)
- Noise Suppression
- Real-time audio streaming
- Multi-buffer TTS playback scheduling
- Thermal state monitoring
SessionManager
Core/Session/SessionManager.swift (~1,367 lines)
Orchestrates voice conversation sessions:
- Turn-taking logic with state machine
- Natural interruption handling
- Context management for LLM prompts
- Session recording with transcripts
- TTS prefetching for smooth playback
- Word-level timing in transcripts
CurriculumEngine
Core/Curriculum/CurriculumEngine.swift
Manages learning materials and progress:
- Topic hierarchy navigation (unlimited depth)
- Progress tracking with mastery scores
- Dynamic context generation for prompts
- Alternative explanation handling
- Misconception detection triggers
- Visual asset caching
TelemetryEngine
Core/Telemetry/TelemetryEngine.swift (~613 lines)
Real-time performance monitoring:
- Latency measurement (TTFT, TTFB)
- Cost calculation per provider
- Memory monitoring and growth tracking
- Thermal state monitoring
- Per-turn performance analysis
- Aggregated session metrics
Intelligent Routing (PatchPanel)
The PatchPanel service provides intelligent LLM endpoint routing based on runtime conditions. This allows UnaMentis to automatically select the best provider based on the current situation.
Routing Conditions
Thermal State
Switch to lighter models when device is hot
Memory Pressure
Reduce model size under memory constraints
Battery Level
Use efficient endpoints when battery is low
Network Quality
Route based on latency and bandwidth
Cost Budget
Stay within per-session cost limits
Task Type
Match model to task complexity
Task Types
The router classifies requests into task types for optimal model selection:
- Quick Response: Simple acknowledgments, short answers
- Explanation: Teaching new concepts
- Deep Thinking: Complex problem solving
- Assessment: Evaluating student understanding
- Remediation: Addressing misconceptions
Server Components
The server-side infrastructure provides curriculum management, remote logging, and a web dashboard for monitoring and administration.
Management Server
Python 3.11+ / aiohttp / asyncio
Async HTTP server with WebSocket support running on port 8766:
- Remote logging aggregation from mobile clients
- Metrics history with time-series storage
- Resource monitoring (CPU, memory, thermal)
- Idle state management for power efficiency
- WebSocket streaming for real-time updates
Curriculum Database
SQLite / Python / UMCF JSON
Storage and retrieval for UMCF curriculum documents:
- SQLite database with shared access
- Plugin-based import from MIT OCW, CK-12, EngageNY, MERLOT
- Full-text search across curriculum content
- AI enrichment pipeline for content processing
- Visual asset management and caching
Operations Console
Next.js 16 / React 19 / TypeScript
Administration and monitoring interface:
- Curriculum Studio for viewing and editing content
- Service status monitoring (Ollama, Chatterbox, VibeVoice, Piper, Gateway)
- Real-time metrics and performance tracking
- Plugin Manager for content source configuration
- Logs and diagnostics with real-time filtering
- Voice Lab: AI model selection, TTS experimentation, and batch processing profiles
Authentication System
JWT / RFC 9700 Token Rotation
Secure multi-device authentication with token rotation:
- JWT access tokens with 15-minute expiry
- Refresh token rotation per RFC 9700 with reuse detection
- Device fingerprinting for multi-device support
- Session management with remote termination
- Rate limiting on auth endpoints
USM Core (Service Manager)
Rust / Tokio / Axum / Port 8787
Cross-platform service manager for development infrastructure. Manages templates and instances of backend services with real-time process monitoring.
- Template-based: Service definitions with variable substitution for multiple instances
- Real-time monitoring: CPU, memory, and status via WebSocket (under 50ms latency)
- HTTP REST API: Start, stop, restart services programmatically
- C FFI bindings: Swift and Python integration for native apps
- Platform support: macOS (libproc), Linux (procfs)
TTS Caching & Session Management
Python / async / Filesystem Cache
A global audio caching system enables efficient multi-user deployments. When one user generates audio for a curriculum segment, that audio is cached and instantly available to all other users with the same voice configuration.
- Cross-user caching: Cache keys use text + voice + provider (no user ID), so identical requests share cached audio
- Priority-based generation: Live requests (user waiting) get highest priority; background pre-generation uses separate resource pools
- Session state: Per-user playback position, voice preferences, and cross-device resume support
- Scheduled deployments: Administrators can pre-generate entire curricula overnight for zero-latency playback
Performance: 1,155 requests/second with 50 concurrent cache hits at 35ms average latency. Corporate training example: 500 employees can start the same course simultaneously with 100% cache hits.
Self-Hosted Server Support
Connect to local/LAN servers for zero-cost inference:
| Server | Port | Purpose |
|---|---|---|
| Ollama | 11434 | LLM inference (primary) |
| llama.cpp | 8080 | LLM inference |
| vLLM | 8000 | High-throughput LLM |
| GLM-ASR | 11401 | STT (WebSocket streaming) |
| whisper.cpp / faster-whisper | 8080 | Self-hosted STT (OpenAI-compatible) |
| Chatterbox TTS | 8004 | Expressive TTS with emotion control, voice cloning |
| VibeVoice TTS | 11403 | Real-time TTS |
| Piper TTS | 11402 | Lightweight TTS |
| UnaMentis Gateway | 11400 | Unified API gateway |
Data Flow
Understanding how data moves through the system during a typical conversation turn:
Audio Capture
AudioEngine captures microphone input at 16kHz, applies hardware AEC/AGC/NS
Voice Detection
Silero VAD (CoreML) detects speech start/end with confidence scores
Streaming STT
Audio streams to STT provider, partial transcripts arrive in real-time
Context Assembly
SessionManager builds LLM prompt with curriculum context and conversation history
LLM Streaming
PatchPanel routes to optimal endpoint, tokens stream back as generated
Sentence Buffering
Tokens accumulate until sentence boundary detected
TTS Synthesis
Complete sentences sent to TTS, audio chunks stream back
Audio Playback
AudioEngine schedules buffers for seamless playback while next sentence synthesizes
Performance Targets
UnaMentis is designed to meet aggressive performance targets for natural conversation:
Time from user finishing speaking to first audio from AI
99th percentile turn latency for consistent experience
Continuous operation without crashes or degradation
Maximum memory increase over 90-minute session
Optimization Techniques
- Streaming Everywhere: STT, LLM, and TTS all stream to minimize time-to-first-byte
- Sentence Pipelining: Start TTS on sentence N while LLM generates sentence N+1
- Audio Prefetching: Buffer multiple audio chunks ahead for seamless playback
- Thermal Monitoring: Automatically reduce load when device heats up
- Memory Management: Careful lifecycle management of audio buffers
- Connection Pooling: Reuse WebSocket/HTTP connections across requests