System Overview
UnaMentis is designed as a modular, provider-agnostic voice AI tutoring platform. The architecture separates concerns into distinct layers, making it easy to swap providers, add new features, and maintain high performance.
Mobile Client Architecture (iOS)
The mobile client is currently implemented for iOS using Swift 6.0 with strict concurrency. An Android version is planned for the near future. All services are implemented as actors to ensure thread safety, and types crossing actor boundaries are Sendable.
Directory Structure
UnaMentis/
├── Core/ # Business logic (actors)
│ ├── Audio/ # AudioEngine - iOS audio I/O
│ ├── Session/ # SessionManager - conversation orchestration
│ ├── Curriculum/ # CurriculumEngine - learning materials
│ ├── Telemetry/ # TelemetryEngine - metrics & costs
│ ├── Routing/ # PatchPanel - intelligent endpoint routing
│ ├── Config/ # API keys, server discovery
│ ├── Logging/ # Remote logging
│ └── Persistence/ # Core Data models
├── Services/ # Provider implementations
│ ├── STT/ # Speech-to-text providers
│ ├── TTS/ # Text-to-speech providers
│ ├── LLM/ # Language model providers
│ ├── VAD/ # Voice activity detection
│ ├── Embeddings/ # Semantic search
│ └── Protocols/ # Service interfaces
└── UI/ # SwiftUI views
├── Session/ # Live conversation interface
├── Curriculum/ # Topic browsing
├── History/ # Past sessions
├── Analytics/ # Performance metrics
├── Settings/ # Configuration
└── Debug/ # Development tools
Key Design Principles
Actor-Based Concurrency
All services are Swift actors, providing automatic thread safety and eliminating data races.
Protocol-First Design
Each service type has a protocol, enabling easy testing and provider swapping.
Sendable Types
All types crossing actor boundaries conform to Sendable for compile-time safety.
@MainActor ViewModels
All ViewModels run on the main actor, ensuring UI updates happen safely.
Voice Pipeline
The voice pipeline is the heart of UnaMentis. It handles real-time audio capture, voice activity detection, speech recognition, language model inference, and speech synthesis in a carefully orchestrated flow.
Session State Machine
The SessionManager maintains a state machine for turn-taking:
Interruption Handling
UnaMentis supports natural interruptions. When the user starts speaking while the AI is talking, the system gracefully stops TTS playback, cancels pending audio, and transitions to processing the user's new input.
Core Services
AudioEngine
Core/Audio/AudioEngine.swift
Manages all iOS audio I/O with hardware voice processing optimization:
- Hardware AEC (Acoustic Echo Cancellation)
- AGC (Automatic Gain Control)
- Noise Suppression
- Real-time audio streaming
- Multi-buffer TTS playback scheduling
- Thermal state monitoring
SessionManager
Core/Session/SessionManager.swift (~1,367 lines)
Orchestrates voice conversation sessions:
- Turn-taking logic with state machine
- Natural interruption handling
- Context management for LLM prompts
- Session recording with transcripts
- TTS prefetching for smooth playback
- Word-level timing in transcripts
CurriculumEngine
Core/Curriculum/CurriculumEngine.swift
Manages learning materials and progress:
- Topic hierarchy navigation (unlimited depth)
- Progress tracking with mastery scores
- Dynamic context generation for prompts
- Alternative explanation handling
- Misconception detection triggers
- Visual asset caching
TelemetryEngine
Core/Telemetry/TelemetryEngine.swift (~613 lines)
Real-time performance monitoring:
- Latency measurement (TTFT, TTFB)
- Cost calculation per provider
- Memory monitoring and growth tracking
- Thermal state monitoring
- Per-turn performance analysis
- Aggregated session metrics
Intelligent Routing (PatchPanel)
The PatchPanel service provides intelligent LLM endpoint routing based on runtime conditions. This allows UnaMentis to automatically select the best provider based on the current situation.
Routing Conditions
Thermal State
Switch to lighter models when device is hot
Memory Pressure
Reduce model size under memory constraints
Battery Level
Use efficient endpoints when battery is low
Network Quality
Route based on latency and bandwidth
Cost Budget
Stay within per-session cost limits
Task Type
Match model to task complexity
Task Types
The router classifies requests into task types for optimal model selection:
- Quick Response: Simple acknowledgments, short answers
- Explanation: Teaching new concepts
- Deep Thinking: Complex problem solving
- Assessment: Evaluating student understanding
- Remediation: Addressing misconceptions
Server Components
The server-side infrastructure provides curriculum management, remote logging, and a web dashboard for monitoring and administration.
Management Server
Python 3.11+ / aiohttp / asyncio
Async HTTP server with WebSocket support running on port 8766:
- Remote logging aggregation from mobile clients
- Metrics history with time-series storage
- Resource monitoring (CPU, memory, thermal)
- Idle state management for power efficiency
- WebSocket streaming for real-time updates
Curriculum Database
PostgreSQL / File-based / JSON
Storage and retrieval for UMLCF curriculum documents:
- File-based storage for development
- PostgreSQL support for production deployments
- Full-text search across curriculum content
- Topic hierarchy navigation APIs
- Metadata indexing for filtering
Web Dashboard
Next.js / React / TypeScript
Administration and monitoring interface:
- Curriculum browsing and management
- Session analytics visualization
- Real-time metrics streaming
- Provider health monitoring
- Configuration management
Data Flow
Understanding how data moves through the system during a typical conversation turn:
Audio Capture
AudioEngine captures microphone input at 16kHz, applies hardware AEC/AGC/NS
Voice Detection
Silero VAD (CoreML) detects speech start/end with confidence scores
Streaming STT
Audio streams to STT provider, partial transcripts arrive in real-time
Context Assembly
SessionManager builds LLM prompt with curriculum context and conversation history
LLM Streaming
PatchPanel routes to optimal endpoint, tokens stream back as generated
Sentence Buffering
Tokens accumulate until sentence boundary detected
TTS Synthesis
Complete sentences sent to TTS, audio chunks stream back
Audio Playback
AudioEngine schedules buffers for seamless playback while next sentence synthesizes
Performance Targets
UnaMentis is designed to meet aggressive performance targets for natural conversation:
Time from user finishing speaking to first audio from AI
99th percentile turn latency for consistent experience
Continuous operation without crashes or degradation
Maximum memory increase over 90-minute session
Optimization Techniques
- Streaming Everywhere: STT, LLM, and TTS all stream to minimize time-to-first-byte
- Sentence Pipelining: Start TTS on sentence N while LLM generates sentence N+1
- Audio Prefetching: Buffer multiple audio chunks ahead for seamless playback
- Thermal Monitoring: Automatically reduce load when device heats up
- Memory Management: Careful lifecycle management of audio buffers
- Connection Pooling: Reuse WebSocket/HTTP connections across requests