System Overview

UnaMentis is designed as a modular, provider-agnostic voice AI tutoring platform. The architecture separates concerns into distinct layers, making it easy to swap providers, add new features, and maintain high performance.

UI Layer
SwiftUI views, session interface, curriculum navigator, analytics dashboard
Core Business Logic
SessionManager, CurriculumEngine, TelemetryEngine, PatchPanel Router
Service Layer
Protocol-based actors: STT, TTS, LLM, VAD, Embeddings
Infrastructure
Audio Engine, Core Data, URLSession, CoreML

Mobile Client Architecture (iOS)

The mobile client is currently implemented for iOS using Swift 6.0 with strict concurrency. An Android version is planned for the near future. All services are implemented as actors to ensure thread safety, and types crossing actor boundaries are Sendable.

Directory Structure

UnaMentis/
├── Core/                    # Business logic (actors)
│   ├── Audio/               # AudioEngine - iOS audio I/O
│   ├── Session/             # SessionManager - conversation orchestration
│   ├── Curriculum/          # CurriculumEngine - learning materials
│   ├── Telemetry/           # TelemetryEngine - metrics & costs
│   ├── Routing/             # PatchPanel - intelligent endpoint routing
│   ├── Config/              # API keys, server discovery
│   ├── Logging/             # Remote logging
│   └── Persistence/         # Core Data models
├── Services/                # Provider implementations
│   ├── STT/                 # Speech-to-text providers
│   ├── TTS/                 # Text-to-speech providers
│   ├── LLM/                 # Language model providers
│   ├── VAD/                 # Voice activity detection
│   ├── Embeddings/          # Semantic search
│   └── Protocols/           # Service interfaces
└── UI/                      # SwiftUI views
    ├── Session/             # Live conversation interface
    ├── Curriculum/          # Topic browsing
    ├── History/             # Past sessions
    ├── Analytics/           # Performance metrics
    ├── Settings/            # Configuration
    └── Debug/               # Development tools

Key Design Principles

Actor-Based Concurrency

All services are Swift actors, providing automatic thread safety and eliminating data races.

Protocol-First Design

Each service type has a protocol, enabling easy testing and provider swapping.

Sendable Types

All types crossing actor boundaries conform to Sendable for compile-time safety.

@MainActor ViewModels

All ViewModels run on the main actor, ensuring UI updates happen safely.

Voice Pipeline

The voice pipeline is the heart of UnaMentis. It handles real-time audio capture, voice activity detection, speech recognition, language model inference, and speech synthesis in a carefully orchestrated flow.

🎤
Audio Capture
AVAudioEngine
👂
VAD
Silero (CoreML)
📝
STT
Streaming
🧠
LLM
Streaming
🔊
TTS
Streaming
🎧
Playback
AVAudioEngine

Session State Machine

The SessionManager maintains a state machine for turn-taking:

Idle
User speaks →
User Speaking
Speech ends →
Processing
LLM responds →
AI Thinking
TTS starts →
AI Speaking
Complete →
Idle

Interruption Handling

UnaMentis supports natural interruptions. When the user starts speaking while the AI is talking, the system gracefully stops TTS playback, cancels pending audio, and transitions to processing the user's new input.

Core Services

AudioEngine

Core/Audio/AudioEngine.swift

Manages all iOS audio I/O with hardware voice processing optimization:

  • Hardware AEC (Acoustic Echo Cancellation)
  • AGC (Automatic Gain Control)
  • Noise Suppression
  • Real-time audio streaming
  • Multi-buffer TTS playback scheduling
  • Thermal state monitoring

SessionManager

Core/Session/SessionManager.swift (~1,367 lines)

Orchestrates voice conversation sessions:

  • Turn-taking logic with state machine
  • Natural interruption handling
  • Context management for LLM prompts
  • Session recording with transcripts
  • TTS prefetching for smooth playback
  • Word-level timing in transcripts

CurriculumEngine

Core/Curriculum/CurriculumEngine.swift

Manages learning materials and progress:

  • Topic hierarchy navigation (unlimited depth)
  • Progress tracking with mastery scores
  • Dynamic context generation for prompts
  • Alternative explanation handling
  • Misconception detection triggers
  • Visual asset caching

TelemetryEngine

Core/Telemetry/TelemetryEngine.swift (~613 lines)

Real-time performance monitoring:

  • Latency measurement (TTFT, TTFB)
  • Cost calculation per provider
  • Memory monitoring and growth tracking
  • Thermal state monitoring
  • Per-turn performance analysis
  • Aggregated session metrics

Intelligent Routing (PatchPanel)

The PatchPanel service provides intelligent LLM endpoint routing based on runtime conditions. This allows UnaMentis to automatically select the best provider based on the current situation.

Routing Conditions

🌡️

Thermal State

Switch to lighter models when device is hot

💾

Memory Pressure

Reduce model size under memory constraints

🔋

Battery Level

Use efficient endpoints when battery is low

📶

Network Quality

Route based on latency and bandwidth

💰

Cost Budget

Stay within per-session cost limits

🎯

Task Type

Match model to task complexity

Task Types

The router classifies requests into task types for optimal model selection:

  • Quick Response: Simple acknowledgments, short answers
  • Explanation: Teaching new concepts
  • Deep Thinking: Complex problem solving
  • Assessment: Evaluating student understanding
  • Remediation: Addressing misconceptions

Server Components

The server-side infrastructure provides curriculum management, remote logging, and a web dashboard for monitoring and administration.

Management Server

Python 3.11+ / aiohttp / asyncio

Async HTTP server with WebSocket support running on port 8766:

  • Remote logging aggregation from mobile clients
  • Metrics history with time-series storage
  • Resource monitoring (CPU, memory, thermal)
  • Idle state management for power efficiency
  • WebSocket streaming for real-time updates

Curriculum Database

PostgreSQL / File-based / JSON

Storage and retrieval for UMLCF curriculum documents:

  • File-based storage for development
  • PostgreSQL support for production deployments
  • Full-text search across curriculum content
  • Topic hierarchy navigation APIs
  • Metadata indexing for filtering

Web Dashboard

Next.js / React / TypeScript

Administration and monitoring interface:

  • Curriculum browsing and management
  • Session analytics visualization
  • Real-time metrics streaming
  • Provider health monitoring
  • Configuration management

Data Flow

Understanding how data moves through the system during a typical conversation turn:

1

Audio Capture

AudioEngine captures microphone input at 16kHz, applies hardware AEC/AGC/NS

2

Voice Detection

Silero VAD (CoreML) detects speech start/end with confidence scores

3

Streaming STT

Audio streams to STT provider, partial transcripts arrive in real-time

4

Context Assembly

SessionManager builds LLM prompt with curriculum context and conversation history

5

LLM Streaming

PatchPanel routes to optimal endpoint, tokens stream back as generated

6

Sentence Buffering

Tokens accumulate until sentence boundary detected

7

TTS Synthesis

Complete sentences sent to TTS, audio chunks stream back

8

Audio Playback

AudioEngine schedules buffers for seamless playback while next sentence synthesizes

Performance Targets

UnaMentis is designed to meet aggressive performance targets for natural conversation:

<500ms
Median Turn Latency

Time from user finishing speaking to first audio from AI

<1000ms
P99 Turn Latency

99th percentile turn latency for consistent experience

90+ min
Session Stability

Continuous operation without crashes or degradation

<50MB
Memory Growth

Maximum memory increase over 90-minute session

Optimization Techniques

  • Streaming Everywhere: STT, LLM, and TTS all stream to minimize time-to-first-byte
  • Sentence Pipelining: Start TTS on sentence N while LLM generates sentence N+1
  • Audio Prefetching: Buffer multiple audio chunks ahead for seamless playback
  • Thermal Monitoring: Automatically reduce load when device heats up
  • Memory Management: Careful lifecycle management of audio buffers
  • Connection Pooling: Reuse WebSocket/HTTP connections across requests