System Overview

UnaMentis is designed as a modular, provider-agnostic voice AI tutoring platform. The architecture separates concerns into distinct layers, making it easy to swap providers, add new features, and maintain high performance.

UI Layer
SwiftUI views, session interface, curriculum navigator, analytics dashboard
Core Business Logic
SessionManager, CurriculumEngine, TelemetryEngine, PatchPanel Router
Service Layer
Protocol-based actors: STT, TTS, LLM, VAD, Embeddings
Infrastructure
Audio Engine, Core Data, URLSession, CoreML

Mobile Client Architecture (iOS)

The mobile client is implemented for iOS using Swift 6.0 with strict concurrency. A web client (Next.js) provides browser-based voice tutoring, and an Android app is in development. All services are implemented as actors to ensure thread safety, and types crossing actor boundaries are Sendable.

Directory Structure

UnaMentis/
├── Core/                    # Business logic (actors)
│   ├── Audio/               # AudioEngine - iOS audio I/O
│   ├── Session/             # SessionManager - conversation orchestration
│   ├── Curriculum/          # CurriculumEngine - learning materials
│   ├── Telemetry/           # TelemetryEngine - metrics & costs
│   ├── Routing/             # PatchPanel - intelligent endpoint routing
│   ├── Config/              # API keys, server discovery
│   ├── Logging/             # Remote logging
│   └── Persistence/         # Core Data models
├── Services/                # Provider implementations
│   ├── STT/                 # Speech-to-text (9 providers)
│   ├── TTS/                 # Text-to-speech (7 providers)
│   ├── LLM/                 # Language models (5 providers)
│   ├── VAD/                 # Voice activity detection
│   ├── Embeddings/          # Semantic search
│   └── Protocols/           # Service interfaces
├── Intents/                 # Siri & App Intents
│   ├── StartLessonIntent    # "Hey Siri, start a lesson"
│   ├── ResumeLearningIntent # "Hey Siri, resume my lesson"
│   └── ShowProgressIntent   # "Hey Siri, show my progress"
└── UI/                      # SwiftUI views
    ├── Session/             # Live conversation interface
    ├── Curriculum/          # Topic browsing
    ├── History/             # Past sessions
    ├── Analytics/           # Performance metrics
    ├── Settings/            # Configuration
    └── Debug/               # Development tools

Key Design Principles

Actor-Based Concurrency

All services are Swift actors, providing automatic thread safety and eliminating data races.

Protocol-First Design

Each service type has a protocol, enabling easy testing and provider swapping.

Sendable Types

All types crossing actor boundaries conform to Sendable for compile-time safety.

@MainActor ViewModels

All ViewModels run on the main actor, ensuring UI updates happen safely.

Siri & App Intents

Deep integration with Siri and iOS Shortcuts enables hands-free control. Start lessons, resume where you left off, or check progress using voice commands or automation.

Voice Commands

"Hey Siri, start a lesson on calculus" or "resume my lesson" launches directly into tutoring.

Shortcuts Integration

Build custom automations that incorporate UnaMentis into your learning routine.

Apple Watch Companion

The Apple Watch companion app provides a control plane during active tutoring sessions, automatically installing with the iOS app. It uses WatchConnectivity for real-time state synchronization with support for extended runtime during 90+ minute sessions.

Session Controls

Mute/unmute microphone, pause/resume session, and stop session directly from your wrist.

At-a-Glance Progress

Circular progress gauge showing lesson completion, current topic, and session mode indicator.

Curriculum Auto-Continuation

Seamless topic-to-topic transitions enable uninterrupted learning sessions. When a topic completes, the next topic begins automatically with a brief audio announcement.

Pre-generation

At 70% progress through a topic, the system begins generating audio for the next topic in the background, ensuring zero delay at transition.

Segment Caching

All audio segments for the current topic are cached locally (up to 50MB), enabling instant replay and navigation.

Navigation Controls

Go back one segment, replay the current topic, or skip ahead to the next topic using on-screen controls.

User Control

Auto-continuation is enabled by default but can be toggled off in settings for manual topic progression.

Visual Content Support

Rich visual content enhances voice-based learning with diagrams, equations, and images.

LaTeX Rendering

Mathematical formulas render using SwiftMath with Unicode fallback for unsupported expressions.

Timed Display

Visual assets appear at specific points during audio playback, synchronized to the curriculum structure.

Web Client Architecture

The web client provides browser-based voice tutoring using Next.js 15+ and React 19. It serves as our planned path for all platforms beyond iOS and Android native apps.

Features

Real-time Voice

OpenAI Realtime API via WebRTC for low-latency conversations with streaming audio.

Curriculum Browser

Full UMCF content navigation with hierarchy, visual assets, and LaTeX rendering.

Responsive Design

Desktop and mobile optimized with light/dark theme support.

Cost Tracking

Real-time session cost display for transparency during conversations.

Provider Support

Component Provider
STT OpenAI Realtime, Deepgram, AssemblyAI, Groq, Self-hosted
TTS OpenAI Realtime, ElevenLabs, Self-hosted
LLM OpenAI Realtime, Anthropic Claude, Groq
Next.js 15+ / React 19 / TypeScript / Tailwind CSS

Voice Pipeline

The voice pipeline is the heart of UnaMentis. It handles real-time audio capture, voice activity detection, speech recognition, language model inference, and speech synthesis in a carefully orchestrated flow.

🎤
Audio Capture
AVAudioEngine
👂
VAD
Silero (CoreML)
📝
STT
Streaming
🧠
LLM
Streaming
🔊
TTS
Streaming
🎧
Playback
AVAudioEngine

Session State Machine

The SessionManager maintains a state machine for turn-taking:

Idle
User speaks
User Speaking
Speech ends
Processing
LLM responds
AI Thinking
TTS starts
AI Speaking
Complete
Idle

Interruption Handling

UnaMentis supports natural interruptions. When the user starts speaking while the AI is talking, the system gracefully stops TTS playback, cancels pending audio, and transitions to processing the user's new input.

Provider Stack

UnaMentis supports multiple deployment configurations, from fully on-device processing to cloud-based services, with intelligent fallback between them.

Kyutai Pocket: On-Device Neural TTS

We've standardized on Kyutai Pocket as our primary on-device TTS engine. This 100M parameter neural model represents a paradigm shift for on-device speech synthesis, delivering natural, expressive voices with zero network latency. Unlike system TTS (AVSpeechSynthesizer), Kyutai Pocket produces human-quality speech entirely on-device, eliminating the traditional trade-off between privacy and voice quality.

  • 100M parameters: Compact enough for mobile, powerful enough for natural speech
  • Zero latency: No network round-trip, instant voice synthesis
  • Complete privacy: Audio never leaves the device
  • Battery efficient: Optimized for extended mobile sessions

On-Device Capabilities

Run entirely on-device for maximum privacy and offline use:

Component Technology
VAD Silero (Core ML)
STT Apple Speech, GLM-ASR (Whisper + GLM-ASR-Nano)
LLM Ministral-3B-Instruct (primary), TinyLlama-1.1B (fallback) via llama.cpp
TTS Kyutai Pocket (100M neural, primary), Apple AVSpeechSynthesizer (fallback)

Self-Hosted Options

Run your own servers for privacy-first deployments:

Component Technology
LLM Ollama, llama.cpp server, vLLM
TTS Kyutai 1.6B (40+ voices), Chatterbox (emotion control, voice cloning), Fish Speech (30+ languages), VibeVoice, Piper
STT GLM-ASR server (port 11401, WebSocket streaming)

Cloud Fallback

High-quality cloud services when needed:

Component Provider
LLM Anthropic Claude 3.5 Sonnet, OpenAI GPT-4o, GPT-4o-mini, Realtime API
STT Deepgram Nova-3, AssemblyAI, OpenAI Whisper, Groq Whisper
TTS ElevenLabs Turbo v2.5, Deepgram Aura-2, OpenAI Realtime

Privacy Architecture

UnaMentis offers three privacy tiers, allowing users to choose their preferred balance between capability and data privacy. Each tier uses different providers with different data handling characteristics.

Tier 1: Maximum Privacy

On-Device Only

All processing happens locally on your device. No audio or text ever leaves your phone.

  • STT: Apple Speech Framework
  • TTS: Kyutai Pocket (100M neural, primary), AVSpeechSynthesizer (fallback)
  • LLM: On-Device (Ministral-3B, TinyLlama)

Tier 2: High Privacy

Self-Hosted Servers

Data stays on servers you control. Ideal for organizations with compliance requirements.

  • STT: Whisper.cpp, faster-whisper
  • TTS: Kyutai 1.6B, Chatterbox, Fish Speech, VibeVoice, Piper
  • LLM: Ollama, llama.cpp, vLLM

Tier 3: Standard Privacy

Cloud with DPAs

Commercial cloud providers with data processing agreements and enterprise-grade security.

  • STT: Deepgram, AssemblyAI, Groq
  • TTS: ElevenLabs, Deepgram
  • LLM: OpenAI, Anthropic

Users can mix tiers based on their needs. For example, using on-device STT for privacy while using cloud TTS for voice quality. The provider-agnostic architecture makes these combinations seamless.

Core Services

AudioEngine

Core/Audio/AudioEngine.swift

Manages all iOS audio I/O with hardware voice processing optimization:

  • Hardware AEC (Acoustic Echo Cancellation)
  • AGC (Automatic Gain Control)
  • Noise Suppression
  • Real-time audio streaming
  • Multi-buffer TTS playback scheduling
  • Thermal state monitoring

SessionManager

Core/Session/SessionManager.swift (~1,367 lines)

Orchestrates voice conversation sessions:

  • Turn-taking logic with state machine
  • Natural interruption handling
  • Context management for LLM prompts
  • Session recording with transcripts
  • TTS prefetching for smooth playback
  • Word-level timing in transcripts

CurriculumEngine

Core/Curriculum/CurriculumEngine.swift

Manages learning materials and progress:

  • Topic hierarchy navigation (unlimited depth)
  • Progress tracking with mastery scores
  • Dynamic context generation for prompts
  • Alternative explanation handling
  • Misconception detection triggers
  • Visual asset caching

TelemetryEngine

Core/Telemetry/TelemetryEngine.swift (~613 lines)

Real-time performance monitoring:

  • Latency measurement (TTFT, TTFB)
  • Cost calculation per provider
  • Memory monitoring and growth tracking
  • Thermal state monitoring
  • Per-turn performance analysis
  • Aggregated session metrics

Intelligent Routing (PatchPanel)

The PatchPanel service provides intelligent LLM endpoint routing based on runtime conditions. This allows UnaMentis to automatically select the best provider based on the current situation.

Routing Conditions

🌡️

Thermal State

Switch to lighter models when device is hot

💾

Memory Pressure

Reduce model size under memory constraints

🔋

Battery Level

Use efficient endpoints when battery is low

📶

Network Quality

Route based on latency and bandwidth

💰

Cost Budget

Stay within per-session cost limits

🎯

Task Type

Match model to task complexity

Task Types

The router classifies requests into task types for optimal model selection:

  • Quick Response: Simple acknowledgments, short answers
  • Explanation: Teaching new concepts
  • Deep Thinking: Complex problem solving
  • Assessment: Evaluating student understanding
  • Remediation: Addressing misconceptions

Server Components

The server-side infrastructure provides curriculum management, remote logging, and a web dashboard for monitoring and administration.

Management Server

Python 3.11+ / aiohttp / asyncio

Async HTTP server with WebSocket support running on port 8766:

  • Remote logging aggregation from mobile clients
  • Metrics history with time-series storage
  • Resource monitoring (CPU, memory, thermal)
  • Idle state management for power efficiency
  • WebSocket streaming for real-time updates

Curriculum Database

SQLite / Python / UMCF JSON

Storage and retrieval for UMCF curriculum documents:

  • SQLite database with shared access
  • Plugin-based import from MIT OCW, CK-12, EngageNY, MERLOT
  • Full-text search across curriculum content
  • AI enrichment pipeline for content processing
  • Visual asset management and caching

Operations Console

Next.js 16 / React 19 / TypeScript

Administration and monitoring interface:

  • Curriculum Studio for viewing and editing content
  • Service status monitoring (Ollama, Chatterbox, VibeVoice, Piper, Gateway)
  • Real-time metrics and performance tracking
  • Plugin Manager for content source configuration
  • Logs and diagnostics with real-time filtering
  • Voice Lab: AI model selection, TTS experimentation, and batch processing profiles

Authentication System

JWT / RFC 9700 Token Rotation

Secure multi-device authentication with token rotation:

  • JWT access tokens with 15-minute expiry
  • Refresh token rotation per RFC 9700 with reuse detection
  • Device fingerprinting for multi-device support
  • Session management with remote termination
  • Rate limiting on auth endpoints

USM Core (Service Manager)

Rust / Tokio / Axum / Port 8787

Cross-platform service manager for development infrastructure. Manages templates and instances of backend services with real-time process monitoring.

  • Template-based: Service definitions with variable substitution for multiple instances
  • Real-time monitoring: CPU, memory, and status via WebSocket (under 50ms latency)
  • HTTP REST API: Start, stop, restart services programmatically
  • C FFI bindings: Swift and Python integration for native apps
  • Platform support: macOS (libproc), Linux (procfs)

TTS Caching & Session Management

Python / async / Filesystem Cache

A global audio caching system enables efficient multi-user deployments. When one user generates audio for a curriculum segment, that audio is cached and instantly available to all other users with the same voice configuration.

  • Cross-user caching: Cache keys use text + voice + provider (no user ID), so identical requests share cached audio
  • Priority-based generation: Live requests (user waiting) get highest priority; background pre-generation uses separate resource pools
  • Session state: Per-user playback position, voice preferences, and cross-device resume support
  • Scheduled deployments: Administrators can pre-generate entire curricula overnight for zero-latency playback

Performance: 1,155 requests/second with 50 concurrent cache hits at 35ms average latency. Corporate training example: 500 employees can start the same course simultaneously with 100% cache hits.

Self-Hosted Server Support

Connect to local/LAN servers for zero-cost inference:

Server Port Purpose
Ollama 11434 LLM inference (primary)
llama.cpp 8080 LLM inference
vLLM 8000 High-throughput LLM
GLM-ASR 11401 STT (WebSocket streaming)
whisper.cpp / faster-whisper 8080 Self-hosted STT (OpenAI-compatible)
Chatterbox TTS 8004 Expressive TTS with emotion control, voice cloning
VibeVoice TTS 11403 Real-time TTS
Piper TTS 11402 Lightweight TTS
UnaMentis Gateway 11400 Unified API gateway

Data Flow

Understanding how data moves through the system during a typical conversation turn:

1

Audio Capture

AudioEngine captures microphone input at 16kHz, applies hardware AEC/AGC/NS

2

Voice Detection

Silero VAD (CoreML) detects speech start/end with confidence scores

3

Streaming STT

Audio streams to STT provider, partial transcripts arrive in real-time

4

Context Assembly

SessionManager builds LLM prompt with curriculum context and conversation history

5

LLM Streaming

PatchPanel routes to optimal endpoint, tokens stream back as generated

6

Sentence Buffering

Tokens accumulate until sentence boundary detected

7

TTS Synthesis

Complete sentences sent to TTS, audio chunks stream back

8

Audio Playback

AudioEngine schedules buffers for seamless playback while next sentence synthesizes

Performance Targets

UnaMentis is designed to meet aggressive performance targets for natural conversation:

<500ms
Median Turn Latency

Time from user finishing speaking to first audio from AI

<1000ms
P99 Turn Latency

99th percentile turn latency for consistent experience

90+ min
Session Stability

Continuous operation without crashes or degradation

<50MB
Memory Growth

Maximum memory increase over 90-minute session

Optimization Techniques

  • Streaming Everywhere: STT, LLM, and TTS all stream to minimize time-to-first-byte
  • Sentence Pipelining: Start TTS on sentence N while LLM generates sentence N+1
  • Audio Prefetching: Buffer multiple audio chunks ahead for seamless playback
  • Thermal Monitoring: Automatically reduce load when device heats up
  • Memory Management: Careful lifecycle management of audio buffers
  • Connection Pooling: Reuse WebSocket/HTTP connections across requests