Architecture - UnaMentis

System Overview

UnaMentis is designed as a modular, provider-agnostic voice AI tutoring platform. The architecture separates concerns into distinct layers, making it easy to swap providers, add new features, and maintain high performance.

UI Layer

SwiftUI views, session interface, curriculum navigator, analytics dashboard

↓

Core Business Logic

SessionManager, CurriculumEngine, TelemetryEngine, PatchPanel Router

↓

Service Layer

Protocol-based actors: STT, TTS, LLM, VAD, Embeddings

↓

Infrastructure

Audio Engine, Core Data, URLSession, CoreML

Mobile Client Architecture (iOS)

The mobile client is implemented for iOS using Swift 6.0 with strict concurrency. A web client (Next.js) provides browser-based voice tutoring, and an Android app is in development. All services are implemented as actors to ensure thread safety, and types crossing actor boundaries are Sendable.

Directory Structure

UnaMentis/
├── Core/                    # Business logic (actors)
│   ├── Audio/               # AudioEngine - iOS audio I/O
│   ├── Session/             # SessionManager - conversation orchestration
│   ├── Curriculum/          # CurriculumEngine - learning materials
│   ├── Telemetry/           # TelemetryEngine - metrics & costs
│   ├── Routing/             # PatchPanel - intelligent endpoint routing
│   ├── Config/              # API keys, server discovery
│   ├── Logging/             # Remote logging
│   └── Persistence/         # Core Data models
├── Services/                # Provider implementations
│   ├── STT/                 # Speech-to-text (9 providers)
│   ├── TTS/                 # Text-to-speech (7 providers)
│   ├── LLM/                 # Language models (5 providers)
│   ├── VAD/                 # Voice activity detection
│   ├── Embeddings/          # Semantic search
│   └── Protocols/           # Service interfaces
├── Intents/                 # Siri & App Intents
│   ├── StartLessonIntent    # "Hey Siri, start a lesson"
│   ├── ResumeLearningIntent # "Hey Siri, resume my lesson"
│   └── ShowProgressIntent   # "Hey Siri, show my progress"
└── UI/                      # SwiftUI views
    ├── Session/             # Live conversation interface
    ├── Curriculum/          # Topic browsing
    ├── History/             # Past sessions
    ├── Analytics/           # Performance metrics
    ├── Settings/            # Configuration
    └── Debug/               # Development tools

Key Design Principles

Actor-Based Concurrency

All services are Swift actors, providing automatic thread safety and eliminating data races.

Protocol-First Design

Each service type has a protocol, enabling easy testing and provider swapping.

Sendable Types

All types crossing actor boundaries conform to Sendable for compile-time safety.

@MainActor ViewModels

All ViewModels run on the main actor, ensuring UI updates happen safely.

Siri & App Intents

Deep integration with Siri and iOS Shortcuts enables hands-free control. Start lessons, resume where you left off, or check progress using voice commands or automation.

Voice Commands

"Hey Siri, start a lesson on calculus" or "resume my lesson" launches directly into tutoring.

Shortcuts Integration

Build custom automations that incorporate UnaMentis into your learning routine.

Apple Watch Companion

The Apple Watch companion app provides a control plane during active tutoring sessions, automatically installing with the iOS app. It uses WatchConnectivity for real-time state synchronization with support for extended runtime during 90+ minute sessions.

Session Controls

Mute/unmute microphone, pause/resume session, and stop session directly from your wrist.

At-a-Glance Progress

Circular progress gauge showing lesson completion, current topic, and session mode indicator.

Curriculum Auto-Continuation

Seamless topic-to-topic transitions enable uninterrupted learning sessions. When a topic completes, the next topic begins automatically with a brief audio announcement.

Pre-generation

At 70% progress through a topic, the system begins generating audio for the next topic in the background, ensuring zero delay at transition.

Segment Caching

All audio segments for the current topic are cached locally (up to 50MB), enabling instant replay and navigation.

Navigation Controls

Go back one segment, replay the current topic, or skip ahead to the next topic using on-screen controls.

User Control

Auto-continuation is enabled by default but can be toggled off in settings for manual topic progression.

Visual Content Support

Rich visual content enhances voice-based learning with diagrams, equations, and images.

LaTeX Rendering

Mathematical formulas render using SwiftMath with Unicode fallback for unsupported expressions.

Timed Display

Visual assets appear at specific points during audio playback, synchronized to the curriculum structure.

Web Client Architecture

The web client provides browser-based voice tutoring using Next.js 15+ and React 19. It serves as our planned path for all platforms beyond iOS and Android native apps.

Features

Real-time Voice

OpenAI Realtime API via WebRTC for low-latency conversations with streaming audio.

Curriculum Browser

Full UMCF content navigation with hierarchy, visual assets, and LaTeX rendering.

Responsive Design

Desktop and mobile optimized with light/dark theme support.

Cost Tracking

Real-time session cost display for transparency during conversations.

Provider Support

Component	Provider
STT	OpenAI Realtime, Deepgram, AssemblyAI, Groq, Self-hosted
TTS	OpenAI Realtime, ElevenLabs, Self-hosted
LLM	OpenAI Realtime, Anthropic Claude, Groq

Next.js 15+ / React 19 / TypeScript / Tailwind CSS

Voice Pipeline

The voice pipeline is the heart of UnaMentis. It handles real-time audio capture, voice activity detection, speech recognition, language model inference, and speech synthesis in a carefully orchestrated flow.

🎤

Audio Capture

AVAudioEngine

→

👂

VAD

Silero (CoreML)

→

📝

STT

Streaming

→

🧠

LLM

Streaming

→

🔊

TTS

Streaming

→

🎧

Playback

AVAudioEngine

Session State Machine

The SessionManager maintains a state machine for turn-taking:

Idle

User speaks →

User Speaking

Speech ends →

Processing

LLM responds →

AI Thinking

TTS starts →

AI Speaking

Complete →

Idle

Interruption Handling

UnaMentis supports natural interruptions. When the user starts speaking while the AI is talking, the system gracefully stops TTS playback, cancels pending audio, and transitions to processing the user's new input.

Provider Stack

UnaMentis supports multiple deployment configurations, from fully on-device processing to cloud-based services, with intelligent fallback between them.

Kyutai Pocket: On-Device Neural TTS

We've standardized on Kyutai Pocket as our primary on-device TTS engine. This 100M parameter neural model represents a paradigm shift for on-device speech synthesis, delivering natural, expressive voices with zero network latency. Unlike system TTS (AVSpeechSynthesizer), Kyutai Pocket produces human-quality speech entirely on-device, eliminating the traditional trade-off between privacy and voice quality.

100M parameters: Compact enough for mobile, powerful enough for natural speech
Zero latency: No network round-trip, instant voice synthesis
Complete privacy: Audio never leaves the device
Battery efficient: Optimized for extended mobile sessions

On-Device Capabilities

Run entirely on-device for maximum privacy and offline use:

Component	Technology
VAD	Silero (Core ML)
STT	Apple Speech, GLM-ASR (Whisper + GLM-ASR-Nano)
LLM	Ministral-3B-Instruct (primary), TinyLlama-1.1B (fallback) via llama.cpp
TTS	Kyutai Pocket (100M neural, primary), Apple AVSpeechSynthesizer (fallback)

Self-Hosted Options

Run your own servers for privacy-first deployments:

Component	Technology
LLM	Ollama, llama.cpp server, vLLM
TTS	Kyutai 1.6B (40+ voices), Chatterbox (emotion control, voice cloning), Fish Speech (30+ languages), VibeVoice, Piper
STT	GLM-ASR server (port 11401, WebSocket streaming)

Cloud Fallback

High-quality cloud services when needed:

Component	Provider
LLM	Anthropic Claude 3.5 Sonnet, OpenAI GPT-4o, GPT-4o-mini, Realtime API
STT	Deepgram Nova-3, AssemblyAI, OpenAI Whisper, Groq Whisper
TTS	ElevenLabs Turbo v2.5, Deepgram Aura-2, OpenAI Realtime

Privacy Architecture

UnaMentis offers three privacy tiers, allowing users to choose their preferred balance between capability and data privacy. Each tier uses different providers with different data handling characteristics.

Tier 1: Maximum Privacy

On-Device Only

All processing happens locally on your device. No audio or text ever leaves your phone.

STT: Apple Speech Framework
TTS: Kyutai Pocket (100M neural, primary), AVSpeechSynthesizer (fallback)
LLM: On-Device (Ministral-3B, TinyLlama)

Tier 2: High Privacy

Self-Hosted Servers

Data stays on servers you control. Ideal for organizations with compliance requirements.

STT: Whisper.cpp, faster-whisper
TTS: Kyutai 1.6B, Chatterbox, Fish Speech, VibeVoice, Piper
LLM: Ollama, llama.cpp, vLLM

Tier 3: Standard Privacy

Cloud with DPAs

Commercial cloud providers with data processing agreements and enterprise-grade security.

STT: Deepgram, AssemblyAI, Groq
TTS: ElevenLabs, Deepgram
LLM: OpenAI, Anthropic

Users can mix tiers based on their needs. For example, using on-device STT for privacy while using cloud TTS for voice quality. The provider-agnostic architecture makes these combinations seamless.

Core Services

AudioEngine

Core/Audio/AudioEngine.swift

Manages all iOS audio I/O with hardware voice processing optimization:

Hardware AEC (Acoustic Echo Cancellation)
AGC (Automatic Gain Control)
Noise Suppression
Real-time audio streaming
Multi-buffer TTS playback scheduling
Thermal state monitoring

SessionManager

Core/Session/SessionManager.swift (~1,367 lines)

Orchestrates voice conversation sessions:

Turn-taking logic with state machine
Natural interruption handling
Context management for LLM prompts
Session recording with transcripts
TTS prefetching for smooth playback
Word-level timing in transcripts

CurriculumEngine

Core/Curriculum/CurriculumEngine.swift

Manages learning materials and progress:

Topic hierarchy navigation (unlimited depth)
Progress tracking with mastery scores
Dynamic context generation for prompts
Alternative explanation handling
Misconception detection triggers
Visual asset caching

TelemetryEngine

Core/Telemetry/TelemetryEngine.swift (~613 lines)

Real-time performance monitoring:

Latency measurement (TTFT, TTFB)
Cost calculation per provider
Memory monitoring and growth tracking
Thermal state monitoring
Per-turn performance analysis
Aggregated session metrics

Intelligent Routing (PatchPanel)

The PatchPanel service provides intelligent LLM endpoint routing based on runtime conditions. This allows UnaMentis to automatically select the best provider based on the current situation.

Routing Conditions

🌡️

Thermal State

Switch to lighter models when device is hot

💾

Memory Pressure

Reduce model size under memory constraints

🔋

Battery Level

Use efficient endpoints when battery is low

📶

Network Quality

Route based on latency and bandwidth

💰

Cost Budget

Stay within per-session cost limits

🎯

Task Type

Match model to task complexity

Task Types

The router classifies requests into task types for optimal model selection:

Quick Response: Simple acknowledgments, short answers
Explanation: Teaching new concepts
Deep Thinking: Complex problem solving
Assessment: Evaluating student understanding
Remediation: Addressing misconceptions

Server Components

The server-side infrastructure provides curriculum management, remote logging, and a web dashboard for monitoring and administration.

Management Server

Python 3.11+ / aiohttp / asyncio

Async HTTP server with WebSocket support running on port 8766:

Remote logging aggregation from mobile clients
Metrics history with time-series storage
Resource monitoring (CPU, memory, thermal)
Idle state management for power efficiency
WebSocket streaming for real-time updates

Curriculum Database

SQLite / Python / UMCF JSON

Storage and retrieval for UMCF curriculum documents:

SQLite database with shared access
Plugin-based import from MIT OCW, CK-12, EngageNY, MERLOT
Full-text search across curriculum content
AI enrichment pipeline for content processing
Visual asset management and caching

Operations Console

Next.js 16 / React 19 / TypeScript

Administration and monitoring interface:

Curriculum Studio for viewing and editing content
Service status monitoring (Ollama, Chatterbox, VibeVoice, Piper, Gateway)
Real-time metrics and performance tracking
Plugin Manager for content source configuration
Logs and diagnostics with real-time filtering
Voice Lab: AI model selection, TTS experimentation, and batch processing profiles

Authentication System

JWT / RFC 9700 Token Rotation

Secure multi-device authentication with token rotation:

JWT access tokens with 15-minute expiry
Refresh token rotation per RFC 9700 with reuse detection
Device fingerprinting for multi-device support
Session management with remote termination
Rate limiting on auth endpoints

USM Core (Service Manager)

Rust / Tokio / Axum / Port 8787

Cross-platform service manager for development infrastructure. Manages templates and instances of backend services with real-time process monitoring.

Template-based: Service definitions with variable substitution for multiple instances
Real-time monitoring: CPU, memory, and status via WebSocket (under 50ms latency)
HTTP REST API: Start, stop, restart services programmatically
C FFI bindings: Swift and Python integration for native apps
Platform support: macOS (libproc), Linux (procfs)

TTS Caching & Session Management

Python / async / Filesystem Cache

A global audio caching system enables efficient multi-user deployments. When one user generates audio for a curriculum segment, that audio is cached and instantly available to all other users with the same voice configuration.

Cross-user caching: Cache keys use text + voice + provider (no user ID), so identical requests share cached audio
Priority-based generation: Live requests (user waiting) get highest priority; background pre-generation uses separate resource pools
Session state: Per-user playback position, voice preferences, and cross-device resume support
Scheduled deployments: Administrators can pre-generate entire curricula overnight for zero-latency playback

Performance: 1,155 requests/second with 50 concurrent cache hits at 35ms average latency. Corporate training example: 500 employees can start the same course simultaneously with 100% cache hits.

Self-Hosted Server Support

Connect to local/LAN servers for zero-cost inference:

Server	Port	Purpose
Ollama	11434	LLM inference (primary)
llama.cpp	8080	LLM inference
vLLM	8000	High-throughput LLM
GLM-ASR	11401	STT (WebSocket streaming)
whisper.cpp / faster-whisper	8080	Self-hosted STT (OpenAI-compatible)
Chatterbox TTS	8004	Expressive TTS with emotion control, voice cloning
VibeVoice TTS	11403	Real-time TTS
Piper TTS	11402	Lightweight TTS
UnaMentis Gateway	11400	Unified API gateway

Data Flow

Understanding how data moves through the system during a typical conversation turn:

Audio Capture

AudioEngine captures microphone input at 16kHz, applies hardware AEC/AGC/NS

Voice Detection

Silero VAD (CoreML) detects speech start/end with confidence scores

Streaming STT

Audio streams to STT provider, partial transcripts arrive in real-time

Context Assembly

SessionManager builds LLM prompt with curriculum context and conversation history

LLM Streaming

PatchPanel routes to optimal endpoint, tokens stream back as generated

Sentence Buffering

Tokens accumulate until sentence boundary detected

TTS Synthesis

Complete sentences sent to TTS, audio chunks stream back

Audio Playback

AudioEngine schedules buffers for seamless playback while next sentence synthesizes

Performance Targets

UnaMentis is designed to meet aggressive performance targets for natural conversation:

<500ms

Median Turn Latency

Time from user finishing speaking to first audio from AI

<1000ms

P99 Turn Latency

99th percentile turn latency for consistent experience

90+ min

Session Stability

Continuous operation without crashes or degradation

<50MB

Memory Growth

Maximum memory increase over 90-minute session

Optimization Techniques

Streaming Everywhere: STT, LLM, and TTS all stream to minimize time-to-first-byte
Sentence Pipelining: Start TTS on sentence N while LLM generates sentence N+1
Audio Prefetching: Buffer multiple audio chunks ahead for seamless playback
Thermal Monitoring: Automatically reduce load when device heats up
Memory Management: Careful lifecycle management of audio buffers
Connection Pooling: Reuse WebSocket/HTTP connections across requests