AI Development - UnaMentis

Our Philosophy: Experience-Driven AI Development

UnaMentis is built with AI-assisted development from the ground up, but this is not "vibecoding." The project founder brings over 30 years of experience in technology, the majority spent as a developer, with years contributing to open source projects. Every architectural decision, every tool selection, and every quality standard is informed by decades of real-world software engineering experience.

The goal is ambitious: use AI not just to move faster, but to approach the quality and review standards achieved by a thoughtful, attentive human developer. We have deep respect for that standard. A skilled developer bringing full attention to code review, architecture decisions, and quality assurance is not easily replicated by any single tool. However, when six or seven layers of AI-driven tools and processes work together, each with overlapping review and complementary perspectives, the cumulative effect can begin to approximate that level of rigor. We are documenting an ongoing experiment in what becomes possible when deep experience guides this layered approach with intention.

A Living Story

This page documents an evolving journey. We use Claude Code as our primary development partner, supplemented by a carefully chosen ecosystem of AI-powered tools. As new capabilities emerge and our understanding deepens, we continuously review and adapt our approach. What you read here represents our current state, our intentions, and our commitment to improvement.

AI handles the repetitive, error-prone aspects of software development while human experience guides architecture, quality standards, and the creative problem-solving that makes UnaMentis unique. The combination enables a small team to build and maintain a sophisticated, multi-platform voice AI tutoring system.

Shift Left

Catch issues at commit time, not in production. AI helps enforce quality standards before code ever leaves the developer's machine.

Automate Everything

Humans should not do what machines can do better. Every manual quality check becomes an automated gate.

Measure Continuously

You cannot improve what you do not measure. AI-powered observability gives us real-time insight into code quality and performance.

Learn from Data

DORA metrics and quality dashboards guide engineering decisions, creating a feedback loop that continuously improves our process.

AI Tools We Use

Our AI-assisted development workflow combines multiple specialized tools, each chosen for its strength in a specific domain. Together, they form a comprehensive system that touches every aspect of our development process.

Claude Code

Primary Development Partner

Our primary AI coding assistant for:

Code generation and refactoring
Architecture design and review
Documentation writing
Test creation and debugging
Cross-platform development (iOS, Web, Server)

CodeRabbit

Automated PR Review

AI-powered code review on every pull request:

Language-specific analysis (Swift, Python, TypeScript)
Concurrency safety checks for Swift 6
Security vulnerability detection
Architecture diagram generation
Free for open source projects

Intelligent Automation

CI/CD & Quality Gates

Automated quality enforcement:

Pre-commit hooks for linting and formatting
Renovate for dependency management
CodeQL for security analysis
Gitleaks for secrets detection
DevLake for DORA metrics

AI-Assisted Development Workflow

💡

Design

Claude Code

→

⌨️

Code

Claude Code

→

🔍

Pre-Commit

Hooks

→

🤖

AI Review

CodeRabbit

→

🛡️

Security

CodeQL

→

📊

Metrics

DevLake

The Code Quality Initiative

To achieve enterprise-grade quality with a small team, we implemented a systematic 5-phase Code Quality Initiative. Each phase builds on the previous, creating layers of automated protection that catch issues progressively earlier in the development cycle.

The Impact

This infrastructure enables a team of 2 people to maintain quality standards typically requiring 10+ engineers, while preserving the agility and velocity that makes small teams effective. Every commit passes the same quality checks. Every PR gets reviewed by AI. Every deployment is monitored.

Key Achievements

Capability	Status	Impact
Pre-commit quality gates	Implemented	Issues caught before commit
Hook bypass auditing	Implemented	Detects when quality checks are skipped
Automated dependency management	Implemented	Zero manual dependency tracking
80% code coverage enforcement	Implemented	CI fails below threshold
Performance regression detection	Implemented	Automated latency monitoring
Security scanning	Implemented	Secrets, CodeQL, dependency audits
Feature flag lifecycle	Implemented	Safe rollouts with cleanup tracking
DORA metrics & observability	Implemented	Engineering health visibility
AI-powered code review	Implemented	Every PR reviewed by CodeRabbit
Mutation testing	Implemented	Weekly test quality validation
Chaos engineering	Implemented	Voice pipeline resilience testing

Phase 1: Foundation

The foundation phase automates existing manual quality gates across iOS, Server, and Web components. The goal: make quality enforcement invisible and unavoidable.

Pre-Commit Hooks

Pre-commit hooks are automated checkpoints that run every time a developer tries to save code changes. Think of them like a spell-checker that runs automatically before you can send an email. If the code has formatting problems, style violations, or accidentally included passwords, the save is blocked until those issues are fixed. This catches problems at the earliest possible moment, when they are easiest and cheapest to fix.

Our unified hook system runs automatically before every commit, completing in under 30 seconds while checking code across all three platforms:

Swift (iOS)

SwiftLint enforces coding standards in strict mode. It catches potential bugs, ensures consistent style, and flags unsafe patterns like force-unwrapping that could cause crashes.

Python (Server)

Ruff checks for errors, potential bugs, and style consistency. It is significantly faster than older tools and catches issues like unused imports, undefined variables, and security vulnerabilities.

JavaScript/TypeScript (Web)

ESLint identifies problematic patterns and bugs, while Prettier automatically formats code. Together they ensure the web codebase is consistent and catches common React mistakes.

Secrets Detection

Gitleaks scans every file for accidentally included API keys, passwords, or access tokens. Committing secrets to code is a serious security risk; this prevents that mistake before it happens.

Hook Bypass Auditing

Developers can skip pre-commit checks in emergencies using a special flag. While sometimes necessary, frequent bypasses indicate a problem. Our audit system tracks every bypass, creating visibility into whether quality gates are being circumvented and enabling conversations about why.

Dependency Automation (Renovate)

Manual dependency tracking is eliminated. Renovate handles everything automatically:

Schedule: Updates run Mondays before 6am, minimizing disruption
Grouping: iOS, Python, and npm dependencies grouped separately for focused review
Auto-merge: Security patches, patch updates, and dev dependencies merge automatically
Manual review: Major version updates and breaking changes require human approval

Coverage Enforcement

Code coverage measures how much of our code is actually tested by our automated tests. When we say "80% coverage," it means that when all our tests run, they exercise at least 80% of the code paths in the application. The remaining 20% represents code that is not directly tested, which could hide undetected bugs.

We treat coverage as a hard gate, not a suggestion. If the iOS codebase drops below 80% coverage, the build automatically fails and cannot proceed. This forces new code to include tests and prevents gradual erosion of test quality over time. Coverage is extracted automatically from Xcode test results, so enforcement is completely automated.

Phase 2: Enhanced Quality Gates

Phase 2 extends quality enforcement with nightly testing, performance regression detection, and comprehensive security scanning.

Nightly End-to-End Testing

Every night at 2am UTC, comprehensive end-to-end tests run against the full system:

iOS E2E tests with real API keys (from GitHub Secrets)
Latency regression tests using the provider comparison suite
Full voice pipeline validation
Automatic GitHub issue creation on failure with "nightly-failure" label

Performance Regression Detection

Voice applications live and die by latency. Our latency test harness ensures we never ship a slower release:

500ms

P50 Target

End-to-end turn latency median target

1000ms

P99 Target

99th percentile latency ceiling

+10%

Warning

Regression warning threshold

+20%

Failure

CI blocks at this regression

Multi-Layer Security Scanning

Security is not a single check. It is a layered defense where multiple specialized tools each look for different types of problems. No single tool catches everything, but together they provide comprehensive protection:

Scanner	What It Does	Schedule
Gitleaks	Scans the entire code history for accidentally committed passwords, API keys, or tokens. Even if a secret was added and then deleted, Gitleaks finds it.	Every PR + weekly
CodeQL	GitHub's static analysis engine that reads code without running it, finding security vulnerabilities, bugs, and dangerous patterns in Swift, Python, and JavaScript. It catches issues like SQL injection, cross-site scripting, and unsafe data handling.	Every PR + weekly
pip-audit	Checks Python dependencies against known vulnerability databases. If any library we use has a published security flaw, this catches it before deployment.	Every PR + weekly
npm audit	Same as pip-audit but for JavaScript packages. Web applications often have hundreds of dependencies; this ensures none of them have known security issues.	Every PR + weekly

Phase 3: Feature Flag System

Feature flags are on/off switches in code that control whether a feature is visible to users. They let us deploy new code to production but keep it hidden until we are ready to reveal it. This is powerful for several reasons:

Safe rollouts: Enable a new feature for 1% of users first, watch for problems, then gradually expand to everyone.
Instant rollback: If a new feature causes problems, disable it with a configuration change instead of deploying a code fix.
A/B testing: Show different experiences to different users and measure which performs better.
Operational control: Disable expensive features during high traffic periods or outages.

Self-Hosted Unleash Infrastructure

We run our own feature flag system using Unleash, an open-source platform that commercial services like LaunchDarkly charge $75+ per month for. Self-hosting gives us full control over our data and zero ongoing costs:

Unleash Server

Port 4242: Core flag management and administration interface.

Unleash Proxy

Port 3063: Edge proxy for client SDK connections with caching.

iOS SDK

Actor-based service with SwiftUI view modifier for seamless integration.

Web SDK

React context and hooks (useFlag, useFlagVariant) for web components.

Flag Lifecycle Management

Feature flags have a lifecycle. A flag created for a gradual rollout should be removed once the feature is fully launched. Forgotten flags accumulate as "flag debt," making code harder to understand and maintain. Our automated audit system prevents this:

Ownership: Every flag has a designated owner and a target removal date set when the flag is created.
Expiration enforcement: Weekly automated scans detect flags older than 90 days and create cleanup tickets.
Early warning: CI automatically creates GitHub issues when flags approach their expiration date.
Visibility: Any code change that adds, modifies, or removes a flag triggers an automatic comment on the pull request for review.

Phase 4: Observability & DORA Metrics

You cannot improve what you do not measure. Phase 4 provides visibility into quality trends and engineering health through industry-standard DORA metrics.

DORA Metrics (Apache DevLake)

DORA (DevOps Research and Assessment) is the largest and longest-running research program studying what makes software teams effective. The research, published in the book Accelerate by Dr. Nicole Forsgren, Jez Humble, and Gene Kim, identified four key metrics that distinguish elite engineering teams from the rest. We track all four:

Metric	What It Really Means	Elite Target
Deployment Frequency	How often new code reaches users. High frequency means small, safe changes. Low frequency means big, risky releases.	Multiple per day
Lead Time for Changes	Time from a developer writing code to users seeing it. Short lead times mean the team can respond quickly to feedback and fix issues fast.	Less than 1 hour
Change Failure Rate	Percentage of deployments that cause problems requiring fixes. This measures whether speed is coming at the cost of quality.	0-15%
Mean Time to Recovery	When something breaks, how quickly is it fixed? This measures resilience and incident response capability.	Less than 1 hour

Research shows teams that excel at these metrics are twice as likely to exceed their organizational performance goals and report significantly higher customer satisfaction. We use Apache DevLake, an open-source platform, to collect and visualize these metrics automatically from our GitHub repositories.

Quality Dashboard

Daily automated metrics collection provides ongoing visibility:

CI/CD success rates across iOS, Server, and Web pipelines
Pull request metrics (count, average size, review time)
Bug metrics (open count, closed in 30 days, age distribution)
Trend analysis with 90-day retention for pattern detection

Phase 5: AI-Powered Code Review

The final phase brings AI into the review process. Every pull request receives automated analysis from CodeRabbit, configured for maximum issue detection with language-specific rules for our tech stack.

Review Configuration

CodeRabbit is configured in "assertive" mode for comprehensive coverage:

Swift Reviews

Swift 6.0 concurrency safety verification
Actor isolation violation detection
Sendable conformance checks
Data race identification in async code
Memory leak and retain cycle detection
Force unwrap usage analysis

Python Reviews

Async/await usage patterns
Exception handling completeness
Type hint coverage
Security vulnerability scanning
aiohttp-specific best practices

TypeScript/React Reviews

React hook dependency arrays
Server/client component boundaries
Accessibility (a11y) compliance
Next.js App Router patterns
Type safety enforcement

CI/CD Reviews

GitHub Action version pinning
Permissions scope verification
Secrets handling review
Cache configuration optimization
Workflow efficiency suggestions

Cost: Free for Open Source

CodeRabbit provides this enterprise-grade AI review capability free for open source projects. The same service costs $24-30 per seat per month for private repositories, making this a significant value for the UnaMentis project.

Results & The Road Ahead

The Code Quality Initiative is an ongoing journey. All five phases are now substantially complete, with continuous refinement ongoing. Here is where we stand:

Current Quality Gates

Gate	Threshold	Enforcement
Code Coverage	80% minimum	CI fails if below
Latency P50	500ms	Warns at +10%, fails at +20%
Latency P99	1000ms	Warns at +10%, fails at +20%
Lint (all languages)	Zero violations	Pre-commit hook blocks
Secrets Detection	Zero findings	Pre-commit + CI blocks
Hook Bypass	Logged and audited	Weekly audit report
Feature Flag Age	90 days maximum	Weekly audit creates issues
Security Vulnerabilities	Zero critical/high	Security workflow blocks
Mutation Score	70%+ target	Weekly validation

Implemented Advanced Features

Mutation Testing

Code coverage tells you that tests ran certain lines of code, but not whether those tests would actually catch bugs. Mutation testing answers a harder question: if we deliberately introduce bugs, do the tests detect them? It works by making small changes to the code (like replacing + with -, or changing true to false) and checking if any test fails. If no test catches the mutation, that reveals a gap in test quality. We run mutation testing weekly using mutmut (Python), Stryker (Web), and Muter (Swift).

Chaos Engineering

Voice applications fail differently than traditional apps. When a webpage fails, users see an error. When a voice app fails, users experience silence or confusion. Our chaos engineering runbook deliberately introduces failures to verify the system handles them gracefully: network degradation (high latency, packet loss), API timeouts, provider failures, and resource pressure. We test these scenarios to ensure users get clear feedback instead of mysterious silence.

Planned Advanced Features

Contract Testing

Ensures iOS client and Server API stay in sync using Pact. Deferred until APIs stabilize.

Predictive Alerts

Move from reactive to proactive: detect performance degradation trends before they impact users.

This Story Continues

AI-assisted development is not a destination. It is an evolving practice. As new tools emerge and our understanding deepens, we will continue to push the boundaries of what a small team can accomplish with intelligent automation. This page will be updated as our journey continues.