Our Philosophy: Experience-Driven AI Development
UnaMentis is built with AI-assisted development from the ground up, but this is not "vibecoding." The project founder brings over 30 years of experience in technology, the majority spent as a developer, with years contributing to open source projects. Every architectural decision, every tool selection, and every quality standard is informed by decades of real-world software engineering experience.
The goal is ambitious: use AI not just to move faster, but to approach the quality and review standards achieved by a thoughtful, attentive human developer. We have deep respect for that standard. A skilled developer bringing full attention to code review, architecture decisions, and quality assurance is not easily replicated by any single tool. However, when six or seven layers of AI-driven tools and processes work together, each with overlapping review and complementary perspectives, the cumulative effect can begin to approximate that level of rigor. We are documenting an ongoing experiment in what becomes possible when deep experience guides this layered approach with intention.
A Living Story
This page documents an evolving journey. We use Claude Code as our primary development partner, supplemented by a carefully chosen ecosystem of AI-powered tools. As new capabilities emerge and our understanding deepens, we continuously review and adapt our approach. What you read here represents our current state, our intentions, and our commitment to improvement.
AI handles the repetitive, error-prone aspects of software development while human experience guides architecture, quality standards, and the creative problem-solving that makes UnaMentis unique. The combination enables a small team to build and maintain a sophisticated, multi-platform voice AI tutoring system.
Shift Left
Catch issues at commit time, not in production. AI helps enforce quality standards before code ever leaves the developer's machine.
Automate Everything
Humans should not do what machines can do better. Every manual quality check becomes an automated gate.
Measure Continuously
You cannot improve what you do not measure. AI-powered observability gives us real-time insight into code quality and performance.
Learn from Data
DORA metrics and quality dashboards guide engineering decisions, creating a feedback loop that continuously improves our process.
AI Tools We Use
Our AI-assisted development workflow combines multiple specialized tools, each chosen for its strength in a specific domain. Together, they form a comprehensive system that touches every aspect of our development process.
Claude Code
Primary Development Partner
Our primary AI coding assistant for:
- Code generation and refactoring
- Architecture design and review
- Documentation writing
- Test creation and debugging
- Cross-platform development (iOS, Web, Server)
CodeRabbit
Automated PR Review
AI-powered code review on every pull request:
- Language-specific analysis (Swift, Python, TypeScript)
- Concurrency safety checks for Swift 6
- Security vulnerability detection
- Architecture diagram generation
- Free for open source projects
Intelligent Automation
CI/CD & Quality Gates
Automated quality enforcement:
- Pre-commit hooks for linting and formatting
- Renovate for dependency management
- CodeQL for security analysis
- Gitleaks for secrets detection
- DevLake for DORA metrics
AI-Assisted Development Workflow
The Code Quality Initiative
To achieve enterprise-grade quality with a small team, we implemented a systematic 5-phase Code Quality Initiative. Each phase builds on the previous, creating layers of automated protection that catch issues progressively earlier in the development cycle.
The Impact
This infrastructure enables a team of 2 people to maintain quality standards typically requiring 10+ engineers, while preserving the agility and velocity that makes small teams effective. Every commit passes the same quality checks. Every PR gets reviewed by AI. Every deployment is monitored.
Key Achievements
| Capability | Status | Impact |
|---|---|---|
| Pre-commit quality gates | Implemented | Issues caught before commit |
| Hook bypass auditing | Implemented | Detects when quality checks are skipped |
| Automated dependency management | Implemented | Zero manual dependency tracking |
| 80% code coverage enforcement | Implemented | CI fails below threshold |
| Performance regression detection | Implemented | Automated latency monitoring |
| Security scanning | Implemented | Secrets, CodeQL, dependency audits |
| Feature flag lifecycle | Implemented | Safe rollouts with cleanup tracking |
| DORA metrics & observability | Implemented | Engineering health visibility |
| AI-powered code review | Implemented | Every PR reviewed by CodeRabbit |
| Mutation testing | Implemented | Weekly test quality validation |
| Chaos engineering | Implemented | Voice pipeline resilience testing |
Phase 1: Foundation
The foundation phase automates existing manual quality gates across iOS, Server, and Web components. The goal: make quality enforcement invisible and unavoidable.
Pre-Commit Hooks
Pre-commit hooks are automated checkpoints that run every time a developer tries to save code changes. Think of them like a spell-checker that runs automatically before you can send an email. If the code has formatting problems, style violations, or accidentally included passwords, the save is blocked until those issues are fixed. This catches problems at the earliest possible moment, when they are easiest and cheapest to fix.
Our unified hook system runs automatically before every commit, completing in under 30 seconds while checking code across all three platforms:
Swift (iOS)
SwiftLint enforces coding standards in strict mode. It catches potential bugs, ensures consistent style, and flags unsafe patterns like force-unwrapping that could cause crashes.
Python (Server)
Ruff checks for errors, potential bugs, and style consistency. It is significantly faster than older tools and catches issues like unused imports, undefined variables, and security vulnerabilities.
JavaScript/TypeScript (Web)
ESLint identifies problematic patterns and bugs, while Prettier automatically formats code. Together they ensure the web codebase is consistent and catches common React mistakes.
Secrets Detection
Gitleaks scans every file for accidentally included API keys, passwords, or access tokens. Committing secrets to code is a serious security risk; this prevents that mistake before it happens.
Hook Bypass Auditing
Developers can skip pre-commit checks in emergencies using a special flag. While sometimes necessary, frequent bypasses indicate a problem. Our audit system tracks every bypass, creating visibility into whether quality gates are being circumvented and enabling conversations about why.
Dependency Automation (Renovate)
Manual dependency tracking is eliminated. Renovate handles everything automatically:
- Schedule: Updates run Mondays before 6am, minimizing disruption
- Grouping: iOS, Python, and npm dependencies grouped separately for focused review
- Auto-merge: Security patches, patch updates, and dev dependencies merge automatically
- Manual review: Major version updates and breaking changes require human approval
Coverage Enforcement
Code coverage measures how much of our code is actually tested by our automated tests. When we say "80% coverage," it means that when all our tests run, they exercise at least 80% of the code paths in the application. The remaining 20% represents code that is not directly tested, which could hide undetected bugs.
We treat coverage as a hard gate, not a suggestion. If the iOS codebase drops below 80% coverage, the build automatically fails and cannot proceed. This forces new code to include tests and prevents gradual erosion of test quality over time. Coverage is extracted automatically from Xcode test results, so enforcement is completely automated.
Phase 2: Enhanced Quality Gates
Phase 2 extends quality enforcement with nightly testing, performance regression detection, and comprehensive security scanning.
Nightly End-to-End Testing
Every night at 2am UTC, comprehensive end-to-end tests run against the full system:
- iOS E2E tests with real API keys (from GitHub Secrets)
- Latency regression tests using the provider comparison suite
- Full voice pipeline validation
- Automatic GitHub issue creation on failure with "nightly-failure" label
Performance Regression Detection
Voice applications live and die by latency. Our latency test harness ensures we never ship a slower release:
End-to-end turn latency median target
99th percentile latency ceiling
Regression warning threshold
CI blocks at this regression
Multi-Layer Security Scanning
Security is not a single check. It is a layered defense where multiple specialized tools each look for different types of problems. No single tool catches everything, but together they provide comprehensive protection:
| Scanner | What It Does | Schedule |
|---|---|---|
| Gitleaks | Scans the entire code history for accidentally committed passwords, API keys, or tokens. Even if a secret was added and then deleted, Gitleaks finds it. | Every PR + weekly |
| CodeQL | GitHub's static analysis engine that reads code without running it, finding security vulnerabilities, bugs, and dangerous patterns in Swift, Python, and JavaScript. It catches issues like SQL injection, cross-site scripting, and unsafe data handling. | Every PR + weekly |
| pip-audit | Checks Python dependencies against known vulnerability databases. If any library we use has a published security flaw, this catches it before deployment. | Every PR + weekly |
| npm audit | Same as pip-audit but for JavaScript packages. Web applications often have hundreds of dependencies; this ensures none of them have known security issues. | Every PR + weekly |
Phase 3: Feature Flag System
Feature flags are on/off switches in code that control whether a feature is visible to users. They let us deploy new code to production but keep it hidden until we are ready to reveal it. This is powerful for several reasons:
- Safe rollouts: Enable a new feature for 1% of users first, watch for problems, then gradually expand to everyone.
- Instant rollback: If a new feature causes problems, disable it with a configuration change instead of deploying a code fix.
- A/B testing: Show different experiences to different users and measure which performs better.
- Operational control: Disable expensive features during high traffic periods or outages.
Self-Hosted Unleash Infrastructure
We run our own feature flag system using Unleash, an open-source platform that commercial services like LaunchDarkly charge $75+ per month for. Self-hosting gives us full control over our data and zero ongoing costs:
Unleash Server
Port 4242: Core flag management and administration interface.
Unleash Proxy
Port 3063: Edge proxy for client SDK connections with caching.
iOS SDK
Actor-based service with SwiftUI view modifier for seamless integration.
Web SDK
React context and hooks (useFlag, useFlagVariant) for web components.
Flag Lifecycle Management
Feature flags have a lifecycle. A flag created for a gradual rollout should be removed once the feature is fully launched. Forgotten flags accumulate as "flag debt," making code harder to understand and maintain. Our automated audit system prevents this:
- Ownership: Every flag has a designated owner and a target removal date set when the flag is created.
- Expiration enforcement: Weekly automated scans detect flags older than 90 days and create cleanup tickets.
- Early warning: CI automatically creates GitHub issues when flags approach their expiration date.
- Visibility: Any code change that adds, modifies, or removes a flag triggers an automatic comment on the pull request for review.
Phase 4: Observability & DORA Metrics
You cannot improve what you do not measure. Phase 4 provides visibility into quality trends and engineering health through industry-standard DORA metrics.
DORA Metrics (Apache DevLake)
DORA (DevOps Research and Assessment) is the largest and longest-running research program studying what makes software teams effective. The research, published in the book Accelerate by Dr. Nicole Forsgren, Jez Humble, and Gene Kim, identified four key metrics that distinguish elite engineering teams from the rest. We track all four:
| Metric | What It Really Means | Elite Target |
|---|---|---|
| Deployment Frequency | How often new code reaches users. High frequency means small, safe changes. Low frequency means big, risky releases. | Multiple per day |
| Lead Time for Changes | Time from a developer writing code to users seeing it. Short lead times mean the team can respond quickly to feedback and fix issues fast. | Less than 1 hour |
| Change Failure Rate | Percentage of deployments that cause problems requiring fixes. This measures whether speed is coming at the cost of quality. | 0-15% |
| Mean Time to Recovery | When something breaks, how quickly is it fixed? This measures resilience and incident response capability. | Less than 1 hour |
Research shows teams that excel at these metrics are twice as likely to exceed their organizational performance goals and report significantly higher customer satisfaction. We use Apache DevLake, an open-source platform, to collect and visualize these metrics automatically from our GitHub repositories.
Quality Dashboard
Daily automated metrics collection provides ongoing visibility:
- CI/CD success rates across iOS, Server, and Web pipelines
- Pull request metrics (count, average size, review time)
- Bug metrics (open count, closed in 30 days, age distribution)
- Trend analysis with 90-day retention for pattern detection
Phase 5: AI-Powered Code Review
The final phase brings AI into the review process. Every pull request receives automated analysis from CodeRabbit, configured for maximum issue detection with language-specific rules for our tech stack.
Review Configuration
CodeRabbit is configured in "assertive" mode for comprehensive coverage:
Swift Reviews
- Swift 6.0 concurrency safety verification
- Actor isolation violation detection
- Sendable conformance checks
- Data race identification in async code
- Memory leak and retain cycle detection
- Force unwrap usage analysis
Python Reviews
- Async/await usage patterns
- Exception handling completeness
- Type hint coverage
- Security vulnerability scanning
- aiohttp-specific best practices
TypeScript/React Reviews
- React hook dependency arrays
- Server/client component boundaries
- Accessibility (a11y) compliance
- Next.js App Router patterns
- Type safety enforcement
CI/CD Reviews
- GitHub Action version pinning
- Permissions scope verification
- Secrets handling review
- Cache configuration optimization
- Workflow efficiency suggestions
Cost: Free for Open Source
CodeRabbit provides this enterprise-grade AI review capability free for open source projects. The same service costs $24-30 per seat per month for private repositories, making this a significant value for the UnaMentis project.
Results & The Road Ahead
The Code Quality Initiative is an ongoing journey. All five phases are now substantially complete, with continuous refinement ongoing. Here is where we stand:
Current Quality Gates
| Gate | Threshold | Enforcement |
|---|---|---|
| Code Coverage | 80% minimum | CI fails if below |
| Latency P50 | 500ms | Warns at +10%, fails at +20% |
| Latency P99 | 1000ms | Warns at +10%, fails at +20% |
| Lint (all languages) | Zero violations | Pre-commit hook blocks |
| Secrets Detection | Zero findings | Pre-commit + CI blocks |
| Hook Bypass | Logged and audited | Weekly audit report |
| Feature Flag Age | 90 days maximum | Weekly audit creates issues |
| Security Vulnerabilities | Zero critical/high | Security workflow blocks |
| Mutation Score | 70%+ target | Weekly validation |
Implemented Advanced Features
Mutation Testing
Code coverage tells you that tests ran certain lines of code, but not whether those tests would actually catch bugs. Mutation testing answers a harder question: if we deliberately introduce bugs, do the tests detect them? It works by making small changes to the code (like replacing + with -, or changing true to false) and checking if any test fails. If no test catches the mutation, that reveals a gap in test quality. We run mutation testing weekly using mutmut (Python), Stryker (Web), and Muter (Swift).
Chaos Engineering
Voice applications fail differently than traditional apps. When a webpage fails, users see an error. When a voice app fails, users experience silence or confusion. Our chaos engineering runbook deliberately introduces failures to verify the system handles them gracefully: network degradation (high latency, packet loss), API timeouts, provider failures, and resource pressure. We test these scenarios to ensure users get clear feedback instead of mysterious silence.
Planned Advanced Features
Contract Testing
Ensures iOS client and Server API stay in sync using Pact. Deferred until APIs stabilize.
Predictive Alerts
Move from reactive to proactive: detect performance degradation trends before they impact users.
This Story Continues
AI-assisted development is not a destination. It is an evolving practice. As new tools emerge and our understanding deepens, we will continue to push the boundaries of what a small team can accomplish with intelligent automation. This page will be updated as our journey continues.