Project Overview

A real-time voice detection system that transforms vocal input into interactive gameplay mechanics. This documentation covers the multi-layer audio processing architecture, performance engineering, and implementation details.

Project Charter

Core Concept

Tower defense meets voice control with an absurd twist. Players yell to create explosions that chain-react through waves of cute enemies marching toward their castle.

Viral Hook

The cognitive dissonance between wanting to preserve cuteness and needing to obliterate it for survival creates natural comedy moments and perfect shareability.

Scope Definition

In Scope

Voice-controlled explosion mechanics
Wave-based enemy spawning
Castle health system with visual damage feedback
Chain reaction system
Progressive difficulty scaling

Out of Scope (MVP)

Multiple castle themes or advanced visual modes
Complex enemy AI beyond basic pathfinding
Multiplayer or online leaderboards
Extensive progression systems or unlockables

Success Metrics

Viral Potential

Average castle survival time: 2-4 minutes per session
Last-second save rate: dramatic chain reactions at <20% castle HP
Social media sharing of "yelling at teddy bears" moments
User-generated content from absurd gameplay situations

Technical Goals

Real-time voice input with adaptive calibration
Pattern matching accuracy: 85%+ for animal sound classification
Mobile performance: 60fps with large enemy waves
Player comprehension: understand castle defense within 30 seconds

Target Audience

Primary

Social Media Users

Ages 16-35 who enjoy sharing funny/absurd content and participate in viral challenges

Secondary

Content Creators

Streamers and influencers looking for unique, shareable gaming moments

Tertiary

Casual Mobile Gamers

Players seeking quirky, easy-to-play mobile games with unique mechanics

Development Methodology

Human-AI Collaborative Development

This project represents a modern development approach based on iterative human-LLM collaboration. Technical solutions emerge through constant brainstorming, mutual refinement, and shared problem-solving rather than traditional hierarchical development.

Collaborative Iteration

Constant back-and-forth brainstorming between human and LLM. Ideas, architectural decisions, and technical solutions emerge through iterative dialogue and mutual refinement.

LLM Partnership

AI assistance provides detailed implementation guidance, comprehensive documentation, and rapid prototyping capabilities while maintaining technical integrity and system coherence.

Iterative Refinement

Solutions evolve through trial, error, and continuous adjustment. Testing happens organically as ideas are implemented, with fixes and improvements emerging from the collaborative process.

Growth Mindset

Approached from a junior developer's perspective. Mistakes and suboptimal implementations are learning opportunities. Constructive criticism and feedback highly welcomed.

Project Approach

Philosophy

Solo hobby project focused on learning and viral potential over strict timelines. Emphasizing rapid prototyping and iterative feedback.

Timeline

Flexible milestones driven by learning opportunities. Castle Defense Prototype → MVP Polish → Launch Polish. Ideally 3 to 6 months

Budget

~$350-450 for assets, platform fees, and development tools. Low financial risk enables creative experimentation.

Risk Strategy

Betting on absurdity and social mechanics over traditional game design. Time investment justified as valuable learning opportunity.

Project Status

Current Phase

Phase 1: Technical Foundation

Establishing core audio processing architecture and documentation. Focus on multi-layer detection system and performance engineering.

Development Roadmap

1

Technical Foundation

Audio processing architecture, documentation, and core system design

2

Castle Defense Prototype

Unity implementation, basic gameplay loop, voice input integration

3

MVP Polish

Asset integration, UI/UX, wave systems, mobile optimization

4

Launch Preparation

Analytics, store assets, stability testing, deployment

Requirements & Specifications

Comprehensive requirements defining what the system must deliver to achieve viral success through castle defense gameplay with voice interaction.

Core Gameplay Requirements

Voice-Triggered Explosions

The core mechanic revolves around players using their voice to trigger explosions that chain-react through waves of approaching enemies. Each plush enemy has a voice tolerance threshold that gradually fills as players provide vocal input, whether through sustained tones, sharp bursts, or varied vocal patterns.

The chain reaction system creates strategic depth through damage falloff mechanics. When an enemy explodes, it triggers secondary explosions in nearby enemies with reduced effectiveness, encouraging players to position their vocal attacks for maximum cascade impact.

Chain Reaction: 100% → 70% → 49% damage falloff across explosion waves

Castle Defense Loop

Waves of adorable plush enemies march inexorably toward the player's castle following predictable pathways. The cognitive dissonance between the enemies' cute appearance and the necessity of their destruction creates the game's core comedic tension and viral appeal.

Any enemy reaching the castle will explode against it, causing visible structural damage and reducing the castle's health meter. This creates clear, immediate stakes where every missed voice activation has tangible consequences, driving player engagement and urgency.

Clear Stakes: Castle health serves as immediate progress indicator and failure condition

Strategic Depth & Skill Expression

Players develop mastery through volume control, timing precision, and vocal pattern recognition. Different enemy types respond to different vocal approaches—some may require sustained tones, others sharp bursts, creating a skill ceiling that rewards experimentation and practice.

The system balances accessibility with depth: newcomers can succeed through enthusiastic yelling, while experienced players can achieve elegant efficiency through whispered precision strikes, strategic timing, and optimal chain reaction positioning.

Skill Ceiling: Whisper precision tactics vs. panic-scream crowd clearing

User Experience & Viral Design

Instant Comprehension Hook

The game's core concept must be immediately graspable: "Scream to defend, plushies meet their end." Within 30 seconds of launching, players should understand both the basic mechanics and the absurd premise that drives the game's shareability.

Visual feedback reinforces the voice-to-action connection through immediate explosion effects, screen shake, and audio cues that make every vocal input feel impactful and satisfying, even for first-time players.

First Impression: Cognitive dissonance between cute enemies and explosive destruction = natural shareability

Viral Moment Generation

The game naturally creates shareable moments through its inherent absurdity: tactical whispering at teddy bears, panic-screaming dramatic last-second saves, and the social awkwardness of enthusiastic voice gaming in public spaces.

These organic comedy moments require no artificial virality mechanics—the core gameplay loop itself generates the "I can't believe I'm doing this" reactions that drive social media sharing and word-of-mouth growth.

Content Gold: "I can't believe I'm yelling at teddy bears to save my castle"

Responsive Technical Performance

Voice input must feel immediately tactile and responsive, with sub-100ms latency from voice detection to explosion visual feedback. The system maintains 60fps performance even during intense multi-enemy chain reactions to preserve the satisfying feel of voice-controlled destruction.

Audio processing adapts to different microphone qualities and ambient noise levels, ensuring consistent gameplay experience across various mobile devices and acoustic environments without requiring complex user calibration.

Target: Tactile satisfaction from voice control with <100ms response time

Development Constraints & Technical Requirements

These constraints shape the technical architecture and scope decisions, balancing ambitious gameplay vision with realistic development and deployment limitations.

Mobile Platform Targeting

Mobile-first development approach targeting iOS 14+ and Android 8+ devices, optimizing for the performance characteristics and input limitations of smartphone hardware. The voice processing system must operate efficiently within mobile device thermal and battery constraints.

Minimum device specifications target iPhone 8+ and Samsung Galaxy S8+ equivalent performance, ensuring broad market accessibility while maintaining the responsive gameplay experience that drives user satisfaction and retention.

Compatibility: iPhone 8+, Samsung Galaxy S8+ minimum performance targets

Solo Development Realities

Single developer constraint with 3-6 month launch timeline demands ruthless scope management and MVP-first thinking. Feature complexity must remain manageable while preserving the core viral appeal that justifies the project's existence.

Budget limitations of approximately $450 for assets, platform fees, and development tools require creative resource allocation and careful vendor selection, favoring proven asset sources and established platform relationships.

Reality Check: MVP-first approach, avoid feature creep, validate core loop before expansion

CLARITY Architecture Framework

Code architecture follows CLARITY principles emphasizing maintainable, testable implementation with explicit contracts and minimal accidental complexity. This framework enables confident iteration and feature addition without destabilizing core systems.

The architecture prioritizes safe addition of new features while maintaining system coherence, supporting the iterative development approach necessary for finding the optimal balance between accessibility and viral appeal.

Philosophy: Add safely, refactor on third similar case, explicit over implicit

Launch Readiness Criteria

Technical Validation

Device Testing: Game launches without crashing on 3+ actual devices
Voice Reliability: Input functions consistently across test environments
Performance: Maintains 60fps during sustained gameplay

User Experience Validation

Instant Comprehension: 30-second understanding without tutorial
Natural Reactions: Players laugh at the absurdity
Organic Sharing: Unprompted social media posts

Technical Architecture

System design focused on maintainable, testable code with real-time performance constraints. Architecture decisions balance immediate delivery needs with long-term extensibility.

System Design Overview

Voice Processing Pipeline

Multi-layer audio analysis pipeline converting raw microphone input into gameplay actions. Designed for 60fps real-time processing with adaptive noise filtering.

Architecture: Microphone → Gate → Fusion → Decision → Gameplay

Dependency Injection Strategy

Clean separation of concerns using explicit interfaces and composition roots. Gameplay logic remains testable and independent of audio processing implementation.

Benefit: Swap voice detection algorithms without touching code

Performance-First Design

Zero-allocation architecture in hot paths with object pooling and buffer reuse. Every frame budget carefully managed for mobile device constraints.

Target: 0-byte GC allocation during sustained gameplay

Voice Processing Pipeline Implementation

Deep dive into the real-time audio processing architecture that converts raw microphone input into reliable gameplay triggers. The pipeline targets ~200–250 ms reaction time by default (optimized for stability and low false positives), with a chirp‑optimized profile reaching ~120–180 ms when needed. Latency depends on FFT size, audio buffer settings, smoothing, and confirm windows.

Note: Latency figures are measured ranges in current dev builds and will evolve as datasets, presets, and platform settings mature. Values vary by device, OS/DSP buffer, and scene configuration.

Multi-Layer Audio Analysis

The voice processing pipeline employs a sophisticated multi-layer approach where each layer provides increasing confidence about voice activity. This design allows for graceful degradation and adaptive behavior based on environmental conditions.

Voice Processing Pipeline (collapse to hide)

Loading voice processing pipeline…

Detection Layers Architecture

Granular breakdown of the four-layer detection system, each optimized for specific aspects of voice pattern recognition with progressive confidence refinement.

Layer 0: Intent Gate

Heuristic-based voice activity detection providing immediate negative evidence filtering. Prevents false triggers from environmental noise and non-vocal audio sources through rapid veto seam implementation.

The HeuristicIntentGate evaluates each audio frame and outputs an IntentGateResult containing IntentScore (0-1), GatePass boolean, and NegEvidenceFlags. The gate uses spectral analysis from FrequencyAnalyzer to detect voice activity while immediately flagging problematic audio conditions.

Implementation Details: Energy-based heuristics combined with spectral centroid and high-frequency ratio analysis. Negative evidence detection includes clipping detection and noise spike identification that triggers immediate arbiter cooldown. The gate operates with O(n) complexity over frame length with zero hot-path allocations.

Note: Gate threshold tuning and negative evidence algorithms are actively being developed and refined based on field telemetry data.

Layer 0 Implementation (collapse to hide)

Loading Layer 0 diagram…

Performance Characteristics: The intent gate provides rapid negative evidence filtering with ~1ms processing time per frame. Under silence conditions, IntentScore ≈ 0 and GatePass = false for most frames. When clipped input is detected, negative evidence flags become non-zero and the arbiter immediately enters cooldown state.

Integration Strategy: Gate output feeds into the fusion system as gateWeight, providing reliability estimates that reweight other layer contributions. The gate can optionally contribute directly as Layer 1 input when UseGateScoreAsLayer1 is enabled, though Layer 2 template matching is preferred for primary classification.

Layer 1: Prosodic Analysis

Frequency domain analysis using prosodic scalars extracted from windowed audio features. Provides baseline voice pattern classification through spectral characteristics and frequency band analysis.

Layer 1 implements prosodic analysis using the formula: L1 = 0.5·low_ratio + 0.3·(1−hf_ratio) + 0.2·centroid_comp, where components are derived from FrequencyAnalyzer spectral features.

Feature Extraction:

low_ratio = LowBandEnergy / (Low+Mid+High)
hf_ratio = HighBandEnergy / (Low+Mid+High)
centroid_comp = 1 − clamp01((SpectralCentroid − 300Hz)/3000Hz)

The centroid range (300-3300 Hz) covers fundamental and first formant regions typical of animal vocalizations and human voice. Weights (0.5/0.3/0.2) bias toward low-band dominance and high-frequency suppression.

Note: Prosodic weights and frequency band configurations are being tuned based on microphone characteristics and room acoustics data.

Layer 1 Implementation (collapse to hide)

Loading Layer 1 diagram…

Real-Time Performance: Prosodic analysis operates allocation-free with bounded computation per frame. Under silence/noise conditions, L1 typically remains below 0.2, while voiced segments push L1 above 0.5. The gate weight coupling derives from SNR and (1−hf_ratio) to dampen contributions during unreliable acoustic scenes.

Failure Mode Mitigation: High-frequency heavy microphones require decreased centroid weight and increased (1−hf_ratio) contribution. Boomy room acoustics necessitate capping low_ratio contribution to 0.4 and requiring L2 concurrency for confirmation.

Layer 2: Template Matching

Dynamic Time Warping correlation against curated animal sound templates. MFCC-based pattern matching with confidence scoring for intended voice classifications and environmental noise rejection.

Layer 2 employs MFCC-based template matching using multiple similarity engines: SimilarityEngineDtw (full), SimilarityEngineDtwBanded (band-limited), and SimilarityEngineCorrelation (baseline). MFCC matrices are windowed using configurable frameMs, hopMs, and framesInWindow parameters.

Template Matching Pipeline:

Feature Extraction: 13-coefficient MFCC with 25ms frame, 10ms hop, ~20 frames per window
Similarity Engines: DTW with band=3 for robustness, correlation for speed baseline
Confidence Mapping: ExpConfidenceMapper converts raw distances to [0,1] confidence via sigma calibration
Template Index: Groups templates by animal/tags with curated candidate sets

Integration: TemplateMatcherFeeds bridges to fusion system, providing cached L2 confidence, winner ID, and quality flags via IConfidenceFeeds interface.

Note: Template curation, similarity engine selection, and confidence calibration parameters are actively refined through dedupe analysis and field testing.

Layer 2 Implementation (collapse to hide)

Loading Layer 2 diagram…

Performance Optimization: Template matching runs at 20Hz cadence with caching to balance accuracy and computational cost. For in-set examples, L2 consistently achieves ≥0.6 confidence with stable winner identification. Quality flags reflect acoustic conditions including SNR levels, utterance length, and data staleness.

Computational Scaling: DTW cost scales with template count, mitigated through dedupe preprocessing and candidate set filtering. The system supports real-time operation with typical template sets while maintaining deterministic behavior for testing and validation.

Layer 3: ML Classifier

Machine learning classification layer providing refined confidence scoring. Optional deployment based on computational budget and accuracy requirements, implemented using swappable strategy pattern for flexible model integration.

Current Implementation: Layer 3 currently uses tonal stability analysis: L3 = 1 − clamp01(std(low_band_peak_Hz)/120Hz) over a ring buffer of ~12 recent LowBandPeakFrequencyHz updates from FrequencyAnalyzer.

Tonal Stability Metrics: Higher confidence indicates stable low-band peaks over time, characteristic of steady phonation or meow core tones. Window length trades responsiveness (smaller N) versus reliability (larger N), with 120Hz standard deviation threshold calibrated for typical animal vocalizations.

Future ML Integration: Planned architecture includes lightweight CNN/CRNN models over log-mel spectrograms or MFCC stacks. Integration contract: ILayer3Model.TryInfer(float[][] window, out float l3, out uint flags) with Unity Barracuda runtime, preallocated tensors, and zero per-frame allocations.

Note: ML classifier architecture, training pipeline, and swappable strategy pattern implementation are in active development with target <2-5ms inference latency.

Layer 3 Implementation (collapse to hide)

Loading Layer 3 diagram…

Development Status: Sustained voiced tones currently produce L3 > 0.6 within 0.5-1.0 seconds, while noisy segments maintain L3 < 0.3. L3 weight remains modest (0.1-0.2) to allow L2 template matching dominance during transition period.

Future Benchmarks: Target specifications include ≤5ms inference latency on mobile devices, accuracy uplift over L2 on hold-out datasets, and stable confidence in noisy acoustic environments. Model deployment will use swappable strategy pattern for A/B testing different architectures without system disruption.

Confidence Fusion Strategy

Individual layer outputs are combined using weighted fusion based on environmental conditions and historical accuracy. The fusion algorithm adapts weights dynamically to optimize for current audio environment characteristics.

Adaptive Weighting

Layer weights adjust based on ambient noise levels, microphone quality, and recent classification accuracy. Noisy environments increase reliance on pattern matching over simple gate detection.

Algorithm: EWMA smoothing with environment-aware weight adjustment

Temporal Smoothing

Exponentially weighted moving average (EWMA) applied to fused confidence scores to eliminate jitter while maintaining responsiveness to genuine voice input changes.

Balance: 95% noise rejection, <100ms response time

Architecture Decisions

Strategy Pattern Implementation

Explicit interfaces for Momentum, Chain Windows, and Mapping Curves allow runtime behavior swapping. Default policies ship first; variants added when player data shows need.

YAGNI Principle: Avoid premature abstraction, add complexity when proven necessary

Composition Root Pattern

Service locators kept at application edges. Gameplay code receives dependencies through constructor injection, maintaining testability and clear contracts.

Testing Benefit: Mock audio services for automated gameplay validation

Event-Driven Communication

Loose coupling between audio detection and gameplay systems through events. Audio confidence changes trigger gameplay reactions without direct dependencies.

Flexibility: Add new gameplay mechanics without modifying voice processing

System Flow Diagram

Complete pipeline from microphone input to gameplay actions. This diagram is automatically generated from source files to stay synchronized with implementation.

System Architecture (collapse to hide)

Loading architecture…

Configuration & Feature Flags

Feature Toggle System

Runtime feature flags allow A/B testing of audio processing algorithms. Configuration-driven development enables safe experimentation with detection methods.

Example: VcrFeatureFlags.EnableFsmArbiter = true

Layered Audio Processing

Multi-layer confidence fusion combines gate detection, pattern matching, and optional ML classification. Layers can be enabled/disabled for performance tuning.

Configuration: EnableFusionFeeds + UseGateScoreAsLayer1

Dependency Injection Setup

CompositionRoot handles complex object graph construction. Gate, Fusion, EWMA, Arbiter, and Layer 2 matcher all configured through single injection point.

Benefit: Centralized configuration, simplified testing, clear dependencies

Decision Engine Detail

The FSM Arbiter represents the critical decision-making component that converts multi-layer audio confidence scores into definitive gameplay actions. This finite-state machine design isolates complex signal processing from game logic, ensuring both systems remain independently testable and maintainable.

State Machine Logic

Four-layer confidence inputs (L0-L3) are fused and smoothed using EWMA filtering before entering FSM with configurable thresholds and confirm-hold states.

Benefit: Eliminates false positives while maintaining responsiveness

Signal Processing Pipeline

Multi-stage confidence fusion combines gate detection, pattern matching, and optional ML classification into single action pulse for gameplay consumption.

Architecture: Fusion → EWMA → FSM → Acting Event

Testability Design

Decision logic completely separated from signal extraction allows unit testing of state transitions with mock confidence inputs and validation of output timing.

Testing: Mock inputs → verify state transitions → assert action timing

FSM Arbiter Implementation (collapse to hide)

Loading arbiter slice…

Contents