Technical Architecture
System design focused on maintainable, testable code with real-time performance constraints. Architecture decisions balance immediate delivery needs with long-term extensibility.
System Design Overview
Voice Processing Pipeline
Multi-layer audio analysis pipeline converting raw microphone input into gameplay actions. Designed for 60fps real-time processing with adaptive noise filtering.
Architecture: Microphone → Gate → Fusion → Decision → Gameplay
Dependency Injection Strategy
Clean separation of concerns using explicit interfaces and composition roots. Gameplay logic remains testable and independent of audio processing implementation.
Benefit: Swap voice detection algorithms without touching code
Performance-First Design
Zero-allocation architecture in hot paths with object pooling and buffer reuse. Every frame budget carefully managed for mobile device constraints.
Target: 0-byte GC allocation during sustained gameplay
Voice Processing Pipeline Implementation
Deep dive into the real-time audio processing architecture that converts raw microphone input into reliable gameplay triggers. The pipeline targets ~200–250 ms reaction time by default (optimized for stability and low false positives), with a chirp‑optimized profile reaching ~120–180 ms when needed. Latency depends on FFT size, audio buffer settings, smoothing, and confirm windows.
Note: Latency figures are measured ranges in current dev builds and will evolve as datasets, presets, and platform settings mature. Values vary by device, OS/DSP buffer, and scene configuration.
Multi-Layer Audio Analysis
The voice processing pipeline employs a sophisticated multi-layer approach where each layer provides increasing confidence about voice activity. This design allows for graceful degradation and adaptive behavior based on environmental conditions.
Voice Processing Pipeline (collapse to hide)
Loading voice processing pipeline…
Detection Layers Architecture
Granular breakdown of the four-layer detection system, each optimized for specific aspects of voice pattern recognition with progressive confidence refinement.
Layer 0: Intent Gate
Heuristic-based voice activity detection providing immediate negative evidence filtering. Prevents false triggers from environmental noise and non-vocal audio sources through rapid veto seam implementation.
The HeuristicIntentGate evaluates each audio frame and outputs an IntentGateResult containing IntentScore (0-1), GatePass boolean, and NegEvidenceFlags. The gate uses spectral analysis from FrequencyAnalyzer to detect voice activity while immediately flagging problematic audio conditions.
Implementation Details: Energy-based heuristics combined with spectral centroid and high-frequency ratio analysis. Negative evidence detection includes clipping detection and noise spike identification that triggers immediate arbiter cooldown. The gate operates with O(n) complexity over frame length with zero hot-path allocations.
Note: Gate threshold tuning and negative evidence algorithms are actively being developed and refined based on field telemetry data.
Layer 0 Implementation (collapse to hide)
Loading Layer 0 diagram…
Performance Characteristics: The intent gate provides rapid negative evidence filtering with ~1ms processing time per frame. Under silence conditions, IntentScore ≈ 0 and GatePass = false for most frames. When clipped input is detected, negative evidence flags become non-zero and the arbiter immediately enters cooldown state.
Integration Strategy: Gate output feeds into the fusion system as gateWeight, providing reliability estimates that reweight other layer contributions. The gate can optionally contribute directly as Layer 1 input when UseGateScoreAsLayer1 is enabled, though Layer 2 template matching is preferred for primary classification.
Layer 1: Prosodic Analysis
Frequency domain analysis using prosodic scalars extracted from windowed audio features. Provides baseline voice pattern classification through spectral characteristics and frequency band analysis.
Layer 1 implements prosodic analysis using the formula: L1 = 0.5·low_ratio + 0.3·(1−hf_ratio) + 0.2·centroid_comp, where components are derived from FrequencyAnalyzer spectral features.
Feature Extraction:
low_ratio = LowBandEnergy / (Low+Mid+High)
hf_ratio = HighBandEnergy / (Low+Mid+High)
centroid_comp = 1 − clamp01((SpectralCentroid − 300Hz)/3000Hz)
The centroid range (300-3300 Hz) covers fundamental and first formant regions typical of animal vocalizations and human voice. Weights (0.5/0.3/0.2) bias toward low-band dominance and high-frequency suppression.
Note: Prosodic weights and frequency band configurations are being tuned based on microphone characteristics and room acoustics data.
Layer 1 Implementation (collapse to hide)
Loading Layer 1 diagram…
Real-Time Performance: Prosodic analysis operates allocation-free with bounded computation per frame. Under silence/noise conditions, L1 typically remains below 0.2, while voiced segments push L1 above 0.5. The gate weight coupling derives from SNR and (1−hf_ratio) to dampen contributions during unreliable acoustic scenes.
Failure Mode Mitigation: High-frequency heavy microphones require decreased centroid weight and increased (1−hf_ratio) contribution. Boomy room acoustics necessitate capping low_ratio contribution to 0.4 and requiring L2 concurrency for confirmation.
Layer 2: Template Matching
Dynamic Time Warping correlation against curated animal sound templates. MFCC-based pattern matching with confidence scoring for intended voice classifications and environmental noise rejection.
Layer 2 employs MFCC-based template matching using multiple similarity engines: SimilarityEngineDtw (full), SimilarityEngineDtwBanded (band-limited), and SimilarityEngineCorrelation (baseline). MFCC matrices are windowed using configurable frameMs, hopMs, and framesInWindow parameters.
Template Matching Pipeline:
- Feature Extraction: 13-coefficient MFCC with 25ms frame, 10ms hop, ~20 frames per window
- Similarity Engines: DTW with band=3 for robustness, correlation for speed baseline
- Confidence Mapping:
ExpConfidenceMapper converts raw distances to [0,1] confidence via sigma calibration
- Template Index: Groups templates by animal/tags with curated candidate sets
Integration: TemplateMatcherFeeds bridges to fusion system, providing cached L2 confidence, winner ID, and quality flags via IConfidenceFeeds interface.
Note: Template curation, similarity engine selection, and confidence calibration parameters are actively refined through dedupe analysis and field testing.
Layer 2 Implementation (collapse to hide)
Loading Layer 2 diagram…
Performance Optimization: Template matching runs at 20Hz cadence with caching to balance accuracy and computational cost. For in-set examples, L2 consistently achieves ≥0.6 confidence with stable winner identification. Quality flags reflect acoustic conditions including SNR levels, utterance length, and data staleness.
Computational Scaling: DTW cost scales with template count, mitigated through dedupe preprocessing and candidate set filtering. The system supports real-time operation with typical template sets while maintaining deterministic behavior for testing and validation.
Layer 3: ML Classifier
Machine learning classification layer providing refined confidence scoring. Optional deployment based on computational budget and accuracy requirements, implemented using swappable strategy pattern for flexible model integration.
Current Implementation: Layer 3 currently uses tonal stability analysis: L3 = 1 − clamp01(std(low_band_peak_Hz)/120Hz) over a ring buffer of ~12 recent LowBandPeakFrequencyHz updates from FrequencyAnalyzer.
Tonal Stability Metrics: Higher confidence indicates stable low-band peaks over time, characteristic of steady phonation or meow core tones. Window length trades responsiveness (smaller N) versus reliability (larger N), with 120Hz standard deviation threshold calibrated for typical animal vocalizations.
Future ML Integration: Planned architecture includes lightweight CNN/CRNN models over log-mel spectrograms or MFCC stacks. Integration contract: ILayer3Model.TryInfer(float[][] window, out float l3, out uint flags) with Unity Barracuda runtime, preallocated tensors, and zero per-frame allocations.
Note: ML classifier architecture, training pipeline, and swappable strategy pattern implementation are in active development with target <2-5ms inference latency.
Layer 3 Implementation (collapse to hide)
Loading Layer 3 diagram…
Development Status: Sustained voiced tones currently produce L3 > 0.6 within 0.5-1.0 seconds, while noisy segments maintain L3 < 0.3. L3 weight remains modest (0.1-0.2) to allow L2 template matching dominance during transition period.
Future Benchmarks: Target specifications include ≤5ms inference latency on mobile devices, accuracy uplift over L2 on hold-out datasets, and stable confidence in noisy acoustic environments. Model deployment will use swappable strategy pattern for A/B testing different architectures without system disruption.
Confidence Fusion Strategy
Individual layer outputs are combined using weighted fusion based on environmental conditions and historical accuracy. The fusion algorithm adapts weights dynamically to optimize for current audio environment characteristics.
Adaptive Weighting
Layer weights adjust based on ambient noise levels, microphone quality, and recent classification accuracy. Noisy environments increase reliance on pattern matching over simple gate detection.
Algorithm: EWMA smoothing with environment-aware weight adjustment
Temporal Smoothing
Exponentially weighted moving average (EWMA) applied to fused confidence scores to eliminate jitter while maintaining responsiveness to genuine voice input changes.
Balance: 95% noise rejection, <100ms response time
Architecture Decisions
Strategy Pattern Implementation
Explicit interfaces for Momentum, Chain Windows, and Mapping Curves allow runtime behavior swapping. Default policies ship first; variants added when player data shows need.
YAGNI Principle: Avoid premature abstraction, add complexity when proven necessary
Composition Root Pattern
Service locators kept at application edges. Gameplay code receives dependencies through constructor injection, maintaining testability and clear contracts.
Testing Benefit: Mock audio services for automated gameplay validation
Event-Driven Communication
Loose coupling between audio detection and gameplay systems through events. Audio confidence changes trigger gameplay reactions without direct dependencies.
Flexibility: Add new gameplay mechanics without modifying voice processing
System Flow Diagram
Complete pipeline from microphone input to gameplay actions. This diagram is automatically generated from source files to stay synchronized with implementation.
System Architecture (collapse to hide)
Loading architecture…
Configuration & Feature Flags
Feature Toggle System
Runtime feature flags allow A/B testing of audio processing algorithms. Configuration-driven development enables safe experimentation with detection methods.
Example: VcrFeatureFlags.EnableFsmArbiter = true
Layered Audio Processing
Multi-layer confidence fusion combines gate detection, pattern matching, and optional ML classification. Layers can be enabled/disabled for performance tuning.
Configuration: EnableFusionFeeds + UseGateScoreAsLayer1
Dependency Injection Setup
CompositionRoot handles complex object graph construction. Gate, Fusion, EWMA, Arbiter, and Layer 2 matcher all configured through single injection point.
Benefit: Centralized configuration, simplified testing, clear dependencies
Decision Engine Detail
The FSM Arbiter represents the critical decision-making component that converts multi-layer audio confidence scores into definitive gameplay actions. This finite-state machine design isolates complex signal processing from game logic, ensuring both systems remain independently testable and maintainable.
State Machine Logic
Four-layer confidence inputs (L0-L3) are fused and smoothed using EWMA filtering before entering FSM with configurable thresholds and confirm-hold states.
Benefit: Eliminates false positives while maintaining responsiveness
Signal Processing Pipeline
Multi-stage confidence fusion combines gate detection, pattern matching, and optional ML classification into single action pulse for gameplay consumption.
Architecture: Fusion → EWMA → FSM → Acting Event
Testability Design
Decision logic completely separated from signal extraction allows unit testing of state transitions with mock confidence inputs and validation of output timing.
Testing: Mock inputs → verify state transitions → assert action timing
FSM Arbiter Implementation (collapse to hide)
Loading arbiter slice…