TurboVAD is a lightweight, high-speed Voice Activity Detection (VAD) system that utilizes self-attention and spectral features for accurate voice detection. Designed for real-time applications, TurboVAD achieves high efficiency and accuracy, making it ideal for low-latency environments.
Note: This project is currently under testing. The initial benchmark results are based on a small sample size and may vary as further testing and fine-tuning are performed.
- Self-Attention Mechanism for efficient feature weighting
- Spectral Feature Analysis including spectral centroid, entropy, flatness, and harmonic noise ratio
- Optimized C Implementation for performance and efficiency
- Model Size < 1KB, making it ideal for memory-constrained applications
- Fine-tuned for Bird Activity Detection to enhance avian vocalization analysis
- Highly optimized build system with extensive compiler optimizations and SIMD support
- Ultra-fast processing at ~1198x faster than real-time
TurboVAD processes 8 spectral-temporal features:
- Spectral Centroid (ฮผ + ฯ)
- Spectral Entropy (ฮผ + ฯ)
- Spectral Flatness (ฮผ + ฯ)
- Harmonic Noise Ratio (ฮผ + ฯ)
These features represent mean (ฮผ) and standard deviation (ฯ) over a time frame, providing robust temporal characteristics for accurate classification.
These features are processed using a self-attention mechanism, which assigns dynamic importance to each feature and refines classification using a softmax-based scoring system. The model is specifically optimized for bird activity detection, making it ideal for wildlife monitoring and ornithological research.
%%{init: {'theme': 'base', 'themeVariables': { 'fontSize': '12px'}}}%%
graph TD
A["Input Features<br/>ฮผ1-4, ฯ1-4"] --> B[Mu Branch]
A --> C[Sigma Branch]
B --> D["Dot Product<br/>W_mu[4ร4]"]
C --> E["Dot Product<br/>W_sigma[4ร4]"]
D --> F[Softmax]
E --> G[Softmax]
F --> H["Linear Comb.<br/>Classifier_mu[4]"]
G --> I["Linear Comb.<br/>Classifier_sigma[4]"]
H --> J[Mu Score]
I --> K[Sigma Score]
J --> L["Weighted Sum<br/>factors=[-11.0, 6.18]"]
K --> L
L --> M["Sigmoid<br/>threshold=0.5002"]
M --> N{Prediction<br/>0/1}
classDef feature fill:#f8bbd0,stroke:#c2185b;
classDef weights fill:#e1bee7,stroke:#8e24aa;
classDef math fill:#c5cae9,stroke:#3949ab;
classDef output fill:#b2dfdb,stroke:#00796b;
classDef decision fill:#ffcc80,stroke:#f57c00;
class A feature;
class D,E weights;
class F,G,H,I math;
class J,K,L,M output;
class N decision;
- Mu Branch (ฮผ): Processes mean spectral features; output: temporal stability of feature averages.
- Sigma Branch (ฯ): Analyzes feature variations; output: detection of transient patterns.
- Feature Projection: 8D input to 8D attention scores via 2x4x4 matrix multiplication.
- Softmax Normalization: Applies numerically stable softmax to score vectors.
- Classification Heads: Linear combination with 4-element classifier weights for each branch.
- Decision Fusion:
- Formula: (-11.003 \times \mu + 6.180 \times \sigma)
- Sigmoid activation and threshold 0.5002 for classification.
- Inference Time Optimization:
- Cache-local weight access, branchless computation, compiler auto-vectorization (AVX2).
- Memory Safety: Stack-allocated intermediates, no heap allocations in the prediction path.
%%{init: {'theme': 'base', 'themeVariables': { 'fontSize': '12px'}}}%%
graph TD
A[Audio Input] --> B[Format Detection]
B --> C{Type?}
C -->|MP3| D[Decode with SIMD]
C -->|WAV| E[Direct Processing]
D --> F[STFT Computation]
E --> F
subgraph Feature Extraction
F --> G[Window Application<br/>size=window_size_det]
G --> H[FFT via FFTW3]
H --> I[Compute ฮผ/ฯ Features]
end
I --> J[Attention Model]
J --> K{Detection?}
subgraph Output Generation
K -->|Yes| L[Segment Slicing<br/>seg_len=0.05s]
L --> M[STFT Recalculation<br/>size=window_size_img]
M --> N[Save WAV]
M --> O[Generate STFT PNG]
M --> P[Create Mel Spectrogram]
end
Input Handling:
- Audio Detection: Auto-detects MP3/WAV using file headers
audio_data audio = auto_detect(filename);
- Decoding:
- MP3: SIMD-accelerated via libmpg123
- WAV: Direct memory mapping
Spectral Analysis:
- STFT Parameters:
stft_d stft = stft(&audio, window_size_det, hop_size_det, ...);
- Dual window sizes:
- Detection: Optimal for model (default 2048 samples)
- Imaging: Visual quality (default 4096 samples)
- Feature Extraction: Computes 8 features (4ฮผ + 4ฯ) per 50ms segment
Output Generation:
- Conditional Saving:
if(pred.val > 0.5002) { sliced_write_wave(...); save_spectrogram(...); }
Technical Highlights:
Aspect | Implementation Detail |
---|---|
Memory Efficiency | <1KB model weights, stack allocation |
Parallelism | 3.19x speedup on 6-core Ryzen |
Numerical Stability | Safe division, softmax with max-sub |
Reproducibility | FFTW wisdom caching |
Portability | Single-precision floats throughout |
Dual-window architecture enables:
- Small windows (50ms) for detection latency
- Large windows (100ms+) for visual quality
- Zero-copy data sharing between stages
Note: The benchmark results presented here are from initial tests using a small sample size. Further testing and benchmarking are required to refine the performance metrics.
Metric | 0.5s Segments | 50ms Segments ๐ | Improvement |
---|---|---|---|
Processing Speed | ~1,198ร real-time | ~13,889ร real-time | 11.6ร faster |
Total Processing Time | 0.834 ms/audio-sec | 0.072 ms/audio-sec | 11.6ร reduction |
Model Inference Time | 87.686 ns/audio-sec | 65.711 ns/audio-sec | 25.1% faster |
STFT | 0.392 ms | 27 ยตs (0.027 ms) | 14.5ร faster |
Normalization | 69.408 ยตs | 8.177 ยตs | 8.5ร faster |
Feature Extraction | 372.793 ยตs | 36.843 ยตs | 10.1ร faster |
System IPC | 0.229 | 0.759 | 3.3ร better |
Branch Miss Rate | 1.49% | 1.99% | 33.6% increase ๐ |
Parallel Speedup | 7.96ร | 3.31ร | 58.4% reduction ๐ |
CPUs Utilized | 8.31 | 4.30 | 48.3% reduction ๐ |
Key Observations:
- ๐ 14.5ร faster STFT from improved cache locality
- ๐ Parallel efficiency tradeoff due to increased I/O-bound workload
- โก 3.3ร better IPC shows improved instruction throughput
- ๐ Branch prediction slightly impacted by finer segmentation
Note: The reduced parallel speedup and CPU utilization reflect increased I/O dominance at smaller segment sizes - a common tradeoff in real-time systems.
Metric | Value |
---|---|
Accuracy | 79% |
Model Size | < 1KB |
Optimal Threshold | 0.5002 |
Processing Speed | ~13,889ร real-time |
Total Processing Time | 0.072 ms/audio-sec |
The benchmarks were generated using random sampling across diverse bird species, with results averaged over multiple samples to ensure representative performance metrics. All timings are normalized per second of audio for consistent comparison.
TurboVAD employs a comprehensive build system with multiple optimization levels:
# Build with built-in color scheme (default)
make builtin
# Build with OpenCV-like color scheme
make opencv_like
# Build in debug mode
make debug
# Clean build artifacts
make clean
# Run the program
make run
# Run with memory checking
make run_debug
- Advanced Optimization with
-Ofast
for maximum performance - SIMD Acceleration with AVX, AVX2, FMA, and SSE4.2 instruction sets
- Parallel Processing via OpenMP (achieving ~8x parallel speedup)
- Link-Time Optimization for cross-file optimizations
- Vectorization of loops for improved performance
- Aggressive loop optimizations including unrolling, peeling, and interchange
TurboVAD accepts the following parameters:
- filename: Path to the input audio file
- window_size_det: Window size for detection
- window_size_img: Window size for output stft/mel spectrogram
- hop_size_det: Hop size for detection
- hop_size_img: Hop size for output stft/mel spectrogram
- window_type_det: Window function type for detection
- window_type_img: Window function type for output stft/mel spectrogram
- seg_len: Segment length
- num_mel: Number of Mel filterbanks
- mel_min: Minimum Mel frequency
- mel_max: Maximum Mel frequency
- wav_fol: Path to the waveform folder
- stft_fol: Path to the STFT cache folder
- mel_fol: Path to the Mel spectrogram cache folder
- cache_fol: Path to the FFT cache folder
- cs_stft: Color scheme ID for visualization output STFT
- cs_mel: Color scheme ID for visualization output Mel Spectrogram
- th: Audio filtration threshold. If set to
0
, the STFT is filtered dynamically to remove noise. It is recommended that the default value for noisy data be 0.01
- Further fine-tuning for different bird species
- Expansion to support multi-environment bird call detection
- Optimizations for low-power embedded devices
- Benchmarking against traditional VAD systems
Contributions are welcome! Feel free to open issues, submit PRs, or discuss improvements in the repo.
TurboVAD is open-source under the MIT License.
Spectral Feature Based Voice Activity Detection using Self Attention