Voice Activity Detector

The Voice Activity Detector (VAD) helps you determine when speech is present in an audio stream. It analyzes short chunks of PCM audio and returns the probability that the chunk contains voice.

This can be used to:

Drive push-to-talk or auto-mute logic
Skip encoding/sending silence to save bandwidth
Trigger UI indicators when the user is speaking

API: dev.onvoid.webrtc.media.audio.VoiceActivityDetector

Overview

VoiceActivityDetector exposes a minimal API:

process(byte[] audio, int samplesPerChannel, int sampleRate): Analyze one audio frame.
getLastVoiceProbability(): Retrieve the probability (0.0..1.0) that the last processed frame contained voice.
dispose(): Release native resources. Always call this when done.

Internally, VAD uses a native implementation optimized for real-time analysis. The class itself does not perform resampling or channel mixing, so provide audio matching the given sampleRate and expected format.

Audio format expectations

PCM signed 16-bit little-endian (typical Java byte[] from microphone capture via this library)
Mono is recommended. If you have stereo, downmix to mono before calling process or pass samples-per-channel accordingly
Frame size: commonly 10 ms per call (e.g., 160 samples at 16 kHz for 10 ms)
Supported sample rates: 8 kHz, 16 kHz, 32 kHz, 48 kHz (use one of these for best results)

Basic usage

java

import dev.onvoid.webrtc.media.audio.VoiceActivityDetector;

// Create the detector
VoiceActivityDetector vad = new VoiceActivityDetector();

try {
    // Example parameters
    int sampleRate = 16000;                 // 16 kHz
    int frameMs = 10;                       // 10 ms frames
    int samplesPerChannel = sampleRate * frameMs / 1000; // 160 samples

    // audioFrame must contain 16-bit PCM data for one frame (mono)
    byte[] audioFrame = new byte[samplesPerChannel * 2]; // 2 bytes per sample

    // Fill audioFrame from your audio source here
    // ...

    // Analyze frame
    vad.process(audioFrame, samplesPerChannel, sampleRate);

    // Query probability of voice in the last frame
    float prob = vad.getLastVoiceProbability(); // 0.0 .. 1.0

    boolean isSpeaking = prob >= 0.5f; // choose a threshold that works for your app

}
finally {
    // Always release resources
    vad.dispose();
}

Continuous processing loop

java

VoiceActivityDetector vad = new VoiceActivityDetector();

try {
    int sampleRate = 16000;
    int frameMs = 10;
    int samplesPerChannel = sampleRate * frameMs / 1000; // 160 samples
    byte[] audioFrame = new byte[samplesPerChannel * 2];

    while (running) {
        // Read PCM frame from your capture pipeline into audioFrame
        // ...

        vad.process(audioFrame, samplesPerChannel, sampleRate);
        float prob = vad.getLastVoiceProbability();

        if (prob > 0.8f) {
            // High confidence of speech
            // e.g., enable VU meter, unmute, or mark active speaker
        }
        else {
            // Likely silence or noise
        }
    }
}
finally {
    vad.dispose();
}

Tips and best practices

Threshold selection: Start with 0.5–0.8 and tune for your environment.
Frame size consistency: Use a consistent frame duration and sample rate.
Resource management: VAD holds native resources; ensure dispose() is called.
Preprocessing: Consider using AudioProcessing (noise suppression, gain control) before VAD for improved robustness in noisy environments. See the Audio Processing guide.

Audio Processing

Voice Activity Detector ​

Overview ​

Audio format expectations ​

Basic usage ​

Continuous processing loop ​

Tips and best practices ​

Related guides ​