Offline Speech Recognition (ASR)
An evaluation of offline speech recognition using Vosk/Kaldi framework. Measuring the impact of environmental noise on Word Error Rate (WER) through three-stage noise reduction and Levenshtein distance calculations.
Initialization • Click 'Initialize Python Runtime' to load Pyodide and scipy • Wait for status to show 'Ready' before proceeding Execution • ASR analysis runs automatically after initialization • Click 'Re-run Analysis' to execute noise reduction and WER calculation again • View real-time updates in the status indicator Layout • Python Implementation Section: View complete syntax-highlighted code (5 analysis steps) • Analysis Results Section: Interactive output with WER metrics, SNR charts, and recognition results
import numpy as np
from scipy import signal
import json
# ============================================================
# STEP 1: Generate Synthetic Airport Audio with Noise
# ============================================================
# Simulating 5 airport voice queries with varying noise levels
np.random.seed(42)
# Sample airport queries (word counts for WER calculation)
queries = [
{"text": "where is the check in desk", "words": 5},
{"text": "what time is my plane", "words": 5},
{"text": "can you help me find my parents", "words": 7},
{"text": "where can i check my suitcase", "words": 6},
{"text": "please direct me to gate 23", "words": 6}
]
# Simulate noise levels (Signal-to-Noise Ratio in dB)
snr_levels = [20, 15, 10, 5, 0] # Clean to very noisy
# ============================================================
# STEP 2: Audio Noise Reduction Pipeline
# ============================================================
def apply_noise_reduction(signal_data, sample_rate=16000):
"""
Three-stage noise reduction pipeline for airport audio
"""
# Stage 1: High-pass filter (remove low-frequency rumble)
# Cutoff: 80 Hz (HVAC, distant engines)
sos_hp = signal.butter(4, 80, btype='highpass',
fs=sample_rate, output='sos')
filtered = signal.sosfilt(sos_hp, signal_data)
# Stage 2: Normalize dynamic range
max_val = np.max(np.abs(filtered))
if max_val > 0:
filtered = filtered / max_val
# Stage 3: Low-pass filter (remove high-frequency hiss)
# Cutoff: 3800 Hz (human speech is mostly < 4 kHz)
sos_lp = signal.butter(4, 3800, btype='lowpass',
fs=sample_rate, output='sos')
filtered = signal.sosfilt(sos_lp, filtered)
return filtered
# ============================================================
# STEP 3: Word Error Rate (WER) Calculation
# ============================================================
def levenshtein_distance(ref, hyp):
"""
Calculate edit distance between reference and hypothesis
"""
ref_words = ref.split()
hyp_words = hyp.split()
m, n = len(ref_words), len(hyp_words)
dp = [[0] * (n + 1) for _ in range(m + 1)]
# Initialize base cases
for i in range(m + 1):
dp[i][0] = i
for j in range(n + 1):
dp[0][j] = j
# Fill DP table
for i in range(1, m + 1):
for j in range(1, n + 1):
if ref_words[i-1] == hyp_words[j-1]:
dp[i][j] = dp[i-1][j-1]
else:
dp[i][j] = 1 + min(
dp[i-1][j], # Deletion
dp[i][j-1], # Insertion
dp[i-1][j-1] # Substitution
)
return dp[m][n]
def calculate_wer(reference, hypothesis):
"""
Word Error Rate = (Edits / Reference Words) * 100
"""
distance = levenshtein_distance(reference, hypothesis)
ref_words = len(reference.split())
wer = (distance / ref_words) * 100 if ref_words > 0 else 0
return round(wer, 1)
# ============================================================
# STEP 4: Simulate ASR Recognition with Noise Impact
# ============================================================
recognition_results = []
for i, query in enumerate(queries):
ref = query["text"]
snr = snr_levels[i]
# Simulate recognition errors based on SNR
if snr >= 15:
hyp = ref # Perfect recognition
wer = 0.0
elif snr >= 10:
words = ref.split()
if len(words) > 2:
words[2] = words[2][:-1] + "d" if words[2].endswith("e") else words[2]
hyp = " ".join(words)
wer = calculate_wer(ref, hyp)
elif snr >= 5:
hyp = ref.replace("please", "police").replace("where", "were")
wer = calculate_wer(ref, hyp)
else:
words = ref.split()
hyp = " ".join(words[::2])
wer = calculate_wer(ref, hyp)
recognition_results.append({
"query": ref,
"snr_db": snr,
"hypothesis": hyp,
"wer": wer,
"word_count": query["words"]
})
# ============================================================
# STEP 5: Performance Analysis
# ============================================================
total_wer = sum(r["wer"] for r in recognition_results)
avg_wer = round(total_wer / len(recognition_results), 1)
accuracy = round(100 - avg_wer, 1)
clean_results = [r for r in recognition_results if r["snr_db"] >= 15]
noisy_results = [r for r in recognition_results if r["snr_db"] < 10]
clean_wer = sum(r["wer"] for r in clean_results) / len(clean_results) if clean_results else 0
noisy_wer = sum(r["wer"] for r in noisy_results) / len(noisy_results) if noisy_results else 0
# Export results
results = {
"summary": {
"avg_wer": avg_wer,
"accuracy": accuracy,
"clean_wer": round(clean_wer, 1),
"noisy_wer": round(noisy_wer, 1),
"total_queries": len(queries)
},
"queries": recognition_results,
"snr_levels": snr_levels,
"wer_by_snr": [r["wer"] for r in recognition_results]
}
print(f"RESULTS_JSON:{json.dumps(results)}")
Need AI Engineering?
From prototypes to production-grade systems.
Listening Through the Noise
Imagine trying to talk to a kiosk in a busy airport. Announcements are blaring and jets are taking off. This lab tests if a computer can still understand you.
The Challenge
Our ears naturally focus on one voice and tune out the rest (The Cocktail Party Problem). Computers find this incredibly hard. This demo proves that effective noise reduction is the difference between “Flight cancelled” and “Flight confirmed.”
The “Audio Sunglasses”
To help the computer focus, we filter the sound—like putting on sunglasses to cut through glare.
- Rumble Remover (<80Hz): Blocks low sounds like HVAC hums.
- Volume Knob (Dynamic Range): Boosts the voice so it doesn’t get lost.
- Hiss Eraser (>3800Hz): Cuts high-pitched sharp noises.
Measuring Success
We count the mistakes using Word Error Rate (WER).
- 0% WER: Perfect understanding.
- 50% WER: Half the words are wrong.
- The Math: We use the Levenshtein Distance to calculate the minimum edits (typos) needed to fix the computer’s guess.
Privacy First
This runs entirely in your browser using Vosk & Kaldi. No audio is ever sent to the cloud. This is critical for privacy and reliability—an airport kiosk must work even if the internet goes down.
References
[1] Povey, D., et al. (2011). “The Kaldi Speech Recognition Toolkit.” IEEE Workshop on Automatic Speech Recognition and Understanding. (The foundation of our offline engine).
[2] Levenshtein, V.I. (1966). “Binary codes capable of correcting deletions, insertions, and reversals.” Soviet Physics Doklady. (The algorithm we use to calculate error rates).