OPERATIONAL MANUAL

Initialization • Click 'Start Audio Engine' to initialize the speech synthesis system Operation • Click 'Play Distorted Audio' to hear the current challenge phrase • Type the phrase you hear into the input field • Click 'Verify' to check your answer • Click 'Generate New Code' to create a fresh challenge Observation • Watch the real-time waveform visualization react to speech output • Notice how pitch and rate change with each playback — this is the bot-thwarting mechanism

Need Site Reliability & Security Audits?

From prototypes to production-grade systems.

The Turing Test Problem

In 1950, Alan Turing proposed the question: “Can machines think?” Today, the question is reversed — can a machine prove that its user is human? Visual CAPTCHAs (Completely Automated Public Turing test to tell Computers and Humans Apart) have been the standard approach, but they fail a critical population: visually impaired users.

This lab implements an audio-first CAPTCHA that exploits a fundamental weakness in Automatic Speech Recognition (ASR) engines while preserving human accessibility.

How ASR Engines Fail

Modern ASR models (Google Speech-to-Text, Whisper, Vosk) are trained on clean, predictable speech patterns. They rely on:

  • Consistent phoneme timing — each phoneme occupies an expected duration window.
  • Stable fundamental frequency (F0) — the pitch of the speaker remains within a narrow band.
  • Standard speaking rate — typically 120-150 words per minute.

This system disrupts all three assumptions simultaneously.

The Distortion Pipeline

Each time the challenge is spoken, three obfuscation operations are applied to the synthesized voice:

  1. Pitch Randomization (0.6x – 1.4x): The fundamental frequency is shifted randomly per utterance. This prevents acoustic fingerprinting — the same phrase never sounds identical twice, defeating pattern-matching strategies.
  2. Rate Modulation (0.65x – 0.95x): Speech is decelerated to varying degrees, stretching phoneme boundaries beyond the timing windows ASR decoders expect. The inter-word gaps become unpredictable.
  3. White Noise Injection: A controlled noise floor is mixed into the synthesized output, reducing the Signal-to-Noise Ratio (SNR) from the infinite (clean TTS) to approximately 15–20 dB — still comfortable for human comprehension but significantly degrading ASR Word Error Rate (WER).

Human Perception vs. Machine Perception

The human auditory system is remarkably robust. Psychoacoustic research demonstrates that speech intelligibility remains above 90% even at SNR levels as low as 5 dB, and pitch variations of ±40% are perceived as “different voices” rather than unintelligible noise. ASR engines, by contrast, show exponential WER increase below 15 dB SNR.

This asymmetry is the security margin of the system.

Accessibility Compliance

Unlike visual CAPTCHAs, this audio-first approach is inherently accessible to visually impaired users. The distortion parameters are calibrated to stay within the intelligible range for human hearing (ITU-T P.800 standard), ensuring the system does not create barriers while maintaining its security function.

References

[1] Turing, A.M. (1950). “Computing Machinery and Intelligence.” Mind, 59(236), 433-460. (The foundational paper on machine intelligence testing).

[2] von Ahn, L., Blum, M., Hopper, N.J., & Langford, J. (2003). “CAPTCHA: Using Hard AI Problems for Security.” EUROCRYPT 2003. (The original CAPTCHA concept).

[3] Tam, J., Simsa, J., Hyde, S., & von Ahn, L. (2008). “Breaking Audio CAPTCHAs.” Advances in Neural Information Processing Systems (NeurIPS). (Analysis of audio CAPTCHA vulnerabilities).

[4] ITU-T Recommendation P.800 (1996). “Methods for subjective determination of transmission quality.” (Standard for speech quality assessment).