Startled? So were a bunch of researchers experimenting with
Despite boasting 680,000 hours of audio data training, Whisper sometimes "hallucinates," or invents entire phrases and sentences out of thin air. These hallucinations can include
For example, in one instance, Whisper accurately transcribed a simple sentence but then hallucinated five additional sentences peppered with words like “terror,” “knife,” and “killed.” In other cases, it generated random names, partial addresses, and irrelevant websites. Even phrases commonly used by YouTubers, such as “Thanks for watching and Electric Unicorn,” inexplicably appeared in some transcriptions.
While OpenAI has made strides in reducing Whisper’s hallucination rate since its release in 2022, the issue persists, especially for speakers with
The root of the problem seems to lie in how the underlying technology interprets pauses and silences, erroneously treating them as cues to generate words. “It appears that the large
Koenecke warns that even a small proportion of these hallucinations can have serious implications. “While most transcriptions are accurate, the few that are not can cause significant harm,” she said. “This can lead to significant consequences if these transcriptions are used in AI-based hiring processes, legal settings, or medical records.”
As AI technology continues to evolve, it is crucial to address these hallucination problems to ensure speech-to-text systems are reliable and safe, particularly in sensitive applications like hiring, legal proceedings, and medical documentation. The work by Koenecke and her team underscores the importance of refining AI to truly understand human speech in all its varied forms, avoiding the pitfalls of creating something harmful from nothing.
The findings of this research can be accessed here.