Microsoft has created an AI that they think is "too dangerous" for public release
Jul 18, 2024, 10:54 IST
In a world where technological advancements are often heralded with great fanfare and widespread availability, Microsoft has taken an unusually cautious step. The tech giant has developed an artificial intelligence (AI) speech generator so convincing and advanced that it has decided to withhold it from public release.
VALL-E 2 is an AI marvel capable of mimicking human speech with uncanny accuracy, using just a few seconds of audio. Representing a significant leap in text-to-speech (TTS) technology, Microsoft’s researchers boast that it achieves "human parity" in generating speech — meaning its output is virtually indistinguishable from a human’s voice.
Secondly, “Grouped Code Modeling” enhances efficiency by reducing the sequence length, allowing the model to process fewer individual tokens in a single input sequence. This improvement not only speeds up speech generation but also tackles the challenges of processing lengthy strings of sounds. As per the researchers, VALL-E 2 is the first voice AI to reach human parity in peech robustness, naturalness, and speaker similarity.
"VALL-E 2 is purely a research project. Currently, we have no plans to incorporate VALL-E 2 into a product or expand access to the public," the researchers stated. This cautious approach aligns with similar restrictions placed by other AI companies, such as OpenAI, on their voice technology.
Despite the decision to withhold VALL-E 2 from public release, Microsoft’s researchers remain optimistic about the future of AI speech technology. They envision practical applications where synthesised speech maintains speaker identity and can be used safely and ethically. Any future deployment of such technology, they emphasise, must include protocols to ensure that the speaker approves the use of their voice and a robust synthesised speech detection model.
The findings of this research have been published in a pre-print paper.
Advertisement
VALL-E 2 is an AI marvel capable of mimicking human speech with uncanny accuracy, using just a few seconds of audio. Representing a significant leap in text-to-speech (TTS) technology, Microsoft’s researchers boast that it achieves "human parity" in generating speech — meaning its output is virtually indistinguishable from a human’s voice.
What makes the AI so believable?
This extraordinary capability has been made possible through a couple of groundbreaking features. The first of these is “Repetition Aware Sampling”, which ensures that VALL-E 2 avoids the pitfalls of monotonous speech by addressing repetitions of "tokens" — the small units of language like words or syllables. This feature prevents the AI from getting stuck in a loop of sounds, making its speech flow more naturally.Secondly, “Grouped Code Modeling” enhances efficiency by reducing the sequence length, allowing the model to process fewer individual tokens in a single input sequence. This improvement not only speeds up speech generation but also tackles the challenges of processing lengthy strings of sounds. As per the researchers, VALL-E 2 is the first voice AI to reach human parity in peech robustness, naturalness, and speaker similarity.
Fears of misuse
While the potential applications of VALL-E 2 are vast — ranging from educational tools and entertainment to accessibility features and interactive voice response systems — Microsoft has opted to keep this technological marvel under wraps. The decision is driven by concerns over the potential misuse of such advanced voice cloning technology. The risks include the ability to spoof voice identification systems and impersonate individuals convincingly."VALL-E 2 is purely a research project. Currently, we have no plans to incorporate VALL-E 2 into a product or expand access to the public," the researchers stated. This cautious approach aligns with similar restrictions placed by other AI companies, such as OpenAI, on their voice technology.
Despite the decision to withhold VALL-E 2 from public release, Microsoft’s researchers remain optimistic about the future of AI speech technology. They envision practical applications where synthesised speech maintains speaker identity and can be used safely and ethically. Any future deployment of such technology, they emphasise, must include protocols to ensure that the speaker approves the use of their voice and a robust synthesised speech detection model.
Advertisement