What makes the AI so believable?
This extraordinary capability has been made possible through a couple of groundbreaking features. The first of these is “Repetition Aware Sampling”, which ensures that VALL-E 2 avoids the pitfalls of monotonous speech by addressing repetitions of "tokens" — the small units of language like words or syllables. This feature prevents the AI from getting stuck in a loop of sounds, making its speech flow more naturally.Secondly, “Grouped Code Modeling” enhances efficiency by reducing the sequence length, allowing the model to process fewer individual tokens in a single input sequence. This improvement not only speeds up speech generation but also tackles the challenges of processing lengthy strings of sounds. As per the researchers, VALL-E 2 is the first voice AI to reach human parity in peech robustness, naturalness, and speaker similarity.
Fears of misuse
While the potential applications of VALL-E 2 are vast — ranging from educational tools and entertainment to accessibility features and interactive voice response systems — Microsoft has opted to keep this technological marvel under wraps. The decision is driven by concerns over the potential misuse of such advanced"VALL-E 2 is purely a research project. Currently, we have no plans to incorporate VALL-E 2 into a product or expand access to the public," the researchers stated. This cautious approach aligns with similar restrictions placed by other AI companies, such as OpenAI, on their
Despite the decision to withhold VALL-E 2 from public release, Microsoft’s researchers remain optimistic about the future of AI speech technology. They envision practical applications where synthesised speech maintains speaker identity and can be used safely and ethically. Any future deployment of such technology, they emphasise, must include protocols to ensure that the speaker approves the use of their voice and a robust synthesised speech detection model.
The findings of this research have been published in a pre-print paper.