Researchers at Meta AI have made a breakthrough in generative AI for speech with their development of Voicebox. Unlike previous speech generation models, Voicebox is the first model capable of generalizing across tasks, meaning it can carry out tasks it was not specifically trained to accomplish.
Voicebox is based on the Flow Matching method, and it can create and modify a wide range of output styles. Its functions include synthesizing speech in six languages, performing noise removal, content editing, style conversion, and generating diverse samples. The model has shown state-of-the-art performance, surpassing other models, such as VALL-E and YourTTS, in terms of intelligibility, audio similarity, and speed.
The training process for Voicebox involved over 50,000 hours of recorded speech and transcripts from public-domain audiobooks in English, French, Spanish, German, Polish, and Portuguese. This extensive dataset enabled Voicebox to predict a speech segment when given the surrounding speech and the transcript of the segment, making it versatile across various speech generation tasks.
Voicebox has potential applications in a variety of areas. For example, it could bring speech to people who are unable to speak, allow individuals to customize the voices used by virtual assistants, facilitate communication across language barriers, assist in editing noisy or misspoken segments within audio recordings, and generate synthetic data for training other speech assistant models.
However, due to potential risks of misuse, Meta AI has decided not to make the Voicebox model or its code publicly available. The researchers are instead sharing audio samples, a research paper detailing their approach, and details about a classifier that can distinguish between authentic speech and audio generated by Voicebox.
The team at Meta AI believes Voicebox is an important step forward in generative AI research, potentially signaling a new era in this field. They have also emphasized the importance of responsibly sharing generative AI research and are hopeful about how other researchers will build upon their work.




































