I was just notified that our E2 TTS paper received the Best Paper Award at IEEE #SLT2024! Many thanks to all the remarkable collaborators who made this happen!
Paper: arxiv.org/abs/2406.18009
Demo: aka.ms/e2tts
I was just notified that our E2 TTS paper received the Best Paper Award at IEEE #SLT2024! Many thanks to all the remarkable collaborators who made this happen!
Paper: arxiv.org/abs/2406.18009
Demo: aka.ms/e2tts
Ah, no, TS3-Codec was trained with 10-second audio segments, while BigCodec-S was trained with 2.5-second audio segments (Section 4.5). This was a somewhat tricky (and perhaps debatable) part of the configuration, and we did our best to tune the hyperparameters within the constraints of GPU memory.
Thanks! To the extent that we checked, yes. The important point is limiting the attention window.
TS3-Codec: yet another audio codec from my former team—simple, fast, and high-quality.
Simple—just a stack of Transformer and linear layers; no convolutions.
Faster and better—superior audio reconstruction quality with fewer MACs compared to strong convolution-based baselines.
Our GenAI-Speech team at Meta is hiring RS interns for summer 2025 to work on speech, LLMs, dialog generation, and other exciting stuff! Check out the job posting here: www.metacareers.com/jobs/3841154...