AI is already generating a ton of audio for podcasters and content creators. With VibeVoice, you will be able to use it for long conversations. It relies on continuous speech tokenizers operating “at an ultra-low frame rate of 7.5 Hz” to preserve audio fidelity. It can handle speeches up to 90 minutes with 4 speakers.
VibeVoice-1.5B has a context length of 64K, so it can generate 90-minute long audio. As the developers explain, this model generates background music or sounds spontaneously.
Microsoft just dropped VibeVoice (open-source)
This AI turn text into a 90-min, up to 4-voice podcast.
With natural pauses, emotion, even singing.
6 wild examples + code:
1. Spontaneous singing pic.twitter.com/Q0MhlnMH8M
— Min Choi (@minchoi) August 27, 2025
[HT]