We know that the voices tend to degrade or start whispering during longer audio generations, and our team is working hard to develop the technology to improve this. This issue is more prominent in the experimental multilingual model.
To help mitigate these problems, we recommend breaking down the text into shorter sections, preferably below 800 characters, as this can help maintain better quality. Additionally, if you use English voices, it is advisable to stick with the monolingual model for now, as it tends to exhibit more stability.
You can read more about it here: https://docs.elevenlabs.io/
There are a few other factors that could contribute to these issues, and we'd like to highlight some of the key ones:
How long is the text chunk?
The voices do have a tendency to degrade over time. The experimental multilingual model tends to degrade quicker than the monolingual model. The team is currently working hard on finding solutions to these problems.
Monolingual or Multilingual?
Monolingual is more stable but only officially supports English.
Multilingual is still experimental and can have a few extra quirks being worked on.
Pre-made, voice-designed voices, or cloned voices?
Some of the pre-made voices have a tendency to start whispering during longer generations.
Similar problems have also been observed in the voice-designed voices, but it is dependent on the voice itself.
If you're using cloned voices, the samples' quality is very important to the final output. Noise and other artefacts tend to be amplified during long generations.
What settings are you using?
Both stability and similarity can change how the voice acts and how prominent the artefacts are. Hoovering over the little "!" next to each side of the sliders will reveal more information.
Running a low-cut filter on the audio might help reduce some of the problems some face, where the audio gets muffled and starts turning into a whispering robot voice because the AI can sometimes hear that rumble and think it is part of the voice and then try to replicate it.
Another thing is if your voice starts whispering, preprocessing the audio before cloning is also worth trying. Both remove artefacts, noise, and reverb in the audio and reduce the dynamic range of the audio using a compressor or even a limiter.
We acknowledge that these solutions are imperfect and may not address all issues fully. However, they can be beneficial in a lot of situations. Our team is committed to continuously improving the technology and providing the best user experience possible, and they are currently working hard on updates that will hopefully help address this issue, if not fully solve it. We are hoping to start rolling out the first of many updates shortly. However, at the moment, we do not have a specific timeframe or any additional information available.