๐ค F5-TTS: Vietnamese Text-to-Speech Synthesis.
The model was trained for 500.000 steps with approximately 150 hours of data on an RTX 3090 GPU.
Enter text and upload a sample voice to generate natural speech.
๐ Sample Voice
Drop Audio Here
- or -
Click to Upload
๐ Text
โก Speed
โบ
0.3
2
๐ฅ Generate Voice
๐ง Generated Audio
๐ Spectrogram
โ Model Limitations
1. This model may not perform well with numerical characters, dates, special characters, etc. => A text normalization module is needed. 2. The rhythm of some generated audios may be inconsistent or choppy => It is recommended to select clearly pronounced sample audios with minimal pauses for better synthesis quality. 3. Default, reference audio text uses the whisper-large-v3-turbo model, which may not always accurately recognize Vietnamese, resulting in poor voice synthesis quality. 4. Checkpoint is stopped at step 500.000, trained with 150 hours of public data => Voice cloning for non-native voices may not be perfectly accurate. 5. Inference with overly long paragraphs may produce poor results.