Echo-TTS

Jordan Darefsky, 2025. See technical details here

License Notice: All audio outputs are subject to non-commercial use only due to the Fish Speech S1-DAC autoencoder being licensed under CC-BY-NC-SA-4.0.

Audio output is watermarked with silentcipher using message [91, 57, 81, 60, 83]

Simple Mode (Recommended for Beginners)

  1. Pick or upload a voice - Choose from the voicebank or upload your own audio (up to 2 minutes)
  2. Choose a text prompt preset or enter your own prompt - What you want the voice to say (the presets are a good guide for format/style)
  3. Select a Sampling preset - The default preset "Independent (High Speaker CFG)" is usually good to start
  4. Click Generate Audio - Wait for the model to generate your audio

💡 Tip: If the generated voice doesn't match the reference speaker at all, enable "Speaker KV Attention Scaling" and click Generate Audio again.

Advanced Mode

Switch to Advanced mode for full control over all generation parameters including CFG scales, sampling steps, truncation, and more.

Other tips

High CFG settings are recommended but may lead to oversaturation; APG might help with this. Flat settings tend to reduce "impulse" artifacts but might result in worse (blunted/compressed/artifact-y) laughter, breathing, etc. generation.

Echo will try to fit the entire text-prompt into (<=) 30 seconds of audio. If your prompt is very long, the generated speech may be too quick (this is not an issue for shorter text-prompts). For disfluent, single-speaker speech, we recommend trying the reference text beginning with "[S1] ... explore how we can design" as a starting point.

Voice Selection

Select Dataset

Choose which voicebank to use

Audio Library (favorite examples from voicebank datasets)

Click to select (or upload your own audio file directly on the right)


Text Prompt


Generation

Sampler Preset

Load preset configurations

Enable if generation does not match reference voice (otherwise leave off)

Format