Generate Speech with Dia

Use [S1]/[S2] for speakers, (laughs), (sighs) etc. for non-verbals. Prepend reference transcript for cloning.

0 / 8192
120

Splitting is automatically disabled if text length is less than 2x Chunk Size. Recommended size: ~100-300 (Default: 120). Make sure that you use Predefined Voices or Voice Cloning mode to ensure voices remain consistent for each chunk.

Generation Parameters

Integer (like 1, 42, 901...) for reproducible results, or -1 for random seeds.

Server Configuration

These settings are saved to config.yaml. Restart the server to apply changes to Server, Model, or Paths sections.

Tips & Tricks for Dia

  • Use **Predefined Voices** for consistent, high-quality output based on provided samples.
  • For **Voice Clone**, upload clean reference audio (.wav/.mp3). Crucially, save the exact transcript of the reference audio as the .txt file with the same name as the audio file. [S1] First speaker [S2] Second speaker or [S1] First speaker if the reference audio file has only one speaker.
  • Use **Random / Dialogue** for multi-speaker text ([S1]/[S2]) or single-speaker generation without cloning.
  • Experiment with **CFG Scale** (higher = more adherence) and **Temperature** (higher = more varied).
  • Use **Generation Seed** integer values like 1, 42, 901... for reproducible results.
  • Enable **Split text** for long inputs (> ~200-300 chars). Note: Using Random/Dialogue mode with splitting and a random seed (-1) may result in different voices per chunk. Use Predefined/Clone or a fixed seed for consistency across chunks.
  • Use the /v1/audio/speech endpoint for OpenAI compatibility.
  • Use the custom /tts endpoint for maximum flexibility and configuring all Dia generation parameters, passing reference audio and transcript information etc.