UtterTune.
Tune your TTS in phoneme-level pronunciation & prosody, even if lacking G2P

LoRA modules that refine phoneme-level pronunciation and prosody on top of LLM-based TTS with no explicit G2P frontend. Lightweight, easy to drop in, research-friendly.

We currently support Japanese on CosyVoice 2 only.

Why UtterTune?

🎛️

Edit & Control Phoneme-Level Pronunciation & Prosody

Add controllability to LLM-TTS with no explicit G2P frontend. Objective and subjective evaluations confirm its effectiveness: Accent correctness has been improved from 0.472 (baseline) to 0.975 (UtterTune) while keeping naturalness and speaker similarity.

🎯

No Interference on Non-Target Languages

UtterTune tunes only the target language. You don't need attach UtterTune on other languages, so it never degrade their performance.

⚙️

Drop-in LoRA

Attach to the LLM backbone via PEFT, keeping base knowledge as well as possible.

🚀

Lightweight

Tiny adapter weights, fast to download and update. You can train LoRA on a single Nvidia RTX4090 (24GB memory) within 1 hour.

🔒

License

The code is released under MIT License. The currently available LoRA weights are released under CC BY-NC 4.0 considering the licenses of the training data.

🎧 Audio Demos (Zero-Shot Cloning)

Baseline refers to the original CosyVoice 2. Baseline (kana input) refers to the original CosyVoice 2 and kana (phonogram) input for difficult-to-read words.

Sentence 1

Input:

Sentence 2

Input:

Sentence 3

Input:

Citation

If you use UtterTune in your research, please cite the paper

@misc{Kato2025UtterTune,
  title={UtterTune: UtterTune: LoRA-Based Target-Language Pronunciation Edit and Control in Multilingual Text-to-Speech},
  author={Kato, Shuhei},
  year={2025},
  howpublished={arXiv:2508.09767 [cs.CL]},
}

License

The model used in this demo is released under CC BY-NC 4.0. It inherits the non-commercial restriction from the training data. Allowed: Academic research, non-commercial research (including within commercial companies), and personal use. Commercial use is prohibited.

The code used in this demo is released under MIT License.

FAQ

Which modules should LoRA target?

We attached a PEFT adapter targeting all the Q/K/V/O projections in the LLM portion used by CosyVoice 2.

Commercial use?

The existing available model is released under CC BY-NC 4.0 considering the licenses of the training data. You can train your own model using our code, released under MIT license.

How to replicate evaluations?

We have a plan to provide scripts for the objective evaluations conducted in the paper on GitHub.

UtterTune.Tune your TTS in phoneme-level pronunciation & prosody, even if lacking G2P

Why UtterTune?

Edit & Control Phoneme-Level Pronunciation & Prosody

No Interference on Non-Target Languages

Drop-in LoRA

Lightweight

License

🎧 Audio Demos (Zero-Shot Cloning)

Sentence 1

Sentence 2

Sentence 3

Citation

License

FAQ

UtterTune.
Tune your TTS in phoneme-level pronunciation & prosody, even if lacking G2P