LoRA modules that refine phoneme-level pronunciation and prosody on top of LLM-based TTS with no explicit G2P frontend. Lightweight, easy to drop in, research-friendly.
We currently support Japanese on CosyVoice 2 only.
Add controllability to LLM-TTS with no explicit G2P frontend. Objective and subjective evaluations confirm its effectiveness: Accent correctness has been improved from 0.472 (baseline) to 0.975 (UtterTune) while keeping naturalness and speaker similarity.
UtterTune tunes only the target language. You don't need attach UtterTune on other languages, so it never degrade their performance.
Attach to the LLM backbone via PEFT, keeping base knowledge as well as possible.
Tiny adapter weights, fast to download and update. You can train LoRA on a single Nvidia RTX4090 (24GB memory) within 1 hour.
The code is released under MIT License. The currently available LoRA weights are released under CC BY-NC 4.0 considering the licenses of the training data.
Baseline refers to the original CosyVoice 2. Baseline (kana input) refers to the original CosyVoice 2 and kana (phonogram) input for difficult-to-read words.
Input:
Input:
Input:
If you use UtterTune in your research, please cite the paper
@misc{Kato2025UtterTune, title={UtterTune: UtterTune: LoRA-Based Target-Language Pronunciation Edit and Control in Multilingual Text-to-Speech}, author={Kato, Shuhei}, year={2025}, howpublished={arXiv:2508.09767 [cs.CL]}, }
The model used in this demo is released under CC BY-NC 4.0. It inherits the non-commercial restriction from the training data. Allowed: Academic research, non-commercial research (including within commercial companies), and personal use. Commercial use is prohibited.
The code used in this demo is released under MIT License.
We attached a PEFT adapter targeting all the Q/K/V/O projections in the LLM portion used by CosyVoice 2.
The existing available model is released under CC BY-NC 4.0 considering the licenses of the training data. You can train your own model using our code, released under MIT license.
We have a plan to provide scripts for the objective evaluations conducted in the paper on GitHub.