Skip to content

Commit

Permalink
Update Bert-VITS2 v2.3 (#121)
Browse files Browse the repository at this point in the history
* Add download link

* Update Bert-VITS2 v2.3
  • Loading branch information
Artrajz authored Jan 2, 2024
1 parent 9b5697d commit 89ade6e
Show file tree
Hide file tree
Showing 16 changed files with 1,706 additions and 82 deletions.
26 changes: 19 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -194,6 +194,20 @@ pip install torch==1.13.1+cu117 --extra-index-url https://download.pytorch.org/w
## Linux
The installation process is similar, but I don't have the environment to test it.

# WebUI

## Inference Frontend

http://127.0.0.1:23456

*Port is modifiable under the default setting of port 23456.

## Admin Backend

The default address is http://127.0.0.1:23456/admin.

The initial username and password can be found at the bottom of the config.yml file after the first startup.

# Function Options Explanation

## Disable the Admin Backend
Expand Down Expand Up @@ -262,11 +276,6 @@ To ensure compatibility with the Bert-VITS2 model, modify the config.json file b
...
```

# Admin Backend
The default address is http://127.0.0.1:23456/admin.

The initial username and password can be found at the bottom of the config.yml file after the first startup.

# API

## GET
Expand Down Expand Up @@ -372,8 +381,11 @@ After enabling it, you need to add the `api_key` parameter in GET requests and a
| SDP noise | noisew | false | From `config.yml` | float | Stochastic Duration Predictor noise, controlling the length of phoneme pronunciation. |
| Segment Size | segment_size | false | From `config.yml` | int | Divide the text into paragraphs based on punctuation marks, and combine them into one paragraph when the length exceeds segment_size. If segment_size<=0, the text will not be divided into paragraphs. |
| SDP/DP mix ratio | sdp_ratio | false | From `config.yml` | int | The theoretical proportion of SDP during synthesis, the higher the ratio, the larger the variance in synthesized voice tone. |
| Emotion | emotion | false | None | | Available for Bert-VITS2 v2.1, ranging from 0 to 9 |
| Reference Audio | reference_audio | false | None | | Available for Bert-VITS2 v2.1 |
| Emotion | emotion | false | None | int | Available for Bert-VITS2 v2.1, ranging from 0 to 9 |
| Emotion reference Audio | reference_audio | false | None | | Bert-VITS2 v2.1 uses reference audio to control the synthesized audio's emotion |
|Text Prompt|text_prompt|false|None|str|Bert-VITS2 v2.2 text prompt used for emotion control|
|Style Text|style_text|false|None|str|Bert-VITS2 v2.3 text prompt used for emotion control|
|Style Text Weight|style_weight|false|From `config.yml`|float|Bert-VITS2 v2.3 text prompt weight used for prompt weighting|


## SSML (Speech Synthesis Markup Language)
Expand Down
49 changes: 30 additions & 19 deletions README_zh.md
Original file line number Diff line number Diff line change
Expand Up @@ -204,6 +204,20 @@ pip install torch==1.13.1+cu117 --extra-index-url https://download.pytorch.org/w

安装过程类似,可以查阅网上的安装资料。也可以直接使用docker部署脚本中的gpu版本。

# WebUI

## 推理前端

http://127.0.0.1:23456

*在默认端口为23456的情况下,端口可修改

## 管理员后台

默认为http://127.0.0.1:23456/admin

初始账号密码在初次启动后,在config.yml最下方可找到。

# 功能选项说明

## 关闭管理员后台
Expand Down Expand Up @@ -268,12 +282,6 @@ pip install pyopenjtalk -i https://pypi.artrajz.cn/simple
...
```
# 管理员后台
默认为http://127.0.0.1:23456/admin
初始账号密码在初次启动后,在config.yml最下方可找到。
# API
## GET
Expand Down Expand Up @@ -371,19 +379,22 @@ pip install pyopenjtalk -i https://pypi.artrajz.cn/simple
## Bert-VITS2语音合成
| Name | Parameter | Is must | Default | Type | Instruction |
| ------------- | --------------- | ------- | -------------------- | ----- | ------------------------------------------------------------ |
| 合成文本 | text | true | | str | 需要合成语音的文本。 |
| 角色id | id | false | 从`config.yml`中获取 | int | 即说话人id。 |
| 音频格式 | format | false | 从`config.yml`中获取 | str | 支持wav,ogg,silk,mp3,flac |
| 文本语言 | lang | false | 从`config.yml`中获取 | str | auto为自动识别语言模式,也是默认模式,但目前只支持识别整段文本的语言,无法细分到每个句子。其余可选语言zh和ja。 |
| 语音长度/语速 | length | false | 从`config.yml`中获取 | float | 调节语音长度,相当于调节语速,该数值越大语速越慢。 |
| 噪声 | noise | false | 从`config.yml`中获取 | float | 样本噪声,控制合成的随机性。 |
| sdp噪声 | noisew | false | 从`config.yml`中获取 | float | 随机时长预测器噪声,控制音素发音长度。 |
| 分段阈值 | segment_size | false | 从`config.yml`中获取 | int | 按标点符号分段,加起来大于segment_size时为一段文本。segment_size<=0表示不分段。 |
| SDP/DP混合比 | sdp_ratio | false | 从`config.yml`中获取 | int | SDP在合成时的占比,理论上此比率越高,合成的语音语调方差越大。 |
| 情感控制 | emotion | false | None | | Bert-VITS2 v2.1可用,范围为0-9 |
| 情感参考音频 | reference_audio | false | None | | Bert-VITS2 v2.1可用 |
| Name | Parameter | Is must | Default | Type | Instruction |
| -------------- | --------------- | ------- | -------------------- | ----- | ------------------------------------------------------------ |
| 合成文本 | text | true | | str | 需要合成语音的文本。 |
| 角色id | id | false | 从`config.yml`中获取 | int | 即说话人id。 |
| 音频格式 | format | false | 从`config.yml`中获取 | str | 支持wav,ogg,silk,mp3,flac |
| 文本语言 | lang | false | 从`config.yml`中获取 | str | auto为自动识别语言模式,也是默认模式,但目前只支持识别整段文本的语言,无法细分到每个句子。其余可选语言zh和ja。 |
| 语音长度/语速 | length | false | 从`config.yml`中获取 | float | 调节语音长度,相当于调节语速,该数值越大语速越慢。 |
| 噪声 | noise | false | 从`config.yml`中获取 | float | 样本噪声,控制合成的随机性。 |
| sdp噪声 | noisew | false | 从`config.yml`中获取 | float | 随机时长预测器噪声,控制音素发音长度。 |
| 分段阈值 | segment_size | false | 从`config.yml`中获取 | int | 按标点符号分段,加起来大于segment_size时为一段文本。segment_size<=0表示不分段。 |
| SDP/DP混合比 | sdp_ratio | false | 从`config.yml`中获取 | int | SDP在合成时的占比,理论上此比率越高,合成的语音语调方差越大。 |
| 情感控制 | emotion | false | None | int | Bert-VITS2 v2.1可用,范围为0-9 |
| 情感参考音频 | reference_audio | false | None | | Bert-VITS2 v2.1 使用参考音频来控制合成音频的情感 |
| 文本提示词 | text_prompt | false | None | str | Bert-VITS2 v2.2 文本提示词,用于控制情感 |
| 文本提示词 | style_text | false | None | str | Bert-VITS2 v2.3 文本提示词,用于控制情感 |
| 文本提示词权重 | style_weight | false | 从`config.yml`中获取 | float | Bert-VITS2 v2.3 文本提示词,用于提示词权重 |
## SSML语音合成标记语言
目前支持的元素与属性
Expand Down
3 changes: 2 additions & 1 deletion TTSManager.py
Original file line number Diff line number Diff line change
Expand Up @@ -374,7 +374,8 @@ def bert_vits2_infer(self, state, encode=True):
for sentence in sentences:
audio = model.infer(sentence, state["id"], lang, state["sdp_ratio"], state["noise"],
state["noise"], length, emotion=state["emotion"],
reference_audio=state["reference_audio"], text_prompt=state["text_prompt"])
reference_audio=state["reference_audio"], text_prompt=state["text_prompt"],
style_text=state["style_text"], style_weight=state["style_weight"])
audios.append(audio)
audio = np.concatenate(audios)

Expand Down
61 changes: 40 additions & 21 deletions bert_vits2/bert_vits2.py
Original file line number Diff line number Diff line change
@@ -1,10 +1,13 @@
import logging

import torch

from bert_vits2 import commons
from bert_vits2 import utils as bert_vits2_utils
from bert_vits2.clap_wrapper import get_clap_audio_feature, get_clap_text_feature
from bert_vits2.get_emo import get_emo
from bert_vits2.models import SynthesizerTrn
from bert_vits2.models_v230 import SynthesizerTrn as SynthesizerTrn_v230
from bert_vits2.text import *
from bert_vits2.text.cleaner import clean_text
from bert_vits2.utils import process_legacy_versions
Expand Down Expand Up @@ -80,14 +83,19 @@ def __init__(self, model_path, config, device=torch.device("cpu"), **kwargs):
self.num_tones = num_tones
if "ja" in self.lang: self.bert_model_names.update({"ja": "DEBERTA_V2_LARGE_JAPANESE_CHAR_WWM"})
if "en" in self.lang: self.bert_model_names.update({"en": "DEBERTA_V3_LARGE"})

# else:
# self.hps_ms.model.n_layers_trans_flow = 4
# self.hps_ms.model.emotion_embedding = 1
# self.lang = getattr(self.hps_ms.data, "lang", ["zh", "ja", "en"])
# self.num_tones = num_tones
# if "ja" in self.lang: self.bert_model_names.update({"ja": "DEBERTA_V2_LARGE_JAPANESE_CHAR_WWM"})
# if "en" in self.lang: self.bert_model_names.update({"en": "DEBERTA_V3_LARGE"})
elif self.version in ["2.3", "2.3.0"]:
self.lang = getattr(self.hps_ms.data, "lang", ["zh", "ja", "en"])
self.num_tones = num_tones
self.text_extra_str_map.update({"en": "_v230"})
if "ja" in self.lang: self.bert_model_names.update({"ja": "DEBERTA_V2_LARGE_JAPANESE_CHAR_WWM"})
if "en" in self.lang: self.bert_model_names.update({"en": "DEBERTA_V3_LARGE"})
else:
logging.debug("Version information not found. Loaded as the newest version: v2.3.")
self.lang = getattr(self.hps_ms.data, "lang", ["zh", "ja", "en"])
self.num_tones = num_tones
self.text_extra_str_map.update({"en": "_v230"})
if "ja" in self.lang: self.bert_model_names.update({"ja": "DEBERTA_V2_LARGE_JAPANESE_CHAR_WWM"})
if "en" in self.lang: self.bert_model_names.update({"en": "DEBERTA_V3_LARGE"})

if "zh" in self.lang:
self.bert_model_names.update({"zh": "CHINESE_ROBERTA_WWM_EXT_LARGE"})
Expand All @@ -99,22 +107,31 @@ def __init__(self, model_path, config, device=torch.device("cpu"), **kwargs):
def load_model(self, model_handler):
self.model_handler = model_handler

self.net_g = SynthesizerTrn(
len(self.symbols),
self.hps_ms.data.filter_length // 2 + 1,
self.hps_ms.train.segment_size // self.hps_ms.data.hop_length,
n_speakers=self.hps_ms.data.n_speakers,
symbols=self.symbols,
ja_bert_dim=self.ja_bert_dim,
num_tones=self.num_tones,
**self.hps_ms.model).to(self.device)
if self.version in ["2.3", "2.3.0"]:
self.net_g = SynthesizerTrn_v230(
len(symbols),
self.hps_ms.data.filter_length // 2 + 1,
self.hps_ms.train.segment_size // self.hps_ms.data.hop_length,
n_speakers=self.hps_ms.data.n_speakers,
**self.hps_ms.model,
).to(self.device)
else:
self.net_g = SynthesizerTrn(
len(self.symbols),
self.hps_ms.data.filter_length // 2 + 1,
self.hps_ms.train.segment_size // self.hps_ms.data.hop_length,
n_speakers=self.hps_ms.data.n_speakers,
symbols=self.symbols,
ja_bert_dim=self.ja_bert_dim,
num_tones=self.num_tones,
**self.hps_ms.model).to(self.device)
_ = self.net_g.eval()
bert_vits2_utils.load_checkpoint(self.model_path, self.net_g, None, skip_optimizer=True, version=self.version)

def get_speakers(self):
return self.speakers

def get_text(self, text, language_str, hps):
def get_text(self, text, language_str, hps, style_text=None, style_weight=0.7):
clean_text_lang_str = language_str + self.text_extra_str_map.get(language_str, "")
bert_feature_lang_str = language_str + self.bert_extra_str_map.get(language_str, "")

Expand All @@ -132,8 +149,9 @@ def get_text(self, text, language_str, hps):
word2ph[i] = word2ph[i] * 2
word2ph[0] += 1

style_text = None if style_text == "" else style_text
bert = self.model_handler.get_bert_feature(norm_text, word2ph, bert_feature_lang_str,
self.bert_model_names[language_str])
self.bert_model_names[language_str], style_text, style_weight)
del word2ph
assert bert.shape[-1] == len(phone), phone

Expand Down Expand Up @@ -173,8 +191,9 @@ def get_emo_(self, reference_audio, emotion):
return emo

def infer(self, text, id, lang, sdp_ratio, noise, noisew, length, reference_audio=None, emotion=None,
skip_start=False, skip_end=False, text_prompt=None, **kwargs):
zh_bert, ja_bert, en_bert, phones, tones, lang_ids = self.get_text(text, lang, self.hps_ms)
skip_start=False, skip_end=False, text_prompt=None, style_text=None, style_weigth=0.7, **kwargs):
zh_bert, ja_bert, en_bert, phones, tones, lang_ids = self.get_text(text, lang, self.hps_ms, style_text,
style_weigth)

if self.hps_ms.model.emotion_embedding == 1:
emo = self.get_emo_(reference_audio, emotion).to(self.device).unsqueeze(0)
Expand Down
56 changes: 38 additions & 18 deletions bert_vits2/model_handler.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,6 @@
from bert_vits2.text.japanese_bert_v200 import get_bert_feature as ja_bert_v200
from bert_vits2.text.english_bert_mock_v200 import get_bert_feature as en_bert_v200


class ModelHandler:
def __init__(self, device):
self.DOWNLOAD_PATHS = {
Expand Down Expand Up @@ -47,10 +46,12 @@ def __init__(self, device):
"https://hf-mirror.com/ku-nlp/deberta-v2-large-japanese-char-wwm/resolve/main/pytorch_model.bin",
],
"WAV2VEC2_LARGE_ROBUST_12_FT_EMOTION_MSP_DIM": [

"https://huggingface.co/audeering/wav2vec2-large-robust-12-ft-emotion-msp-dim/resolve/main/pytorch_model.bin",
"https://hf-mirror.com/audeering/wav2vec2-large-robust-12-ft-emotion-msp-dim/resolve/main/pytorch_model.bin",
],
"CLAP_HTSAT_FUSED": [

"https://huggingface.co/laion/clap-htsat-fused/resolve/main/pytorch_model.bin?download=true",
"https://hf-mirror.com/laion/clap-htsat-fused/resolve/main/pytorch_model.bin?download=true",
]
}

Expand All @@ -62,7 +63,8 @@ def __init__(self, device):
"DEBERTA_V3_LARGE": "dd5b5d93e2db101aaf281df0ea1216c07ad73620ff59c5b42dccac4bf2eef5b5",
"SPM": "c679fbf93643d19aab7ee10c0b99e460bdbc02fedf34b92b05af343b4af586fd",
"DEBERTA_V2_LARGE_JAPANESE_CHAR_WWM": "bf0dab8ad87bd7c22e85ec71e04f2240804fda6d33196157d6b5923af6ea1201",
"CLAP_HTSAT_FUSED": ""
"WAV2VEC2_LARGE_ROBUST_12_FT_EMOTION_MSP_DIM": "176d9d1ce29a8bddbab44068b9c1c194c51624c7f1812905e01355da58b18816",
"CLAP_HTSAT_FUSED": "1ed5d0215d887551ddd0a49ce7311b21429ebdf1e6a129d4e68f743357225253",
}
self.model_path = {
"CHINESE_ROBERTA_WWM_EXT_LARGE": os.path.join(config.ABS_PATH,
Expand Down Expand Up @@ -141,28 +143,45 @@ def load_bert(self, bert_model_name, max_retries=3):
tokenizer, model, count = self.bert_models[bert_model_name]
self.bert_models[bert_model_name] = (tokenizer, model, count + 1)

def load_emotion(self):
def load_emotion(self, max_retries=3):
"""Bert-VITS2 v2.1 EmotionModel"""
if self.emotion is None:
from transformers import Wav2Vec2Processor
from bert_vits2.get_emo import EmotionModel
self.emotion = {}
self.emotion["model"] = EmotionModel.from_pretrained(
self.model_path["WAV2VEC2_LARGE_ROBUST_12_FT_EMOTION_MSP_DIM"]).to(self.device)
self.emotion["processor"] = Wav2Vec2Processor.from_pretrained(
self.model_path["WAV2VEC2_LARGE_ROBUST_12_FT_EMOTION_MSP_DIM"])
self.emotion["reference_count"] = 1
retries = 0
model_path = self.model_path["WAV2VEC2_LARGE_ROBUST_12_FT_EMOTION_MSP_DIM"]
while retries < max_retries:
try:
self.emotion = {}
self.emotion["model"] = EmotionModel.from_pretrained(model_path).to(self.device)
self.emotion["processor"] = Wav2Vec2Processor.from_pretrained(model_path)
self.emotion["reference_count"] = 1
break
except Exception as e:
logging.error(f"Failed loading {model_path}. {e}")
self._download_model("WAV2VEC2_LARGE_ROBUST_12_FT_EMOTION_MSP_DIM")
retries += 1
else:
self.emotion["reference_count"] += 1

def load_clap(self):
def load_clap(self, max_retries=3):
"""Bert-VITS2 v2.2 ClapModel"""
if self.clap is None:
from transformers import ClapModel, ClapProcessor
self.clap = {}
self.clap["model"] = ClapModel.from_pretrained(self.model_path["CLAP_HTSAT_FUSED"]).to(self.device)
self.clap["processor"] = ClapProcessor.from_pretrained(self.model_path["CLAP_HTSAT_FUSED"])
self.clap["reference_count"] = 1
retries = 0
model_path = self.model_path["CLAP_HTSAT_FUSED"]
while retries < max_retries:
try:
self.clap = {}
self.clap["model"] = ClapModel.from_pretrained(model_path).to(self.device)
self.clap["processor"] = ClapProcessor.from_pretrained(model_path)
self.clap["reference_count"] = 1
break
except Exception as e:
logging.error(f"Failed loading {model_path}. {e}")
self._download_model("CLAP_HTSAT_FUSED")
retries += 1

else:
self.clap["reference_count"] += 1

Expand All @@ -173,9 +192,10 @@ def get_bert_model(self, bert_model_name):
tokenizer, model, _ = self.bert_models[bert_model_name]
return tokenizer, model

def get_bert_feature(self, norm_text, word2ph, language, bert_model_name):
def get_bert_feature(self, norm_text, word2ph, language, bert_model_name, style_text=None, style_weight=0.7):
tokenizer, model = self.get_bert_model(bert_model_name)
bert_feature = self.lang_bert_func_map[language](norm_text, word2ph, tokenizer, model, self.device)
bert_feature = self.lang_bert_func_map[language](norm_text, word2ph, tokenizer, model, self.device, style_text,
style_weight)
return bert_feature

def release_bert(self, bert_model_name):
Expand Down
Loading

0 comments on commit 89ade6e

Please sign in to comment.