Neural text to speech のメモ(2020 年 3 月 28 日時点)

8259 ワード

wavernn Text-to-Speech tacotron Text-to-Speech テキストリンク

テキストから, 自然な(人間が話しているっぽい)スピーチを生成し, LibTorch, TensorFlow C++ でモバイル(オフライン)でリアルタイム or インタラクィブに動く(動かしやすそう)な手法に注力しています.

英語に限っています.

インターネット上のビデオから学習して, 話者の声質を再現

FastSpeech

高速に TTS できるっぽい. ソースコード公開予定

FastSpeech: Fast, Robust and Controllable Text to Speech
https://arxiv.org/abs/1905.09263

有志?による実装

pretrained model でそこそこいい感じに推論できます.

CPU でも I'am happy to see you again だと 1 秒くらいで合成できます(Transformer 0.1 秒, griffin-lim 0.9 秒くらい). waveglow と組みあわえる場合は 9 秒くらい.

Transformer-TTS

Neural Speech Synthesis with Transformer Network
https://arxiv.org/abs/1809.08895

LCPNet

モバイルで動く

Mellotron

Mellotron: Multispeaker expressive voice synthesis by conditioning on rhythm, pitch and global style tokens

感情や歌声などのトレーニングデータなしに, 感情を含んだ音声や歌声を生成できる.

2019 年 11 月 25 日にオフィシャル実装公開されました.

WaveFlow

WaveGlow より, よりコンパクトに表現できる(パラメータ数が少ない). 2D convolution する. WaveGlow, WaveNet もここから派生して定義することがでいる.

WaveFlow: A Compact Flow-based Model for Raw Audio
https://arxiv.org/abs/1912.01219

W.I.P 実装
https://github.com/L0SG/WaveFlow

ForwardTacotron

FastSpeech inspired.

Tacotron と FastSpeech のいいとこ取りな感じか.

その他

ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech
https://arxiv.org/abs/1807.07281

Close to Human Quality TTS with Transformer
https://arxiv.org/abs/1809.08895

https://github.com/tensorflow/models/tree/master/official/transformer を使えばさらっと実装できるっぽい...?

FFTNet http://gfx.cs.princeton.edu/pubs/Jin_2018_FAR/
リアルタイム向け. WaveRNN よりいいかも?

GAN-based text-to-speech synthesis and voice conversion (VC)
https://github.com/r9y9/gantts

AlignTTS: Efficient Feed-Forward Text-to-Speech System without Explicit Alignment
https://arxiv.org/abs/2003.01950

WG-WaveNet: Real-Time High-Fidelity Speech Synthesis without GPU
https://arxiv.org/abs/2005.07412

所感

Tacotron は end-to-end で学習できるのが利点であるが, 品質を出すにはいろいろ学習の試行錯誤が必要なようである.
2020 年 3 月 28 日時点では,

Tacotron2 + Waveglow が基準品質(英語)
FastSpeech + WaveGlow -> Tacotron2 + Waveglow に比べ, いくらか機械的にはなるがより抑揚などが制御できている(英語) LJSpeech 以外のデータセットで学習すればいい感じになるかも
モバイル向けなら FastSpeech + SquezeWave or ForwardTacotron + SqueezeWave か?

Transformer 系は高速でモバイルで動かすのによさそうである.

Tacotron + WaveRNN は単一話者向け(一話者一学習データ)っぽいので, マルチスピーカーや声質変換, 感情つけなどの場合は別のモデルがよさそう(DeepVoice, LoopVoice, Mellotron など)

MelNet に期待.

実装

https://github.com/espnet/espnet
- End-to-End Speech Processing Toolkit
- いろいろてんこ盛り. ありがとうございます.
https://github.com/keithito/tacotron
- 中国語(Chinese Mandarin)版 https://github.com/keithito/tacotron/issues/118
https://github.com/NVIDIA/tacotron2
https://github.com/fatchord/WaveRNN
- ネットワークだけなので参考程度
Tacotron2 + WaveRNN https://github.com/h-meru/Tacotron-WaveRNN
Combination of the Tacotron-2 implementation by Rayhane-mamah with the WaveRNN-inspired method by fatchord https://github.com/m-toman/tacorn

Author And Source

この問題について(Neural text to speech のメモ(2020 年 3 月 28 日時点)), 我々は、より多くの情報をここで見つけました https://qiita.com/syoyo/items/6b7faf99cbfc9e2e173a

著者帰属：元の著者の情報は、元のURLに含まれています。著作権は原作者に属する。

Content is automatically searched and collected through network algorithms . If there is a violation . Please contact us . We will adjust (correct author information ,or delete content ) as soon as possible .

Neural text to speech のメモ(2020 年 3 月 28 日時点)

人気がありそう(いろいろな人がトライしていて知見や実装があるもの)なやりやた

最近のトレンド?

Mel spectroguram(メルスペクトログラム)

Tactron2

WaveRNN

WaveGlow

MelNet

FastSpeech

関連論文

Transformer-TTS

LCPNet

Mellotron

WaveFlow

ForwardTacotron

その他

所感

実装

Author And Source

Neural text to speech のメモ(2020 年 3 月 28 日時点)

人気がありそう(いろいろな人がトライしていて知見や実装があるもの)なやりやた

最近のトレンド?

Mel spectroguram(メル スペクトログラム)

Tactron2

WaveRNN

WaveGlow

MelNet

FastSpeech

関連論文

Transformer-TTS

LCPNet

Mellotron

WaveFlow

ForwardTacotron

その他

所感

実装

Author And Source

Mel spectroguram(メルスペクトログラム)