1
SNAC: Speaker-normalized affine coupling layer in
flow-based architecture for zero-shot multi-speaker
text-to-speech
Byoung Jin Choi, Student Member, IEEE, Myeonghun Jeong, Student Member, IEEE, Joun Yeop Lee,
and Nam Soo Kim, Senior Member, IEEE
Abstract—Zero-shot multi-speaker text-to-speech (ZSM-TTS)
models aim to generate a speech sample with the voice character-
istic of an unseen speaker. The main challenge of ZSM-TTS is to
increase the overall speaker similarity for unseen speakers. One
of the most successful speaker conditioning methods for flow-
based multi-speaker text-to-speech (TTS) models is to utilize the
functions which predict the scale and bias parameters of the
affine coupling layers according to the given speaker embedding
vector. In this letter, we improve on the previous speaker
conditioning method by introducing a speaker-normalized affine
coupling (SNAC) layer which allows for unseen speaker speech
synthesis in a zero-shot manner leveraging a normalization-
based conditioning technique. The newly designed coupling layer
explicitly normalizes the input by the parameters predicted
from a speaker embedding vector while training, enabling an
inverse process of denormalizing for a new speaker embedding
at inference. The proposed conditioning scheme yields the state-
of-the-art performance in terms of the speech quality and speaker
similarity in a ZSM-TTS setting.
Index Terms—speech synthesis, zero-shot multi-speaker text-
to-speech, conditional normalizing flow
I. INTRODUCTION
A
S the sample quality of the recently proposed neural
text-to-speech (TTS) models, [1]–[11], approaches to the
natural speech, the research interest has extended to high
fidelity multi-speaker TTS systems, which enables the speech
generation of multiple speakers via a single trained model.
However, training a multi-speaker TTS system requires a large
dataset of [text, audio, speaker] tuples where the labeling can
be costly. Furthermore, such systems are limited to generate
the voice of speakers seen during the training period, whereas
instant adaptation to a new speaker’s voice may be required in
real life applications. To this end, personalized TTS is gaining
huge attention from researchers.
Personalized TTS aims at generating new speakers’ speech
with limited resources. One possible approach is speaker
adaptation. The idea of adapting a pre-trained TTS model to a
new speaker with more than one [text, audio] pair dates back
to the hidden Markov model (HMM)-based TTS [12]–[15].
This work was supported by Samsung Research, Samsung Electronics
Co.,Ltd.
Byoung Jin Choi, Myeonghun Jeong, and Nam Soo Kim are with the De-
partment of Electrical and Computer Engineering and with the Institute of New
Media and Communications, Seoul National University, Seoul 08826, South
Joun Yeop Lee is with Samsung Research, Samsung Electronics Co., Ltd,
Seoul, South Korea (e-mail: [email protected])
[16] and [17] extend the maximum likelihood linear regression
(MLLR) algorithm for speaker adaptation. For more robust
speaker adaptation, structured maximum a posteriori linear
regression (SMAPLR) [18] is developed by combining the
maximum a posteriori (MAP) criterion to MLLR. The adapta-
tion process is based on the affine transformation of mean and
variance of HMM parameters for the target speaker and such
transformation matrices are derived by maximum likelihood
and MAP criterion respectively. With the recent development
of non-autoregressive neural TTS systems, [19] and [20] focus
on effectively finetuning the parameters of pre-trained neural
TTS model to adapt to the speaker’s characteristics.
Another approach deals with an extreme situation where
only an [audio] from a target speaker is available. The model
is required to correctly reflect the unseen target speaker’s
characteristics without further finetuning the model. This task
is known as zero-shot multi-speaker TTS (ZSM-TTS). Some
of the previous works, [21]–[24], propose using an external
speaker encoder trained for speaker verification. [25]–[27]
utilized adversarial training to enhance unseen speaker gener-
alization. On the other hand, normalization-based conditioning
techniques used in style transfer, [28], [29], were introduced
to condition speaker embeddings for FastSpeech-based models
in [25] and [30]. These conditioning methods first remove
the instance-specific information from the input to preserve
content via speaker-normalization. The normalized input is
then scaled and shifted by the affine parameters predicted from
the target speaker embedding vector.
However, recently proposed flow-based TTS models are
rather under-explored in ZSM-TTS applications. Leveraging
the aforementioned normalization-based speaker conditioning
techniques in flow-based models is especially challenging
because the flow requires the inverse operation of such nor-
malization unlike feed-forward models.
In this letter, we propose a speaker-normalized affine cou-
pling (SNAC) layer for flow-based TTS models in the ZSM-
TTS scenario. The proposed method explicitly normalizes
the input by the speaker-dependent parameters in order to
preserve speaker-independent information at training, while
target speaker’s information is inversely injected at inference
through denormalization. We compare the proposed condi-
tioning method to the existing method in several different
experimental settings using VITS [7] as our base model and
demonstrate that the proposed method outperforms the con-
ventional technique in both subjective and objective measures.
Copyright © 2022 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including
reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any
copyrighted component of this work in other works.
arXiv:2211.16866v1 [eess.AS] 30 Nov 2022
2
The audio samples are available on the demo page
1
.
II. AFFINE COUPLING-BASED GENERATIVE FLOW
Normalizing flow models, [31]–[33], learn an invertible
mapping between a prior distribution, p
θ
(z), and a more
complex data distribution p
θ
(x) using a sequence of bijective
functions. The log-likelihood computation is tractable via the
rule of change-of-variables. Let f
θ
: R
D
R
D
be a bijective
function which maps the observed data x to the latent variable
z from a simple prior distribution p
θ
(z) where x, z R
D
.
Then the log-likelihood is obtained by
log p
θ
(x) = log p
θ
(z) + log
det
f
θ
(x)
x
.
(1)
Computing the log determinant of the Jacobian matrix in (1)
is computationally expensive in general. In addition, f
θ
is
strictly restricted to be a bijective function in which only
certain types of transformations can be easily inversed. An
affine coupling layer, first introduced in [31], allows for an
efficient computation of the log determinant with invertible
transformations by generating an output y R
D
given an
input x R
D
and d < D via
y
1:d
= x
1:d
y
d+1:D
= x
d+1:D
exp(s
θ
(x
1:d
)) + b
θ
(x
1:d
)
(2)
where s
θ
and b
θ
are parameterized scale and bias functions
mapping R
d
R
Dd
, and is an element-wise product.
With this coupling architecture, the Jacobian becomes a lower
triangular matrix as given by
y
x
=
I
d
0
y
d+1:D
x
1:d
diag(exp(s
θ
(x
1:d
)))
(3)
where I
d
represents a d × d identity matrix. Computing the
determinant of Jacobian matrix of the affine coupling does not
depend on the Jacobians of s
θ
and b
θ
. Therefore, they can be
any type of complex functions modeled by highly expressive
neural networks, such as non-causal Wavenet [34]. The inverse
transformation of the coupling layer can be easily derived as
x
1:d
= y
1:d
x
d+1:D
=
y
d+1:D
b
θ
(y
1:d
)
exp(s
θ
(y
1:d
))
,
(4)
hence sampling is also efficient. Each coupling layer is then
followed by a layer which permutes the ordering of the
channels along the feature dimension.
III. SPEAKER-NORMALIZED AFFINE COUPLING LAYER FOR
ZSM-TTS
A conditional generative flow [35], [36] models a condi-
tional probability distribution p
θ
(x|g) where g represents a
conditioning term. Conventionally, a conditional flow extends
the forward and inverse transformations of an affine coupling
layer given in (2) and (4) by modifying s
θ
and b
θ
such that
they take g as an additional input.
1
https://byoungjinchoi.github.io/snac/
For ZSM-TTS, the condition g usually represents a specific
speaker embedding vector. Our strategy for ZSM-TTS is to
convert the speaker-dependent data distribution to a latent
prior distribution which becomes speaker-independent. Then,
when synthesizing speech, the speaker-independent latent prior
distribution is mapped back to a speaker-specific data distri-
bution depending on the given speaker embedding. In order to
achieve this, we design each affine coupling layer to remove
the information related to g at the forward transformation.
Contrarily, g is injected to the input embedding sequence at
the inverse transformation. To obtain such bijective transfor-
mation with explicit g conditioning, we propose a speaker-
normalized affine coupling (SNAC) layer, which normalizes
and denormalizes the input embedding sequence by the mean
and standard deviation parameters predicted by g . Speaker
normalization (SN ) and speaker denormalization (SDN ) in
SNAC are performed as follows:
SN(x; g) =
x m
θ
(g)
exp (v
θ
(g))
SDN (x; g) = x exp (v
θ
(g)) + m
θ
(g)
(5)
where m
θ
and v
θ
are simple linear projections to obtain the
mean and standard deviation parameters from g. SN and
SDN are applied across the temporal axis, thus normalizing
and denormalizing each frame of the input x with the same
mean and standard deviation parameters.
The forward transformation of the SNAC layer is now given
by
y
1:d
= x
1:d
y
d+1:D
= SN(x
d+1:D
; g) exp(s
θ
(SN(x
1:d
; g)))
+b
θ
(SN(x
1:d
; g))
.
(6)
The inverse transformation can be derived straightforwardly
as follows:
x
1:d
= y
1:d
x
d+1:D
= SDN
y
d+1:D
b
θ
(SN(y
1:d
; g))
exp(s
θ
(SN(y
1:d
; g)))
; g
.
(7)
At each SNAC layer, SN is applied to the input of s
θ
and b
θ
such that the affine parameters contain the informa-
tion unrelated to the speaker. Since x
d+1:D
is also speaker-
normalized in the forward transformation, this results in the
extensive removal of speaker information at training. When
inferring x
d+1:D
through the inverse transformation, SDN is
enforced after the affine transformation is applied to y
d+1:D
to
appropriately inject information related to the target speaker.
The log-determinant of the conditional flow with SNAC
layers can be obtained by
log
det
f
θ
(x)
x
= log
X
j
exp(s
θ
(SN(x
1:d
; g))
j
)
exp (v
θ
(g)
j
)
.
(8)
The complete architecture of the SNAC layer is presented in
Fig. 1.
3
(a) Forward transformation
(b) Inverse transformation
Fig. 1: (a) and (b) each show the forward and inverse trans-
formations of SNAC layer.
IV. EXPERIMENTS
A. Model
We performed experiments on the proposed method by
replacing the affine coupling layer of the flow module in VITS
with the SNAC layer.
1) VITS overview: VITS leverages the variational autoen-
coder (VAE) [37] formulation and the adversarial training
strategy to successfully combine the joint training of acoustic
feature generation, vocoding, and duration prediction in an
end-to-end manner. The objective is to maximize the varia-
tional lower bound of a conditional log-likelihood
log p
θ
(o|l) E
q
φ
(z|o)
log p
θ
(o|z) log
q
φ
(z|o)
p
θ
(z|l)
(9)
where p
θ
(o|z), q
φ
(z|o) and p
θ
(z|l) respectively denote the
likelihood, the approximate posterior and the conditional prior
distributions. A target speech and the corresponding phoneme
sequence are denoted as o and l = [l
text
, A], and z is a frame-
level latent sequence representing the intermediate acoustic
features. The alignment A is estimated using the monotonic
alignment search (MAS) algorithm proposed in [10]. The
generator part of VITS architecture consists of a posterior
encoder, prior encoder, decoder, and duration predictor and
is trained with a discriminator in an adversarial manner. The
prior encoder is composed of two parts: a text encoder and
a flow module. The flow module plays an essential role in
transforming a simple text-conditional distribution to a more
complex distribution.
2) Multi-speaker VITS: In a multi-speaker setting, the like-
lihood p
θ
(o|l) is substituted with p
θ
(o|l, g) where g represents
a speaker embedding. For training, the original work uses a
speaker label as an additional input which is transformed to a
fixed-dimensional vector g via a learnable embedding table. g
is then conditioned to every module of the generator.
B. Datasets
All tested models were trained on VCTK [38] dataset.
VCTK is a multi-speaker audio dataset which contains ap-
proximately 44 hours of speech recorded by 109 speakers. We
selected 11 speakers as an in-domain test set following [24].
To evaluate the performance on the out-of-domain dataset,
we randomly selected 20 speakers from LibriTTS [39] test-
clean dataset, which consists of 8.56 hours of audio from 39
speakers. Each utterance was downsampled to 22050 Hz for
training.
C. Implementation details
Our proposed method modifies the official implementation
of VITS
2
. For the partitioning scheme of affine coupling layer
at flow module, we chose channel-wise masking pattern [32].
To ensure that all input entries are processed, we reverse
the ordering of the feature dimension at each layer of flow
module. We employ a reference encoder to extract the speaker
embedding vector. The reference encoder is composed of a
stack of 2-D convolutional layers and a gated recurrent unit
(GRU) [40], following global style token (GST) [41]. The
reference encoder takes a sequence of linear spectrograms of
the reference audio as an input and outputs a 256-dimensional
embedding vector. Two separate linear projection layers, m
θ
and v
θ
, are employed to predict the mean and standard
deviation parameters of the speaker in (5) from the reference
embedding vector.
D. Experimental setup
We evaluated our method in several different settings. We
built our first baseline, Baseline+REF+ALL, by attaching a
reference encoder to the vanilla multi-speaker VITS model,
which applied the speaker conditioning at every module of
the generator. The second baseline, Baseline+REF+FLOW,
conditioned the speaker embedding only to the duration
predictor and the flow module to focus on the effect of
the proposed method. The last baseline, Baseline+PRE-
TRAINED+FLOW, substituted the reference encoder with a
pre-trained speaker encoder which was trained from a speaker
verification for a different speaker embedding scenario. We
used a H/ASP model [42] which can be obtained from an
open-source project
3
. In this baseline, the pre-trained speaker
encoder weights were fixed to consistently draw speaker
embedding vectors from a learned speaker embedding space.
For the above three baselines, the speaker embedding vector
was used as a conditional input to produce s
θ
and b
θ
at affine
coupling layers of the flow module.
To demonstrate the effect of the proposed method for each
setting, we replaced the conventional affine coupling layers
with the SNAC layers for the above three baselines. We name
the three proposed models corresponding to each baseline
2
https://github.com/jaywalnut310/vits
3
https://github.com/clovaai/voxceleb trainer
4
Model
VCTK LibriTTS
MOS() SMOS() SECS() MOS() SMOS() SECS()
Ground Truth 4.76±0.02 4.19±0.04 0.748 4.80±0.02 4.51±0.03 0.646
Meta-StyleSpeech 2.06±0.04 2.62±0.05 0.212 2.00±0.03 2.50±0.04 0.131
YourTTS 4.42±0.03 3.86±0.04 0.447 4.23±0.03 3.35±0.04 0.317
Baseline+REF+ALL 4.22±0.04 4.11±0.04 0.350 4.30±0.03 3.67±0.04 0.143
Baseline+REF+FLOW 4.08±0.04 4.01±0.04 0.339 3.98±0.04 3.64±0.04 0.135
Baseline+PRE-TRAINED+FLOW 4.38±0.03 3.52±0.04 0.321 4.17±0.03 2.91±0.05 0.135
Proposed+REF+ALL 4.30±0.03 4.07±0.04 0.320 4.11±0.03 3.56±0.04 0.145
Proposed+REF+FLOW 4.48±0.03 4.19±0.04 0.352 4.41±0.03 3.70±0.04 0.151
Proposed+PRE-TRAINED+FLOW 4.46±0.03 3.61±0.04 0.270 4.40±0.03 3.18±0.04 0.116
TABLE I: MOS, SMOS, and SECS on unseen speakers of VCTK and LibriTTS
as follows: Proposed+REF+ALL, Proposed+REF+FLOW,
Proposed+PRE-TRAINED+FLOW.
Furthermore, we compared our models with two other
baseline models: Meta-StyleSpeech [25] and YourTTS [24].
Meta-StyleSpeech is trained with a meta-learning scheme
with a modified structure of FastSpeech2 [9]. YourTTS is
built on VITS architecture with an external speaker encoder
and an additional speaker consistency loss. We used an open-
source implementation
4
for Meta-StyleSpeech and followed
the paper to implement YourTTS on the official VITS code.
E. Evaluation method
We first conducted subjective tests to measure the overall
speech quality using mean opinion score (MOS). To assess the
effectiveness of the proposed speaker conditioning method, we
also measured similarity mean opinion score (SMOS). SMOS
is employed to evaluate how similar the synthesized samples
are to the reference speech samples in terms of speaker charac-
teristic. Both MOS and SMOS are 5-scale scores ranging from
1 to 5 and reported with the 95% confidence interval. For in-
domain evaluations, we drew 3 pairs of text and reference
audio randomly from each of the 11 unseen speakers of
VCTK test dataset. To evaluate on the out-of-domain case,
2 pairs of text and reference audio from 20 randomly selected
speakers were drawn from the LibriTTS test-clean dataset. 15
judges participated in the subjective tests with both VCTK and
LibriTTS unseen speakers.
In addition, we also evaluated the objective score for speaker
similarity between the generated samples and the ground truth
samples using speaker embedding cosine similarity (SECS).
The SECS ranges from -1 to 1, where 1 indicates both samples
are from the same speaker. We computed SECS using a
pre-trained speaker encoder model provided by SpeechBrain
toolkit
5
[43]. The results of MOS, SMOS, and SECS are
presented in Table I.
F. Results
The MOS and SMOS results shown in Table I indicate that
Proposed+REF+FLOW consistently shows superior perfor-
mance over the baseline models in terms of sample quality
and speaker similarity.
4
https://github.com/keonlee9420/StyleSpeech
5
https://github.com/speechbrain/speechbrain
In Proposed+REF+ALL, the SNAC-based flow module
enforces the explicit removal of speaker information at
the forward transformation while the speaker information
is conversely injected to the generator modules. Training
this model results in neutralizing the effect of the SNAC
layer, and this accounts for the MOS and SMOS drop from
Baseline+REF+ALL to Proposed+REF+ALL in LibriTTS
dataset. On the contrary, the best performance for subjective
tests is consistently achieved by Proposed+REF+FLOW in
both VCTK and LibriTTS datasets. Nonetheless, YourTTS
shows the highest SECS scores among all models since
YourTTS is trained to minimize the speaker embedding cosine
similarity.
From the synthesized samples, we have noticed
that the models using pre-trained speaker encoders
occasionally produce the voice of different speakers.
This phenomenon is reflected in the lower SMOS of
Baseline+PRE-TRAINED+FLOW and Proposed+PRE-
TRAINED+FLOW. This shows that the joint training of a
reference encoder may be more suitable for ZSM-TTS task
than using a pre-trained speaker encoder in terms of speaker
stability. However, this does not affect the MOS as much
since the generated samples maintain the consistent quality.
Although Proposed+REF+ALL outperforms the baseline
models in both in-domain and out-of-domain datasets, the
overall performance drop between the two settings still exists
in terms of speaker similarity. Since LibriTTS dataset inher-
ently includes various types of channel conditions which may
interfere with the accurate inference of speaker embedding
while VCTK contains only clean speech data, we conjecture
such domain mismatch accounts for the performance drop.
V. CONCLUSION
We have proposed a novel speaker conditioning method
for flow-based multi-speaker TTS. The experimental results
show that the proposed method outperforms the conventional
conditioning technique in a ZSM-TTS setting and achieves
the best performance in subjective tests compared to the other
baseline models. For a future work, we intend to incorporate
locally-varying features related to prosody and accents.
ACKNOWLEDGMENT
This work was supported by Samsung Research, Samsung
Electronics Co.,Ltd.
5
REFERENCES
[1] Y. Wang et al., “Tacotron: Towards end-to-end speech synthesis, in
Proc. Interspeech 2017, pp. 4006–4010, 2017.
[2] J. Shen et al., “Natural tts synthesis by conditioning wavenet on mel
spectrogram predictions, in Proc. IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP), pp. 4779–4783,
IEEE, 2018.
[3] H. Tachibana, K. Uenoyama, and S. Aihara, “Efficiently trainable text-
to-speech system based on deep convolutional networks with guided
attention,” in Proc. IEEE International Conference on Acoustics, Speech
and Signal Processing (ICASSP), pp. 4784–4788, IEEE, 2018.
[4] S.
¨
O. Arık et al., “Deep voice: Real-time neural text-to-speech, in Proc.
International Conference on Machine Learning, pp. 195–204, PMLR,
2017.
[5] W. Ping et al., “Deep voice 3: Scaling text-to-speech with convolutional
sequence learning, arXiv preprint arXiv:1710.07654, 2017.
[6] N. Li, S. Liu, Y. Liu, S. Zhao, and M. Liu, “Neural speech synthesis
with transformer network, in Proc. AAAI Conference on Artificial
Intelligence, vol. 33, pp. 6706–6713, 2019.
[7] J. Kim, J. Kong, and J. Son, “Conditional variational autoencoder with
adversarial learning for end-to-end text-to-speech, in Proc. Interna-
tional Conference on Machine Learning, pp. 5530–5540, PMLR, 2021.
[8] Y. Ren et al., “Fastspeech: Fast, robust and controllable text to speech,
Proc. Advances in Neural Information Processing Systems, vol. 32, 2019.
[9] Y. Ren, C. Hu, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T.-Y. Liu,
“Fastspeech 2: Fast and high-quality end-to-end text to speech, arXiv
preprint arXiv:2006.04558, 2020.
[10] J. Kim, S. Kim, J. Kong, and S. Yoon, “Glow-tts: A generative flow
for text-to-speech via monotonic alignment search, Proc. Advances in
Neural Information Processing Systems, vol. 33, pp. 8067–8077, 2020.
[11] J. Donahue, S. Dieleman, M. Bi
´
nkowski, E. Elsen, and K. Si-
monyan, “End-to-end adversarial text-to-speech, arXiv preprint
arXiv:2006.03575, 2020.
[12] T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura,
“Simultaneous modeling of spectrum, pitch and duration in hmm-
based speech synthesis,” in Proc. Sixth European Conference on Speech
Communication and Technology, 1999.
[13] K. Tokuda, T. Kobayashi, and S. Imai, “Speech parameter generation
from hmm using dynamic features, in Proc. IEEE International Con-
ference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 1,
pp. 660–663, IEEE, 1995.
[14] T. Masuko, K. Tokuda, T. Kobayashi, and S. Imai, “Speech synthesis
using hmms with dynamic features, in Proc. IEEE International Con-
ference on Acoustics, Speech and Signal Processing (ICASSP), vol. 1,
pp. 389–392, IEEE, 1996.
[15] T. Masuko, K. Tokuda, T. Kobayashi, and S. Imai, “Voice characteristics
conversion for hmm-based speech synthesis system, in Proc. IEEE
international conference on acoustics, speech, and signal processing
(ICASSP), vol. 3, pp. 1611–1614, IEEE, 1997.
[16] M. Tamura, T. Masuko, K. Tokuda, and T. Kobayashi, “Speaker adap-
tation for hmm-based speech synthesis system using mllr, in the third
ESCA/COCOSDA Workshop (ETRW) on Speech Synthesis, 1998.
[17] M. Tamura, T. Masuko, K. Tokuda, and T. Kobayashi, Adaptation of
pitch and spectrum for hmm-based speech synthesis using mllr, in
Proc. IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP), vol. 2, pp. 805–808, IEEE, 2001.
[18] O. Siohan, T. A. Myrvoll, and C.-H. Lee, “Structural maximum a
posteriori linear regression for fast hmm adaptation, Computer Speech
& Language, vol. 16, no. 1, pp. 5–24, 2002.
[19] S. Arik, J. Chen, K. Peng, W. Ping, and Y. Zhou, “Neural voice cloning
with a few samples, Proc. Advances in Neural Information Processing
Systems, vol. 31, 2018.
[20] M. Chen et al., Adaspeech: Adaptive text to speech for custom voice,
arXiv preprint arXiv:2103.00993, 2021.
[21] Y. Jia et al., “Transfer learning from speaker verification to multispeaker
text-to-speech synthesis, Proc. Advances in Neural Information Pro-
cessing Systems, vol. 31, 2018.
[22] E. Cooper et al., “Zero-shot multi-speaker text-to-speech with state-
of-the-art neural speaker embeddings, in Proc. IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP),
pp. 6184–6188, IEEE, 2020.
[23] E. Casanova et al., “Sc-glowtts: an efficient zero-shot multi-speaker text-
to-speech model, arXiv preprint arXiv:2104.05557, 2021.
[24] E. Casanova et al., “Yourtts: Towards zero-shot multi-speaker tts and
zero-shot voice conversion for everyone, in Proc. International Con-
ference on Machine Learning, pp. 2709–2720, PMLR, 2022.
[25] D. Min, D. B. Lee, E. Yang, and S. J. Hwang, “Meta-stylespeech:
Multi-speaker adaptive text-to-speech generation,” in Proc. International
Conference on Machine Learning, pp. 7748–7759, PMLR, 2021.
[26] S.-H. Lee, H.-W. Yoon, H.-R. Noh, J.-H. Kim, and S.-W. Lee, “Multi-
spectrogan: High-diversity and high-fidelity spectrogram generation with
adversarial style combination for speech synthesis, in Proc. AAAI
Conference on Artificial Intelligence, vol. 35, pp. 13198–13206, 2021.
[27] B. J. Choi, M. Jeong, M. Kim, S. H. Mun, and N. S. Kim, Adversarial
speaker-consistency learning using untranscribed speech data for zero-
shot multi-speaker text-to-speech, arXiv preprint arXiv:2210.05979,
2022.
[28] D. Ulyanov, A. Vedaldi, and V. Lempitsky, “Instance normalization: The
missing ingredient for fast stylization,arXiv preprint arXiv:1607.08022,
2016.
[29] X. Huang and S. Belongie, Arbitrary style transfer in real-time with
adaptive instance normalization, in Proc. IEEE International Confer-
ence on Computer Vision, pp. 1501–1510, 2017.
[30] N. Kumar, S. Goel, A. Narang, and B. Lall, “Normalization driven
zero-shot multi-speaker speech synthesis., in Proc. Interspeech 2021,
pp. 1354–1358, 2021.
[31] L. Dinh, D. Krueger, and Y. Bengio, “Nice: Non-linear independent
components estimation, arXiv preprint arXiv:1410.8516, 2014.
[32] L. Dinh, J. Sohl-Dickstein, and S. Bengio, “Density estimation using
real nvp, arXiv preprint arXiv:1605.08803, 2016.
[33] D. P. Kingma and P. Dhariwal, “Glow: Generative flow with invertible
1x1 convolutions, Proc. Advances in Neural Information Processing
Systems, vol. 31, 2018.
[34] A. v. d. Oord et al., “Wavenet: A generative model for raw audio,arXiv
preprint arXiv:1609.03499, 2016.
[35] A. Atanov, A. Volokhova, A. Ashukha, I. Sosnovik, and D. Vetrov,
“Semi-conditional normalizing flows for semi-supervised learning,
arXiv preprint arXiv:1905.00505, 2019.
[36] J. Serr
`
a, S. Pascual, and C. Segura Perales, “Blow: a single-scale
hyperconditioned flow for non-parallel raw-audio voice conversion,
Proc. Advances in Neural Information Processing Systems, vol. 32, 2019.
[37] D. P. Kingma and M. Welling, “Auto-encoding variational bayes, arXiv
preprint arXiv:1312.6114, 2013.
[38] J. Yamagishi, C. Veaux, and K. MacDonald, “Cstr vctk corpus: English
multi-speaker corpus for cstr voice cloning toolkit (version 0.92),” 2019.
University of Edinburgh. The Centre for Speech Technology Research
(CSTR). https://doi.org/10.7488/ds/2645.
[39] H. Zen et al., “Libritts: A corpus derived from librispeech for text-to-
speech, Proc. Interspeech 2019, pp. 1526–1530, 2019.
[40] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical evaluation of
gated recurrent neural networks on sequence modeling, arXiv preprint
arXiv:1412.3555, 2014.
[41] Y. Wang et al., “Style tokens: Unsupervised style modeling, control
and transfer in end-to-end speech synthesis, in Proc. International
Conference on Machine Learning, pp. 5180–5189, PMLR, 2018.
[42] H. S. Heo, B.-J. Lee, J. Huh, and J. S. Chung, “Clova baseline system
for the voxceleb speaker recognition challenge 2020, arXiv preprint
arXiv:2009.14153, 2020.
[43] M. Ravanelli et al., “Speechbrain: A general-purpose speech toolkit,
arXiv preprint arXiv:2106.04624, 2021.