SNAC: Speaker-normalized affine coupling layer in flow-based

SNAC: Speaker-normalized afﬁne coupling layer in

ﬂow-based architecture for zero-shot multi-speaker

text-to-speech

Byoung Jin Choi, Student Member, IEEE, Myeonghun Jeong, Student Member, IEEE, Joun Yeop Lee,

and Nam Soo Kim, Senior Member, IEEE

Abstract—Zero-shot multi-speaker text-to-speech (ZSM-TTS)

models aim to generate a speech sample with the voice character-

istic of an unseen speaker. The main challenge of ZSM-TTS is to

increase the overall speaker similarity for unseen speakers. One

of the most successful speaker conditioning methods for ﬂow-

based multi-speaker text-to-speech (TTS) models is to utilize the

functions which predict the scale and bias parameters of the

afﬁne coupling layers according to the given speaker embedding

vector. In this letter, we improve on the previous speaker

conditioning method by introducing a speaker-normalized afﬁne

coupling (SNAC) layer which allows for unseen speaker speech

synthesis in a zero-shot manner leveraging a normalization-

based conditioning technique. The newly designed coupling layer

explicitly normalizes the input by the parameters predicted

from a speaker embedding vector while training, enabling an

inverse process of denormalizing for a new speaker embedding

at inference. The proposed conditioning scheme yields the state-

of-the-art performance in terms of the speech quality and speaker

similarity in a ZSM-TTS setting.

Index Terms—speech synthesis, zero-shot multi-speaker text-

to-speech, conditional normalizing ﬂow

I. INTRODUCTION

S the sample quality of the recently proposed neural

text-to-speech (TTS) models, [1]–[11], approaches to the

natural speech, the research interest has extended to high

ﬁdelity multi-speaker TTS systems, which enables the speech

generation of multiple speakers via a single trained model.

However, training a multi-speaker TTS system requires a large

dataset of [text, audio, speaker] tuples where the labeling can

be costly. Furthermore, such systems are limited to generate

the voice of speakers seen during the training period, whereas

instant adaptation to a new speaker’s voice may be required in

real life applications. To this end, personalized TTS is gaining

huge attention from researchers.

Personalized TTS aims at generating new speakers’ speech

with limited resources. One possible approach is speaker

adaptation. The idea of adapting a pre-trained TTS model to a

new speaker with more than one [text, audio] pair dates back

to the hidden Markov model (HMM)-based TTS [12]–[15].

This work was supported by Samsung Research, Samsung Electronics

Co.,Ltd.

Byoung Jin Choi, Myeonghun Jeong, and Nam Soo Kim are with the De-

partment of Electrical and Computer Engineering and with the Institute of New

Media and Communications, Seoul National University, Seoul 08826, South

Korea (e-mail: [email protected]; [email protected]; [email protected])

Joun Yeop Lee is with Samsung Research, Samsung Electronics Co., Ltd,

Seoul, South Korea (e-mail: [email protected])

[16] and [17] extend the maximum likelihood linear regression

(MLLR) algorithm for speaker adaptation. For more robust

speaker adaptation, structured maximum a posteriori linear

regression (SMAPLR) [18] is developed by combining the

maximum a posteriori (MAP) criterion to MLLR. The adapta-

tion process is based on the afﬁne transformation of mean and

variance of HMM parameters for the target speaker and such

transformation matrices are derived by maximum likelihood

and MAP criterion respectively. With the recent development

of non-autoregressive neural TTS systems, [19] and [20] focus

on effectively ﬁnetuning the parameters of pre-trained neural

TTS model to adapt to the speaker’s characteristics.

Another approach deals with an extreme situation where

only an [audio] from a target speaker is available. The model

is required to correctly reﬂect the unseen target speaker’s

characteristics without further ﬁnetuning the model. This task

is known as zero-shot multi-speaker TTS (ZSM-TTS). Some

of the previous works, [21]–[24], propose using an external

speaker encoder trained for speaker veriﬁcation. [25]–[27]

utilized adversarial training to enhance unseen speaker gener-

alization. On the other hand, normalization-based conditioning

techniques used in style transfer, [28], [29], were introduced

to condition speaker embeddings for FastSpeech-based models

in [25] and [30]. These conditioning methods ﬁrst remove

the instance-speciﬁc information from the input to preserve

content via speaker-normalization. The normalized input is

then scaled and shifted by the afﬁne parameters predicted from

the target speaker embedding vector.

However, recently proposed ﬂow-based TTS models are

rather under-explored in ZSM-TTS applications. Leveraging

the aforementioned normalization-based speaker conditioning

techniques in ﬂow-based models is especially challenging

because the ﬂow requires the inverse operation of such nor-

malization unlike feed-forward models.

In this letter, we propose a speaker-normalized afﬁne cou-

pling (SNAC) layer for ﬂow-based TTS models in the ZSM-

TTS scenario. The proposed method explicitly normalizes

the input by the speaker-dependent parameters in order to

preserve speaker-independent information at training, while

target speaker’s information is inversely injected at inference

through denormalization. We compare the proposed condi-

tioning method to the existing method in several different

experimental settings using VITS [7] as our base model and

demonstrate that the proposed method outperforms the con-

ventional technique in both subjective and objective measures.

reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any

copyrighted component of this work in other works.

arXiv:2211.16866v1 [eess.AS] 30 Nov 2022

The audio samples are available on the demo page

II. AFFINE COUPLING-BASED GENERATIVE FLOW

Normalizing ﬂow models, [31]–[33], learn an invertible

mapping between a prior distribution, p

(z), and a more

complex data distribution p

(x) using a sequence of bijective

functions. The log-likelihood computation is tractable via the

rule of change-of-variables. Let f

: R

→ R

be a bijective

function which maps the observed data x to the latent variable

z from a simple prior distribution p

(z) where x, z ∈ R

Then the log-likelihood is obtained by

log p

(x) = log p

(z) + log



det

∂f

(x)

∂x



(1)

Computing the log determinant of the Jacobian matrix in (1)

is computationally expensive in general. In addition, f

strictly restricted to be a bijective function in which only

certain types of transformations can be easily inversed. An

afﬁne coupling layer, ﬁrst introduced in [31], allows for an

efﬁcient computation of the log determinant with invertible

transformations by generating an output y ∈ R

given an

input x ∈ R

and d < D via

1:d

= x

1:d

d+1:D

= x

d+1:D

 exp(s

1:d

)) + b

1:d

)

(2)

where s

and b

are parameterized scale and bias functions

mapping R

→ R

D−d

, and  is an element-wise product.

With this coupling architecture, the Jacobian becomes a lower

triangular matrix as given by

∂y

∂x



∂y

d+1:D

∂x

1:d

diag(exp(s

1:d

)))



(3)

where I

represents a d × d identity matrix. Computing the

determinant of Jacobian matrix of the afﬁne coupling does not

depend on the Jacobians of s

and b

. Therefore, they can be

any type of complex functions modeled by highly expressive

neural networks, such as non-causal Wavenet [34]. The inverse

transformation of the coupling layer can be easily derived as

1:d

= y

1:d

d+1:D

− b

1:d

)

exp(s

1:d

))

(4)

hence sampling is also efﬁcient. Each coupling layer is then

followed by a layer which permutes the ordering of the

channels along the feature dimension.

III. SPEAKER-NORMALIZED AFFINE COUPLING LAYER FOR

ZSM-TTS

A conditional generative ﬂow [35], [36] models a condi-

tional probability distribution p

(x|g) where g represents a

conditioning term. Conventionally, a conditional ﬂow extends

the forward and inverse transformations of an afﬁne coupling

layer given in (2) and (4) by modifying s

and b

such that

they take g as an additional input.

https://byoungjinchoi.github.io/snac/

For ZSM-TTS, the condition g usually represents a speciﬁc

speaker embedding vector. Our strategy for ZSM-TTS is to

convert the speaker-dependent data distribution to a latent

prior distribution which becomes speaker-independent. Then,

when synthesizing speech, the speaker-independent latent prior

distribution is mapped back to a speaker-speciﬁc data distri-

bution depending on the given speaker embedding. In order to

achieve this, we design each afﬁne coupling layer to remove

the information related to g at the forward transformation.

Contrarily, g is injected to the input embedding sequence at

the inverse transformation. To obtain such bijective transfor-

mation with explicit g conditioning, we propose a speaker-

normalized afﬁne coupling (SNAC) layer, which normalizes

and denormalizes the input embedding sequence by the mean

and standard deviation parameters predicted by g . Speaker

normalization (SN ) and speaker denormalization (SDN ) in

SNAC are performed as follows:

SN(x; g) =

x − m

(g)

exp (v

(g))

SDN (x; g) = x  exp (v

(g)) + m

(g)

(5)

where m

and v

are simple linear projections to obtain the

mean and standard deviation parameters from g. SN and

SDN are applied across the temporal axis, thus normalizing

and denormalizing each frame of the input x with the same

mean and standard deviation parameters.

The forward transformation of the SNAC layer is now given

1:d

= x

1:d

d+1:D

= SN(x

d+1:D

; g)  exp(s

(SN(x

1:d

; g)))

(SN(x

1:d

; g))

(6)

The inverse transformation can be derived straightforwardly

as follows:

1:d

= y

1:d

d+1:D

= SDN



d+1:D

− b

(SN(y

1:d

; g))

exp(s

(SN(y

1:d

; g)))

; g



(7)

At each SNAC layer, SN is applied to the input of s

and b

such that the afﬁne parameters contain the informa-

tion unrelated to the speaker. Since x

d+1:D

is also speaker-

normalized in the forward transformation, this results in the

extensive removal of speaker information at training. When

inferring x

d+1:D

through the inverse transformation, SDN is

enforced after the afﬁne transformation is applied to y

d+1:D

appropriately inject information related to the target speaker.

The log-determinant of the conditional ﬂow with SNAC

layers can be obtained by

log



det

∂f

(x)

∂x



= log

exp(s

(SN(x

1:d

; g))

)

exp (v

(g)

)

(8)

The complete architecture of the SNAC layer is presented in

Fig. 1.

(a) Forward transformation

(b) Inverse transformation

Fig. 1: (a) and (b) each show the forward and inverse trans-

formations of SNAC layer.

IV. EXPERIMENTS

A. Model

We performed experiments on the proposed method by

replacing the afﬁne coupling layer of the ﬂow module in VITS

with the SNAC layer.

1) VITS overview: VITS leverages the variational autoen-

coder (VAE) [37] formulation and the adversarial training

strategy to successfully combine the joint training of acoustic

feature generation, vocoding, and duration prediction in an

end-to-end manner. The objective is to maximize the varia-

tional lower bound of a conditional log-likelihood

log p

(o|l) ≥ E

(z|o)



log p

(o|z) − log

(z|o)

(z|l)



(9)

where p

(o|z), q

(z|o) and p

(z|l) respectively denote the

likelihood, the approximate posterior and the conditional prior

distributions. A target speech and the corresponding phoneme

sequence are denoted as o and l = [l

text

, A], and z is a frame-

level latent sequence representing the intermediate acoustic

features. The alignment A is estimated using the monotonic

alignment search (MAS) algorithm proposed in [10]. The

generator part of VITS architecture consists of a posterior

encoder, prior encoder, decoder, and duration predictor and

is trained with a discriminator in an adversarial manner. The

prior encoder is composed of two parts: a text encoder and

a ﬂow module. The ﬂow module plays an essential role in

transforming a simple text-conditional distribution to a more

complex distribution.

2) Multi-speaker VITS: In a multi-speaker setting, the like-

lihood p

(o|l) is substituted with p

(o|l, g) where g represents

a speaker embedding. For training, the original work uses a

speaker label as an additional input which is transformed to a

ﬁxed-dimensional vector g via a learnable embedding table. g

is then conditioned to every module of the generator.

B. Datasets

All tested models were trained on VCTK [38] dataset.

VCTK is a multi-speaker audio dataset which contains ap-

proximately 44 hours of speech recorded by 109 speakers. We

selected 11 speakers as an in-domain test set following [24].

To evaluate the performance on the out-of-domain dataset,

we randomly selected 20 speakers from LibriTTS [39] test-

clean dataset, which consists of 8.56 hours of audio from 39

speakers. Each utterance was downsampled to 22050 Hz for

training.

C. Implementation details

Our proposed method modiﬁes the ofﬁcial implementation

of VITS

. For the partitioning scheme of afﬁne coupling layer

at ﬂow module, we chose channel-wise masking pattern [32].

To ensure that all input entries are processed, we reverse

the ordering of the feature dimension at each layer of ﬂow

module. We employ a reference encoder to extract the speaker

embedding vector. The reference encoder is composed of a

stack of 2-D convolutional layers and a gated recurrent unit

(GRU) [40], following global style token (GST) [41]. The

reference encoder takes a sequence of linear spectrograms of

the reference audio as an input and outputs a 256-dimensional

embedding vector. Two separate linear projection layers, m

and v

, are employed to predict the mean and standard

deviation parameters of the speaker in (5) from the reference

embedding vector.

D. Experimental setup

We evaluated our method in several different settings. We

built our ﬁrst baseline, Baseline+REF+ALL, by attaching a

reference encoder to the vanilla multi-speaker VITS model,

which applied the speaker conditioning at every module of

the generator. The second baseline, Baseline+REF+FLOW,

conditioned the speaker embedding only to the duration

predictor and the ﬂow module to focus on the effect of

the proposed method. The last baseline, Baseline+PRE-

TRAINED+FLOW, substituted the reference encoder with a

pre-trained speaker encoder which was trained from a speaker

veriﬁcation for a different speaker embedding scenario. We

used a H/ASP model [42] which can be obtained from an

open-source project

. In this baseline, the pre-trained speaker

encoder weights were ﬁxed to consistently draw speaker

embedding vectors from a learned speaker embedding space.

For the above three baselines, the speaker embedding vector

was used as a conditional input to produce s

and b

at afﬁne

coupling layers of the ﬂow module.

To demonstrate the effect of the proposed method for each

setting, we replaced the conventional afﬁne coupling layers

with the SNAC layers for the above three baselines. We name

the three proposed models corresponding to each baseline

https://github.com/jaywalnut310/vits

https://github.com/clovaai/voxceleb trainer

Model

VCTK LibriTTS

MOS(↑) SMOS(↑) SECS(↑) MOS(↑) SMOS(↑) SECS(↑)

Ground Truth 4.76±0.02 4.19±0.04 0.748 4.80±0.02 4.51±0.03 0.646

Meta-StyleSpeech 2.06±0.04 2.62±0.05 0.212 2.00±0.03 2.50±0.04 0.131

YourTTS 4.42±0.03 3.86±0.04 0.447 4.23±0.03 3.35±0.04 0.317

Baseline+REF+ALL 4.22±0.04 4.11±0.04 0.350 4.30±0.03 3.67±0.04 0.143

Baseline+REF+FLOW 4.08±0.04 4.01±0.04 0.339 3.98±0.04 3.64±0.04 0.135

Baseline+PRE-TRAINED+FLOW 4.38±0.03 3.52±0.04 0.321 4.17±0.03 2.91±0.05 0.135

Proposed+REF+ALL 4.30±0.03 4.07±0.04 0.320 4.11±0.03 3.56±0.04 0.145

Proposed+REF+FLOW 4.48±0.03 4.19±0.04 0.352 4.41±0.03 3.70±0.04 0.151

Proposed+PRE-TRAINED+FLOW 4.46±0.03 3.61±0.04 0.270 4.40±0.03 3.18±0.04 0.116

TABLE I: MOS, SMOS, and SECS on unseen speakers of VCTK and LibriTTS

as follows: Proposed+REF+ALL, Proposed+REF+FLOW,

Proposed+PRE-TRAINED+FLOW.

Furthermore, we compared our models with two other

baseline models: Meta-StyleSpeech [25] and YourTTS [24].

Meta-StyleSpeech is trained with a meta-learning scheme

with a modiﬁed structure of FastSpeech2 [9]. YourTTS is

built on VITS architecture with an external speaker encoder

and an additional speaker consistency loss. We used an open-

source implementation

for Meta-StyleSpeech and followed

the paper to implement YourTTS on the ofﬁcial VITS code.

E. Evaluation method

We ﬁrst conducted subjective tests to measure the overall

speech quality using mean opinion score (MOS). To assess the

effectiveness of the proposed speaker conditioning method, we

also measured similarity mean opinion score (SMOS). SMOS

is employed to evaluate how similar the synthesized samples

are to the reference speech samples in terms of speaker charac-

teristic. Both MOS and SMOS are 5-scale scores ranging from

1 to 5 and reported with the 95% conﬁdence interval. For in-

domain evaluations, we drew 3 pairs of text and reference

audio randomly from each of the 11 unseen speakers of

VCTK test dataset. To evaluate on the out-of-domain case,

2 pairs of text and reference audio from 20 randomly selected

speakers were drawn from the LibriTTS test-clean dataset. 15

judges participated in the subjective tests with both VCTK and

LibriTTS unseen speakers.

In addition, we also evaluated the objective score for speaker

similarity between the generated samples and the ground truth

samples using speaker embedding cosine similarity (SECS).

The SECS ranges from -1 to 1, where 1 indicates both samples

are from the same speaker. We computed SECS using a

pre-trained speaker encoder model provided by SpeechBrain

toolkit

[43]. The results of MOS, SMOS, and SECS are

presented in Table I.

F. Results

The MOS and SMOS results shown in Table I indicate that

Proposed+REF+FLOW consistently shows superior perfor-

mance over the baseline models in terms of sample quality

and speaker similarity.

https://github.com/keonlee9420/StyleSpeech

https://github.com/speechbrain/speechbrain

In Proposed+REF+ALL, the SNAC-based ﬂow module

enforces the explicit removal of speaker information at

the forward transformation while the speaker information

is conversely injected to the generator modules. Training

this model results in neutralizing the effect of the SNAC

layer, and this accounts for the MOS and SMOS drop from

Baseline+REF+ALL to Proposed+REF+ALL in LibriTTS

dataset. On the contrary, the best performance for subjective

tests is consistently achieved by Proposed+REF+FLOW in

both VCTK and LibriTTS datasets. Nonetheless, YourTTS

shows the highest SECS scores among all models since

YourTTS is trained to minimize the speaker embedding cosine

similarity.

From the synthesized samples, we have noticed

that the models using pre-trained speaker encoders

occasionally produce the voice of different speakers.

This phenomenon is reﬂected in the lower SMOS of

Baseline+PRE-TRAINED+FLOW and Proposed+PRE-

TRAINED+FLOW. This shows that the joint training of a

reference encoder may be more suitable for ZSM-TTS task

than using a pre-trained speaker encoder in terms of speaker

stability. However, this does not affect the MOS as much

since the generated samples maintain the consistent quality.

Although Proposed+REF+ALL outperforms the baseline

models in both in-domain and out-of-domain datasets, the

overall performance drop between the two settings still exists

in terms of speaker similarity. Since LibriTTS dataset inher-

ently includes various types of channel conditions which may

interfere with the accurate inference of speaker embedding

while VCTK contains only clean speech data, we conjecture

such domain mismatch accounts for the performance drop.

V. CONCLUSION

We have proposed a novel speaker conditioning method

for ﬂow-based multi-speaker TTS. The experimental results

show that the proposed method outperforms the conventional

conditioning technique in a ZSM-TTS setting and achieves

the best performance in subjective tests compared to the other

baseline models. For a future work, we intend to incorporate

locally-varying features related to prosody and accents.

ACKNOWLEDGMENT

This work was supported by Samsung Research, Samsung

Electronics Co.,Ltd.

REFERENCES

[1] Y. Wang et al., “Tacotron: Towards end-to-end speech synthesis,” in

Proc. Interspeech 2017, pp. 4006–4010, 2017.

[2] J. Shen et al., “Natural tts synthesis by conditioning wavenet on mel

spectrogram predictions,” in Proc. IEEE International Conference on

Acoustics, Speech and Signal Processing (ICASSP), pp. 4779–4783,

IEEE, 2018.

[3] H. Tachibana, K. Uenoyama, and S. Aihara, “Efﬁciently trainable text-

to-speech system based on deep convolutional networks with guided

attention,” in Proc. IEEE International Conference on Acoustics, Speech

and Signal Processing (ICASSP), pp. 4784–4788, IEEE, 2018.

[4] S.

O. Arık et al., “Deep voice: Real-time neural text-to-speech,” in Proc.

International Conference on Machine Learning, pp. 195–204, PMLR,

2017.

[5] W. Ping et al., “Deep voice 3: Scaling text-to-speech with convolutional

sequence learning,” arXiv preprint arXiv:1710.07654, 2017.

[6] N. Li, S. Liu, Y. Liu, S. Zhao, and M. Liu, “Neural speech synthesis

with transformer network,” in Proc. AAAI Conference on Artiﬁcial

Intelligence, vol. 33, pp. 6706–6713, 2019.

[7] J. Kim, J. Kong, and J. Son, “Conditional variational autoencoder with

adversarial learning for end-to-end text-to-speech,” in Proc. Interna-

tional Conference on Machine Learning, pp. 5530–5540, PMLR, 2021.

[8] Y. Ren et al., “Fastspeech: Fast, robust and controllable text to speech,”

Proc. Advances in Neural Information Processing Systems, vol. 32, 2019.

[9] Y. Ren, C. Hu, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T.-Y. Liu,

“Fastspeech 2: Fast and high-quality end-to-end text to speech,” arXiv

preprint arXiv:2006.04558, 2020.

[10] J. Kim, S. Kim, J. Kong, and S. Yoon, “Glow-tts: A generative ﬂow

for text-to-speech via monotonic alignment search,” Proc. Advances in

Neural Information Processing Systems, vol. 33, pp. 8067–8077, 2020.

[11] J. Donahue, S. Dieleman, M. Bi

nkowski, E. Elsen, and K. Si-

monyan, “End-to-end adversarial text-to-speech,” arXiv preprint

arXiv:2006.03575, 2020.

[12] T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura,

“Simultaneous modeling of spectrum, pitch and duration in hmm-

based speech synthesis,” in Proc. Sixth European Conference on Speech

Communication and Technology, 1999.

[13] K. Tokuda, T. Kobayashi, and S. Imai, “Speech parameter generation

from hmm using dynamic features,” in Proc. IEEE International Con-

ference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 1,

pp. 660–663, IEEE, 1995.

[14] T. Masuko, K. Tokuda, T. Kobayashi, and S. Imai, “Speech synthesis

using hmms with dynamic features,” in Proc. IEEE International Con-

ference on Acoustics, Speech and Signal Processing (ICASSP), vol. 1,

pp. 389–392, IEEE, 1996.

[15] T. Masuko, K. Tokuda, T. Kobayashi, and S. Imai, “Voice characteristics

conversion for hmm-based speech synthesis system,” in Proc. IEEE

international conference on acoustics, speech, and signal processing

(ICASSP), vol. 3, pp. 1611–1614, IEEE, 1997.

[16] M. Tamura, T. Masuko, K. Tokuda, and T. Kobayashi, “Speaker adap-

tation for hmm-based speech synthesis system using mllr,” in the third

ESCA/COCOSDA Workshop (ETRW) on Speech Synthesis, 1998.

[17] M. Tamura, T. Masuko, K. Tokuda, and T. Kobayashi, “Adaptation of

pitch and spectrum for hmm-based speech synthesis using mllr,” in

Proc. IEEE International Conference on Acoustics, Speech and Signal

Processing (ICASSP), vol. 2, pp. 805–808, IEEE, 2001.

[18] O. Siohan, T. A. Myrvoll, and C.-H. Lee, “Structural maximum a

posteriori linear regression for fast hmm adaptation,” Computer Speech

& Language, vol. 16, no. 1, pp. 5–24, 2002.

[19] S. Arik, J. Chen, K. Peng, W. Ping, and Y. Zhou, “Neural voice cloning

with a few samples,” Proc. Advances in Neural Information Processing

Systems, vol. 31, 2018.

[20] M. Chen et al., “Adaspeech: Adaptive text to speech for custom voice,”

arXiv preprint arXiv:2103.00993, 2021.

[21] Y. Jia et al., “Transfer learning from speaker veriﬁcation to multispeaker

text-to-speech synthesis,” Proc. Advances in Neural Information Pro-

cessing Systems, vol. 31, 2018.

[22] E. Cooper et al., “Zero-shot multi-speaker text-to-speech with state-

of-the-art neural speaker embeddings,” in Proc. IEEE International

Conference on Acoustics, Speech and Signal Processing (ICASSP),

pp. 6184–6188, IEEE, 2020.

[23] E. Casanova et al., “Sc-glowtts: an efﬁcient zero-shot multi-speaker text-

to-speech model,” arXiv preprint arXiv:2104.05557, 2021.

[24] E. Casanova et al., “Yourtts: Towards zero-shot multi-speaker tts and

zero-shot voice conversion for everyone,” in Proc. International Con-

ference on Machine Learning, pp. 2709–2720, PMLR, 2022.

[25] D. Min, D. B. Lee, E. Yang, and S. J. Hwang, “Meta-stylespeech:

Multi-speaker adaptive text-to-speech generation,” in Proc. International

Conference on Machine Learning, pp. 7748–7759, PMLR, 2021.

[26] S.-H. Lee, H.-W. Yoon, H.-R. Noh, J.-H. Kim, and S.-W. Lee, “Multi-

spectrogan: High-diversity and high-ﬁdelity spectrogram generation with

adversarial style combination for speech synthesis,” in Proc. AAAI

Conference on Artiﬁcial Intelligence, vol. 35, pp. 13198–13206, 2021.

[27] B. J. Choi, M. Jeong, M. Kim, S. H. Mun, and N. S. Kim, “Adversarial

speaker-consistency learning using untranscribed speech data for zero-

shot multi-speaker text-to-speech,” arXiv preprint arXiv:2210.05979,

2022.

[28] D. Ulyanov, A. Vedaldi, and V. Lempitsky, “Instance normalization: The

missing ingredient for fast stylization,” arXiv preprint arXiv:1607.08022,

2016.

[29] X. Huang and S. Belongie, “Arbitrary style transfer in real-time with

adaptive instance normalization,” in Proc. IEEE International Confer-

ence on Computer Vision, pp. 1501–1510, 2017.

[30] N. Kumar, S. Goel, A. Narang, and B. Lall, “Normalization driven

zero-shot multi-speaker speech synthesis.,” in Proc. Interspeech 2021,

pp. 1354–1358, 2021.

[31] L. Dinh, D. Krueger, and Y. Bengio, “Nice: Non-linear independent

components estimation,” arXiv preprint arXiv:1410.8516, 2014.

[32] L. Dinh, J. Sohl-Dickstein, and S. Bengio, “Density estimation using

real nvp,” arXiv preprint arXiv:1605.08803, 2016.

[33] D. P. Kingma and P. Dhariwal, “Glow: Generative ﬂow with invertible

1x1 convolutions,” Proc. Advances in Neural Information Processing

Systems, vol. 31, 2018.

[34] A. v. d. Oord et al., “Wavenet: A generative model for raw audio,” arXiv

preprint arXiv:1609.03499, 2016.

[35] A. Atanov, A. Volokhova, A. Ashukha, I. Sosnovik, and D. Vetrov,

“Semi-conditional normalizing ﬂows for semi-supervised learning,”

arXiv preprint arXiv:1905.00505, 2019.

[36] J. Serr

a, S. Pascual, and C. Segura Perales, “Blow: a single-scale

hyperconditioned ﬂow for non-parallel raw-audio voice conversion,”

Proc. Advances in Neural Information Processing Systems, vol. 32, 2019.

[37] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” arXiv

preprint arXiv:1312.6114, 2013.

[38] J. Yamagishi, C. Veaux, and K. MacDonald, “Cstr vctk corpus: English

multi-speaker corpus for cstr voice cloning toolkit (version 0.92),” 2019.

University of Edinburgh. The Centre for Speech Technology Research

(CSTR). https://doi.org/10.7488/ds/2645.

[39] H. Zen et al., “Libritts: A corpus derived from librispeech for text-to-

speech,” Proc. Interspeech 2019, pp. 1526–1530, 2019.

[40] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical evaluation of

gated recurrent neural networks on sequence modeling,” arXiv preprint

arXiv:1412.3555, 2014.

[41] Y. Wang et al., “Style tokens: Unsupervised style modeling, control

and transfer in end-to-end speech synthesis,” in Proc. International

Conference on Machine Learning, pp. 5180–5189, PMLR, 2018.

[42] H. S. Heo, B.-J. Lee, J. Huh, and J. S. Chung, “Clova baseline system

for the voxceleb speaker recognition challenge 2020,” arXiv preprint

arXiv:2009.14153, 2020.

[43] M. Ravanelli et al., “Speechbrain: A general-purpose speech toolkit,”

arXiv preprint arXiv:2106.04624, 2021.