The Effects of a Self-Similar Avatar Voice in Educational Games

238

The Eects of a Self-Similar Avatar Voice in Educational

Games

DOMINIC KAO, Purdue University, USA

RABINDRA RATAN, Michigan State University, USA

CHRISTOS MOUSAS, Purdue University, USA

ALEJANDRA J. MAGANA, Purdue University, USA

Avatar identication is one of the most promising research areas in games user research. Greater identication

with one’s avatar has been associated with improved outcomes in the domains of health, entertainment, and

education. However, existing studies have focused almost exclusively on the visual appearance of avatars.

Yet audio is known to inuence immersion/presence, performance, and physiological responses. We perform

one of the rst studies to date on avatar self-similar audio. We conducted a 2 x 3 (similar/dissimilar x mod-

ulation upwards/downwards/none) study in a Java programming game. We nd that voice similarity leads

to a signicant increase in performance, time spent, similarity identication, competence, relatedness, and

immersion. Similarity identication acts as a signicant mediator variable between voice similarity and all

measured outcomes. Our study demonstrates the importance of avatar audio and has implications for avatar

design more generally across digital applications.

CCS Concepts: • Human-centered computing → Empirical studies in HCI.

Additional Key Words and Phrases: Games; Avatar; Audio; Voice; Identication; Player Experience

ACM Reference Format:

Dominic Kao, Rabindra Ratan, Christos Mousas, and Alejandra J. Magana. 2021. The Eects of a Self-Similar

Avatar Voice in Educational Games. Proc. ACM Hum.-Comput. Interact. 5, CHI PLAY, Article 238 (Septem-

ber 2021), 28 pages. https://doi.org/10.1145/3474665

1 INTRODUCTION

Virtual identities exist everywhere. From social network proles, to video games, to virtual reality,

there is almost always a representation of the self. Because these virtual representations serve as

extensions of ourselves, we can identify with them, meaning we temporarily merge their identities

with our own self-perception [

]. This identication can be so strong that studies have shown

that we conform to the virtual representation’s expected behaviors [

158

200

]. This inuences

outcomes including negotiation aggressiveness [

200

201

], food choices [

169

], physical exercise

[

116

150

151

], racial bias [

149

], math performance [

113

159

], and creative thinking [

Greater identication with a virtual representation—which are often referred to as avatars—is

associated with increased motivation [

182

], performance [

100

], enjoyment [

115

139

180

ow [

175

], and trust in others [

101

]. Yet, despite the extensive literature attesting to avatars’

inuence, research has focused almost exclusively on visual aspects of the avatar rather than audial

aspects, potentially because the latter tend to be perceived as a non-critical element of avatar

Authors’ addresses: Dominic Kao, kaod@purdue.edu, Purdue University, USA; Rabindra Ratan, [email protected], Michigan

State University, USA; Christos Mousas, cmousas@purdue.edu, Purdue University, USA; Alejandra J. Magana, admagana@

purdue.edu, Purdue University, USA.

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee

provided that copies are not made or distributed for prot or commercial advantage and that copies bear this notice and

the full citation on the rst page. Copyrights for third-party components of this work must be honored. For all other uses,

contact the owner/author(s).

2573-0142/2021/9-ART238

https://doi.org/10.1145/3474665

Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CHI PLAY, Article 238. Publication date: September 2021.

238:2 Dominic Kao et al.

use, silent avatars are often perceived as more identiable, and developing a variety in character

voices is resource intensive (e.g., hiring multiple voice actors, programming branching dialog

trees) [

199

]. However, technological solutions to these challenges (e.g., high-quality text-to-speech

engines, voice cloning software) can now greatly reduce the resources needed for developing avatar

voices, signaling a new opportunity for research on avatar voice eects. For example, consider an

avatar in an exercise application for running that speaks using voice characteristics with which the

user identies; greater identication with the avatar’s voice could translate into increased exercise

performance. Or consider a digital self-help application for smoking cessation; greater identication

with the avatar’s voice could decrease user attrition, increasing the odds of successfully overcoming

addiction. Similarly, when using immersive technologies for learning, such as training how to

perform surgery in virtual reality [

136

], greater identication with the avatar’s voice could result

in increased presence and motivation, increasing training eectiveness.

There is good reason to believe that an avatar’s voice could inuence outcomes. A meta-analysis

of 83 studies in virtual environments found that the presence of audio contributes a small- to

medium-sized eect on presence [

]. Furthermore, audio in games has been linked to greater

immersion [

109

135

], physiological responses [

], performance [

], and emotional realism

[

]. Prior studies give us reason to believe that avatar audio in particular could inuence avatar

identication. Functional neuroimaging shows that perceived similarity is critical to simulating

another person’s internal state [

131

]. When a study participant watched a game show contestant

with high perceived similarity, the participant experienced a signicant increase in vicarious reward

[

132

]. Researchers suggest that similar others trigger likeability, familiarity, and kin-motivated

responses [

132

146

]. This is often referred to as similarity-attraction [

] and is highly relevant

in the extensive literature on pedagogical agents and avatars [

100

]. Therefore, an avatar with a

voice that is more similar to the user’s own voice could increase engagement. Nevertheless, one

can imagine that a self-similar avatar voice might instead break immersion because the avatar is

speaking when the user is not. Additionally, Wauck et al. have shown that self-similarity in the

context of visual appearance did not make a dierence to game performance and experience [

196

As such, self-similar avatar audio might also produce negligible dierences.

Our project treats voice similarity as a holistic quality of sound, encompassing characteristics of

voice that people use to discriminate between speakers, such as tone, stress, intonation, rhythm,

and tempo. Together [

107

], they comprise a voice identity that individuals might associate with

other characteristics they identify with, such as masculinity and femininity [

179

]. Voice similar-

ity is not a solved technical problem, and the most reliable measure of similarity is subjective

ratings—e.g., [

], section 3.2. In this paper, our goal is to study how voice similarity (versus voice

dissimilarity) inuences users in an educational programming game. We chose to study avatar

voice in an educational game as opposed to a game primarily for entertainment because we are also

interested in whether STEM gender stereotypes inuence the eects of avatar voice. For example,

stereotyped avatars’ identity characteristics have been found to inuence performance in STEM

learning contexts. Studies suggest that people perform better on STEM-related tasks when they are

represented by a male avatar compared to a female avatar [

112

113

159

]. We expected a similar

eect might occur due to avatar voice characteristics associated with masculinity and femininity.

We conducted an online study on Amazon’s Mechanical Turk (MTurk) in which half of the

participants were given an avatar voice that matched their own voice, while the other half of the

participants were given an avatar voice that was randomly chosen from a pool for prior participants.

Additionally, we varied voice modulation across all participants; this consisted of pitch shifting

the voice upwards, downwards, or not at all. Their avatar’s voice was then used inside of the Java

programming game as they played. Participants could spend as much time and complete as many

Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CHI PLAY, Article 238. Publication date: September 2021.

The Eects of a Self-Similar Avatar Voice in Educational Games 238:3

puzzles as they liked, reecting motivation to engage in and learn from the game. Afterward, we

collected measures of need satisfaction, intrinsic motivation, and avatar identication.

Our results show that voice similarity increases performance, time spent, similarity identication,

competence, relatedness, and immersion. Similarity identication acts as a signicant mediating

variable between voice similarity and all measured outcomes. However, there was no evidence that

voice modulation signicantly inuenced outcomes. Our study suggests that games can be made

more engaging through self-similar avatar voice audio. Moreover, our study provides motivation

for applying similar methods to virtual reality (e.g., eect on presence), voice assistants (e.g., Siri

[

]), digital learning (e.g., second-language learning through hearing a self-similar voice), avatar

customization (e.g., customizing avatar audio to be similar), and the Proteus eect (the phenomenon

that avatar users tend to conform behaviorally to the identity characteristics that they associate

with their avatars [200]) in the context of audio.

2 RELATED WORK

We describe research in three domains of interest: identication with avatars, audio in games, and

player experience in games.

2.1 Avatar Identification

2.1.1 Identification. Identication is a temporary change in a user’s self-concept by adoption of

a media persona’s perceived characteristics [

]. Identication is one of the core components of

why media experiences are enjoyable [

]. For example, in literary ction, the reader is said

to adopt the protagonist’s emotions, experiences, and objectives such that they feel as if they are

the protagonist [

143

]. Or in television, the audience member is said to not only feel sympathetic

towards a character, but to feel with the character [

]. However, one key dierence in video games

from other genres of media, such as television, is that players have direct control over the behavior

and actions of their characters. Through this active participation, video games can override the

distance between players and their avatars [

]. Avatar identication can positively inuence

enjoyment [

115

139

180

], health outcomes [

104

], and learning interest [

]. Moreover, it can

positively inuence intrinsic motivation [

182

], ow [

175

], motivation to exercise [

115

190

trust in others [

101

], self-esteem [

195

], loyalty to a game [

178

], and appreciation of the game [

Avatar identication has also been associated with aggression [

105

], addiction [

174

], and depression

[14, 126].

2.1.2 Avatar Identification. Avatar identication is typically operationalized as a multi-faceted

construct [

185

]. Similarity identication can be understood as the extent to which we feel similar

to the avatar. People expect to be able to build more rewarding interpersonal interactions [

more easily like, and identify with media characters perceived as being similar [

206

]. Therefore,

avatars that are similar facilitate feelings of closeness and stronger vicarious experiences [

185

This phenomenon (sometimes called similarity-attraction [

]) has been studied for decades in

education, wherein pedagogical agents that are similar to users (e.g., gender [

] and race [

153

165

])

are more inuential. Likewise, greater physical similarity with an on-screen avatar has been shown

to signicantly increase exercise eort [

]. Nevertheless, avatar dissimilarities can be valuable.

For example, users are known to sometimes create avatars that represent idealized versions of

themselves [

]. This is known to foster wishful identication, wherein the avatar represents an

improved version of the real-life self (e.g., leaner, more attractive, and fashionable [

]), represents

an ideal, and is someone the user would like to be [

185

]. Embodied identication represents the

concept of presence in a virtual environment through a “body container” [

121

]. This concept

refers to being the avatar, or feeling as if one is inside the avatar with the body of the avatar as

Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CHI PLAY, Article 238. Publication date: September 2021.

238:4 Dominic Kao et al.

being one’s own [

185

]. Studies have found that perceived embodiment—induced through increased

control of an avatar—heightens the outcomes associated with the avatar’s identity [

202

], supporting

the notion that embodied identication should be considered in avatar-eects research. In this

paper, we measure the inuence of similarity, wishful, and embodied identication as mediators

between voice similarity and other outcome variables.

2.1.3 Facilitating Avatar Identification. Currently, the prevailing method of increasing avatar

identication is through avatar customization. Customization of one’s avatar has been shown to

positively increase avatar identication in a variety of contexts [

106

181

204

]. Other factors

that can increase identication include the presence of narrative [

171

] and the character’s name

[42]. However, no study to date has manipulated the avatar’s voice audio.

2.2 Audio in Games

2.2.1 Audio Types. Audio can signicantly inuence player’s experiences. A meta-analysis found

that the existence of audio, compared to its absence, has a signicant eect on presence [

Researchers have classied audio into speech and dialog, sound eects, and music [

117

]. Sound

eects are further classied into avatar sounds, object sounds, character sounds, and ornamental

sounds [

117

]. All categories of audio appear to have eects. Game music has been found to inuence

immersion [

135

170

197

], tension/anxiety [

], risk-taking behavior [

163

], and concentration [

Game sound eects, often an important source of feedback [

147

161

], aect immersion

[

] and performance [

]. Additionally, the eects of audio are often contextually dependent on

game genre [

], device type [

164

], and preferences [

170

]. Other studies, though, have found that

audio has little eect [164]. Our goal in this paper is to study self-similar avatar voices.

2.2.2 Avatars and Audio. Researchers have suggested that avatar-based sounds, such as breathing

(proprioceptive) and footsteps (exteroceptive), can facilitate imaginative immersion and help the

player identify with their avatar [

]. It has also been suggested that audio creates a sense of

self-representation, which can intensify self-awareness, body ownership, and place illusion [

142

A few early studies have explored how the addition of footstep sounds can inuence presence

[

140

] and movement behavior [

141

]. Researchers have also suggested that using one’s own voice

to interact with a game (e.g., voice commands) can positively aect avatar identication, despite

the dissonance produced in speaking to the game [30].

Several studies show that voice aects users. Voiceovers for non-player characters have been

shown to increase engagement in a role-playing game [

]. Virtual customer service representatives

that include a text-to-speech voice increase ow [

156

] and trust [

157

]. However, not only the

presence vs. absence of voice, but voice similarity also aects users. In a public-speaking experiment

in front of a virtual classroom, participants either gave their own speech out loud, or had another

participant’s speech audio played back. Participants using their own voice experienced signicantly

higher presence [

]. However, this may have been a result of the voice similarity group actually

having to give the speech while the dissimilarity group only had to act it out. In a study on

synthesized voices, participants evaluated voices that were designed to have dierent personalities

(extroverted vs. introverted). The authors found consistent support for similarity-attraction—i.e.,

participants evaluated higher the voice more similar to their own personality [

137

]. Studies suggest

that the perceived gender of the voice can also inuence users.

During a lecture in which the same person spoke as both a male and female (both voice morphed),

students evaluated the female as more likeable and the male as more intelligent [

]. In a decision-

making study, a male-voiced computer inuenced the user’s decision signicantly more often than

the female counterpart [

111

]. In a study with computerized voice output, three gender stereotypes

Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CHI PLAY, Article 238. Publication date: September 2021.

The Eects of a Self-Similar Avatar Voice in Educational Games 238:5

were found: male evaluation as more valid than female evaluation; dominance in females as unbe-

coming; and women knowing more about feminine topics (facts relating to love-and-relationships),

with men knowing more about masculine topics (facts relating to computers) [

138

]

. Consistent

with such stereotyping, one study found that informative male and sociable female voice agents

led to more positive assessments of an autonomous vehicle compared to stereotype-inconsistent

gender matching (i.e., informative female, sociable male) [

114

]. Therefore, the gender of the avatar

voice may be crucial to its inuence on player experience.

Studies show, however, that pitch can also aect voice evaluation. Studies that manipulate

voice pitch across multiple languages and cultures have found that men’s and women’s voices

with lowered pitch are perceived as more dominant and masculine than those with raised pitch

[

110

145

152

]. People often assess voice pitch as being associated with a certain body mass

and height [

184

], attractiveness [

144

154

207

], and age [

]. Studies show that

a pitch manipulation of 20 Hz is sucient to alter attractiveness ratings of voices [

–

189

In our study, we examine the eects of voice modulation (i.e., pitch manipulation). Specically,

we are interested in the interaction between gender and modulation direction, hypothesizing that

a lower modulation will result in more positive outcomes in our programming game because of

STEM gender stereotypes. Here, we conduct the one of the rst studies to look at either voice

similarity or modulation and their eects on player experience (PX) in games.

2.3 Player Experience in Games

The past two decades have seen the development of a number of instruments to measure PX. These

include the Game Immersion Questionnaire (GIQ) [

], the Immersive Experience Questionnaire

(IEQ) [

], the Game Engagement Questionnaire (GEQ) [

], the Game Experience Questionnaire

(GEQIJ) [

], the Digital Games Motives Scale (DGMS) [

], the Player Experience of Need Sat-

isfaction (PENS) [

168

], and the Player Experience Inventory (PXI) [

]. In this study, we leverage

the PENS because it is based on a well-grounded theoretical framework [

168

] and allows us to

better contextualize our results in the existing literature, which uses the PENS as a theoretical

framework to explicate avatar identication (e.g., [

]). More specically, the PENS is based on

self-determination theory (SDT). SDT, as originally conceptualized, consists of three core building

blocks to explain human motivation, which in turn lead to greater performance, persistence, and

creativity [

167

]. These building blocks are competence, the need for being eective at achiev-

ing desired objectives; autonomy, the need for having the ability to make decisions; and relatedness,

the need for social closeness with others. This original model has been extended to games by

including presence/immersion, the sense of actually being transported into the game world and

intuitive controls, the intuitiveness of controls. In addition to the PENS, we also leverage the Intrinsic

Motivation Inventory (IMI) [

125

], through which we measure interest/enjoyment, eort/importance,

pressure/tension, and value/usefulness. Through the interest/enjoyment subscale, the IMI measures

intrinsic motivation—the desire to complete an activity because of the satisfaction of doing so in

and of itself.

Need satisfaction is essential for intrinsic motivation to exist [

125

]. A study in an endless

runner game found that avatar identication increases autonomy, immersion, interest/enjoyment,

eort/importance, positive aect, and time spent [

]. A study that involved playing an educational

programming game, then making a custom game level for that same game, found that avatar

identication increases need satisfaction, intrinsic motivation, self-ecacy, time spent, and quality

In discussing gender stereotypes, we acknowledge that although researchers have found these stereotypes to often be

consistent across culture [

] and time [

], other studies have found some variability [

]. Therefore, it is important

to note that studies cited in this section were (1) based on stereotypes validated in the social scientic literature during a

recent time period prior to each study and (2) for the culture from which the studies’ participants are drawn.

Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CHI PLAY, Article 238. Publication date: September 2021.

238:6 Dominic Kao et al.

Fig. 1. Data type puzzle (L). Curing a wounded knight (R). Placeholders . . . indicate where code snippets can

be thrown.

of game levels [

100

]. A study in a jumping game found that avatar identication increases need

satisfaction and time spent playing [

]. Here, we are interested in determining whether voice

similarity inuences need satisfaction and intrinsic motivation, as well as the potential mediating

eect of avatar identication.

2.4 Hypotheses

Building on the literature and arguments presented thus far, we pose the following hypotheses.

H1:

Higher voice similarity will lead to more positive avatar identication, need satisfaction,

intrinsic motivation, and performance.

H2:

Avatar identication will mediate more positive need satisfaction, intrinsic motivation, and

performance—i.e., voice similarity will lead to a higher level of avatar identication, which will in

turn increase these outcomes.

H3:

Consistent with gender stereotypes in STEM, voice modulation upwards/downwards will

have a negative/positive eect, respectively, on avatar identication, need satisfaction, intrinsic

motivation, and performance.

3 EXPERIMENTAL TESTBED

3.1 The Game

Our experimental testbed is CodeBreakers

[

], which was created for conducting avatar-based

studies. CodeBreakers is a Java programming game in which players solve increasingly dicult

problems by throwing snippets of code. See Figure 1. CodeBreakers was iteratively created with

feedback from professional game developers, game designers, and Java developers, and included

informal play testing over an eighteen-month span with playtesters. There were 14 total puzzles,

spanning 6 levels. CodeBreakers was designed to incorporate best practices on eective learning

curves [

119

]. Programming topics include data types, conditionals and control ow, classes and

objects, inheritance and interfaces, loops and recursion, and data structures. Each puzzle had up to

3 hints, which are increasingly detailed. Players controlled their character using the keyboard and

mouse. We measured performance through the number of puzzles completed. Players could exit at

any time once they began playing. CodeBreakers was made available for machines running either

Microsoft Windows or macOS.

Gameplay video: https://youtu.be/x5U-Jd6tKXA

Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CHI PLAY, Article 238. Publication date: September 2021.

The Eects of a Self-Similar Avatar Voice in Educational Games 238:7

Fig. 2. Male (L) and female (R) avatars.

3.2 Validating Visual Avatar Design

For this experiment, the player avatar was purposefully designed to avoid known color eects

(e.g., the color red is known to reduce mood, aect, and performance in cognitive-oriented tasks

[

108

127

128

]), to have ambiguous identity characteristics besides its gender, and to t

the game. We chose blue for the avatar color because it is not associated with negative cognitive

eects and has comparable eects to other more neutral colors, such as gray, on test performance

and heart rate variability (HF-HRV) [

]. Blue was also chosen to match the aesthetic of the game.

The avatar models themselves were designed and created from scratch by a professional 3D game

artist and were made intentionally abstract for ambiguity in identity. This was to reduce variance

in how much players identied with the avatar’s visual appearance. For example, an avatar with

unambiguous identity characteristics would have a high similarity with only the subset of players

who match those identity characteristics. See Figure 2.

To validate that these goals were met, we ran a validation study with 140 participants on

Amazon Mechanical Turk (MTurk). We used a screening survey to retrieve 70 participants who self-

identied as male and 70 participants who self-identied as female. After an audio check to ensure

participants had their audio turned on, each participant played the base version of CodeBreakers

(i.e., without any voice-related aspects) for a minimum of ve minutes. All participants played with

a gender-matched avatar. After ve minutes, participants were allowed to quit at any time. After

quitting, we asked participants the following questions: “How appropriate was the avatar color for

the game?”, “How appropriate was the avatar color for the avatar?”, “How appropriate was the

avatar clothing for the game?”, “How appropriate was the avatar clothing for the avatar?”, and

“How appropriate was the avatar design overall?” on a scale from 1:Very Inappropriate to 5:Ver y

Appropriate. Participants then answered two additional questions: “Besides its gender, the identity

of my avatar was ambiguous (e.g., ethnicity/race)” and “My avatar resembled me” on a scale from

1:Strongly Disagree to 5:Strongly Agree. See Table 1. Participants were compensated $5.00 (USD) for

taking part in this validation study.

We then performed independent samples t-tests between male and female participants. None

of the tests were signicant. ColorG:

(138)=0.54,

=0.59,

=0.09; ColorA:

(138)=0.87,

=0.39,

=0.15; ClothesG:

(138)=0.98,

=0.33,

=0.17; ClothesA:

(138)=0.87,

=0.39,

=0.15; DesignO:

(138)=0.31,

=0.76,

=0.05; Ambiguous:

(138)=0.64,

=0.52,

=0.11; Resemble:

(138)=0.36,

=0.72,

Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CHI PLAY, Article 238. Publication date: September 2021.

238:8 Dominic Kao et al.

Gender ColorG ColorA ClothesG ClothesA DesignO Ambiguous Resemble

M SD M SD M SD M SD M SD M SD M SD

Male 4.23 0.80 4.24 0.79 4.24 0.79 4.23 0.80 4.27 0.80 4.34 0.98 2.64 1.14

Female 4.16 0.77 4.11 0.96 4.37 0.77 4.34 0.76 4.23 0.85 4.24 0.86 2.71 1.19

Table 1. Descriptive results from validation study validating that the avatar’s visual characteristics were

appropriate for the game (ColorG, ClothesG), for the avatar (ColorA, ClothesA), overall (DesignO), that

the avatar’s identity was viewed as ambiguous (Ambiguous), and that resemblance to the avatar across gender

was similar (Resemble). All measures are on a 5-pt Likert scale.

=0.06. Moreover, the average of all responses scored higher than Agree (4)—except for resemblance,

which scored only slightly higher than neutral (~2.6–2.7). Therefore, these results validate our

goals of avatars that t the game and have ambiguous identity characteristics, without signicant

deviations across gender.

3.3 Voice Manipulation Platform

For this study, we developed an online platform that takes as input a single audio voice clip and

is able to generate an arbitrary number of similar voice clips. We did this by leveraging recent

advances in neural network-based speech synthesis [

194

]. We started with an open-source

implementation of real-time voice cloning [

] (see Section 3.3.1) and made several signicant

additions. In order to run a large-scale study using this, it was necessary to create a version that

could be deployed remotely in the cloud and could be accessed on-demand. We did this by rst

deploying the software to an Amazon EC2 P2 server (type p2.xlarge), a GPU-based computing

instance that has 1 GPU, 4 vCPUs, and 61 GiB of RAM located in the region us-east-1b [

]. This

server leverages NVIDIA’s Compute Unied Device Architecture (CUDA), which allows for GPU

support in running the software. We then created a server which uses HTTP POST requests to

communicate with clients with the following message types: Upload (uploads the sample voice clip

while specifying any modulation parameters, returns a unique key for subsequent messages), Status

Check (checks if job completed), and Download (get a single voice le, or get all voice les). When

a client makes an upload request, this is placed in a queue and then served in order. We use 30

worker threads on our server so that multiple jobs can be processed concurrently. During the study,

we kept watch over server performance (i.e., memory and GPU usage), which is important for

consistent participant experiences. For example, high GPU usage would delay new requests from

being processed in a timely manner. This was managed by limiting the number of concurrent study

participants. To reduce wait time, clients only need to download the voice les for the next level to

begin playing (while the remaining voice les are downloaded in the background asynchronously).

It takes 15

5s (variation dependent on internet speed) for a U.S. based client to request and to

download voice les for the rst level. All voice les on the server are deleted on download. To

perform voice modulation (i.e., pitch shifting), we rst measure average fundamental frequency

using a pitch oor of 75 Hz (Male) & 100 Hz (Female) and a pitch ceiling of 300 Hz (Male) & 500 Hz

(Female) [

102

]. We performed pitch shifting using pitch synchronous overlap and add (PSOLA)

[

]. PSOLA is frequently used in studies that manipulate voice [

–

160

189

], and

it aects only pitch, leaving other properties of voice perceptually unchanged [

]. After

modulation, the sample is then processed through the voice cloning software. To mitigate dierences

Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CHI PLAY, Article 238. Publication date: September 2021.

The Eects of a Self-Similar Avatar Voice in Educational Games 238:9

Fig. 3. Voice audio occurs in conjunction with speech bubbles that appear on top of the avatar.

in processing time, unmodulated samples are still dummy pitch-shifted to simulate the same time

delay as modulated samples.

3.3.1 Voice Cloning Soware Architecture. The open-source implementation of real-time voice

cloning is described at a more granular level in [

], and the framework as a whole borrows

heavily from [

]. There are three main components to the software architecture which are each

trained separately: 1) a speaker encoder which creates embedding vectors representing the voice

of the speaker [

194

]; 2) a synthesizer which takes input text and, conditioning the text on the

speaker embedding vectors, generates a mel spectrogram [

172

]; and 3) a vocoder that converts the

spectogram into an audio waveform [

183

]. Implementation and training details can be found in

[89] and [86].

4 METHODS

4.1 Initial Voice Processing

To collect a sample of the participant’s voice, during the beginning of the game, the player was

asked to speak to an animated robot (Harley) which introduced the game to the participant. The

participant was requested to speak the audio line: Hello, my name is [the character name chosen

by the participant], I am about to play CodeBreakers, and it is very nice to meet you, Harley. We

then checked the recorded audio for any long pauses without voice (>1 second) and that the entire

audio length t in a roughly acceptable interval for the voice cloning software (3—7 seconds). If

these checks were violated, the participant was asked to re-record the audio until the audio was

acceptable. The participant was then asked to listen to their recorded audio to ensure they could

hear themselves speaking the sentence, and they were given the option of re-recording. After the

voice sample was collected, it was sent to the server and processed while the participant completed

the rest of the game introduction. This entire process was identical across conditions. During

analysis, we manually checked every sample to ensure that the participant was clearly audible and

had followed instructions; ~4.3% of participants were excluded based on this check. For sample rate,

we use the default sampling rate from the participant’s microphone. All samples are normalized to

the same perceived volume using RMS (root mean square) normalization.

4.2 Conditions

The study uses a 2 x 3 factorial design. We manipulate avatar voice similarity (similar vs. dissimilar)

and voice modulation (upwards vs. downwards vs. none). The manipulations are as follows:

Similar Voice: Avatar voice is generated using the participant’s voice.

Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CHI PLAY, Article 238. Publication date: September 2021.

238:10 Dominic Kao et al.

Dissimilar Voice

: Avatar voice is generated using a gender-matched prior participant’s voice. The

voice was selected at random from a corpus of 10 (5 male and 5 female) samples collected during

pilot testing.

Modulation Upwards

: Original voice sample is pitch-shifted upwards by 20 Hz. We choose 20 Hz

as it is a manipulation used in prior voice studies [60–63, 92, 189].

Modulation Downwards: Original voice sample is pitch-shifted downwards by 20 Hz.

No Modulation

: Original voice sample is used. Dummy pitch-shifted to simulate time delay and

ensure consistent user experiences across conditions.

Other than the above, all other aspects of the experiment were identical across conditions. In

total, there were 30 possible voice lines that could have been triggered. Other than the rst voice

line (What am I doing here? Did my ship crash? How long have I been lying here for? I guess I should

get up and look around.), audio lines typically come before and after each puzzle. For example, prior

to puzzle #7: The castle is under siege!. And after completing puzzle #7: It worked! I neutralize d all of

the bugs by using the sta. These voice lines were accompanied by speech bubbles (see Figure 3).

All audio aside from the avatar voice was identical across conditions, i.e., music and sound eects.

4.3 Measures

We use three validated PX questionnaires and gameplay metrics.

4.3.1 Avatar Identification. For measuring avatar identication, we use the player identication

scale (PIS) [

185

]. The PIS measures three dimensions of avatar identication on a 5-pt Likert scale:

similarity identication (e.g., “My character is similar to me”), embodied identication (e.g., “In the

game, it is as if I become one with my character”), and wishful identication (e.g., “I would like to

be more like my character”).

4.3.2 Player Experience of Need Satisfaction. To measure need satisfaction, we use the PENS scale

[

168

]. PENS measures the following dimensions on a 7-pt Likert scale: competence (e.g., “I feel com-

petent at the game”), autonomy (e.g., “The game provides me with interesting options and choices”),

relatedness (e.g., “I nd the relationships I form in this game fullling”), presence/immersion (e.g.,

“When playing the game, I feel transported to another time and place”), and intuitive controls (e.g.,

“Learning the game controls was easy”).

4.3.3 Intrinsic Motivation Inventory. To measure intrinsic motivation, we use the IMI [

125

]. We

leverage the following IMI dimensions which use a 7-pt Likert scale: interest/enjoyment (e.g., “I

enjoyed doing this activity very much”), eort/importance (e.g., “I put a lot of eort into this”),

pressure/tension (e.g., “I felt very tense while doing this activity”), and value/usefulness (e.g., “I

believe this activity could be of some value to me”).

4.3.4 Game Performance. We automatically recorded metrics for game performance, including

puzzles completed (max 14) and number of hints accessed (max 42). Re-played puzzles or re-accessed

hints are not counted. These metrics were considered a reection of motivation to engage in and

learn from the game, which are clear intended outcomes of educational games.

4.3.5 Time Played. We operationalize motivated behavior as the time spent playing the game.

4.4 Participants

In total, 698 participants (30% female)

with an average age of M = 33.53 (SD = 9.55) were recruited

through Amazon Mechanical Turk (MTurk). MTurk is a platform in which workers complete Human

The smaller proportion of female participants was unexpected given the female skew overall on MTurk (over 60% in the

U.S.) [

166

], and it was a possible byproduct of male participants being more attracted to playing a programming game.

Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CHI PLAY, Article 238. Publication date: September 2021.

The Eects of a Self-Similar Avatar Voice in Educational Games 238:11

Intelligence Tasks (HITs), including tasks for research studies. Studies show that MTurk provides

data of similar quality [

], diversity [

], and reliability [

123

] as typical samples (e.g.,

college students). Participants were each paid $7.50. Participants who answered multiple surveys

with zero variance, or multiple surveys with

3SD, were excluded. Participant voice recordings were

manually checked by an author blind to condition to ensure they were audible and had followed

instructions. After exclusion based on these criteria, we were left with 657 participants (29% female)

for analysis, with an average age of M = 33.45 (SD = 9.66). The HIT was available to workers in

the U.S. over the age of 18 who had a computer with a working microphone. For quality control,

workers were required to have a HIT approval rate >95%. The Purdue University Institutional

Review Board (IRB) approved the study. All participants were asked to provide informed consent.

4.4.1 Experience With Video Games and Programming. Participants reported playing an average of

M=11.6 (SD=10.5) hours of video games per week, above the global average of M=8.45 [

118

]. On a

scale from 1:Minimal to 7:Extensive, participants rated their prior experience playing video games

(“How would you rate your prior experience playing video games?”) as M=5.38 (SD=1.66) and their

prior programming experience (“How would you rate your prior programming experience?”) as

M=2.64 (SD=1.77). Next, we adapted several questions on programming experience from [

173

On a scale from 1:Very Inexperienced to 5:Very Exp erienced, participants rated their programming

experience compared to experts (“How do you estimate your programming experience compared to

experts with 20 years of practical experience?”) as M=1.43 (SD=0.93), their programming experience

compared to beginners (“How do you estimate your programming experience compared to beginner

programmers?”) as M=2.33, (SD=1.29), their programming experience in Java specically (“How

experienced are you with the Java programming language?”) as M=1.73 (SD=1.05), and their

experience with an object-oriented paradigm (“How experienced are you with the object-oriented

programming paradigm?”) as M=1.91 (SD=1.21). Therefore, our sample contains participants who

are regularly exposed to video games and have low prior programming experience. ANOVAs

found that there were no signicant dierences between conditions on prior gaming experience

(

F [

651

]

=0.053,

=0.998,

=.000), programming experience (

F [

651

]

=0.345,

=0.886,

=.003),

and Java programming experience (F[5, 651] =0.459, p=0.807, η

=.004).

4.5 Design

A between-subjects factorial design was used. Each participant was randomly assigned to one of

six possible conditions. Participant counts in each condition were approximately equal (M=109.5,

SD=4.2), with a similar number of male (M=77.5, SD=3.6) and female (M=32.0, SD=5.3) participants

across each condition.

4.6 Procedure

Participants rst lled out an IRB-approved consent form. Participants were informed that they

could exit the game at any time. Participants then began playing CodeBreakers. At the beginning of

the game, participants underwent an audio check during which they were required to type a spoken

English word. Next, the participant was asked to speak into their microphone to conrm that we

could detect their audio input. Participants then selected a name and gender for their character. For

the purposes of the experiment, participants were asked to choose the same gender as their real-life

gender. We manually double-checked that their selected gender matched the gender reported post-

experiment. A robotic agent (see Figure 4) then engaged in a short conversation with the player.

The robot was animated with audio dialog generated through an automatic voice generator [

120

Nevertheless, we proceeded with our analyses as planned because the total number of female participants was still high

(>200) and our statistical testing is robust to unequal group sizes.

Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CHI PLAY, Article 238. Publication date: September 2021.

238:12 Dominic Kao et al.

Fig. 4. The robotic agent during the introduction asks the player to speak (L), and introduces the game (R).

After a brief introduction, the robot asked the participant to introduce himself/herself through

their microphone. When the participant was ready to speak, they clicked on the Record button,

then clicked Stop Recording when nished. In case the recording was too short (<3 seconds), too

long (>7 seconds), or contained long pauses (>1 second), the player was asked to retry and to keep

their dialog a continuous ~5 seconds in length. Once completed, participants were asked to conrm

they could hear their recorded audio and to re-record otherwise. Next, the participant’s audio was

sent to the server for processing as they completed the remainder of the introduction. During

the rest of the introduction, the participant was briefed on how to play the game. Participants

were told they could exit the game at any time by pressing ESC on their keyboard, then clicking

quit game. The participant then began playing the game. In the event that the participant’s avatar

voice les for the rst level had not yet been completely downloaded, we showed a loading screen.

Participants played the game with a blue gender-matched avatar. All participant game data was

automatically logged. Once participants quit the game (or completed all 6 levels), they lled out

a set of manipulation check questions, the PIS, PENS, and IMI. Participants were then asked to

describe in their own words any problems encountered and what they thought the purpose of

the experiment was. None of the participants correctly guessed the purpose of the experiment.

Participants then lled out a set of questions about prior video game experience, programming

experience, and demographics.

4.7 Analysis

Data was analyzed using SPSS 23 and the PROCESS macro for SPSS [

]. Independent t-tests were

used to compare voice similarity versus voice dissimilarity on the outcomes of performance, time

spent, PIS, PENS, and IMI. We then performed a mediation analysis using avatar identication

as the mediator. Voice similarity coded as a dichotomous variable is X, avatar identication is

the mediator M, and performance, time spent, PENS, and IMI are Y. We use mediation with each

dimension of identication modeled individually (similarity, embodied, wishful) rather than parallel

mediation. We chose to do this because of multicollinearity between identication dimensions

(correlations>0.7), which can aect the estimation of mediation relationships in parallel mediation

[

]. To investigate voice modulation, we use a 3x2 (modulation x gender) ANOVA

because we

expected voice modulation eects to be moderated by gender. We use an α of 0.05.

Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CHI PLAY, Article 238. Publication date: September 2021.

The Eects of a Self-Similar Avatar Voice in Educational Games 238:13

Voice

Pitch

Similar Like Me Gender Age Speaking Friends Likeable Friendly Fit

Condition M SD M SD M SD M SD M SD M SD M SD M SD M SD

None 4.33 1.54 4.31 1.61 6.28 1.08 4.79 1.52 4.22 1.67 4.20 1.70 4.71 1.67 4.79 1.64 4.78 1.50

Similar Up 4.17 1.51 4.17 1.45 6.13 1.28 4.91 1.34 4.27 1.53 4.25 1.50 4.68 1.42 4.81 1.46 4.75 1.16

Down 4.19 1.68 4.17 1.79 6.17 1.32 4.81 1.46 4.18 1.81 4.26 1.66 4.40 1.61 4.52 1.66 4.90 1.52

None 3.07 1.78 3.08 1.73 6.12 1.52 4.61 1.71 3.28 1.85 4.10 1.76 4.62 1.76 4.66 1.76 4.76 1.73

Dissimilar Up 3.06 1.72 2.96 1.67 6.23 1.32 4.45 1.72 3.16 1.88 4.12 1.80 4.66 1.74 4.75 1.63 4.78 1.51

Down 2.92 1.76 2.75 1.64 6.18 1.48 4.54 1.66 3.01 1.76 3.90 1.73 4.50 1.64 4.58 1.62 4.92 1.59

Table 2. Descriptive results of manipulation check.

Voice

Pitch

Similar Like Me Gender Age Speaking Friends Likeable Friendly Fit

Condition M

None 4.32 4.35 4.36 4.22 6.33 6.19 4.86 4.65 4.15 4.35 4.15 4.30 4.74 4.65 4.78 4.81 4.76 4.81

Similar Up 4.13 4.29 4.11 4.36 6.13 6.11 4.97 4.75 4.21 4.43 4.20 4.39 4.65 4.79 4.79 4.86 4.71 4.86

Down 4.20 4.17 4.14 4.22 6.22 6.06 4.94 4.53 4.18 4.19 4.32 4.14 4.35 4.50 4.49 4.58 4.84 5.06

None 3.02 3.20 3.05 3.20 6.13 6.08 4.65 4.48 3.29 3.24 4.10 4.12 4.63 4.60 4.71 4.48 4.74 4.84

Dissimilar Up 3.17 2.79 2.99 2.90 6.26 6.17 4.45 4.45 3.21 3.04 4.17 4.00 4.77 4.38 4.81 4.59 4.77 4.79

Down 2.97 2.81 2.79 2.68 6.26 6.03 4.61 4.41 3.05 2.92 3.92 3.84 4.55 4.41 4.64 4.46 4.90 4.97

Table 3. Manipulation check mean scores for participants self-identifying as male (M

) and female (M

5 RESULTS

5.1 Manipulation Check

The manipulation check consisted of 9 questions. The rst 8 questions all began with “My avatar’s

voice sounded...” and ended with “similar to me”; “like me when I talk”; “the same gender as me”;

“about my age”; “as if I was the one speaking”; “like someone I would be friends with”; “likeable”;

and “friendly” on a 7-pt Likert scale (1:Strongly Disagree to 7:Strongly Agree). These questions

assessed the voice similarity manipulation. The last question, “How well do you feel your avatar’s

voice t with your avatar?” was on a 7-pt Likert scale (1:Very Poorly to 7:Very Well). This question

assessed the perceived t between the voice and the game avatar.

Between-subjects testing found that participants in the voice similarity condition scored signi-

cantly higher on “similar to me,”

(655)=9.36,

<0.001,

=0.73, “like me when I talk,”

(655)=9.97,

<0.001,

=0.78, “about my age,”

(655)=2.45,

<0.05,

=0.19, and “as if I was the one speaking,”

(655)=7.87,

<0.001,

=0.61, compared to participants in the voice dissimilarity condition. There

were no signicant dierences across the remaining questions. From these results, the voice similar-

ity manipulation was successful at inducing higher perceived avatar voice similarity. All conditions

had a slightly above neutral t between voice and avatar. See Table 2 and Table 3 for a more

detailed breakdown of the manipulation check measures by voice similarity conditions as well as

the voice modulation conditions and participant gender. The remaining results are organized by

our hypotheses.

5.2 Eects of Voice Similarity

H1:

Higher voice similarity will lead to more positive avatar identication, need satisfaction, intrinsic

motivation, and performance.

Both t-tests and ANOVAs are considered robust to non-normality, especially at larger sample sizes [20, 122].

Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CHI PLAY, Article 238. Publication date: September 2021.

238:14 Dominic Kao et al.

Variable Similar Voice Dissimilar Voice

M SD M SD

**Puzzles Completed 6.58 5.64 5.47 5.28

Hints Accessed 13.95 13.15 12.07 12.50

*Time Spent Sec. 899.75 838.26 743.80 739.30

***PIS Similarity 2.99 1.03 2.75 1.02

PIS Embodied 2.99 1.12 2.93 1.10

PIS Wishful 2.53 1.04 2.50 1.09

***PENS Competence 4.57 1.60 4.21 1.62

PENS Autonomy 4.30 1.58 4.17 1.59

**PENS Relatedness 3.44 1.55 3.10 1.58

***PENS Immersion 3.88 1.58 3.52 1.60

PENS Controls 4.73 1.57 4.75 1.58

IMI Enjoyment 4.73 1.48 4.60 1.55

IMI Eort 5.47 1.23 5.37 1.31

IMI Pressure 3.34 1.56 3.29 1.60

IMI Value 4.64 1.59 4.55 1.63

* signicant at p < .05; ** signicant at p < .01; *** signicant at p < .005.

Table 4. Results for eects of voice similarity (H1). PIS was on a 5-pt Likert scale, while IMI and PENS were

on a 7-pt Likert scale.

5.2.1 Performance. Participants in the voice similarity condition completed signicantly more

puzzles than participants in the voice dissimilarity condition,

(655)=2.61,

<0.01,

=0.20. There

was no signicant dierence in hints used between the voice similarity condition and the voice

dissimilarity condition, t(655)=1.89, p=0.06, d=0.15.

5.2.2 Time Spent. Participants in the voice similarity condition played for a signicantly longer

period of time than participants in the voice dissimilarity condition, t(655)=2.53, p<0.05, d=0.20.

5.2.3 PIS. Participants in the voice similarity condition had signicantly higher similarity identi-

cation than participants in the voice dissimilarity condition,

(655)=2.94,

<0.005,

=0.23. There

was no signicant dierence in embodied identication between the voice similarity condition and

the voice dissimilarity condition,

(655)=0.70,

=0.48,

=0.06. There was no signicant dierence in

wishful identication between the voice similarity condition and the voice dissimilarity condition,

t(655)=0.34, p=0.74, d=0.03.

5.2.4 PENS. Participants in the voice similarity condition experienced signicantly higher com-

petence than participants in the voice dissimilarity condition,

(655)=2.89,

<0.005,

=0.23. There

was no signicant dierence in autonomy between the voice similarity condition and the voice

dissimilarity condition,

(655)=1.02,

=0.31,

=0.08. Participants in the voice similarity condition

experienced signicantly higher relatedness than participants in the voice dissimilarity condition,

(655)=2.78,

<0.01,

=0.22. Participants in the voice similarity condition experienced signicantly

higher immersion than participants in the voice dissimilarity condition,

(655)=2.94,

<0.005,

=0.23.

There was no signicant dierence in intuitive controls between the voice similarity condition and

the voice dissimilarity condition, t(655)=0.12, p=0.90, d=0.01.

Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CHI PLAY, Article 238. Publication date: September 2021.

The Eects of a Self-Similar Avatar Voice in Educational Games 238:15

Similarity Identification Embodied Identification Wishful Identification

a b c

′

c ab a b c

′

c ab a b c

′

c ab

Performance

Puzzles 0.24*** 0.660*** 0.959* 1.114** 0.155; CI [0.038, 0.320] 0.061 0.791*** 1.066* 1.114** 0.048; CI [-0.088, 0.196] 0.028 -0.231 1.121** 1.114** -0.006; CI [-0.066, 0.045]

Hints — 1.411*** 1.555 1.887 0.332; CI [0.068, 0.704] — 1.975*** 1.767 1.887 0.120; CI [-0.213, 0.494] — 0.137 1.884 1.887 0.004; CI[-0.080, 0.107]

Time Spent

Time Spent — 88.74*** 135.1* 156.0* 20.90; CI [4.366, 44.60] — 98.66*** 150.0* 156.0* 5.99; CI[-11.46, 25.23] — -21.27 156.5* 156.0* -0.590; CI[-8.263, 5.479]

Player Experience of Need Satisfaction (PENS)

Competence — 0.579*** 0.226 0.362*** 0.136; CI [0.046, 0.231] — 0.596*** 0.326*** 0.362*** 0.036; CI [-0.065, 0.139] — 0.524*** 0.348*** 0.362*** 0.015; CI [-0.070, 0.101]

Autonomy — 0.640*** -0.025 0.126 0.151; CI [0.050, 0.257] — 0.682*** 0.084 0.126 0.041; CI [-0.074, 0.156] — 0.592*** 0.109 0.126 0.016; CI [-0.081, 0.115]

Relatedness — 0.694*** 0.176 0.340** 0.164; CI [0.055, 0.276] — 0.618*** 0.302** 0.340** 0.038; CI [-0.068, 0.141] — 0.655*** 0.322*** 0.340** 0.018; CI[-0.088, 0.126]

Immersion — 0.760*** 0.186 0.365*** 0.179; CI [0.057, 0.302] — 0.769*** 0.318*** 0.365*** 0.047; CI [-0.084, 0.176] — 0.743*** 0.344*** 0.365*** 0.021; CI [-0.100, 0.142]

Controls — 0.460*** -0.123 -0.015 0.108; CI[0.034, 0.189] — 0.498*** -0.045 -0.015 0.030; CI[-0.056, 0.116] — 0.370*** -0.025 -0.015 0.010; CI [-0.051, 0.070]

Intrinsic Motivation Inventory (IMI)

Enjoyment — 0.591*** -0.004 0.135 0.139; CI [0.047, 0.236] — 0.628*** 0.097 0.135 0.038; CI [-0.069, 0.145] — 0.479*** 0.122 0.135 0.013; CI [-0.065, 0.095]

Eort — 0.032 0.094 0.102 0.008; CI [-0.016, 0.035] — 0.118** 0.095 0.102 0.007; CI[-0.014, 0.032] — 0.042 0.101 0.102 0.001; CI [-0.009, 0.015]

Tension — -0.066 0.063 0.048 -0.016; CI[-0.051, 0.015] — -0.063 0.052 0.048 -0.004; CI [-0.023, 0.011] — 0.023 0.047 0.048 0.001; CI [-0.010, 0.014]

Usefulness — 0.594*** -0.050 0.090 0.140; CI [0.048, 0.239] — 0.584*** 0.055 0.090 0.036; CI[-0.065, 0.133] — 0.529*** 0.076 0.090 0.015; CI [-0.071, 0.102]

* signicant at p < .05; ** signicant at p < .01; *** signicant at p < .005; signicant ab based on 95% CI.

Table 5. Mediation results with voice similarity (X), avatar identification (M), and outcome (Y). Regression

coeicients a (X→M), b (M→Y), c’ (direct X→Y ), c (total X→Y), and ab. Significant results are bold.

5.2.5 IMI. There was no signicant dierence in enjoyment between the voice similarity condition

and the voice dissimilarity condition,

(655)=1.14,

=0.25,

=0.09. There was no signicant dierence

in eort between the voice similarity condition and the voice dissimilarity condition,

(655)=1.03,

=0.31,

=0.08. There was no signicant dierence in pressure between the voice similarity condition

and the voice dissimilarity condition,

(655)=0.39,

=0.70,

=0.03. There was no signicant dierence

in value between the voice similarity condition and the voice dissimilarity condition,

(655)=0.72,

p=0.47, d=0.06.

5.2.6 Summary of Results. Higher voice similarity leads to a signicant increase in performance,

time spent, similarity identication, competence, relatedness, and immersion. Eect sizes (

) range

from 0.2 to 0.23, making these eects small. However, given the complexity of player-game interac-

tions, small eect sizes are not uncommon in games user research [

177

205

]. Embodied and

wishful identication, autonomy, controls, and intrinsic motivation were unaected. See Table 4.

5.3 Avatar Identification as a Mediator

H2:

Avatar identication will mediate more positive need satisfaction, intrinsic motivation, and

performance.

From Table 5, we can see that voice similarity leads to higher similarity identication (

) and

that higher similarity identication was subsequently related to higher performance, time spent,

need satisfaction, interest/enjoyment, and value/usefulness (

). A 95% bias-corrected condence

interval based on 10,000 bootstrap samples indicates that the indirect eects (

) are also signicant.

Therefore, we conclude that similarity identication signicantly mediates the relationship between

voice similarity and performance, time spent, need satisfaction, and intrinsic motivation.

Embodied identication was related to higher performance, time spent, need satisfaction, in-

terest/enjoyment, eort/importance, and value/usefulness. Wishful identication was related to

higher need satisfaction, interest/enjoyment, and value/usefulness. Indirect eects for embodied

and wishful identication were non-signicant.

5.4 Eects of Voice Modulation

H3:

Consistent with gender stereotypes in STEM, voice modulation upwards/downwards will have a

negative/positive eect, respectively, on avatar identication, need satisfaction, intrinsic motivation,

and performance.

Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CHI PLAY, Article 238. Publication date: September 2021.

238:16 Dominic Kao et al.

(1) (2) (3) (4) (5) (6) (7) (8) (9) (10) (11) (12) (13) (14) (15)

M S D M S D M S D M S D M S D M S D M S D M SD M S D M S D M S D M S D M S D M S D M S D

Male

Pitch— 6.63 5.63 13.68 13. 13 876. 4 765.8 3.02 1.05 3. 06 1.05 2.58 1.08 4. 62 1.53 4.38 1. 61 3. 42 1.59 3.80 1. 60 4.90 1.47 4. 68 1.38 5.36 1. 23 3. 10 1.49 4.60 1. 58

Pitch↑ 6. 53 5. 39 13.60 12.53 820.4 766. 3 2. 99 0.98 3.08 1.09 2.56 1.07 4. 69 1.58 4.50 1.54 3. 32 1.56 3. 86 1. 56 5.18 1.46 4. 87 1.48 5.55 1.29 2. 90 1.39 4. 84 1. 57

Pitch↓ 6. 56 5. 39 13.37 13.14 809.0 712. 5 2. 92 1.03 3.01 1.08 2.59 1.08 4. 53 1.49 4.20 1.41 3. 32 1.55 3. 64 1. 60 4.82 1.45 4. 55 1.54 5.22 1.22 3. 04 1.36 4. 45 1. 58

Female

Pitch— 4.45 5.07 11.61 11. 84 693. 1 644.6 2.59 1.06 2. 74 1.18 2.39 1.07 3. 89 1.73 3.69 1. 80 3. 06 1.50 3.63 1. 65 4.12 1.72 4. 51 1.73 5.55 1. 28 4. 11 1.76 4.30 1. 66

Pitch↑ 4. 30 5. 11 11.21 12.81 820.7 940. 3 2. 58 0.92 2.73 1.15 2.23 0.93 3. 57 1.67 3.84 1.63 2. 94 1.68 3. 25 1. 54 4.16 1.58 4. 53 1.59 5.59 1.27 3. 71 1.68 4. 56 1. 60

Pitch↓ 5. 18 5. 74 12.15 13.34 845.1 1039.4 2. 64 1.09 2.76 1.16 2. 47 1.04 4.00 1. 68 4. 18 1.61 3.22 1. 59 3.67 1.66 4. 32 1.79 4.68 1. 57 5. 49 1.37 4.23 1. 73 4.62 1.75

Main Effect Gender

F 17.035 2.915 0.512 18.217 10.353 5.359 34.152 11.391 4.289 3.345 33.320 0.915 2.394 59.511 0.949

p <0.001 0.088 0.474 <0.001 <0.005 <0.05 <0.001 <0.001 <0.05 0.068 <0.001 0.339 0.122 <0.001 0.330

0.026 0.004 0.001 0.027 0.016 0.008 0.050 0.017 0.007 0.005 0.049 0.001 0.004 0.084 0.001

Main Effect Pitch

F 0.352 0.035 0.149 0.033 0.021 0.737 0.379 0.497 0.393 0.435 0.460 0.234 1.331 2.501 1.079

p 0.703 0.966 0.861 0.968 0.980 0.479 0.684 0.609 0.675 0.648 0.632 0.792 0.265 0.083 0.340

0.001 0.000 0.000 0.000 0.000 0.002 0.001 0.002 0.001 0.001 0.001 0.001 0.004 0.008 0.003

Interaction Effect

F 0.357 0.104 0.996 0.296 0.104 0.466 1.556 2.818 0.451 1.854 1.271 1.115 0.359 0.727 1.308

p 0.700 0.901 0.370 0.744 0.901 0.628 0.212 0.060 0.637 0.157 0.281 0.328 0.698 0.484 0.271

0.001 0.000 0.003 0.001 0.000 0.001 0.005 0.009 0.001 0.006 0.004 0.003 0.001 0.002 0.004

Gender df=1, Pitch df=2, Interaction df=2, Error df=651

(1) Puzzles Comp.

(2) Hints Accessed

(3) Time Spent Sec.

(4) PIS Similarity

(5) PIS Embodied

(6) PIS Wishful

(7) PENS Comp.

(8) PENS Autonomy

(9) PENS Related.

(10) PENS Immersion

(11) PENS Controls

(12) IMI Enjoyment

(13) IMI Effort

(14) IMI Pressure

(15) IMI Value

Table 6. Results for eects of voice modulation (H3). Significant results are bold.

From Table 6, 2x3 ANOVAs (gender x voice modulation) found main eects of gender for puzzles

completed, similarity identication, embodied identication, wishful identication, competence,

autonomy, relatedness, intuitive controls, and pressure/tension. No main eects of voice modulation

were found. No interaction eects between gender and voice modulation were found. Therefore,

voice modulation had a negligible impact on outcomes.

6 DISCUSSION

Existing literature has shown that an avatar’s visual appearance aects its user [

200

]. Such eects

are moderated by how much we identify with the avatar. This identication can be increased

through visual avatar customization. However, it remains unclear whether the audial aspects of an

avatar inuences identication and other outcomes.

Here, we conducted a 2 x 3 (voice similarity x voice modulation) experiment with neural network

voice cloning. Higher voice similarity directly increases game performance

, time spent, similarity

identication, competence, relatedness, and immersion. Mediation analysis found that similar-

ity identication (M) mediates between voice similarity (X) and performance, time spent, need

satisfaction, and intrinsic motivation (Y). Therefore, avatar voice inuences crucial PX outcomes.

Surprisingly, pitch shifting had no signicant eect. Although studies have shown that manipu-

lating pitch by 20 Hz alters attractiveness of a voice [

–

189

], our measurement instruments

did not focus on attractiveness and were instead geared towards PX and performance outcomes.

Furthermore, our study takes place during gameplay, not an environment where the player’s sole

focus is on evaluating the audio being presented. The player’s focus is instead divided between

For the similar voice condition, we see a signicant increase in performance, but also a (non-signicant) increase in hints

accessed (see Table 4). The mean increase in puzzles completed is ~1, while the mean increase in hints accessed is ~2. Each

puzzle contains three hints, with the rst two designed to guide the player towards the answer (e.g., “Look around the

environment.”) and the last hint providing the answer. Therefore, the increase in hints accessed alone does not explain the

performance increase. We interpret both the increased performance and the increased time spent as behavioral indicators of

increased motivation to engage in and learn from the game.

Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CHI PLAY, Article 238. Publication date: September 2021.

The Eects of a Self-Similar Avatar Voice in Educational Games 238:17

cognitive processes—e.g., visually interpreting the scene and learning gameplay. Therefore, voice

modulation may have been too subtle for observable eects. Larger modulations (e.g., 40 Hz) should

be considered in future research on the potential for avatar voices to inuence stereotype eects.

6.1 Applications to Games

Although the amount of dialogue in our game can be considered minimal compared to other games,

such as Mass Eect [

130

], even this amount of higher player-similar audio signicantly promoted

gameplay performance and inuenced PX. This has signicant implications for audio in games.

Game companies can create more engaging experiences through similar voice audio, leading to

greater commercial success. Similarly, games that promote health (e.g., exercise [

]), learning

(e.g., educational games [

]), and discovery (e.g., citizen science [

]) could benet from increased

engagement. Engagement can translate to better habits, greater learning gains, and increased

scientic discoveries. Our results show that increasing similarity identication through higher

voice similarity results in increased need satisfaction, intrinsic motivation, and motivated behavior.

These outcomes are important across virtually all games.

6.2 Broader Voice Applications

Virtual environments more generally that contain voiced characters could also benet from voice

similarity. For example, consider an intelligent agent, designed for math learning, whose voice

resembles the user’s. An intelligent agent perceived as being similar could improve learning

outcomes [

103

]. Applications for learning a new language could similarly benet users,

as hearing how one’s own voice should sound could help users more easily imitate speech. Or

consider VR oil rig safety training where the narrator’s voice resembles the trainee’s. A similar

narrator could lead to more engaged and immersive training.

Many real-world devices incorporate voice assistants such as Siri [

], Cortana [

129

], Google

Assistant [

], and Alexa [

], and these are increasingly prevalent in homes, cars, and mobile

devices. Although the eect of voice similarity with these assistants has not been studied directly,

the present ndings suggest that increasing voice similarity would lead to more positive interactions

with such voice assistants. More research is needed on the extensive number of potential use cases

for voice similarity.

6.3 Audio Customization

While we demonstrated these results in a controlled lab experiment, players will likely experience

even greater similarity identication and aected outcomes in realistic volitional play contexts

where players engage with their virtual representations over a longer period of time. For in-

stance, research suggests that over time we become more congruent with our virtual identities

[

159

200

203

]. Of the types of identication measured (similarity, embodied, and wishful), only

similarity was aected. While expected due to the manipulation of voice similarity, the avatar

customization process on the other hand has been shown to increase similarity, embodied, and

wishful identication [

]. For example, the options during avatar customization allow players to

create not only themselves but an ideal that they would like to become [

]. This leads us to believe

that customization of avatar audio, similar to customization of an avatar’s visual appearance, would

be benecial for fostering avatar identication.

Although still not common, some games allow for customization of avatar audio. Games such

as Final Fantasy XIV [

176

], Saints Row IV [

188

], and Monster Hunter: World [

], allow for selec-

tion of dierent pre-created collections of voice audio. Other games allow the player to directly

manipulate the voice itself. Black Desert Online [

148

] and Red Dead Redemption 2 [

162

] both allow

for customization of pitch, with the latter introducing an additional “clarity” parameter. The Sims

Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CHI PLAY, Article 238. Publication date: September 2021.

238:18 Dominic Kao et al.

4 [

] allows pitch adjustment and choosing between ‘sweet,’ ‘melodic,’ and ‘lilted’ for women,

and between ‘clear,’ ‘warm,’ and ‘brash’ for men. However, more extensive audio customization in

games does not currently exist. With these limited parameters, a self-similar voice is not possible

in most circumstances.

Nevertheless, more complex avatar audio customization could be highly benecial. Allowing users

to create similar (and perhaps embodied and wishful, as is possible with visual avatar customization)

audial identities gives rise to new possibilities for identication (possibly leading to stronger

emotional attachments [21]), thereby enhancing a wide range of PX outcomes.

6.4 Behavioral Influence

This line of research on audial avatar identities is also relevant to the Proteus eect, the phenomenon

that avatar users tend to conform behaviorally to the identity characteristics that they associate

with their avatars [

200

]. This phenomenon has been studied extensively with respect to avatar

appearance [

158

], but not with respect to avatar voice characteristics. Just as taller avatars lead to

more aggressive negotiation [

201

], healthier-weight avatars lead to more physical activity [

116

and inventor-looking avatars lead to more creative brainstorming [

], an avatar that sounds more

condent, healthy, and creative could also cause enactments of those attributes. Future research on

the Proteus eect could use the methods adopted in the present study to conrm these expectations.

7 LIMITATIONS

Controlled experiments with random assignment are considered robust. However, compensating

participants to play a game in a controlled lab setting is fundamentally dissimilar than playing

of one’s own volition. Future studies should seek to understand whether these results extend to

voluntary play.

As our study design was relatively complex, there was an inherent degree of randomness in

our conditions. For example, the three conditions that that were aggregated into the similar voice

condition had slight degrees of dissimilarity due to the pitch modulation. Similarly, the dissimilar

voice was cloned from a random corpus of 10 participant voices and was also aggregated with

pitch modulated versions. Nevertheless, these comparisons can be performed given the large

sample size and the manipulation check. For example, Table 2 validates that pitch modulated

voices did not dier greatly in similarity from their unmodulated counterparts. That being said, it

is important to note that technically, while the similar voice was viewed as having higher than

average similarity with the player (~4.23), this cannot be considered to be truly a very similar voice.

This is mostly a technological constraint in that the state-of-the-art in voice cloning is currently

unable to consistently generate very similar voices across all speakers. Future studies might address

this through collecting a larger corpus of participant audio to train deep learning algorithms to

create an even better matching voice prior to conducting the experiment. Moreover, the eect sizes

of our results fall in the small range. Nonetheless, this study has successfully compared a voice

that sounds more like the player to a voice that sounds less like the player, illustrating signicant

dierences in PX. The implications of such results can be of value to the HCI community more

broadly, as audio is often understudied in comparison to visual aspects of games and other systems.

This study used a single, education-oriented game that was designed for research purposes.

Hence, generalizability was not established for the types of games or media applications that are

used more commonly, such as entertainment-oriented action games or mobile phone operating

systems. The inuence of voice similarity may depend on facets of the media design (e.g., pacing,

opportunities for voice-based interaction) as well as user orientations toward the media (e.g.,

playing for fun or to learn). Future research could examine such factors as moderating eects of

voice similarity on PX.

Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CHI PLAY, Article 238. Publication date: September 2021.

The Eects of a Self-Similar Avatar Voice in Educational Games 238:19

This research was designed to examine voiced avatars that speak for the avatar user, presumably

within single-player games or applications. However, many multi-user applications oer voice-

based communication [

191

192

], which enhances user experiences and social trust [

198

], although

users rarely actually hear their own voices. That said, previous research suggests that when

gender is communicated through voices in online games, women are more likely to receive toxic

treatment [

186

193

], which potentially triggers stereotype threat and causes psychological harm

[

]. Although the present research did not nd any dierences in stereotype-related outcomes due

to voice pitch modulation, the ndings do suggest that user voices are malleable, just like the visual

characteristics of avatars. Technologies are currently available to consumers that facilitate voice

modication in multi-user games and other applications (e.g., [

124

187

]), oering the potential to

switch genders or even species. Future research could use such tools to examine voice avatars and

stereotype eects in multi-user voice-communication contexts, e.g., social VR [68].

There were aspects of the experiment that were not entirely under our control. The quality of

the microphone and audio, for example, depend on what devices are owned by the participant.

However, using participants’ own devices increases ecological validity as this is more typical to

how a person would play a game compared to a lab. Other aspects, however, could have also played

a role in the experiment. For example, we performed an audio check to ensure participants could

hear audio at the beginning of the experiment, and we additionally recorded participants’ system

audio level whenever a voice line was triggered, but we had no control over the specic volume

being used or whether they were really listening (e.g., putting their headphones down on the table).

Our research on voice pitch and stereotypes is based on decades of work on evolutionary

behavior. There are common associations between voice pitch and masculinity, femininity, and

dominance, and these associations exist across animal species and nonhuman primates [

134

Furthermore, the “universality of voice pitch sexual dimorphism” has led researchers to argue that

such associations are expected to hold across cultures [

155

]. Nevertheless, this should not be taken

for granted and such studies should be replicated in non-U.S. contexts.

One aspect not directly studied is the degree of similarity. For example, with too little similarity,

there may be no eect; too much similarity and it may be strange (e.g., an audial analogue to the

uncanny valley, which refers to revulsion for nearly human-looking avatars [

133

]). Similarly, there

are ethical concerns that need to be explored prior to broadly deploying voice manipulation. A

recent workshop hosted by the U.S. Federal Trade Commission (FTC) discussed both the risks

and benets of voice cloning [

]. Risks include fraud and harassment, while benets include

synthesizing voices for those suering from amyotrophic lateral sclerosis (ALS), Huntington’s

disease, and autism. Nevertheless, the full implications of voice cloning are still unfolding.

8 CONCLUSION

Avatar identication is a topic of extensive research. Despite widespread acknowledgment of how

avatar identication benets users, existing studies have focused on visual appearance of avatars.

We presented one of the rst studies to date on avatar self-similar audio. Higher voice similarity

leads to a signicant increase in performance, time spent, similarity identication, competence,

relatedness, and immersion. Similarity identication acts as a signicant mediator variable between

voice similarity and performance, time spent, need satisfaction, and intrinsic motivation. We

discussed the wide-ranging implications of these results for games and beyond. This study is an

important step towards understanding voice audio eects.

REFERENCES

[1]

Vero Vanden Abeele, Katta Spiel, Lennart Nacke, Daniel Johnson, and Kathrin Gerling. 2020. Development and

validation of the player experience inventory: A scale to measure player experiences at the level of functional and

Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CHI PLAY, Article 238. Publication date: September 2021.

238:20 Dominic Kao et al.

psychosocial consequences. International Journal of Human Computer Studies 135, January 2019 (2020), 102370.

https://doi.org/10.1016/j.ijhcs.2019.102370

[2]

Ma Victoria Almeda, Erica Kleinman, Chaima Jemmali, Carter Ithier, Elizabeth Rowe, and Magy Seif El-Nasr. 2020.

Labeling debugging in may’s journey gameplay. In Proceedings of the 51st ACM Technical Symposium on Computer

Science Education. https://doi.org/10.1145/3328778.3372624

[3] Amazon. 2020. Alexa. https://developer.amazon.com/en-US/alexa

[4] Amazon. 2020. Amazon EC2 P2 Instances. https://aws.amazon.com/ec2/instance-types/p2/

[5]

Moya L. Andrews and Charles P. Schmidt. 1997. Gender presentation: Perceptual and acoustical analyses of voice.

Journal of Voice 11, 3 (1997), 307–313. https://doi.org/10.1016/S0892-1997(97)80009-4

[6] Apple. 2020. Siri. https://www.apple.com/siri/

[7]

Laura Aymerich-Franch, Cody Karutz, and Jeremy N Bailenson. 2012. Eects of Facial and Voice Similarity on

Presence in a Public Speaking Virtual Environment. ISPR Presence Live Conference (2012), 1–7.

[8]

Christine M. Bachen, Pedro Hernández-Ramos, Chad Raphael, and Amanda Waldron. 2016. How do presence, ow,

and character identication aect players’ empathy and interest in learning from a serious computer game? Computers

in Human Behavior 64 (2016), 77–87. https://doi.org/10.1016/j.chb.2016.06.043

[9]

Jeremy N. Bailenson, Jim Blascovich, and Rosanna E. Guadagno. 2008. Self-representations in immersive virtual

environments. Journal of Applied Social Psychology 38, 11 (2008), 2673–2690.

[10]

Anna Barenbrock, Marc Herrlich, Kathrin Maria Gerling, Jan David Smeddinck, and Rainer Malaka. 2018. Varying

avatar weight to increase player motivation: Challenges of a gaming setup. Conference on Human Factors in Computing

Systems - Proceedings 2018-April (2018), 1–6. https://doi.org/10.1145/3170427.3188634

[11]

Al Baylor and Yanghee Kim. 2004. Pedagogical agent design: The impact of agent realism, gender, ethnicity, and

instructional role. Intelligent Tutoring Systems 1997 (2004), 592–603. https://doi.org/10.1007/978-3-540-30139-4_56

[12]

Adam J Berinsky, Gregory A Huber, and Gabriel S Lenz. 2012. Evaluating online labor markets for experimental

research: Amazon. com’s Mechanical Turk. Political Analysis 20, 3 (2012), 351–368.

[13]

Axel Berndt and Knut Hartmann. 2008. The functions of music in interactive media. Lecture Notes in Computer

Science (including subseries Lecture Notes in Articial Intelligence and Lecture Notes in Bioinformatics) 5334 LNCS

(2008), 126–131. https://doi.org/10.1007/978-3-540-89454-4_19

[14]

K Bessière, AF Seay, and S Kiesler. 2007. The ideal elf: Identity exploration in World of Warcraft. CyberPsychology &

Behavior (2007). http://online.liebertpub.com/doi/abs/10.1089/cpb.2007.9994

[15]

Frank Biocca. 1997. Cyborg’s dilemma: Embodiment in virtual environments. In Proceedings of the International

Conference on Cognitive Technology.

[16]

Max V Birk, Cheralyn Atkins, Jason T Bowey, and Regan L Mandryk. 2016. Fostering Intrinsic Motivation through

Avatar Identication in Digital Games. CHI (2016). https://doi.org/10.1145/2858036.2858062

[17]

Max V. Birk and Regan L. Mandryk. 2018. Combating Attrition in Digital Self-Improvement Programs using Avatar

Customization. CHI ’18: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems: Proceedings of

the SIGCHI Conference on Human Factors in Computing Systems (2018), 1–15. https://doi.org/10.1145/3173574.3174234

[18]

Max V Birk, Regan L Mandryk, and Cheralyn Atkins. 2016. The Motivational Push of Games. Proceedings of

the 2016 Annual Symposium on Computer-Human Interaction in Play - CHI PLAY ’16 April 2017 (2016), 291–303.

https://doi.org/10.1145/2967934.2968091

[19]

Max V. Birk, Regan L. Mandryk, Matthew K. Miller, and Kathrin M. Gerling. 2015. How self-esteem shapes our

interactions with play technologies. In CHI PLAY 2015 - Proceedings of the 2015 Annual Symposium on Computer-Human

Interaction in Play. https://doi.org/10.1145/2793107.2793111

[20]

María J. Blanca, Rafael Alarcón, Jaume Arnau, Roser Bono, and Rebecca Bendayan. 2017. Non-normal data: Is ANOVA

still a valid option? Psicothema (2017). https://doi.org/10.7334/psicothema2016.383

[21]

Julia Ayumi Bopp, Livia J. Müller, Lena Fanya Aeschbach, Klaus Opwis, and Elisa D. Mekler. 2019. Exploring emotional

attachment to game characters. CHI PLAY 2019 - Proceedings of the Annual Symposium on Computer-Human Interaction

in Play (2019), 313–324. https://doi.org/10.1145/3311350.3347169

[22]

Barbara Borkowska and Boguslaw Pawlowski. 2011. Female voice frequency in the context of dominance and

attractiveness perception. Animal Behaviour 82, 1 (2011), 55–59. https://doi.org/10.1016/j.anbehav.2011.03.024

[23]

Nicholas David Bowman, Mary Beth Oliver, Ryan Rogers, Brett Sherrick, Julia Woolley, and Mun-Young Chung.

2016. In control or in their shoes? How character attachment dierentially inuences video game enjoyment and

appreciation. Journal of Gaming & Virtual Worlds 8, 1 (2016), 83–99. https://doi.org/10.1386/jgvw.8.1.83_1

[24]

Jeanne H Brockmyer, Christine M Fox, Kathleen A Curtiss, Evan McBroom, Kimberly M Burkhart, and Jacquelyn N

Pidruzny. 2009. The development of the Game Engagement Questionnaire: A measure of engagement in video

game-playing. Journal of Experimental Social Psychology 45, 4 (2009), 624–634.

[25]

Michael Buhrmester, Tracy Kwang, and Samuel D Gosling. 2011. Amazon’s Mechanical Turk: A new source of

inexpensive, yet high-quality, data? Perspectives on psychological science 6, 1 (2011), 3–5.

Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CHI PLAY, Article 238. Publication date: September 2021.

The Eects of a Self-Similar Avatar Voice in Educational Games 238:21

[26]

Stéphanie Buisine, Jérôme Guegan, Jessy Barré, Frédéric Segonds, and Améziane Aoussat. 2016. Using avatars to

tailor ideation process to innovation strategy. Cognition, Technology and Work (2016). https://doi.org/10.1007/s10111-

016-0378-y

[27]

Donn Byrne and Don Nelson. 1965. Attraction as a linear function of proportion of positive reinforcements. Journal

of Personality and Social Psychology 1, 6 (1965), 659–663. https://doi.org/10.1037/h0022073

[28]

Jaehwan Byun and Christian S. Loh. 2015. Audial engagement: Eects of game sound on learner engagement

in digital game-based learning environments. Computers in Human Behavior 46, May (2015), 129–138. https:

//doi.org/10.1016/j.chb.2014.12.052

[29] Capcom. 2018. Monster Hunter: World. Game [Multiple Platforms].

[30]

Marcus Carter, Fraser Allison, John Downs, and Martin Gibbs. 2015. Player identity dissonance and voice interaction

in games. CHI PLAY 2015 - Proceedings of the 2015 Annual Symposium on Computer-Human Interaction in Play (2015),

265–270. https://doi.org/10.1145/2793107.2793144

[31]

Gianna Cassidy and Raymond MacDonald. 2009. The eects of music choice on task performance: A study of the

impact of self-selected and experimenter-selected music on driving game performance and experience. Musicae

Scientiae (2009). https://doi.org/10.1177/102986490901300207

[32]

G G Cassidy and Raymond A R MacDonald. 2010. The eects of music on time perception and performance of a

driving game. Scandinavian journal of psychology 51, 6 (2010), 455–464.

[33]

Jesse Chandler and Danielle Shapiro. 2016. Conducting clinical research using crowdsourced convenience samples.

Annual Review of Clinical Psychology 12 (2016).

[34]

Ff Charpentier and M Stella. 1986. Diphone synthesis using an overlap-add technique for speech waveforms

concatenation. In ICASSP’86. IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol. 11. IEEE,

2015–2018.

[35]

M-T Cheng, H-C She, and Leonard A Annetta. 2015. Game immersion experience: its hierarchical structure and

impact on game-based science learning. Journal of Computer Assisted Learning 31, 3 (2015), 232–253.

[36]

Klimmt Christoph, Hefner Dorothée, and Vorderer Peter. 2009. The Video Game Experience as "True" Identication:

A Theory of Enjoyable Alterations of Players’ Self-Perception. Communication theory 19, 4 (2009), 351–373.

[37]

Jonathan Cohen. 2001. Dening identication: A theoretical look at the identication of audiences with media

characters. Mass communication & society 4, 3 (2001), 245–264.

[38]

Jonathan Cohen. 2006. Audience identication with media characters. Psychology of entertainment 13 (2006), 183–197.

[39] Sarah A Collins. 2000. Men’s voices and women’s choices. Animal behaviour 60, 6 (2000), 773–780.

[40]

Sarah A Collins and Caroline Missing. 2003. Vocal and visual attractiveness are related in women. Animal behaviour

65, 5 (2003), 997–1004.

[41]

Seth Cooper, Firas Khatib, Adrien Treuille, Janos Barbero, Jeehyung Lee, Michael Beenen, Andrew Leaver-Fay, David

Baker, Zoran Popović, and Foldit Players. 2010. Predicting protein structures with a multiplayer online game. Nature

(2010). https://doi.org/10.1038/nature09304

[42]

Nicole Crenshaw and Bonnie Nardi. 2014. What’s in a Name? Naming Practices in Online Video Games. CHI PLAY

(2014), 67–76.

[43]

James J. Cummings and Jeremy N. Bailenson. 2016. How Immersive Is Enough? A Meta-Analysis of the Eect of

Immersive Technology on User Presence. Media Psychology 19, 2 (2016), 272–309. https://doi.org/10.1080/15213269.

2015.1015740

[44]

Frederik De Grove, Verolien Cauberghe, and Jan Van Looy. 2016. Development and validation of an instrument for

measuring individual motives for playing digital games. Media Psychology 19, 1 (2016), 101–125.

[45]

Alwin De Rooij, Sarah Van Der Land, and Shelly Van Erp. 2017. The creative proteus eect: How self-similarity,

embodiment, and priming of creative stereotypes with avatars inuences creative ideation. In C and C 2017 - Proceedings

of the 2017 ACM SIGCHI Conference on Creativity and Cognition. https://doi.org/10.1145/3059454.3078856

[46]

Edward Deci and Richard M. Ryan. 1985. Intrinsic Motivation and Self-Determination in Human Behavior. Plenum

Press.

[47]

Edward L. Deci and Richard M. Ryan. 2000. The "what" and "why" of goal pursuits: Human needs and the self-

determination of behavior. Psychological Inquiry (2000). https://doi.org/10.1207/S15327965PLI1104_01

[48]

Michel Désert and Jacques-Philippe Leyens. 2006. Social comparisons across cultures I: Gender. Social comparison

and social psychology: Understanding cognition, intergroup relations, and culture (2006), 303.

[49]

Mats Deutschmann, Anders Steinvall, Anna Lagerström, Mats Deutschmann, Anders Steinvall, and Anna Lagerström.

2011. Gender-Bending in Virtual Space - Using Voice-morphing in Second Life to Raise Sociolinguistic Gender

Awareness. V-lang International Conference, Warsaw November (2011), 54–61.

[50]

Edward Downs, Nicholas D. Bowman, and Jaime Banks. 2019. A polythetic model of player-avatar identication:

Synthesizing multiple mechanisms. Psychology of Popular Media Culture (2019). https://doi.org/10.1037/ppm0000170

Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CHI PLAY, Article 238. Publication date: September 2021.

238:22 Dominic Kao et al.

[51]

Nicolas Ducheneaut, MH Wen, Nicholas Yee, and Greg Wadley. 2009. Body and mind: a study of avatar personalization

in three virtual worlds. CHI 2009 (2009). http://dl.acm.org/citation.cfm?id=1518877

[52]

Alice H. Eagly, Christa Nater, David I. Miller, Michèle Kaufmann, and Sabine Sczesny. 2020. Gender stereotypes have

changed: A cross-temporal meta-analysis of U.S. public opinion polls from 1946 to 2018. American Psychologist (2020).

https://doi.org/10.1037/amp0000494

[53]

Inger Ekman. 2005. Meaningful noise: Understanding sound eects in computer games. Proc. Digital Arts and Cultures

17 (2005).

[54]

Inger Ekman. 2008. Psychologically Motivated Techniques for Emotional Sound in Computer Games. Proc. AudioMostly

2008 January 2008 (2008), 20–26. https://meaningfulnoise.wordpress.com/psychologically-motivated-techniques-for-

emotional-sound-in-computer-games/

[55]

Inger Ekman. 2013. On the desire to not kill your players: Rethinking sound in pervasive and mixed reality games.

FDG (2013), 142–149.

[56] Electronic Arts. 2014. The Sims 4. Game [Multiple Platforms].

[57]

Andrew J. Elliot, Vincent Payen, Jeanick Brisswalter, Francois Cury, and Julian F. Thayer. 2011. A subtle threat cue, heart

rate variability, and cognitive performance. Psychophysiology (2011). https://doi.org/10.1111/j.1469-8986.2011.01216.x

[58]

Federal Trade Commission. 2020. You Don’t Say: An FTC Workshop on Voice Cloning Technologies. https:

//www.ftc.gov/news-events/events-calendar/you-dont-say-ftc-workshop-voice-cloning-technologies

[59] Ernst Fehr and Urs Fischbacher. 2003. The nature of human altruism. Nature 425, 6960 (2003), 785–791.

[60]

David R Feinberg, Lisa M DeBruine, Benedict C Jones, and Anthony C Little. 2008. Correlated preferences for men’s

facial and vocal masculinity. Evolution and Human Behavior 29, 4 (2008), 233–241.

[61]

David R. Feinberg, Lisa M. Debruine, Benedict C. Jones, and David I. Perrett. 2008. The role of femininity and

averageness of voice pitch in aesthetic judgments of women’s voices. Perception (2008). https://doi.org/10.1068/p5514

[62]

D. R. Feinberg, B. C. Jones, M. J. Law Smith, F. R. Moore, L. M. DeBruine, R. E. Cornwell, S. G. Hillier, and D. I. Perrett.

2006. Menstrual cycle, trait estrogen level, and masculinity preferences in the human voice. Hormones and Behavior

(2006). https://doi.org/10.1016/j.yhbeh.2005.07.004

[63]

David R Feinberg, Benedict C Jones, Anthony C Little, D Michael Burt, and David I Perrett. 2005. Manipulations of

fundamental and formant frequencies inuence the attractiveness of human male voices. Animal behaviour 69, 3

(2005), 561–568.

[64]

Susan T. Fiske. 2017. Prejudices in Cultural Contexts: Shared Stereotypes (Gender, Age) Versus Variable Stereotypes

(Race, Ethnicity, Religion). Perspectives on Psychological Science (2017). https://doi.org/10.1177/1745691617708204

[65]

Joseph Fordham, Rabindra Ratan, Kuo-Ting Huang, and Kyle Silva. 2020. Stereotype Threat in a Video Game Context

and Its Inuence on Perceptions of Science, Technology, Engineering, and Mathematics (STEM): Avatar-Induced

Active Self-Concept as a Possible Mitigator. American Behavioral Scientist (2020), 0002764220919148.

[66]

Jesse Fox, Jeremy Bailenson, and Joseph Binney. 2009. Virtual experiences, physical behaviors: The eect of presence

on imitation of an eating avatar. Presence: Teleoperators and Virtual Environments 18, 4 (2009), 294–303.

[67]

Jesse Fox and Jeremy N. Bailenson. 2009. Virtual Self-Modeling: The Eects of Vicarious Reinforcement and

Identication on Exercise Behaviors. Media Psychology 12 (2009), 1–25. https://doi.org/10.1080/15213260802669474

[68]

Guo Freeman, Samaneh Zamanifard, Divine Maloney, and Alexandra Adkins. 2020. My body, my avatar: How people

perceive their avatars in social virtual reality. Conference on Human Factors in Computing Systems - Proceedings (2020),

1–8. https://doi.org/10.1145/3334480.3382923

[69]

Asif A Ghazanfar and Drew Rendall. 2008. Evolution of human vocal production. Current Biology 18, 11 (2008),

R457—-R460.

[70]

Asif A. Ghazanfar, Hjalmar K. Turesson, Joost X. Maier, Ralph van Dinther, Roy D. Patterson, and Nikos K. Logothetis.

2007. Vocal-Tract Resonances as Indexical Cues in Rhesus Monkeys. Current Biology (2007). https://doi.org/10.1016/

j.cub.2007.01.029

[71]

Timo Gnambs, Markus Appel, and Bernad Batinic. 2010. Color red in web-based knowledge testing. Computers in

Human Behavior 26, 6 (2010), 1625–1631. https://doi.org/10.1016/j.chb.2010.06.010

[72] Google. 2020. Google Assistant. https://assistant.google.com/

[73]

Mark Grimshaw. 2007. Sound and immersion in the rst-person shooter. In Proceedings of CGAMES 2007 - 11th

International Conference on Computer Games: AI, Animation, Mobile, Educational and Serious Games.

[74]

Mark Grimshaw. 2007. Sound and immersion in the rst-person shooter. Proceedings of CGAMES 2007 - 11th

International Conference on Computer Games: AI, Animation, Mobile, Educational and Serious Games January 2007

(2007), 119–124.

[75]

Rosanna E Guadagno, Jim Blascovich, Jeremy N Bailenson, and Cade McCall. 2007. Virtual humans and persua-

sion: The eects of agency and behavioral realism. Media Psychology 10, 1 (2007), 1–22. https://doi.org/10.108/

15213260701300865

Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CHI PLAY, Article 238. Publication date: September 2021.

The Eects of a Self-Similar Avatar Voice in Educational Games 238:23

[76]

Jérôme Guegan, Stéphanie Buisine, Fabrice Mantelet, Nicolas Maranzana, and Frédéric Segonds. 2016. Avatar-mediated

creativity: When embodying inventors makes engineers more creative. Computers in Human Behavior 61 (2016),

165–175. https://doi.org/10.1016/j.chb.2016.03.024

[77]

Elizabeth L. Haines, Kay Deaux, and Nicole Lofaro. 2016. The Times They Are a-Changing . .. or Are They Not?

A Comparison of Gender Stereotypes, 1983-2014. Psychology of Women Quarterly (2016). https://doi.org/10.1177/

0361684316634081

[78]

Andrew F Hayes. 2017. Introduction to mediation, moderation, and conditional process analysis: A regression-based

approach. Guilford publications.

[79]

Sylvie Hébert, Renée Béland, Odrée Dionne-Fournelle, Martine Crête, and Sonia J. Lupien. 2005. Physiological

stress response to video-game playing: The contribution of built-in music. Life Sciences 76, 20 (2005), 2371–2380.

https://doi.org/10.1016/j.lfs.2004.11.011

[80]

Cynthia Honer and Martha Buchanan. 2005. Young adults’ wishful identication with television characters: The

role of perceived similarity and character attributes. https://doi.org/10.1207/S1532785XMEP0704_2

[81]

T. C. Holyoke, Eugene S. Morton, and Jake Page. 1992. Animal Talk: Science and the Voices of Nature. The Antioch

Review (1992). https://doi.org/10.2307/4612642

[82]

John J Horton, David G Rand, and Richard J Zeckhauser. 2011. The online laboratory: Conducting experiments in a

real labor market. Experimental Economics 14, 3 (2011), 399–425.

[83]

Bart Hulshof. 2013. The inuence of colour and scent on people’s mood and cognitive performance in meeting rooms.

Master Thesis May (2013), 1–97.

[84]

W IJsselsteijn, Y De Kort, K Poels, A Jurgelionis, and Francesco Bellotti. 2007. Characterising and Measuring User

Experiences in Digital Games. International Conference on Advances in Computer Entertainment Technology 620 (2007),

1–4. https://doi.org/10.1007/978-1-60761-580-4

[85]

Katherine Isbister and Cliord Nass. 2000. Consistency of personality in interactive characters: verbal cues, non-

verbal cues, and user characteristics. International Journal of Human-Computer Studies 53 (2000), 251–267. https:

//doi.org/10.1006/ijhc.2000.0368

[86]

Corentin Jemine. 2019. Master’s thesis: Real-Time Voice Cloning. (2019). https://matheo.uliege.be/handle/2268.2/6801

[87] Corentin Jemine. 2020. Real-Time Voice Cloning. https://github.com/CorentinJ/Real-Time-Voice-Cloning

[88]

Charlene Jennett, Anna L. Cox, Paul Cairns, Samira Dhoparee, Andrew Epps, Tim Tijs, and Alison Walton. 2008.

Measuring and dening the experience of immersion in games. International Journal of Human-Computer Studies 66,

9 (sep 2008), 641–661. https://doi.org/10.1016/j.ijhcs.2008.04.004

[89]

Ye Jia, Yu Zhang, Ron J. Weiss, Quan Wang, Jonathan Shen, Fei Ren, Zhifeng Chen, Patrick Nguyen, Ruoming

Pang, Ignacio Lopez Moreno, and Yonghui Wu. 2018. Transfer learning from speaker verication to multispeaker

text-to-speech synthesis. In Advances in Neural Information Processing Systems. arXiv:1806.04558

[90]

Colby Johanson and Regan L. Mandryk. 2016. Scaolding Player Location Awareness through Audio Cues in First-

Person Shooters. Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems - CHI ’16 (2016),

3450–3461. https://doi.org/10.1145/2858036.2858172

[91]

Colby Johanson and Regan L. Mandryk. 2016. Scaolding Player Location Awareness through Audio Cues in

First-Person Shooters. (2016), 3450–3461. https://doi.org/10.1145/2858036.2858172

[92]

Benedict C Jones, David R Feinberg, Lisa M DeBruine, Anthony C Little, and Jovana Vukovic. 2008. Integrating cues

of social interest and voice pitch in men’s preferences for women’s voices. Biology Letters 4, 2 (2008), 192–194.

[93]

Benedict C. Jones, David R. Feinberg, Lisa M. DeBruine, Anthony C. Little, and Jovana Vukovic. 2010. A domain-

specic opposite-sex bias in human preferences for manipulated voice pitch. Animal Behaviour 79, 1 (2010), 57–62.

https://doi.org/10.1016/j.anbehav.2009.10.003

[94] Kristine Jørgensen. 2008. Left in the dark: playing computer games with the sound turned o. Ashgate.

[95]

Kristine Jørgensen. 2008. Left in the dark: playing computer games with the sound turned o. From Pac-Man to Pop

Music: Interactive Audio in Games and New Media (2008), 163–176. http://hdl.handle.net/1956/7855

[96]

Kristine Jørgensen. 2010. Time for new terminology? Diegetic and non-diegetic sounds in computer games revisited.

In Game Sound Technology and Player Interaction: Concepts and Developments. https://doi.org/10.4018/978-1-61692-

828-5.ch005

[97]

Dominic Kao. 2019. JavaStrike: A Java Programming Engine Embedded in Virtual Worlds. In Proceedings of The

Fourteenth International Conference on the Foundations of Digital Games.

[98]

Dominic Kao. 2019. The Eects of Anthropomorphic Avatars vs. Non-Anthropomorphic Avatars in a Jumping Game.

In The Fourteenth International Conference on the Foundations of Digital Games.

[99]

Dominic Kao and D. Fox Harrell. 2016. Exploring the Impact of Avatar Color on Game Experience in Educational

Games. Proceedings of the 34th Annual ACM Conference Extended Abstracts on Human Factors in Computing Systems

(CHI 2016) (2016).

Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CHI PLAY, Article 238. Publication date: September 2021.

238:24 Dominic Kao et al.

[100]

Dominic Kao and D. Fox Harrell. 2018. The Eects of Badges and Avatar Identication on Play and Making in

Educational Games. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems - CHI’18.

[101]

Changsoo Kim, Sang Gun Lee, and Minchoel Kang. 2012. I became an attractive person in the virtual world:

Users’ identication with virtual communities and avatars. Computers in Human Behavior 28, 5 (2012), 1663–1669.

https://doi.org/10.1016/j.chb.2012.04.004

[102]

Jong Wook Kim, Justin Salamon, Peter Li, and Juan Pablo Bello. 2018. Crepe: A Convolutional Representation for

Pitch Estimation. In ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings.

https://doi.org/10.1109/ICASSP.2018.8461329 arXiv:1802.06182

[103]

Yanghee Kim and Amy L. Baylor. 2006. Pedagogical agents as learning companions: The role of agent competency

and type of interaction. Educational Technology Research and Development 54, 3 (2006), 223–243.

[104]

Youjeong Kim and S Shyam Sundar. 2012. Visualizing ideal self vs. actual self through avatars: Impact on preventive

health outcomes. Computers in Human Behavior 28, 4 (2012), 1356–1364.

[105]

Elly A Konijn, Marije Nije Bijvank, and Brad J Bushman. 2007. I wish I were a warrior: the role of wishful identication

in the eects of violent video games on aggression in adolescent boys. Developmental psychology 43, 4 (2007), 1038.

[106]

Jordan Koulouris, Zoe Jeery, James Best, Eamonn O’Neill, and Christof Lutteroth. 2020. Me vs. Super(wo)man:

Eects of Customization and Identication in a VR Exergame. (2020), 1–17. https://doi.org/10.1145/3313831.3376661

[107]

Jody Kreiman, Diana Vanlancker-Sidtis, and Bruce R Gerratt. 2003. Dening and Measuring Voice Quality. VOQUAL’03,

Geneva, August 27-29, 2003 (2003).

[108]

Christof Kuhbandner and Reinhard Pekrun. 2013. Joint eects of emotion and color on memory. Emotion (Washington,

D.C.) 13, 3 (2013), 375–9. https://doi.org/10.1037/a0031821

[109]

Pontus Larsson, Aleksander Väljamäe, Daniel Västfjäll, Ana Tajadura-Jiménez, and Mendel Kleiner. 2010. Auditory-

Induced Presence in Mixed Reality Environments and Related Technology. (2010), 143–163. https://doi.org/10.1007/

978-1-84882-733-2_8 arXiv:arXiv:1011.1669v3

[110]

Marianne Latinus and Margot J Taylor. 2012. Discriminating male and female voices: dierentiating pitch and gender.

Brain topography 25, 2 (2012), 194–204.

[111]

Eun Ju Lee, Cliord Nass, and Scott Brave. 2000. Can computer-generated speech have gender? An experimental

test of gender stereotype. Conference on Human Factors in Computing Systems - Proceedings (2000), 289–290. https:

//doi.org/10.1145/633292.633461

[112]

Jong Eun Roselyn Lee and Cliord Nass. 2012. Distinctiveness-based stereotype threat and the moderating role of

coaction contexts. Journal of Experimental Social Psychology (2012). https://doi.org/10.1016/j.jesp.2011.06.018

[113]

Jong-Eun Roselyn Lee, Cliord I Nass, and Jeremy N Bailenson. 2014. Does the mask govern the mind?: Eects of arbi-

trary gender representation on quantitative task performance in avatar-represented virtual groups. Cyberpsychology,

Behavior, and Social Networking 17, 4 (2014), 248–254.

[114]

Sanguk Lee, Rabindra Ratan, and Taiwoo Park. 2019. The voice makes the car: Enhancing autonomous vehicle

perceptions and adoption intention through voice agent gender and style. Multimodal Technologies and Interaction

(2019). https://doi.org/10.3390/mti3010020

[115]

Benjamin J. Li and May O. Lwin. 2016. Player see, player do: Testing an exergame motivation model based on the

inuence of the self avatar. Computers in Human Behavior 59 (2016), 350–357. https://doi.org/10.1016/j.chb.2016.02.034

[116]

Benjamin J. Li, May O. Lwin, and Younbo Jung. 2014. Wii, Myself, and Size: The Inuence of Proteus Eect and

Stereotype Threat on Overweight Children’s Exercise Motivation and Behavior in Exergames. Games for Health

Journal (2014). https://doi.org/10.1089/g4h.2013.0081

[117]

Mats Liljedahl. 2011. Sound for Fantasy and Freedom. Game Sound Technology and Player Interaction (2011), 264–285.

https://doi.org/10.4018/978-1-61692-828-5.ch017

[118]

Limelight Networks. 2021. State of Online Gaming 2021. (2021). https://www.limelight.com/lp/state-of-online-

gaming-2021/

[119]

Conor Linehan, George Bellord, Ben Kirman, Zachary H. Morford, and Bryan Roche. 2014. Learning curves: Analysing

pace and challenge in four successful puzzle games. CHI PLAY 2014 - Proceedings of the 2014 Annual Symposium on

Computer-Human Interaction in Play (2014), 181–190. https://doi.org/10.1145/2658537.2658695

[120] LingoJam. 2020. Robot Voice Generator. https://lingojam.com/RobotVoiceGenerator

[121]

Van Looy and De Grove. 2013. Avatar identication in serious games - The role of avatar identication in the learning

experience of a serious game. Proceeding of: The Power of Play : Motivational Uses and Applications. Pre-Conference to

the 63rd International Communication Association (ICA) Annual Conference, Abstracts (2013).

[122]

Thomas Lumley, Paula Diehr, Scott Emerson, and Lu Chen. 2002. The importance of the normality assumption in

large public health data sets. https://doi.org/10.1146/annurev.publhealth.23.100901.140546

[123]

Winter Mason and Siddharth Suri. 2012. Conducting behavioral research on Amazon’s Mechanical Turk. Behavior Re-

search Methods 44, 1 (2012), 1–23. https://doi.org/10.3758/s13428-011-0124-6 arXiv:/ssrn.com/abstract=1691163 [http:]

Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CHI PLAY, Article 238. Publication date: September 2021.

The Eects of a Self-Similar Avatar Voice in Educational Games 238:25

[124]

Oscar Mayor, Jordi Bonada, and Jordi Janer. 2009. Kaleivoicecope: Voice transformation from interactive installations

to video-games. In Proceedings of the AES International Conference.

[125]

Edward McAuley, Terry Duncan, and Vance V Tammen. 1989. Psychometric properties of the Intrinsic Motivation

Inventory in a competitive sport setting: A conrmatory factor analysis. Research Quarterly for Exercise and Sport 60,

1 (1989), 48–58.

[126]

Daniel G McDonald and Hyeok Kim. 2001. When I die, I feel small: Electronic game characters and the social self.

Journal of Broadcasting & Electronic Media 45, 2 (2001), 241–258.

[127]

Ravi Mehta and Rui(Juliet) Zhu. 2008. Blue or Red? Exploring the Eect of Color on Cognitive Task Performances.

Science 323, February (2008), 1226–1229. https://doi.org/10.1126/science.1169144

[128]

M.A. Meier, Russell A. Hill, Andrew J. Elliot, and R.A. Barton. 2015. Color in Achievement Contexts in Humans.

Handbook of Color Psychology 44, February (2015), 0–103. https://doi.org/10.1063/1.2756072 arXiv:arXiv:0811.2183v2

[129] Microsoft. 2020. Cortana. https://www.microsoft.com/en-us/cortana

[130] Microsoft Game Studios and Electronic Arts. 2007. Mass Eect. Game [Multiple Platforms].

[131]

Jason P. Mitchell, C. Neil Macrae, and Mahzarin R. Banaji. 2006. Dissociable Medial Prefrontal Contributions to

Judgments of Similar and Dissimilar Others. Neuron 50, 4 (2006), 655–663. https://doi.org/10.1016/j.neuron.2006.03.040

[132]

Dean Mobbs, Rongjun Yu, Marcel Meyer, Luca Passamonti, Ben Seymour, Andrew J Calder, Susanne Schweizer,

Chris D Frith, and Tim Dalgleish. 2009. A key role for similarity in vicarious reward. Science 324, 5929 (2009), 900.

https://doi.org/10.1126/science.1170539

[133] Masahiro Mori. 1970. The uncanny valley. Energy 7, 4 (1970), 33–35. https://doi.org/10.1109/MRA.2012.2192811

[134]

Eugene S. Morton. 1977. On the Occurrence and Signicance of Motivation-Structural Rules in Some Bird and

Mammal Sounds. The American Naturalist (1977). https://doi.org/10.1086/283219

[135]

Lennart E. Nacke and Mark Grimshaw. 2011. Player-Game Interaction Through Aective Sound. Game Sound

Technology and Player Interaction (2011), 264–285. https://doi.org/10.4018/978-1-61692-828-5.ch013

[136]

Myura Nagendran, Kurinchi Selvan Gurusamy, Rajesh Aggarwal, Marilena Loizidou, and Brian R. Davidson. 2013.

Virtual reality training for surgical trainees in laparoscopic surgery. https://doi.org/10.1002/14651858.CD006575.pub3

[137]

Cliord Nass and Kwan Min Lee. 2001. Does computer-synthesized speech manifest personality? Experimental tests

of recognition, similarity-attraction, and consistency-attraction. Journal of Experimental Psychology: Applied 7, 3

(2001), 171–181. https://doi.org/10.1037/1076-898X.7.3.171

[138]

Cliord Nass, Youngme Moon, and Nancy Green. 1997. Are machines gender neutral? Gender-stereotypic responses

to computers with voices. Journal of Applied Social Psychology 27, 10 (1997), 864–876. https://doi.org/10.1111/j.1559-

1816.1997.tb00275.x

[139]

Raymond Ng and Robb Lindgren. 2013. Examining the eects of avatar customization and narrative on engagement

and learning in video games. Proceedings of CGAMES 2013 USA - 18th International Conference on Computer Games:

AI, Animation, Mobile, Interactive Multimedia, Educational and Serious Games (2013), 87–90. https://doi.org/10.1109/

CGames.2013.6632611

[140]

Rolf Nordahl. 2005. Self-induced Footsteps Sounds in Virtual Reality: Latency, Recognition, Quality and Presence.

(2005).

[141]

Rolf Nordahl. 2006. Increasing the Motion of Users in Photo-realistic Virtual Environments by Utilising Auditory

Rendering of the Environment and Ego-motion. Presence 2006 (2006), 57–62.

[142] Rolf Nordahl and Niels C Nilsson. 2014. The sound of being there. In The Oxford handbook of interactive audio.

[143]

Keith Oatley. 1995. A taxonomy of the emotions of literary response and a theory of identication in ctional

narrative. Poetics (1995). https://doi.org/10.1016/0304-422X(94)P4296-S

[144]

Takashi Oguchi and Hiroto Kikuchi. 1997. Voice and interpersonal attraction. Japanese Psychological Research 39, 1

(1997), 56–61.

[145]

Yumiko Ohara. 1999. Performing gender through voice pitch: A cross-cultural analysis of Japanese and American

English. In Wahrnehmung und Herstellung von Geschlecht. Springer, 105–116.

[146]

Justin H Park and Mark Schaller. 2005. Does attitude similarity serve as a heuristic cue for kinship? Evidence of an

implicit cognitive association. Evolution and Human Behavior 26, 2 (2005), 158–170.

[147]

Jim R Parker and John Heerema. 2008. Audio interaction in computer mediated games. International Journal of

Computer Games Technology 2008 (2008).

[148] Pearl Abyss. 2015. Black Desert Online. Game [Multiple Platforms].

[149]

Tabitha C. Peck, Soa Seinfeld, Salvatore M. Aglioti, and Mel Slater. 2013. Putting yourself in the skin of a black

avatar reduces implicit racial bias. Consciousness and Cognition (2013). https://doi.org/10.1016/j.concog.2013.04.016

[150]

Jorge Peña, Subuhi Khan, and Cassandra Alexopoulos. 2016. I Am What I See: How Avatar and Opponent Agent

Body Size Aects Physical Activity Among Men Playing Exergames. Journal of Computer-Mediated Communication

(2016). https://doi.org/10.1111/jcc4.12151

Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CHI PLAY, Article 238. Publication date: September 2021.

238:26 Dominic Kao et al.

[151]

Jorge Peña and Eunice Kim. 2014. Increasing exergame physical activity through self and opponent avatar appearance.

Computers in Human Behavior (2014). https://doi.org/10.1016/j.chb.2014.09.038

[152]

Cyril R Pernet and Pascal Belin. 2012. The role of pitch and timbre in voice gender categorization. Frontiers in

psychology 3 (2012), 23.

[153]

Jean A. Pratt, Karina Hauser, Zsolt Ugray, and Olga Patterson. 2007. Looking at human-computer interface design:

Eects of ethnicity in computer agents. Interacting with Computers 19, 4 (2007), 512–523.

[154]

David Andrew Puts. 2005. Mating context and menstrual phase aect women’s preferences for male voice pitch.

Evolution and Human Behavior 26, 5 (2005), 388–397.

[155]

David Andrew Puts, Steven J.C. Gaulin, and Katherine Verdolini. 2006. Dominance and the evolution of sexual

dimorphism in human voice pitch. Evolution and Human Behavior (2006). https://doi.org/10.1016/j.evolhumbehav.

2005.11.003

[156]

Lingyun Qiu and Izak Benbasat. 2005. An investigation into the eects of text-to-speech voice and 3D avatars on

the perception of presence and ow of Live Help in electronic commerce. ACM Transactions on Computer-Human

Interaction 12, 4 (2005), 329–355. https://doi.org/10.1145/1121112.1121113

[157]

Lingyun Qiu and Izak Benbasat. 2005. Online consumer trust and live help interfaces: The eects of text-to-speech

voice and three-dimensional avatars. International Journal of Human-Computer Interaction 19, 1 (2005), 75–94.

https://doi.org/10.1207/s15327590ijhc1901_6

[158]

Rabindra Ratan, David Beyea, Benjamin J. Li, and Luis Graciano. 2019. Avatar characteristics induce users’ behavioral

conformity with small-to-medium eect sizes: a meta-analysis of the proteus eect. Media Psychology 0, 0 (2019),

1–25. https://doi.org/10.1080/15213269.2019.1623698

[159]

Rabindra Ratan and Young June Sah. 2015. Leveling up on stereotype threat: The role of avatar customization and

avatar embodiment. Computers in Human Behavior 50 (2015), 367–374. https://doi.org/10.1016/j.chb.2015.04.010

[160]

David Reby, Karen McComb, Bruno Cargnelutti, Chris Darwin, W. Tecumseh Fitch, and Tim Clutton-Brock. 2005.

Red deer stags use formants as assessment cues during intrasexual agonistic interactions. Proceedings of the Royal

Society B: Biological Sciences (2005). https://doi.org/10.1098/rspb.2004.2954

[161]

James Robb, Tom Garner, Karen Collins, and Lennart E. Nacke. 2017. The Impact of Health-Related User Interface

Sounds on Player Experience. Simulation and Gaming (2017). https://doi.org/10.1177/1046878116688236

[162] Rockstar Games. 2018. Red Dead Redemption 2. Game [Multiple Platforms].

[163]

Katja Rogers, Matthias Jörg, and Michael Weber. 2019. Eects of background music on risk-taking and general player

experience. CHI PLAY 2019 - Proceedings of the Annual Symposium on Computer-Human Interaction in Play (2019),

213–224. https://doi.org/10.1145/3311350.3347158

[164]

Katja Rogers, Giovanni Ribeiro, Rina R. Wehbe, Michael Weber, and Lennart E. Nacke. 2018. Vanishing Importance.

Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems - CHI ’18 (2018), 1–13. https:

//doi.org/10.1145/3173574.3173902

[165]

Rinat B. Rosenberg-Kima, E. Ashby Plant, Celestee E. Doerr, and Amy Baylor. 2010. The inuence of computer-based

model’s race and gender on female students’ attitudes and beliefs towards engineering. Journal of Engineering

Education (2010), 35–44. https://doi.org/10.1002/j.2168-9830.2010.tb01040.x

[166]

Joel Ross, Lilly Irani, M Six Silberman, Andrew Zaldivar, and Bill Tomlinson. 2010. Who are the crowdworkers?

Shifting demographics in Mechanical Turk. In CHI’10 extended abstracts on Human factors in computing systems.

2863–2872.

[167]

R M Ryan and E L Deci. 2000. Self-determination theory and the facilitation of intrinsic motivation, social devel-

opment, and well-being. The American psychologist 55, 1 (2000), 68–78. https://doi.org/10.1037/0003-066X.55.1.68

arXiv:0208024 [gr-qc]

[168]

Richard M. Ryan, C. Scott Rigby, and Andrew Przybylski. 2006. The Motivational Pull of Video Games: A Self-

Determination Theory Approach. Motivation and Emotion 30, 4 (2006), 344–360. https://doi.org/10.1007/s11031-006-

9051-8

[169]

Young June Sah, Rabindra Ratan, Hsin-Yi Sandy Tsai, Wei Peng, and Issidoros Sarinopoulos. 2017. Are you what your

avatar eats? Health-behavior eects of avatar-manifested self-concept. Media Psychology 20, 4 (2017), 632–657.

[170]

Timothy Sanders and Paul Cairns. 2010. Time perception, immersion and music in videogames. In Proceedings ofthe

24th BCS interaction specialist group conference.

[171]

Edward F. Schneider, Annie Lang, Mija Shin, and Samuel D. Bradley. 2004. Death with a story: How story impacts

emotional, motivational, and physiological responses to rst-person shooter video games. Human Communication

Research (2004). https://doi.org/10.1093/hcr/30.3.361

[172]

Jonathan Shen, Ruoming Pang, Ron J. Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang,

Yuxuan Wang, R. J. Skerry-Ryan, Rif A. Saurous, Yannis Agiomyrgiannakis, and Yonghui Wu. 2017. Natural tts

synthesis by conditioning wavenet on mel spectrogram predictions. arXiv (2017), 4779–4783.

Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CHI PLAY, Article 238. Publication date: September 2021.

The Eects of a Self-Similar Avatar Voice in Educational Games 238:27

[173]

Janet Siegmund, Christian Kästner, Jörg Liebig, Sven Apel, and Stefan Hanenberg. 2014. Measuring and modeling

programming experience. Empirical Software Engineering 19, 5 (2014), 1299–1334. https://doi.org/10.1007/s10664-

013-9286-4

[174]

David Smahel, Lukas Blinka, and Ondrej Ledabyl. 2008. Playing MMORPGs: connections between addiction and

identifying with a character. Cyberpsychology & Behavior 11, 6 (2008), 715–718. https://doi.org/10.1089/cpb.2007.0210

[175]

Alistair Raymond Bryce Soutter and Michael Hitchens. 2016. The relationship between character identication

and ow state within video games. Computers in Human Behavior 55, December 2015 (2016), 1030–1038. https:

//doi.org/10.1016/j.chb.2015.11.012

[176] Square Enix. 2013. Final Fantasy XIV. Game [Multiple Platforms].

[177]

Sharon T. Steinemann, Elisa D. Mekler, and Klaus Opwis. 2015. Increasing Donating Behavior Through a Game for

Change. https://doi.org/10.1145/2793107.2793125

[178]

Ching-I Teng. 2017. Impact of avatar identication on online gamer loyalty: Perspectives of social identity and social

capital theories. International Journal of Information Management 37, 6 (2017), 601–610. https://doi.org/10.1016/j.

ijinfomgt.2017.06.006

[179]

Larry Terango. 1966. Pitch and Duration Characteristics of the Oral Reading of Males on a Masculinity-Femininity

Dimension. Journal of Speech and Hearing Research (1966). https://doi.org/10.1044/jshr.0904.590

[180]

Sabine Trepte and Leonard Reinecke. 2010. Avatar creation and video game enjoyment. Journal of Media Psychology

(2010).

[181]

Sabine Trepte, Leonard Reinecke, and Katharina-Maria Behr. 2010. Avatar Creation and Video Game Enjoyment:

Eects of Life-Satisfaction, Game Competitiveness and Identication with the Avatar. In 60th Annual Conference of

the International Communication Association (ICA). https://doi.org/10.1027/1864-1105/a000022

[182]

Selen Turkay and Charles K. Kinzer. 2017. The Relationship between Avatar-Based Customization, Player Identication,

and Motivation. 48–79 pages. https://doi.org/10.4018/978-1-5225-1817-4.ch003

[183]

Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbren-

ner, Andrew Senior, and Koray Kavukcuoglu. 2016. WaveNet: A Generative Model for Raw Audio. (2016), 1–15.

arXiv:1609.03499 http://arxiv.org/abs/1609.03499

[184]

Wim A Van Dommelen and Bente H Moxness. 1995. Acoustic parameters in speaker height and weight identication:

sex-specic behaviour. Language and speech 38, 3 (1995), 267–287.

[185]

Jan Van Looy, Cédric Courtois, Melanie De Vocht, and Lieven De Marez. 2012. Player Identication in Online

Games: Validation of a Scale for Measuring Identication in MMOGs. Media Psychology 15, 2 (2012), 197–221.

https://doi.org/10.1080/15213269.2012.674917

[186] Kellie Vella, Madison Klarkowski, Selen Turkay, and Daniel Johnson. 2020. Making friends in online games: gender

dierences and designing for greater social connectedness. Behaviour and Information Technology (2020). https:

//doi.org/10.1080/0144929X.2019.1625442

[187] Voicemod. 2020. Voicemod. https://www.voicemod.net/

[188] Volition and Deep Silver. 2013. Saints Row IV. Game [Multiple Platforms].

[189]

Jovana Vukovic, David R Feinberg, Benedict C Jones, Lisa M DeBruine, Lisa L M Welling, Anthony C Little, and

Finlay G Smith. 2008. Self-rated attractiveness predicts individual dierences in women’s preferences for masculine

men’s voices. Personality and Individual Dierences 45, 6 (2008), 451–456.

[190]

T Franklin Waddell, S Shyam Sundar, and Joshua Auriemma. 2015. Can customizing an avatar motivate exercise

intentions and health behaviors among those with low health ideals? Cyberpsychology, Behavior, and Social Networking

18, 11 (2015), 687–690.

[191]

Greg Wadley, Marcus Carter, and Martin Gibbs. 2015. Voice in virtual worlds: The design, use, and inuence of voice

chat in online play. Human–Computer Interaction 30, 3-4 (2015), 336–365.

[192]

Greg Wadley, Martin Gibbs, and Peter Benda. 2007. Speaking in character: using voice-over-IP to communicate within

MMORPGs. In Proceedings of the 4th Australasian conference on Interactive entertainment. 1–8.

[193]

Greg Wadley, Martin R. Gibbs, and Nicolas Ducheneaut. 2009. You can be too rich: Mediated communication in a

virtual world. In Proceedings of the 21st Annual Conference of the Australian Computer-Human Interaction Special

Interest Group - Design: Open 24/7, OZCHI ’09. https://doi.org/10.1145/1738826.1738835

[194]

Li Wan, Quan Wang, Alan Papir, and Ignacio Lopez Moreno. 2018. Generalized end-to-end loss for speaker verication.

In 2018 IEEE International Conference on Acoustics, Spe ech and Signal Processing (ICASSP). IEEE, 4879–4883.

[195]

Melissa Watts. 2016. Avatar Self-Identication, Self-Esteem, and Perceived Social Capital in the Real World: A Study of

World of Warcraft Players and their Avatars. University of South Florida.

[196]

Helen Wauck, Gale Lucas, Ari Shapiro, Andrew Feng, Jill Boberg, and Jonathan Gratch. 2018. Analyzing the eect of

avatar self-similarity on men and women in a search and rescue game. Conference on Human Factors in Computing

Systems - Proceedings 2018-April (2018). https://doi.org/10.1145/3173574.3174059

Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CHI PLAY, Article 238. Publication date: September 2021.

238:28 Dominic Kao et al.

[197]

Alexander Wharton and Karen Collins. 2011. Subjective measures of the inuence of music customization on the

video game play experience: A pilot study. Game Studies (2011).

[198]

Dmitri Williams, Scott Caplan, and Li Xiong. 2007. Can you hear me now? The impact of voice in an online gaming

community. Human Communication Research (2007). https://doi.org/10.1111/j.1468-2958.2007.00306.x

[199]

Hanna Elina Wirman and Rhys Jones. 2017. Voice and Sound: Player Contributions to Speech. Peter Lang, Digital

Formations Series.

[200]

Nick Yee and J Bailenson. 2007. The Proteus Eect: The Eect of Transformed Self-Representation on Behavior.

Human communication research (2007), 1–38. http://onlinelibrary.wiley.com/doi/10.1111/j.1468-2958.2007.00299.x/full

[201]

Nick Yee, Jeremy N Bailenson, and Nicolas Ducheneaut. 2009. The Proteus Eect Implications of Transformed Digital

Self-Representation on Online and Oine Behavior. Vol. 36. 285–312 pages. https://doi.org/10.1177/0093650208330254

[202]

Nick Yee, Jeremy N. Bailenson, Mark Urbanek, Francis Chang, and Dan Merget. 2007. The unbearable likeness of being

digital: The persistence of nonverbal social norms in online virtual environments. Cyberpsychology and Behavior 10,

1 (2007), 115–121. https://doi.org/10.1089/cpb.2006.9984

[203]

Nick Yee, Nicolas Ducheneaut, Mike Yao, and Les Nelson. 2011. Do Men Heal More When in Drag? CHI 2011 (2011),

1–4.

[204]

Sukkyung You, Euikyung Kim, and Donguk Lee. 2017. Virtually Real: Exploring Avatar Identication in Game

Addiction among Massively Multiplayer Online Role-Playing Games (MMORPG) Players Sukkyung. Games and

Culture (2017). https://doi.org/10.1177/1555412015581087

[205]

David Zendle, Paul Cairns, and Daniel Kudenko. 2015. Higher graphical delity decreases players’ access to aggressive

concepts in violent video games. In CHI PLAY 2015 - Proceedings of the 2015 Annual Symposium on Computer-Human

Interaction in Play. https://doi.org/10.1145/2793107.2793113

[206]

Dolf Zillmann. 1995. Mechanisms of emotional involvement with drama. Poetics (1995). https://doi.org/10.1016/0304-

422X(94)00020-7

[207]

Miron Zuckerman and Kunitate Miyake. 1993. The attractive voice: What makes it so? Journal of nonverbal behavior

17, 2 (1993), 119–135.

Received February 2021; revised June 2021; accepted July 2021

Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CHI PLAY, Article 238. Publication date: September 2021.