The Eects of a Self-Similar Avatar Voice in Educational
Avatar identication is one of the most promising research areas in games user research. Greater identication
with one’s avatar has been associated with improved outcomes in the domains of health, entertainment, and
education. However, existing studies have focused almost exclusively on the visual appearance of avatars.
Yet audio is known to inuence immersion/presence, performance, and physiological responses. We perform
one of the rst studies to date on avatar self-similar audio. We conducted a 2 x 3 (similar/dissimilar x mod-
ulation upwards/downwards/none) study in a Java programming game. We nd that voice similarity leads
to a signicant increase in performance, time spent, similarity identication, competence, relatedness, and
immersion. Similarity identication acts as a signicant mediator variable between voice similarity and all
measured outcomes. Our study demonstrates the importance of avatar audio and has implications for avatar
design more generally across digital applications.
CCS Concepts: Human-centered computing Empirical studies in HCI.
Additional Key Words and Phrases: Games; Avatar; Audio; Voice; Identication; Player Experience
Virtual identities exist everywhere. From social network proles, to video games, to virtual reality,
there is almost always a representation of the self. Because these virtual representations serve as
extensions of ourselves, we can identify with them, meaning we temporarily merge their identities
with our own self-perception [
]. This identication can be so strong that studies have shown
that we conform to the virtual representation’s expected behaviors [
]. This inuences
outcomes including negotiation aggressiveness [
], food choices [
], physical exercise
], racial bias [
], math performance [
], and creative thinking [
Greater identication with a virtual representation—which are often referred to as avatars—is
associated with increased motivation [
], performance [
], enjoyment [
ow [
], and trust in others [
]. Yet, despite the extensive literature attesting to avatars’
inuence, research has focused almost exclusively on visual aspects of the avatar rather than audial
aspects, potentially because the latter tend to be perceived as a non-critical element of avatar
238:2 Dominic Kao et al.
use, silent avatars are often perceived as more identiable, and developing a variety in character
voices is resource intensive (e.g., hiring multiple voice actors, programming branching dialog
trees) [
]. However, technological solutions to these challenges (e.g., high-quality text-to-speech
engines, voice cloning software) can now greatly reduce the resources needed for developing avatar
voices, signaling a new opportunity for research on avatar voice eects. For example, consider an
avatar in an exercise application for running that speaks using voice characteristics with which the
user identies; greater identication with the avatar’s voice could translate into increased exercise
performance. Or consider a digital self-help application for smoking cessation; greater identication
with the avatar’s voice could decrease user attrition, increasing the odds of successfully overcoming
addiction. Similarly, when using immersive technologies for learning, such as training how to
perform surgery in virtual reality [
], greater identication with the avatar’s voice could result
in increased presence and motivation, increasing training eectiveness.
There is good reason to believe that an avatar’s voice could inuence outcomes. A meta-analysis
of 83 studies in virtual environments found that the presence of audio contributes a small- to
medium-sized eect on presence [
]. Furthermore, audio in games has been linked to greater
immersion [
], physiological responses [
], performance [
], and emotional realism
]. Prior studies give us reason to believe that avatar audio in particular could inuence avatar
identication. Functional neuroimaging shows that perceived similarity is critical to simulating
another person’s internal state [
]. When a study participant watched a game show contestant
with high perceived similarity, the participant experienced a signicant increase in vicarious reward
]. Researchers suggest that similar others trigger likeability, familiarity, and kin-motivated
responses [
]. This is often referred to as similarity-attraction [
] and is highly relevant
in the extensive literature on pedagogical agents and avatars [
]. Therefore, an avatar with a
voice that is more similar to the user’s own voice could increase engagement. Nevertheless, one
can imagine that a self-similar avatar voice might instead break immersion because the avatar is
speaking when the user is not. Additionally, Wauck et al. have shown that self-similarity in the
context of visual appearance did not make a dierence to game performance and experience [
As such, self-similar avatar audio might also produce negligible dierences.
Our project treats voice similarity as a holistic quality of sound, encompassing characteristics of
voice that people use to discriminate between speakers, such as tone, stress, intonation, rhythm,
and tempo. Together [
], they comprise a voice identity that individuals might associate with
other characteristics they identify with, such as masculinity and femininity [
]. Voice similar-
ity is not a solved technical problem, and the most reliable measure of similarity is subjective
ratings—e.g., [
], section 3.2. In this paper, our goal is to study how voice similarity (versus voice
dissimilarity) inuences users in an educational programming game. We chose to study avatar
voice in an educational game as opposed to a game primarily for entertainment because we are also
interested in whether STEM gender stereotypes inuence the eects of avatar voice. For example,
stereotyped avatars’ identity characteristics have been found to inuence performance in STEM
learning contexts. Studies suggest that people perform better on STEM-related tasks when they are
represented by a male avatar compared to a female avatar [
]. We expected a similar
eect might occur due to avatar voice characteristics associated with masculinity and femininity.
We conducted an online study on Amazon’s Mechanical Turk (MTurk) in which half of the
participants were given an avatar voice that matched their own voice, while the other half of the
participants were given an avatar voice that was randomly chosen from a pool for prior participants.
Additionally, we varied voice modulation across all participants; this consisted of pitch shifting
the voice upwards, downwards, or not at all. Their avatar’s voice was then used inside of the Java
programming game as they played. Participants could spend as much time and complete as many
Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CHI PLAY, Article 238. Publication date: September 2021.
The Eects of a Self-Similar Avatar Voice in Educational Games 238:3
puzzles as they liked, reecting motivation to engage in and learn from the game. Afterward, we
collected measures of need satisfaction, intrinsic motivation, and avatar identication.
Our results show that voice similarity increases performance, time spent, similarity identication,
competence, relatedness, and immersion. Similarity identication acts as a signicant mediating
variable between voice similarity and all measured outcomes. However, there was no evidence that
voice modulation signicantly inuenced outcomes. Our study suggests that games can be made
more engaging through self-similar avatar voice audio. Moreover, our study provides motivation
for applying similar methods to virtual reality (e.g., eect on presence), voice assistants (e.g., Siri
]), digital learning (e.g., second-language learning through hearing a self-similar voice), avatar
customization (e.g., customizing avatar audio to be similar), and the Proteus eect (the phenomenon
that avatar users tend to conform behaviorally to the identity characteristics that they associate
with their avatars [200]) in the context of audio.
We describe research in three domains of interest: identication with avatars, audio in games, and
player experience in games.
2.1 Avatar Identification
2.1.1 Identification. Identication is a temporary change in a user’s self-concept by adoption of
a media persona’s perceived characteristics [
]. Identication is one of the core components of
why media experiences are enjoyable [
]. For example, in literary ction, the reader is said
to adopt the protagonist’s emotions, experiences, and objectives such that they feel as if they are
the protagonist [
]. Or in television, the audience member is said to not only feel sympathetic
towards a character, but to feel with the character [
]. However, one key dierence in video games
from other genres of media, such as television, is that players have direct control over the behavior
and actions of their characters. Through this active participation, video games can override the
distance between players and their avatars [
]. Avatar identication can positively inuence
enjoyment [
], health outcomes [
], and learning interest [
]. Moreover, it can
positively inuence intrinsic motivation [
], ow [
], motivation to exercise [
trust in others [
], self-esteem [
], loyalty to a game [
], and appreciation of the game [
Avatar identication has also been associated with aggression [
], addiction [
], and depression
[14, 126].
2.1.2 Avatar Identification. Avatar identication is typically operationalized as a multi-faceted
construct [
]. Similarity identication can be understood as the extent to which we feel similar
to the avatar. People expect to be able to build more rewarding interpersonal interactions [
more easily like, and identify with media characters perceived as being similar [
]. Therefore,
avatars that are similar facilitate feelings of closeness and stronger vicarious experiences [
This phenomenon (sometimes called similarity-attraction [
]) has been studied for decades in
education, wherein pedagogical agents that are similar to users (e.g., gender [
] and race [
are more inuential. Likewise, greater physical similarity with an on-screen avatar has been shown
to signicantly increase exercise eort [
]. Nevertheless, avatar dissimilarities can be valuable.
For example, users are known to sometimes create avatars that represent idealized versions of
themselves [
]. This is known to foster wishful identication, wherein the avatar represents an
improved version of the real-life self (e.g., leaner, more attractive, and fashionable [
]), represents
an ideal, and is someone the user would like to be [
]. Embodied identication represents the
concept of presence in a virtual environment through a “body container” [
]. This concept
refers to being the avatar, or feeling as if one is inside the avatar with the body of the avatar as
Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CHI PLAY, Article 238. Publication date: September 2021.
238:4 Dominic Kao et al.
being one’s own [
]. Studies have found that perceived embodiment—induced through increased
control of an avatar—heightens the outcomes associated with the avatar’s identity [
], supporting
the notion that embodied identication should be considered in avatar-eects research. In this
paper, we measure the inuence of similarity, wishful, and embodied identication as mediators
between voice similarity and other outcome variables.
2.1.3 Facilitating Avatar Identification. Currently, the prevailing method of increasing avatar
identication is through avatar customization. Customization of one’s avatar has been shown to
positively increase avatar identication in a variety of contexts [
]. Other factors
that can increase identication include the presence of narrative [
] and the character’s name
[42]. However, no study to date has manipulated the avatar’s voice audio.
2.2 Audio in Games
2.2.1 Audio Types. Audio can signicantly inuence player’s experiences. A meta-analysis found
that the existence of audio, compared to its absence, has a signicant eect on presence [
Researchers have classied audio into speech and dialog, sound eects, and music [
]. Sound
eects are further classied into avatar sounds, object sounds, character sounds, and ornamental
sounds [
]. All categories of audio appear to have eects. Game music has been found to inuence
immersion [
], tension/anxiety [
], risk-taking behavior [
], and concentration [
Game sound eects, often an important source of feedback [
], aect immersion
] and performance [
]. Additionally, the eects of audio are often contextually dependent on
game genre [
], device type [
], and preferences [
]. Other studies, though, have found that
audio has little eect [164]. Our goal in this paper is to study self-similar avatar voices.
2.2.2 Avatars and Audio. Researchers have suggested that avatar-based sounds, such as breathing
(proprioceptive) and footsteps (exteroceptive), can facilitate imaginative immersion and help the
player identify with their avatar [
]. It has also been suggested that audio creates a sense of
self-representation, which can intensify self-awareness, body ownership, and place illusion [
A few early studies have explored how the addition of footstep sounds can inuence presence
] and movement behavior [
]. Researchers have also suggested that using one’s own voice
to interact with a game (e.g., voice commands) can positively aect avatar identication, despite
the dissonance produced in speaking to the game [30].
Several studies show that voice aects users. Voiceovers for non-player characters have been
shown to increase engagement in a role-playing game [
]. Virtual customer service representatives
that include a text-to-speech voice increase ow [
] and trust [
]. However, not only the
presence vs. absence of voice, but voice similarity also aects users. In a public-speaking experiment
in front of a virtual classroom, participants either gave their own speech out loud, or had another
participant’s speech audio played back. Participants using their own voice experienced signicantly
higher presence [
]. However, this may have been a result of the voice similarity group actually
having to give the speech while the dissimilarity group only had to act it out. In a study on
synthesized voices, participants evaluated voices that were designed to have dierent personalities
(extroverted vs. introverted). The authors found consistent support for similarity-attraction—i.e.,
participants evaluated higher the voice more similar to their own personality [
]. Studies suggest
that the perceived gender of the voice can also inuence users.
During a lecture in which the same person spoke as both a male and female (both voice morphed),
students evaluated the female as more likeable and the male as more intelligent [
]. In a decision-
making study, a male-voiced computer inuenced the user’s decision signicantly more often than
the female counterpart [
]. In a study with computerized voice output, three gender stereotypes
Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CHI PLAY, Article 238. Publication date: September 2021.
The Eects of a Self-Similar Avatar Voice in Educational Games 238:5
were found: male evaluation as more valid than female evaluation; dominance in females as unbe-
coming; and women knowing more about feminine topics (facts relating to love-and-relationships),
with men knowing more about masculine topics (facts relating to computers) [
. Consistent
with such stereotyping, one study found that informative male and sociable female voice agents
led to more positive assessments of an autonomous vehicle compared to stereotype-inconsistent
gender matching (i.e., informative female, sociable male) [
]. Therefore, the gender of the avatar
voice may be crucial to its inuence on player experience.
Studies show, however, that pitch can also aect voice evaluation. Studies that manipulate
voice pitch across multiple languages and cultures have found that men’s and women’s voices
with lowered pitch are perceived as more dominant and masculine than those with raised pitch
]. People often assess voice pitch as being associated with a certain body mass
and height [
], attractiveness [
], and age [
]. Studies show that
a pitch manipulation of 20 Hz is sucient to alter attractiveness ratings of voices [
In our study, we examine the eects of voice modulation (i.e., pitch manipulation). Specically,
we are interested in the interaction between gender and modulation direction, hypothesizing that
a lower modulation will result in more positive outcomes in our programming game because of
STEM gender stereotypes. Here, we conduct the one of the rst studies to look at either voice
similarity or modulation and their eects on player experience (PX) in games.
2.3 Player Experience in Games
The past two decades have seen the development of a number of instruments to measure PX. These
include the Game Immersion Questionnaire (GIQ) [
], the Immersive Experience Questionnaire
(IEQ) [
], the Game Engagement Questionnaire (GEQ) [
], the Game Experience Questionnaire
], the Digital Games Motives Scale (DGMS) [
], the Player Experience of Need Sat-
isfaction (PENS) [
], and the Player Experience Inventory (PXI) [
]. In this study, we leverage
the PENS because it is based on a well-grounded theoretical framework [
] and allows us to
better contextualize our results in the existing literature, which uses the PENS as a theoretical
framework to explicate avatar identication (e.g., [
]). More specically, the PENS is based on
self-determination theory (SDT). SDT, as originally conceptualized, consists of three core building
blocks to explain human motivation, which in turn lead to greater performance, persistence, and
creativity [
]. These building blocks are competence, the need for being eective at achiev-
ing desired objectives; autonomy, the need for having the ability to make decisions; and relatedness,
the need for social closeness with others. This original model has been extended to games by
including presence/immersion, the sense of actually being transported into the game world and
intuitive controls, the intuitiveness of controls. In addition to the PENS, we also leverage the Intrinsic
Motivation Inventory (IMI) [
], through which we measure interest/enjoyment, eort/importance,
pressure/tension, and value/usefulness. Through the interest/enjoyment subscale, the IMI measures
intrinsic motivation—the desire to complete an activity because of the satisfaction of doing so in
and of itself.
Need satisfaction is essential for intrinsic motivation to exist [
]. A study in an endless
runner game found that avatar identication increases autonomy, immersion, interest/enjoyment,
eort/importance, positive aect, and time spent [
]. A study that involved playing an educational
programming game, then making a custom game level for that same game, found that avatar
identication increases need satisfaction, intrinsic motivation, self-ecacy, time spent, and quality
In discussing gender stereotypes, we acknowledge that although researchers have found these stereotypes to often be
consistent across culture [
] and time [
], other studies have found some variability [
]. Therefore, it is important
to note that studies cited in this section were (1) based on stereotypes validated in the social scientic literature during a
recent time period prior to each study and (2) for the culture from which the studies’ participants are drawn.
Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CHI PLAY, Article 238. Publication date: September 2021.
238:6 Dominic Kao et al.
Fig. 1. Data type puzzle (L). Curing a wounded knight (R). Placeholders . . . indicate where code snippets can
be thrown.
of game levels [
]. A study in a jumping game found that avatar identication increases need
satisfaction and time spent playing [
]. Here, we are interested in determining whether voice
similarity inuences need satisfaction and intrinsic motivation, as well as the potential mediating
eect of avatar identication.
2.4 Hypotheses
Building on the literature and arguments presented thus far, we pose the following hypotheses.
Higher voice similarity will lead to more positive avatar identication, need satisfaction,
intrinsic motivation, and performance.
Avatar identication will mediate more positive need satisfaction, intrinsic motivation, and
performance—i.e., voice similarity will lead to a higher level of avatar identication, which will in
turn increase these outcomes.
Consistent with gender stereotypes in STEM, voice modulation upwards/downwards will
have a negative/positive eect, respectively, on avatar identication, need satisfaction, intrinsic
motivation, and performance.
3.1 The Game
Our experimental testbed is CodeBreakers
], which was created for conducting avatar-based
studies. CodeBreakers is a Java programming game in which players solve increasingly dicult
problems by throwing snippets of code. See Figure 1. CodeBreakers was iteratively created with
feedback from professional game developers, game designers, and Java developers, and included
informal play testing over an eighteen-month span with playtesters. There were 14 total puzzles,
spanning 6 levels. CodeBreakers was designed to incorporate best practices on eective learning
curves [
]. Programming topics include data types, conditionals and control ow, classes and
objects, inheritance and interfaces, loops and recursion, and data structures. Each puzzle had up to
3 hints, which are increasingly detailed. Players controlled their character using the keyboard and
mouse. We measured performance through the number of puzzles completed. Players could exit at
any time once they began playing. CodeBreakers was made available for machines running either
Microsoft Windows or macOS.
Gameplay video: https://youtu.be/x5U-Jd6tKXA
Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CHI PLAY, Article 238. Publication date: September 2021.
The Eects of a Self-Similar Avatar Voice in Educational Games 238:7
Fig. 2. Male (L) and female (R) avatars.
3.2 Validating Visual Avatar Design
For this experiment, the player avatar was purposefully designed to avoid known color eects
(e.g., the color red is known to reduce mood, aect, and performance in cognitive-oriented tasks
]), to have ambiguous identity characteristics besides its gender, and to t
the game. We chose blue for the avatar color because it is not associated with negative cognitive
eects and has comparable eects to other more neutral colors, such as gray, on test performance
and heart rate variability (HF-HRV) [
]. Blue was also chosen to match the aesthetic of the game.
The avatar models themselves were designed and created from scratch by a professional 3D game
artist and were made intentionally abstract for ambiguity in identity. This was to reduce variance
in how much players identied with the avatar’s visual appearance. For example, an avatar with
unambiguous identity characteristics would have a high similarity with only the subset of players
who match those identity characteristics. See Figure 2.
To validate that these goals were met, we ran a validation study with 140 participants on
Amazon Mechanical Turk (MTurk). We used a screening survey to retrieve 70 participants who self-
identied as male and 70 participants who self-identied as female. After an audio check to ensure
participants had their audio turned on, each participant played the base version of CodeBreakers
(i.e., without any voice-related aspects) for a minimum of ve minutes. All participants played with
a gender-matched avatar. After ve minutes, participants were allowed to quit at any time. After
quitting, we asked participants the following questions: “How appropriate was the avatar color for
the game?”, “How appropriate was the avatar color for the avatar?”, “How appropriate was the
avatar clothing for the game?”, “How appropriate was the avatar clothing for the avatar?”, and
“How appropriate was the avatar design overall?” on a scale from 1:Very Inappropriate to 5:Ver y
Appropriate. Participants then answered two additional questions: “Besides its gender, the identity
of my avatar was ambiguous (e.g., ethnicity/race)” and “My avatar resembled me” on a scale from
1:Strongly Disagree to 5:Strongly Agree. See Table 1. Participants were compensated $5.00 (USD) for
taking part in this validation study.
We then performed independent samples t-tests between male and female participants. None
of the tests were signicant. ColorG:
=0.09; ColorA:
=0.15; ClothesG:
=0.17; ClothesA:
=0.15; DesignO:
=0.05; Ambiguous:
=0.11; Resemble:
Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CHI PLAY, Article 238. Publication date: September 2021.
238:8 Dominic Kao et al.
Gender ColorG ColorA ClothesG ClothesA DesignO Ambiguous Resemble
Male 4.23 0.80 4.24 0.79 4.24 0.79 4.23 0.80 4.27 0.80 4.34 0.98 2.64 1.14
Female 4.16 0.77 4.11 0.96 4.37 0.77 4.34 0.76 4.23 0.85 4.24 0.86 2.71 1.19
Table 1. Descriptive results from validation study validating that the avatar’s visual characteristics were
appropriate for the game (ColorG, ClothesG), for the avatar (ColorA, ClothesA), overall (DesignO), that
the avatar’s identity was viewed as ambiguous (Ambiguous), and that resemblance to the avatar across gender
was similar (Resemble). All measures are on a 5-pt Likert scale.
=0.06. Moreover, the average of all responses scored higher than Agree (4)—except for resemblance,
which scored only slightly higher than neutral (~2.6–2.7). Therefore, these results validate our
goals of avatars that t the game and have ambiguous identity characteristics, without signicant
deviations across gender.
3.3 Voice Manipulation Platform
For this study, we developed an online platform that takes as input a single audio voice clip and
is able to generate an arbitrary number of similar voice clips. We did this by leveraging recent
advances in neural network-based speech synthesis [
]. We started with an open-source
implementation of real-time voice cloning [
] (see Section 3.3.1) and made several signicant
additions. In order to run a large-scale study using this, it was necessary to create a version that
could be deployed remotely in the cloud and could be accessed on-demand. We did this by rst
deploying the software to an Amazon EC2 P2 server (type p2.xlarge), a GPU-based computing
instance that has 1 GPU, 4 vCPUs, and 61 GiB of RAM located in the region us-east-1b [
]. This
server leverages NVIDIA’s Compute Unied Device Architecture (CUDA), which allows for GPU
support in running the software. We then created a server which uses HTTP POST requests to
communicate with clients with the following message types: Upload (uploads the sample voice clip
while specifying any modulation parameters, returns a unique key for subsequent messages), Status
Check (checks if job completed), and Download (get a single voice le, or get all voice les). When
a client makes an upload request, this is placed in a queue and then served in order. We use 30
worker threads on our server so that multiple jobs can be processed concurrently. During the study,
we kept watch over server performance (i.e., memory and GPU usage), which is important for
consistent participant experiences. For example, high GPU usage would delay new requests from
being processed in a timely manner. This was managed by limiting the number of concurrent study
participants. To reduce wait time, clients only need to download the voice les for the next level to
begin playing (while the remaining voice les are downloaded in the background asynchronously).
It takes 15
5s (variation dependent on internet speed) for a U.S. based client to request and to
download voice les for the rst level. All voice les on the server are deleted on download. To
perform voice modulation (i.e., pitch shifting), we rst measure average fundamental frequency
using a pitch oor of 75 Hz (Male) & 100 Hz (Female) and a pitch ceiling of 300 Hz (Male) & 500 Hz
(Female) [
]. We performed pitch shifting using pitch synchronous overlap and add (PSOLA)
]. PSOLA is frequently used in studies that manipulate voice [
], and
it aects only pitch, leaving other properties of voice perceptually unchanged [
]. After
modulation, the sample is then processed through the voice cloning software. To mitigate dierences
Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CHI PLAY, Article 238. Publication date: September 2021.
The Eects of a Self-Similar Avatar Voice in Educational Games 238:9
Fig. 3. Voice audio occurs in conjunction with speech bubbles that appear on top of the avatar.
in processing time, unmodulated samples are still dummy pitch-shifted to simulate the same time
delay as modulated samples.
3.3.1 Voice Cloning Soware Architecture. The open-source implementation of real-time voice
cloning is described at a more granular level in [
], and the framework as a whole borrows
heavily from [
]. There are three main components to the software architecture which are each
trained separately: 1) a speaker encoder which creates embedding vectors representing the voice
of the speaker [
]; 2) a synthesizer which takes input text and, conditioning the text on the
speaker embedding vectors, generates a mel spectrogram [
]; and 3) a vocoder that converts the
spectogram into an audio waveform [
]. Implementation and training details can be found in
[89] and [86].
4.1 Initial Voice Processing
To collect a sample of the participant’s voice, during the beginning of the game, the player was
asked to speak to an animated robot (Harley) which introduced the game to the participant. The
participant was requested to speak the audio line: Hello, my name is [the character name chosen
by the participant], I am about to play CodeBreakers, and it is very nice to meet you, Harley. We
then checked the recorded audio for any long pauses without voice (>1 second) and that the entire
audio length t in a roughly acceptable interval for the voice cloning software (3—7 seconds). If
these checks were violated, the participant was asked to re-record the audio until the audio was
acceptable. The participant was then asked to listen to their recorded audio to ensure they could
hear themselves speaking the sentence, and they were given the option of re-recording. After the
voice sample was collected, it was sent to the server and processed while the participant completed
the rest of the game introduction. This entire process was identical across conditions. During
analysis, we manually checked every sample to ensure that the participant was clearly audible and
had followed instructions; ~4.3% of participants were excluded based on this check. For sample rate,
we use the default sampling rate from the participant’s microphone. All samples are normalized to
the same perceived volume using RMS (root mean square) normalization.
4.2 Conditions
The study uses a 2 x 3 factorial design. We manipulate avatar voice similarity (similar vs. dissimilar)
and voice modulation (upwards vs. downwards vs. none). The manipulations are as follows:
Similar Voice: Avatar voice is generated using the participant’s voice.
Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CHI PLAY, Article 238. Publication date: September 2021.
238:10 Dominic Kao et al.
Dissimilar Voice
: Avatar voice is generated using a gender-matched prior participant’s voice. The
voice was selected at random from a corpus of 10 (5 male and 5 female) samples collected during
pilot testing.
Modulation Upwards
: Original voice sample is pitch-shifted upwards by 20 Hz. We choose 20 Hz
as it is a manipulation used in prior voice studies [6063, 92, 189].
Modulation Downwards: Original voice sample is pitch-shifted downwards by 20 Hz.
No Modulation
: Original voice sample is used. Dummy pitch-shifted to simulate time delay and
ensure consistent user experiences across conditions.
Other than the above, all other aspects of the experiment were identical across conditions. In
total, there were 30 possible voice lines that could have been triggered. Other than the rst voice
line (What am I doing here? Did my ship crash? How long have I been lying here for? I guess I should
get up and look around.), audio lines typically come before and after each puzzle. For example, prior
to puzzle #7: The castle is under siege!. And after completing puzzle #7: It worked! I neutralize d all of
the bugs by using the sta. These voice lines were accompanied by speech bubbles (see Figure 3).
All audio aside from the avatar voice was identical across conditions, i.e., music and sound eects.
4.3 Measures
We use three validated PX questionnaires and gameplay metrics.
4.3.1 Avatar Identification. For measuring avatar identication, we use the player identication
scale (PIS) [
]. The PIS measures three dimensions of avatar identication on a 5-pt Likert scale:
similarity identication (e.g., “My character is similar to me”), embodied identication (e.g., “In the
game, it is as if I become one with my character”), and wishful identication (e.g., “I would like to
be more like my character”).
4.3.2 Player Experience of Need Satisfaction. To measure need satisfaction, we use the PENS scale
]. PENS measures the following dimensions on a 7-pt Likert scale: competence (e.g., “I feel com-
petent at the game”), autonomy (e.g., “The game provides me with interesting options and choices”),
relatedness (e.g., “I nd the relationships I form in this game fullling”), presence/immersion (e.g.,
“When playing the game, I feel transported to another time and place”), and intuitive controls (e.g.,
“Learning the game controls was easy”).
4.3.3 Intrinsic Motivation Inventory. To measure intrinsic motivation, we use the IMI [
]. We
leverage the following IMI dimensions which use a 7-pt Likert scale: interest/enjoyment (e.g., “I
enjoyed doing this activity very much”), eort/importance (e.g., “I put a lot of eort into this”),
pressure/tension (e.g., “I felt very tense while doing this activity”), and value/usefulness (e.g., “I
believe this activity could be of some value to me”).
4.3.4 Game Performance. We automatically recorded metrics for game performance, including
puzzles completed (max 14) and number of hints accessed (max 42). Re-played puzzles or re-accessed
hints are not counted. These metrics were considered a reection of motivation to engage in and
learn from the game, which are clear intended outcomes of educational games.
4.3.5 Time Played. We operationalize motivated behavior as the time spent playing the game.
4.4 Participants
In total, 698 participants (30% female)
with an average age of M = 33.53 (SD = 9.55) were recruited
through Amazon Mechanical Turk (MTurk). MTurk is a platform in which workers complete Human
The smaller proportion of female participants was unexpected given the female skew overall on MTurk (over 60% in the
U.S.) [
], and it was a possible byproduct of male participants being more attracted to playing a programming game.
Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CHI PLAY, Article 238. Publication date: September 2021.
The Eects of a Self-Similar Avatar Voice in Educational Games 238:11
Intelligence Tasks (HITs), including tasks for research studies. Studies show that MTurk provides
data of similar quality [
], diversity [
], and reliability [
] as typical samples (e.g.,
college students). Participants were each paid $7.50. Participants who answered multiple surveys
with zero variance, or multiple surveys with
3SD, were excluded. Participant voice recordings were
manually checked by an author blind to condition to ensure they were audible and had followed
instructions. After exclusion based on these criteria, we were left with 657 participants (29% female)
for analysis, with an average age of M = 33.45 (SD = 9.66). The HIT was available to workers in
the U.S. over the age of 18 who had a computer with a working microphone. For quality control,
workers were required to have a HIT approval rate >95%. The Purdue University Institutional
Review Board (IRB) approved the study. All participants were asked to provide informed consent.
4.4.1 Experience With Video Games and Programming. Participants reported playing an average of
M=11.6 (SD=10.5) hours of video games per week, above the global average of M=8.45 [
]. On a
scale from 1:Minimal to 7:Extensive, participants rated their prior experience playing video games
(“How would you rate your prior experience playing video games?”) as M=5.38 (SD=1.66) and their
prior programming experience (“How would you rate your prior programming experience?”) as
M=2.64 (SD=1.77). Next, we adapted several questions on programming experience from [
On a scale from 1:Very Inexperienced to 5:Very Exp erienced, participants rated their programming
experience compared to experts (“How do you estimate your programming experience compared to
experts with 20 years of practical experience?”) as M=1.43 (SD=0.93), their programming experience
compared to beginners (“How do you estimate your programming experience compared to beginner
programmers?”) as M=2.33, (SD=1.29), their programming experience in Java specically (“How
experienced are you with the Java programming language?”) as M=1.73 (SD=1.05), and their
experience with an object-oriented paradigm (“How experienced are you with the object-oriented
programming paradigm?”) as M=1.91 (SD=1.21). Therefore, our sample contains participants who
are regularly exposed to video games and have low prior programming experience. ANOVAs
found that there were no signicant dierences between conditions on prior gaming experience
F [
=.000), programming experience (
F [
and Java programming experience (F[5, 651] =0.459, p=0.807, η
4.5 Design
A between-subjects factorial design was used. Each participant was randomly assigned to one of
six possible conditions. Participant counts in each condition were approximately equal (M=109.5,
SD=4.2), with a similar number of male (M=77.5, SD=3.6) and female (M=32.0, SD=5.3) participants
across each condition.
4.6 Procedure
Participants rst lled out an IRB-approved consent form. Participants were informed that they
could exit the game at any time. Participants then began playing CodeBreakers. At the beginning of
the game, participants underwent an audio check during which they were required to type a spoken
English word. Next, the participant was asked to speak into their microphone to conrm that we
could detect their audio input. Participants then selected a name and gender for their character. For
the purposes of the experiment, participants were asked to choose the same gender as their real-life
gender. We manually double-checked that their selected gender matched the gender reported post-
experiment. A robotic agent (see Figure 4) then engaged in a short conversation with the player.
The robot was animated with audio dialog generated through an automatic voice generator [
Nevertheless, we proceeded with our analyses as planned because the total number of female participants was still high
(>200) and our statistical testing is robust to unequal group sizes.
Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CHI PLAY, Article 238. Publication date: September 2021.
238:12 Dominic Kao et al.
Fig. 4. The robotic agent during the introduction asks the player to speak (L), and introduces the game (R).
After a brief introduction, the robot asked the participant to introduce himself/herself through
their microphone. When the participant was ready to speak, they clicked on the Record button,
then clicked Stop Recording when nished. In case the recording was too short (<3 seconds), too
long (>7 seconds), or contained long pauses (>1 second), the player was asked to retry and to keep
their dialog a continuous ~5 seconds in length. Once completed, participants were asked to conrm
they could hear their recorded audio and to re-record otherwise. Next, the participant’s audio was
sent to the server for processing as they completed the remainder of the introduction. During
the rest of the introduction, the participant was briefed on how to play the game. Participants
were told they could exit the game at any time by pressing ESC on their keyboard, then clicking
quit game. The participant then began playing the game. In the event that the participant’s avatar
voice les for the rst level had not yet been completely downloaded, we showed a loading screen.
Participants played the game with a blue gender-matched avatar. All participant game data was
automatically logged. Once participants quit the game (or completed all 6 levels), they lled out
a set of manipulation check questions, the PIS, PENS, and IMI. Participants were then asked to
describe in their own words any problems encountered and what they thought the purpose of
the experiment was. None of the participants correctly guessed the purpose of the experiment.
Participants then lled out a set of questions about prior video game experience, programming
experience, and demographics.
4.7 Analysis
Data was analyzed using SPSS 23 and the PROCESS macro for SPSS [
]. Independent t-tests were
used to compare voice similarity versus voice dissimilarity on the outcomes of performance, time
spent, PIS, PENS, and IMI. We then performed a mediation analysis using avatar identication
as the mediator. Voice similarity coded as a dichotomous variable is X, avatar identication is
the mediator M, and performance, time spent, PENS, and IMI are Y. We use mediation with each
dimension of identication modeled individually (similarity, embodied, wishful) rather than parallel
mediation. We chose to do this because of multicollinearity between identication dimensions
(correlations>0.7), which can aect the estimation of mediation relationships in parallel mediation
]. To investigate voice modulation, we use a 3x2 (modulation x gender) ANOVA
because we
expected voice modulation eects to be moderated by gender. We use an α of 0.05.
Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CHI PLAY, Article 238. Publication date: September 2021.
The Eects of a Self-Similar Avatar Voice in Educational Games 238:13
Similar Like Me Gender Age Speaking Friends Likeable Friendly Fit
None 4.33 1.54 4.31 1.61 6.28 1.08 4.79 1.52 4.22 1.67 4.20 1.70 4.71 1.67 4.79 1.64 4.78 1.50
Similar Up 4.17 1.51 4.17 1.45 6.13 1.28 4.91 1.34 4.27 1.53 4.25 1.50 4.68 1.42 4.81 1.46 4.75 1.16
Down 4.19 1.68 4.17 1.79 6.17 1.32 4.81 1.46 4.18 1.81 4.26 1.66 4.40 1.61 4.52 1.66 4.90 1.52
None 3.07 1.78 3.08 1.73 6.12 1.52 4.61 1.71 3.28 1.85 4.10 1.76 4.62 1.76 4.66 1.76 4.76 1.73
Dissimilar Up 3.06 1.72 2.96 1.67 6.23 1.32 4.45 1.72 3.16 1.88 4.12 1.80 4.66 1.74 4.75 1.63 4.78 1.51
Down 2.92 1.76 2.75 1.64 6.18 1.48 4.54 1.66 3.01 1.76 3.90 1.73 4.50 1.64 4.58 1.62 4.92 1.59
Table 2. Descriptive results of manipulation check.
Similar Like Me Gender Age Speaking Friends Likeable Friendly Fit
Condition M
None 4.32 4.35 4.36 4.22 6.33 6.19 4.86 4.65 4.15 4.35 4.15 4.30 4.74 4.65 4.78 4.81 4.76 4.81
Similar Up 4.13 4.29 4.11 4.36 6.13 6.11 4.97 4.75 4.21 4.43 4.20 4.39 4.65 4.79 4.79 4.86 4.71 4.86
Down 4.20 4.17 4.14 4.22 6.22 6.06 4.94 4.53 4.18 4.19 4.32 4.14 4.35 4.50 4.49 4.58 4.84 5.06
None 3.02 3.20 3.05 3.20 6.13 6.08 4.65 4.48 3.29 3.24 4.10 4.12 4.63 4.60 4.71 4.48 4.74 4.84
Dissimilar Up 3.17 2.79 2.99 2.90 6.26 6.17 4.45 4.45 3.21 3.04 4.17 4.00 4.77 4.38 4.81 4.59 4.77 4.79
Down 2.97 2.81 2.79 2.68 6.26 6.03 4.61 4.41 3.05 2.92 3.92 3.84 4.55 4.41 4.64 4.46 4.90 4.97
Table 3. Manipulation check mean scores for participants self-identifying as male (M
) and female (M
5.1 Manipulation Check
The manipulation check consisted of 9 questions. The rst 8 questions all began with “My avatar’s
voice sounded...” and ended with “similar to me”; “like me when I talk”; “the same gender as me”;
“about my age”; “as if I was the one speaking”; “like someone I would be friends with”; “likeable”;
and “friendly” on a 7-pt Likert scale (1:Strongly Disagree to 7:Strongly Agree). These questions
assessed the voice similarity manipulation. The last question, “How well do you feel your avatar’s
voice t with your avatar?” was on a 7-pt Likert scale (1:Very Poorly to 7:Very Well). This question
assessed the perceived t between the voice and the game avatar.
Between-subjects testing found that participants in the voice similarity condition scored signi-
cantly higher on “similar to me,
=0.73, “like me when I talk,
=0.78, “about my age,
=0.19, and “as if I was the one speaking,
=0.61, compared to participants in the voice dissimilarity condition. There
were no signicant dierences across the remaining questions. From these results, the voice similar-
ity manipulation was successful at inducing higher perceived avatar voice similarity. All conditions
had a slightly above neutral t between voice and avatar. See Table 2 and Table 3 for a more
detailed breakdown of the manipulation check measures by voice similarity conditions as well as
the voice modulation conditions and participant gender. The remaining results are organized by
our hypotheses.
5.2 Eects of Voice Similarity
Higher voice similarity will lead to more positive avatar identication, need satisfaction, intrinsic
motivation, and performance.
Both t-tests and ANOVAs are considered robust to non-normality, especially at larger sample sizes [20, 122].
Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CHI PLAY, Article 238. Publication date: September 2021.
238:14 Dominic Kao et al.
Variable Similar Voice Dissimilar Voice
**Puzzles Completed 6.58 5.64 5.47 5.28
Hints Accessed 13.95 13.15 12.07 12.50
*Time Spent Sec. 899.75 838.26 743.80 739.30
***PIS Similarity 2.99 1.03 2.75 1.02
PIS Embodied 2.99 1.12 2.93 1.10
PIS Wishful 2.53 1.04 2.50 1.09
***PENS Competence 4.57 1.60 4.21 1.62
PENS Autonomy 4.30 1.58 4.17 1.59
**PENS Relatedness 3.44 1.55 3.10 1.58
***PENS Immersion 3.88 1.58 3.52 1.60
PENS Controls 4.73 1.57 4.75 1.58
IMI Enjoyment 4.73 1.48 4.60 1.55
IMI Eort 5.47 1.23 5.37 1.31
IMI Pressure 3.34 1.56 3.29 1.60
IMI Value 4.64 1.59 4.55 1.63
* signicant at p < .05; ** signicant at p < .01; *** signicant at p < .005.
Table 4. Results for eects of voice similarity (H1). PIS was on a 5-pt Likert scale, while IMI and PENS were
on a 7-pt Likert scale.
5.2.1 Performance. Participants in the voice similarity condition completed signicantly more
puzzles than participants in the voice dissimilarity condition,
=0.20. There
was no signicant dierence in hints used between the voice similarity condition and the voice
dissimilarity condition, t(655)=1.89, p=0.06, d=0.15.
5.2.2 Time Spent. Participants in the voice similarity condition played for a signicantly longer
period of time than participants in the voice dissimilarity condition, t(655)=2.53, p<0.05, d=0.20.
5.2.3 PIS. Participants in the voice similarity condition had signicantly higher similarity identi-
cation than participants in the voice dissimilarity condition,
=0.23. There
was no signicant dierence in embodied identication between the voice similarity condition and
the voice dissimilarity condition,
=0.06. There was no signicant dierence in
wishful identication between the voice similarity condition and the voice dissimilarity condition,
t(655)=0.34, p=0.74, d=0.03.
5.2.4 PENS. Participants in the voice similarity condition experienced signicantly higher com-
petence than participants in the voice dissimilarity condition,
=0.23. There
was no signicant dierence in autonomy between the voice similarity condition and the voice
dissimilarity condition,
=0.08. Participants in the voice similarity condition
experienced signicantly higher relatedness than participants in the voice dissimilarity condition,
=0.22. Participants in the voice similarity condition experienced signicantly
higher immersion than participants in the voice dissimilarity condition,
There was no signicant dierence in intuitive controls between the voice similarity condition and
the voice dissimilarity condition, t(655)=0.12, p=0.90, d=0.01.
Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CHI PLAY, Article 238. Publication date: September 2021.
The Eects of a Self-Similar Avatar Voice in Educational Games 238:15
Similarity Identification Embodied Identification Wishful Identification
a b c
c ab a b c
c ab a b c
c ab
Puzzles 0.24*** 0.660*** 0.959* 1.114** 0.155; CI [0.038, 0.320] 0.061 0.791*** 1.066* 1.114** 0.048; CI [-0.088, 0.196] 0.028 -0.231 1.121** 1.114** -0.006; CI [-0.066, 0.045]
Hints 1.411*** 1.555 1.887 0.332; CI [0.068, 0.704] 1.975*** 1.767 1.887 0.120; CI [-0.213, 0.494] 0.137 1.884 1.887 0.004; CI[-0.080, 0.107]
Time Spent
Time Spent 88.74*** 135.1* 156.0* 20.90; CI [4.366, 44.60] 98.66*** 150.0* 156.0* 5.99; CI[-11.46, 25.23] -21.27 156.5* 156.0* -0.590; CI[-8.263, 5.479]
Player Experience of Need Satisfaction (PENS)
Competence 0.579*** 0.226 0.362*** 0.136; CI [0.046, 0.231] 0.596*** 0.326*** 0.362*** 0.036; CI [-0.065, 0.139] 0.524*** 0.348*** 0.362*** 0.015; CI [-0.070, 0.101]
Autonomy 0.640*** -0.025 0.126 0.151; CI [0.050, 0.257] 0.682*** 0.084 0.126 0.041; CI [-0.074, 0.156] 0.592*** 0.109 0.126 0.016; CI [-0.081, 0.115]
Relatedness 0.694*** 0.176 0.340** 0.164; CI [0.055, 0.276] 0.618*** 0.302** 0.340** 0.038; CI [-0.068, 0.141] 0.655*** 0.322*** 0.340** 0.018; CI[-0.088, 0.126]
Immersion 0.760*** 0.186 0.365*** 0.179; CI [0.057, 0.302] 0.769*** 0.318*** 0.365*** 0.047; CI [-0.084, 0.176] 0.743*** 0.344*** 0.365*** 0.021; CI [-0.100, 0.142]
Controls 0.460*** -0.123 -0.015 0.108; CI[0.034, 0.189] 0.498*** -0.045 -0.015 0.030; CI[-0.056, 0.116] 0.370*** -0.025 -0.015 0.010; CI [-0.051, 0.070]
Intrinsic Motivation Inventory (IMI)
Enjoyment 0.591*** -0.004 0.135 0.139; CI [0.047, 0.236] 0.628*** 0.097 0.135 0.038; CI [-0.069, 0.145] 0.479*** 0.122 0.135 0.013; CI [-0.065, 0.095]
Eort 0.032 0.094 0.102 0.008; CI [-0.016, 0.035] 0.118** 0.095 0.102 0.007; CI[-0.014, 0.032] 0.042 0.101 0.102 0.001; CI [-0.009, 0.015]
Tension -0.066 0.063 0.048 -0.016; CI[-0.051, 0.015] -0.063 0.052 0.048 -0.004; CI [-0.023, 0.011] 0.023 0.047 0.048 0.001; CI [-0.010, 0.014]
Usefulness 0.594*** -0.050 0.090 0.140; CI [0.048, 0.239] 0.584*** 0.055 0.090 0.036; CI[-0.065, 0.133] 0.529*** 0.076 0.090 0.015; CI [-0.071, 0.102]
* signicant at p < .05; ** signicant at p < .01; *** signicant at p < .005; signicant ab based on 95% CI.
Table 5. Mediation results with voice similarity (X), avatar identification (M), and outcome (Y). Regression
coeicients a (XM), b (MY), c’ (direct XY ), c (total XY), and ab. Significant results are bold.
5.2.5 IMI. There was no signicant dierence in enjoyment between the voice similarity condition
and the voice dissimilarity condition,
=0.09. There was no signicant dierence
in eort between the voice similarity condition and the voice dissimilarity condition,
=0.08. There was no signicant dierence in pressure between the voice similarity condition
and the voice dissimilarity condition,
=0.03. There was no signicant dierence
in value between the voice similarity condition and the voice dissimilarity condition,
p=0.47, d=0.06.
5.2.6 Summary of Results. Higher voice similarity leads to a signicant increase in performance,
time spent, similarity identication, competence, relatedness, and immersion. Eect sizes (
) range
from 0.2 to 0.23, making these eects small. However, given the complexity of player-game interac-
tions, small eect sizes are not uncommon in games user research [
]. Embodied and
wishful identication, autonomy, controls, and intrinsic motivation were unaected. See Table 4.
5.3 Avatar Identification as a Mediator
Avatar identication will mediate more positive need satisfaction, intrinsic motivation, and
From Table 5, we can see that voice similarity leads to higher similarity identication (
) and
that higher similarity identication was subsequently related to higher performance, time spent,
need satisfaction, interest/enjoyment, and value/usefulness (
). A 95% bias-corrected condence
interval based on 10,000 bootstrap samples indicates that the indirect eects (
) are also signicant.
Therefore, we conclude that similarity identication signicantly mediates the relationship between
voice similarity and performance, time spent, need satisfaction, and intrinsic motivation.
Embodied identication was related to higher performance, time spent, need satisfaction, in-
terest/enjoyment, eort/importance, and value/usefulness. Wishful identication was related to
higher need satisfaction, interest/enjoyment, and value/usefulness. Indirect eects for embodied
and wishful identication were non-signicant.
5.4 Eects of Voice Modulation
Consistent with gender stereotypes in STEM, voice modulation upwards/downwards will have a
negative/positive eect, respectively, on avatar identication, need satisfaction, intrinsic motivation,
and performance.
Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CHI PLAY, Article 238. Publication date: September 2021.
238:16 Dominic Kao et al.
(1) (2) (3) (4) (5) (6) (7) (8) (9) (10) (11) (12) (13) (14) (15)
Pitch— 6.63 5.63 13.68 13. 13 876. 4 765.8 3.02 1.05 3. 06 1.05 2.58 1.08 4. 62 1.53 4.38 1. 61 3. 42 1.59 3.80 1. 60 4.90 1.47 4. 68 1.38 5.36 1. 23 3. 10 1.49 4.60 1. 58
Pitch 6. 53 5. 39 13.60 12.53 820.4 766. 3 2. 99 0.98 3.08 1.09 2.56 1.07 4. 69 1.58 4.50 1.54 3. 32 1.56 3. 86 1. 56 5.18 1.46 4. 87 1.48 5.55 1.29 2. 90 1.39 4. 84 1. 57
Pitch 6. 56 5. 39 13.37 13.14 809.0 712. 5 2. 92 1.03 3.01 1.08 2.59 1.08 4. 53 1.49 4.20 1.41 3. 32 1.55 3. 64 1. 60 4.82 1.45 4. 55 1.54 5.22 1.22 3. 04 1.36 4. 45 1. 58
Pitch— 4.45 5.07 11.61 11. 84 693. 1 644.6 2.59 1.06 2. 74 1.18 2.39 1.07 3. 89 1.73 3.69 1. 80 3. 06 1.50 3.63 1. 65 4.12 1.72 4. 51 1.73 5.55 1. 28 4. 11 1.76 4.30 1. 66
Pitch 4. 30 5. 11 11.21 12.81 820.7 940. 3 2. 58 0.92 2.73 1.15 2.23 0.93 3. 57 1.67 3.84 1.63 2. 94 1.68 3. 25 1. 54 4.16 1.58 4. 53 1.59 5.59 1.27 3. 71 1.68 4. 56 1. 60
Pitch 5. 18 5. 74 12.15 13.34 845.1 1039.4 2. 64 1.09 2.76 1.16 2. 47 1.04 4.00 1. 68 4. 18 1.61 3.22 1. 59 3.67 1.66 4. 32 1.79 4.68 1. 57 5. 49 1.37 4.23 1. 73 4.62 1.75
Main Effect Gender
F 17.035 2.915 0.512 18.217 10.353 5.359 34.152 11.391 4.289 3.345 33.320 0.915 2.394 59.511 0.949
p <0.001 0.088 0.474 <0.001 <0.005 <0.05 <0.001 <0.001 <0.05 0.068 <0.001 0.339 0.122 <0.001 0.330
0.026 0.004 0.001 0.027 0.016 0.008 0.050 0.017 0.007 0.005 0.049 0.001 0.004 0.084 0.001
Main Effect Pitch
F 0.352 0.035 0.149 0.033 0.021 0.737 0.379 0.497 0.393 0.435 0.460 0.234 1.331 2.501 1.079
p 0.703 0.966 0.861 0.968 0.980 0.479 0.684 0.609 0.675 0.648 0.632 0.792 0.265 0.083 0.340
0.001 0.000 0.000 0.000 0.000 0.002 0.001 0.002 0.001 0.001 0.001 0.001 0.004 0.008 0.003
Interaction Effect
F 0.357 0.104 0.996 0.296 0.104 0.466 1.556 2.818 0.451 1.854 1.271 1.115 0.359 0.727 1.308
p 0.700 0.901 0.370 0.744 0.901 0.628 0.212 0.060 0.637 0.157 0.281 0.328 0.698 0.484 0.271
0.001 0.000 0.003 0.001 0.000 0.001 0.005 0.009 0.001 0.006 0.004 0.003 0.001 0.002 0.004
Gender df=1, Pitch df=2, Interaction df=2, Error df=651
(1) Puzzles Comp.
(2) Hints Accessed
(3) Time Spent Sec.
(4) PIS Similarity
(5) PIS Embodied
(6) PIS Wishful
(7) PENS Comp.
(8) PENS Autonomy
(9) PENS Related.
(10) PENS Immersion
(11) PENS Controls
(12) IMI Enjoyment
(13) IMI Effort
(14) IMI Pressure
(15) IMI Value
Table 6. Results for eects of voice modulation (H3). Significant results are bold.
From Table 6, 2x3 ANOVAs (gender x voice modulation) found main eects of gender for puzzles
completed, similarity identication, embodied identication, wishful identication, competence,
autonomy, relatedness, intuitive controls, and pressure/tension. No main eects of voice modulation
were found. No interaction eects between gender and voice modulation were found. Therefore,
voice modulation had a negligible impact on outcomes.
Existing literature has shown that an avatar’s visual appearance aects its user [
]. Such eects
are moderated by how much we identify with the avatar. This identication can be increased
through visual avatar customization. However, it remains unclear whether the audial aspects of an
avatar inuences identication and other outcomes.
Here, we conducted a 2 x 3 (voice similarity x voice modulation) experiment with neural network
voice cloning. Higher voice similarity directly increases game performance
, time spent, similarity
identication, competence, relatedness, and immersion. Mediation analysis found that similar-
ity identication (M) mediates between voice similarity (X) and performance, time spent, need
satisfaction, and intrinsic motivation (Y). Therefore, avatar voice inuences crucial PX outcomes.
Surprisingly, pitch shifting had no signicant eect. Although studies have shown that manipu-
lating pitch by 20 Hz alters attractiveness of a voice [
], our measurement instruments
did not focus on attractiveness and were instead geared towards PX and performance outcomes.
Furthermore, our study takes place during gameplay, not an environment where the player’s sole
focus is on evaluating the audio being presented. The player’s focus is instead divided between
For the similar voice condition, we see a signicant increase in performance, but also a (non-signicant) increase in hints
accessed (see Table 4). The mean increase in puzzles completed is ~1, while the mean increase in hints accessed is ~2. Each
puzzle contains three hints, with the rst two designed to guide the player towards the answer (e.g., “Look around the
environment.”) and the last hint providing the answer. Therefore, the increase in hints accessed alone does not explain the
performance increase. We interpret both the increased performance and the increased time spent as behavioral indicators of
increased motivation to engage in and learn from the game.
Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CHI PLAY, Article 238. Publication date: September 2021.
The Eects of a Self-Similar Avatar Voice in Educational Games 238:17
cognitive processes—e.g., visually interpreting the scene and learning gameplay. Therefore, voice
modulation may have been too subtle for observable eects. Larger modulations (e.g., 40 Hz) should
be considered in future research on the potential for avatar voices to inuence stereotype eects.
6.1 Applications to Games
Although the amount of dialogue in our game can be considered minimal compared to other games,
such as Mass Eect [
], even this amount of higher player-similar audio signicantly promoted
gameplay performance and inuenced PX. This has signicant implications for audio in games.
Game companies can create more engaging experiences through similar voice audio, leading to
greater commercial success. Similarly, games that promote health (e.g., exercise [
]), learning
(e.g., educational games [
]), and discovery (e.g., citizen science [
]) could benet from increased
engagement. Engagement can translate to better habits, greater learning gains, and increased
scientic discoveries. Our results show that increasing similarity identication through higher
voice similarity results in increased need satisfaction, intrinsic motivation, and motivated behavior.
These outcomes are important across virtually all games.
6.2 Broader Voice Applications
Virtual environments more generally that contain voiced characters could also benet from voice
similarity. For example, consider an intelligent agent, designed for math learning, whose voice
resembles the user’s. An intelligent agent perceived as being similar could improve learning
outcomes [
]. Applications for learning a new language could similarly benet users,
as hearing how one’s own voice should sound could help users more easily imitate speech. Or
consider VR oil rig safety training where the narrator’s voice resembles the trainee’s. A similar
narrator could lead to more engaged and immersive training.
Many real-world devices incorporate voice assistants such as Siri [
], Cortana [
], Google
Assistant [
], and Alexa [
], and these are increasingly prevalent in homes, cars, and mobile
devices. Although the eect of voice similarity with these assistants has not been studied directly,
the present ndings suggest that increasing voice similarity would lead to more positive interactions
with such voice assistants. More research is needed on the extensive number of potential use cases
for voice similarity.
6.3 Audio Customization
While we demonstrated these results in a controlled lab experiment, players will likely experience
even greater similarity identication and aected outcomes in realistic volitional play contexts
where players engage with their virtual representations over a longer period of time. For in-
stance, research suggests that over time we become more congruent with our virtual identities
]. Of the types of identication measured (similarity, embodied, and wishful), only
similarity was aected. While expected due to the manipulation of voice similarity, the avatar
customization process on the other hand has been shown to increase similarity, embodied, and
wishful identication [
]. For example, the options during avatar customization allow players to
create not only themselves but an ideal that they would like to become [
]. This leads us to believe
that customization of avatar audio, similar to customization of an avatar’s visual appearance, would
be benecial for fostering avatar identication.
Although still not common, some games allow for customization of avatar audio. Games such
as Final Fantasy XIV [
], Saints Row IV [
], and Monster Hunter: World [
], allow for selec-
tion of dierent pre-created collections of voice audio. Other games allow the player to directly
manipulate the voice itself. Black Desert Online [
] and Red Dead Redemption 2 [
] both allow
for customization of pitch, with the latter introducing an additional “clarity” parameter. The Sims
Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CHI PLAY, Article 238. Publication date: September 2021.
238:18 Dominic Kao et al.
4 [
] allows pitch adjustment and choosing between ‘sweet, ‘melodic,’ and ‘lilted’ for women,
and between ‘clear,’ ‘warm,’ and ‘brash’ for men. However, more extensive audio customization in
games does not currently exist. With these limited parameters, a self-similar voice is not possible
in most circumstances.
Nevertheless, more complex avatar audio customization could be highly benecial. Allowing users
to create similar (and perhaps embodied and wishful, as is possible with visual avatar customization)
audial identities gives rise to new possibilities for identication (possibly leading to stronger
emotional attachments [21]), thereby enhancing a wide range of PX outcomes.
6.4 Behavioral Influence
This line of research on audial avatar identities is also relevant to the Proteus eect, the phenomenon
that avatar users tend to conform behaviorally to the identity characteristics that they associate
with their avatars [
]. This phenomenon has been studied extensively with respect to avatar
appearance [
], but not with respect to avatar voice characteristics. Just as taller avatars lead to
more aggressive negotiation [
], healthier-weight avatars lead to more physical activity [
and inventor-looking avatars lead to more creative brainstorming [
], an avatar that sounds more
condent, healthy, and creative could also cause enactments of those attributes. Future research on
the Proteus eect could use the methods adopted in the present study to conrm these expectations.
Controlled experiments with random assignment are considered robust. However, compensating
participants to play a game in a controlled lab setting is fundamentally dissimilar than playing
of one’s own volition. Future studies should seek to understand whether these results extend to
voluntary play.
As our study design was relatively complex, there was an inherent degree of randomness in
our conditions. For example, the three conditions that that were aggregated into the similar voice
condition had slight degrees of dissimilarity due to the pitch modulation. Similarly, the dissimilar
voice was cloned from a random corpus of 10 participant voices and was also aggregated with
pitch modulated versions. Nevertheless, these comparisons can be performed given the large
sample size and the manipulation check. For example, Table 2 validates that pitch modulated
voices did not dier greatly in similarity from their unmodulated counterparts. That being said, it
is important to note that technically, while the similar voice was viewed as having higher than
average similarity with the player (~4.23), this cannot be considered to be truly a very similar voice.
This is mostly a technological constraint in that the state-of-the-art in voice cloning is currently
unable to consistently generate very similar voices across all speakers. Future studies might address
this through collecting a larger corpus of participant audio to train deep learning algorithms to
create an even better matching voice prior to conducting the experiment. Moreover, the eect sizes
of our results fall in the small range. Nonetheless, this study has successfully compared a voice
that sounds more like the player to a voice that sounds less like the player, illustrating signicant
dierences in PX. The implications of such results can be of value to the HCI community more
broadly, as audio is often understudied in comparison to visual aspects of games and other systems.
This study used a single, education-oriented game that was designed for research purposes.
Hence, generalizability was not established for the types of games or media applications that are
used more commonly, such as entertainment-oriented action games or mobile phone operating
systems. The inuence of voice similarity may depend on facets of the media design (e.g., pacing,
opportunities for voice-based interaction) as well as user orientations toward the media (e.g.,
playing for fun or to learn). Future research could examine such factors as moderating eects of
voice similarity on PX.
Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CHI PLAY, Article 238. Publication date: September 2021.
The Eects of a Self-Similar Avatar Voice in Educational Games 238:19
This research was designed to examine voiced avatars that speak for the avatar user, presumably
within single-player games or applications. However, many multi-user applications oer voice-
based communication [
], which enhances user experiences and social trust [
], although
users rarely actually hear their own voices. That said, previous research suggests that when
gender is communicated through voices in online games, women are more likely to receive toxic
treatment [
], which potentially triggers stereotype threat and causes psychological harm
]. Although the present research did not nd any dierences in stereotype-related outcomes due
to voice pitch modulation, the ndings do suggest that user voices are malleable, just like the visual
characteristics of avatars. Technologies are currently available to consumers that facilitate voice
modication in multi-user games and other applications (e.g., [
]), oering the potential to
switch genders or even species. Future research could use such tools to examine voice avatars and
stereotype eects in multi-user voice-communication contexts, e.g., social VR [68].
There were aspects of the experiment that were not entirely under our control. The quality of
the microphone and audio, for example, depend on what devices are owned by the participant.
However, using participants’ own devices increases ecological validity as this is more typical to
how a person would play a game compared to a lab. Other aspects, however, could have also played
a role in the experiment. For example, we performed an audio check to ensure participants could
hear audio at the beginning of the experiment, and we additionally recorded participants’ system
audio level whenever a voice line was triggered, but we had no control over the specic volume
being used or whether they were really listening (e.g., putting their headphones down on the table).
Our research on voice pitch and stereotypes is based on decades of work on evolutionary
behavior. There are common associations between voice pitch and masculinity, femininity, and
dominance, and these associations exist across animal species and nonhuman primates [
Furthermore, the “universality of voice pitch sexual dimorphism” has led researchers to argue that
such associations are expected to hold across cultures [
]. Nevertheless, this should not be taken
for granted and such studies should be replicated in non-U.S. contexts.
One aspect not directly studied is the degree of similarity. For example, with too little similarity,
there may be no eect; too much similarity and it may be strange (e.g., an audial analogue to the
uncanny valley, which refers to revulsion for nearly human-looking avatars [
]). Similarly, there
are ethical concerns that need to be explored prior to broadly deploying voice manipulation. A
recent workshop hosted by the U.S. Federal Trade Commission (FTC) discussed both the risks
and benets of voice cloning [
]. Risks include fraud and harassment, while benets include
synthesizing voices for those suering from amyotrophic lateral sclerosis (ALS), Huntington’s
disease, and autism. Nevertheless, the full implications of voice cloning are still unfolding.
Avatar identication is a topic of extensive research. Despite widespread acknowledgment of how
avatar identication benets users, existing studies have focused on visual appearance of avatars.
We presented one of the rst studies to date on avatar self-similar audio. Higher voice similarity
leads to a signicant increase in performance, time spent, similarity identication, competence,
relatedness, and immersion. Similarity identication acts as a signicant mediator variable between
voice similarity and performance, time spent, need satisfaction, and intrinsic motivation. We
discussed the wide-ranging implications of these results for games and beyond. This study is an
important step towards understanding voice audio eects.
Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CHI PLAY, Article 238. Publication date: September 2021.