2-Sample t-Test

WWW.MINITAB.COM

MINITAB ASSISTANT WHITE PAPER

This paper explains the research conducted by Minitab statisticians to develop the methods and

data checks used in the Assistant in Minitab Statistical Software.

2-Sample t-Test

Overview

A 2-sample t-test can be used to compare whether two independent groups differ. This test is

derived under the assumptions that both populations are normally distributed and have equal

variances. Although the assumption of normality is not critical (Pearson, 1931; Barlett, 1935;

Geary, 1947), the assumption of equal variances is critical if the sample sizes are markedly

different (Welch, 1937; Horsnell, 1953).

Some practitioners first perform a preliminary test to evaluate equal variances before they

perform the classical 2-sample t procedure. This approach has serious drawbacks, however,

because these variance tests are subject to important assumptions and limitations. For example,

many tests for equal variances, such as the classical F-test, are sensitive to departures from

normality. Other tests that do not rely on the assumption of normality, such as Levene/Brown-

Forsythe, have low power to detect a difference between variances.

B.L. Welch developed an approximation method for comparing the means of two independent

normal populations when their variances are not necessarily equal (Welch, 1947). Because

Welch’s modified t-test is not derived under the assumption of equal variances, it allows users to

compare the means of two populations without first having to test for equal variances.

In this paper, we compare Welch’s modified t method with the classical 2-sample t procedure

and determine which procedure is the most reliable. We also describe the following data checks

that are automatically performed and displayed in the Assistant Report Card and explain how

they affect the results of the analysis:

 Normality

 Unusual data

 Sample size

2-SAMPLE t-TEST 2

2-sample t-test method

Classical 2- -test

If data come from two normal populations with the same variances, the classical 2-sample t-test

is as powerful or more powerful than Welch’s t-test. The normality assumption is not critical for

the classical procedure (Pearson, 1931; Barlett, 1935; Geary, 1947), but the equal-variance

assumption is important to ensure valid results. More specifically, the classical procedure is

sensitive to the assumption of equal variances when the sample sizes differ regardless of how

large the samples are (Welch, 1937; Horsnell, 1953). In practice, however, the equal variance

assumption rarely holds true, which can lead to higher Type I error rates. Therefore, if the

classical 2-sample t-test is used when two samples have different variances, the test is more

likely to produce incorrect results.

Welch’s t-test is a viable alternative to the classical t-test because it does not assume equal

variances and therefore is insensitive to unequal variances for all sample sizes. However, Welch’s

t-test is approximation-based and its performance in small sample sizes may be questionable.

We wanted to determine whether Welch’s t-test or the classical 2-sample t-test is the most

reliable and practical test to use in the Assistant.

Objective

We wanted to determine, through simulation studies and theoretical derivations, whether

Welch’s t-test or the classical 2-sample t-test is more reliable. More specifically, we want to

examine:

 The Type I and Type II error rates of both the classical 2-sample t-test and Welch’s t-test

at various sample sizes when the data are normally distributed and the variances are

equal.

 The Type I and Type II error rates of Welch’s t-test for unbalanced and unequal-variance

designs for which the classical 2-sample t-test fails.

Method

Our simulations focused on three areas:

 We compared simulated test results of the classical 2-sample t-test and Welch’s t-test

under various model assumptions, including normality, nonnormality, equal variances,

unequal variances, balanced, and unbalanced designs. For more details, see Appendix A.

 We derived the power function for Welch’s t-test and compared it with the power

function of the classical 2-sample t-test. For more details, see Appendix B.

 We studied the impact of nonnormality on the theoretical power function of Welch’s

t-test.

2-SAMPLE t-TEST 3

Results

When the assumptions for the classical 2-sample t model hold, Welch’s t-test performs as well

or nearly as well as the classical 2-sample t-test except for small unbalanced designs. However,

the classical 2-sample t-test may also perform poorly when designs are small and unbalanced,

due to its sensitivity to the equal variance assumption. Moreover, in practical settings, it is

difficult to establish that two populations have exactly the same variance. Therefore, the

theoretical superiority of the classical 2-sample test over Welch’s t-test has a little or no practical

value. For this reason, the Assistant uses Welch’s t-test to compare the means of two

populations. For the detailed simulation results, see Appendices A, B, and C.

2-SAMPLE t-TEST 4

Data checks

Normality

Welch’s t-test, the method used in the Assistant to compare the means of two independent

populations, is derived under the assumption that the populations are normally distributed.

Fortunately, even when data are not normally distributed, Welch’s t-test works well if the

samples are large enough.

Objective

We wanted to determine how closely the simulated levels of significance for the Welch method

and the classical 2-sample t-test matched the target level of significance (Type I error rate) of

0.05.

Method

We performed simulations of Welch’s t-test and the classical 2-sample t-test on 10,000 pairs of

independent samples generated from normal, skewed, and contaminated normal (equal and

unequal variances) populations. The samples were of various sizes. The normal population

serves as a control population for comparison purposes. For each condition, we calculated the

simulated significance levels and compared them with the target, or nominal, significance level

of 0.05. If the test performs well, the simulated significance levels should be close to 0.05.

Results

For moderate or large samples, Welch’s t-test maintains its Type I error rates for normal as well

as nonnormal data. The simulated significance levels are close to the targeted significance level

when both sample sizes are at least 15. See Appendix A for more details.

Because the test performs well with relatively small samples, the Assistant does not test the data

for normality. Instead, it checks the size of the samples and displays the following status

indicators in the Report Card:

Status

Condition

Both sample sizes are at least 15; normality is not an issue.

At least one of the sample sizes < 15; normality may be an issue.

2-SAMPLE t-TEST 5

Unusual data

Unusual data are extremely large or small data values, also known as outliers. Unusual data can

have a strong influence on the results of the analysis. When the sample is small, they can affect

the chances of finding statistically significant results. Unusual data can indicate problems with

data collection or unusual behavior of a process. Therefore, these data points are often worth

investigating and should be corrected when possible.

Objective

We wanted to develop a method to check for data values that are very large or very small

relative to the overall sample and that may affect the results of the analysis

Method

We developed a method to check for unusual data based on the method described by Hoaglin,

Iglewicz, and Tukey (1986) to identify outliers in boxplots.

Results

The Assistant identifies a data point as unusual if it is more than 1.5 times the interquartile range

beyond the lower or upper quartile of the distribution. The lower and upper quartiles are the

and 75

percentiles of the data. The interquartile range is the difference between the two

quartiles. This method works well even when there are multiple outliers because it makes it

possible to detect each specific outlier.

Outliers tend to have an influence on the power function only when the sample sizes are very

small. In general, when outliers are present the observed power values tend to be a bit higher

than the targeted theoretical power values. This pattern can be seen in Figure 10 in Appendix C

where the simulated and theoretical power curves are not reasonably close until the minimum

sample size reaches 15.

When checking for unusual data, the Assistant Report Card for the 2-sample t-test displays the

following status indicators:

Status

Condition

There are no unusual data points.

At least one data point is unusual and may affect the test results.

2-SAMPLE t-TEST 6

Sample size

Typically, a hypothesis test is performed to gather evidence to reject the null hypothesis of “no

difference”. If the samples are too small, the power of the test may not be adequate to detect a

difference between the means when one actually exists, which results in a Type II error. It is

therefore crucial to ensure that the sample sizes are sufficiently large to detect practically

important differences with high probability.

Objective

If the current data does not provide sufficient evidence against the null hypothesis, we want to

determine if the sample sizes are large enough for the test to detect practical differences of

interest with high probability. Although the objective of sample size planning is to ensure that

the samples are large enough to detect important differences with high probability, the samples

should not be so large that meaningless differences become statistically significant with high

probability.

Method

The power and sample size analysis is based upon the theoretical power function of the specific

test that is used to perform the statistical analysis. For the Welch t-test, this power function

depends upon the sample sizes, the difference between the two population means, and the true

variances of the two populations. For more details, see Appendix B.

Results

When the data does not provide enough evidence against the null hypothesis, the Assistant

calculates practical differences that can be detected with an 80% and a 90% probability for the

given sample sizes. In addition, if the user provides a practical difference of interest, the

Assistant calculates the sample sizes that yield an 80% and a 90% chance of detecting the

difference.

There is no general result to report because the results depend on the user’s specific samples.

However, you can refer to Appendices B and C for more information about Welch’s test power

function.

When checking for power and sample size, the Assistant Report Card for the 2-sample t-test

displays the following status indicators:

Status

Condition

The test finds a difference between the means, so power is not an issue.

Power is sufficient. The test did not find a difference between the means, but the sample is large

enough to provide at least a 90% chance of detecting the given difference.

2-SAMPLE t-TEST 7

Status

Condition

Power may be sufficient. The test did not find a difference between the means, but the sample is

large enough to provide an 80% to 90% chance of detecting the given difference. The sample size

required to achieve 90% power is reported.

Power might not be sufficient. The test did not find a difference between the means, and the

sample is large enough to provide a 60% to 80% chance of detecting the given difference. The

sample sizes required to achieve 80% power and 90% power are reported.

The power is not sufficient. The test did not find a difference between the means, and the sample is

not large enough to provide at least a 60% chance of detecting the given difference. The sample

sizes required to achieve 80% power and 90% power are reported.

The test did not find a difference between the means. You did not specify a practical difference

between the means to detect; therefore, the report indicates the differences that you could detect

with 80% and 90% chance, based on your sample sizes, standard deviations, and alpha.

2-SAMPLE t-TEST 8

References

Arnold, S. F. (1990). Mathematical Statistics. Englewood Cliffs, NJ: Prentice-Hall, Inc.

Aspin, A. A. (1949). Tables for Use in Comparisons whose Accuracy Involves Two Variances,

Separately Estimated, Biometrika, 36, 290-296.

Bartlett, M. S. (1935). The effect of non-normality on the t-distribution. Proceedings of the

Cambridge Philosophical Society, 31, 223-231.

Box, G. E. P. (1953). Non-normality and Tests on Variances, Biometrika, 40, 318-335.

Geary, R. C. (1947). Testing for Normality, Biometrika, 34, 209-242.

Hoaglin, D. C., Iglewicz, B., and Tukey, J. W. (1986). Performance of Some Resistant Rules for

Outlier Labeling. Journal of the American Statistical Association, 81, 991-999.

Horsnell, G. (1953). The effect of unequal group variances on the F test for homogeneity of

group means. Biometrika, 40, 128-136.

James, G. S. (1951). The comparison of several groups of observations when the ratios of the

populations variances are unknown, Biometrika, 38, 324-329.

Kulinskaya, E. Staudte, R. G. and Gao, H. (2003). Power Approximations in Testing for unequal

Means in a One-Way Anova Weighted for Unequal Variances, Communication in Statistics,

32(12), 2353-2371.

Lehmann, E. L. (1959). Testing statistical hypotheses. New York, NY: Wiley.

Neyman, J., Iwaszkiewicz, K. & Kolodziejczyk, S. (1935). Statistical problems in agricultural

experimentation, Journal of the Royal Statistical Society, Series B, 2, 107-180.

Pearson, E. S. (1931). The Analysis of variance in case of non-normal variation, Biometrika, 23,

114-133.

Pearson, E.S. & Hartley, H.O. (Eds.). (1954). Biometrika Tables for Statisticians, Vol. I. London:

Cambridge University Press.

Srivastava, A. B. L. (1958). Effect of non-normality on the power function of t-test, Biometrika, 45,

421-429.

Welch, B. L. (1951). On the comparison of several mean values: an alternative approach.

Biometrika, 38, 330-336.

Welch, B. L. (1947). The generalization of “Student’s” problem when several different population

variances are involved. Biometrika, 34, 28-35.

Welch, B. L. (1938). The significance of the difference between two means when the population

variances are unequal, Biometrika, 29, 350-362.

Wolfram, S. (1999). The Mathematica Book (4th ed.). Champaign, IL: Wolfram Media/Cambridge

University Press.

2-SAMPLE t-TEST 9

Appendix A: Impact of nonnormality

and heterogeneity on the classical

2-sample t-test and the Welch t-test

We conducted several simulation studies designed to compare the classical 2-sample t-test and

the Welch’s t-test under different model assumptions.

Simulation study A

We performed the study in three parts:

 In the first part of the study, we explored the sensitivity of the classical 2-sample t-test

and Welch’s t-test to the assumption of equal variance when the normality assumption

holds true. Two samples were generated from two independent normal populations. The

first sample, the base sample, was drawn from a normal population with mean 0 and

standard deviation 



, . The second sample was also drawn from a normal

population with mean 0, but with the standard deviation 



chosen so that the ratio 









0.5, 1.0, 1.5 and 2. In other words, the second samples were drawn from the

populations , , , and , respectively. In addition, the base

sample size in each case was fixed at 



 and for each given 



, the second

sample size, 



, was chosen so that the ratio of sample sizes, 







,was

approximately equal to 0.5, 1, 1.5, and 2.0.

For each of these 2-sample designs, we generated 10,000 pairs of independent samples

from the respective populations. Then we performed the classical 2-sample

t-test and Welch’s t-test on each of the 10,000 pairs of samples to test the null

hypothesis of no difference between the means. Because the true difference between the

means is null, the fraction of the 10,000 replicates for which the null hypothesis is

rejected represents the simulated level of significance of the test. Because the targeted

significance level for each of the tests is , the simulation error associated with

each test and each experiment is about 0.2%.

 In the second part, we investigated the impact of nonnormality, specifically skewness, on

the simulated significance levels of the two tests. This simulation was set up the same

way as the previous simulation except that the base sample was drawn from the chi-

squared distribution with 2 degrees of freedom, Chi(2) and the second samples were

drawn from other chi-square distributions so that 







takes on the values 0.5, 1.0,

1.5 and 2. The hypothesized difference between the means was set to be the true

difference between the means of the parent populations.

 In the third part, we examined the effect of outliers on the performance of the two t-

tests. For this reason, the two samples were drawn from contaminated normal

2-SAMPLE t-TEST 10

distributions. A contaminated normal population 







is a mixture of two normal

populations: the  population and the normal  population. We define a

contaminated normal distribution as:





















  





where  is the mixing parameter and    is the proportion of contamination or proportion of

outliers. It is easy to show that if  is distributed as 







then its mean is 



 and its

standard deviation is 







    



The base sample was drawn from 







and the second sample was drawn from the

contaminated normal 







. The parameter  was chosen so that the ratio of the standard

deviations of the two (contaminated) populations 







equals 0.5, 1.0, 1.5 and 2, just like in

parts I and II. Because 







 



  



 , this results in choosing ,

respectively. In other words, the second samples were drawn from 







, 















, and 







. We then performed the simulations as described in Part I.

The results of the study are organized in Table 1 and displayed in Figures 1, 2, and 3.

Results and summary

In general, the simulation results support the theoretical results that under the assumption of

normality and equal variances, the classical 2-sample t-test yields significance levels that are

close to the targeted level even when the sample sizes are small. The second column of plots in

Figure 1 displays the simulated significance levels in designs where the variances of the two

normal populations are equal. The simulated significance levels curves based on the classical

2-sample t-test are undistinguishable from the targeted level lines.

The tables below show the simulated significance levels of two-sided tests for both the classical

2-sample t-test and Welch’s t-test, each with  based on pairs of samples generated

from normal population, skewed populations (Chi-square), and contaminated normal

populations. The pairs of samples are from the same family of distribution but the variances of

the respective parent populations are not necessarily equal.

Table 1 Simulated significance levels of two-sided tests (classical 2-sample t-test and Welch’s

t-test, each with ) for n = 5.

Base Pop.: N(0,2)

2nd Pop: N(0,





Base Population: Chi(2)

2nd Pop: Chi-square

Base Pop.: CN(.8,4)

2nd Pop.: CN(.8,)









1.0

1.5

2.0

1.0

1.5

2.0

1.0

1.5

2.0













Meth.



















.035

.050

.079

.105

.058

.042

.078

.113

.031

.036

.035

.034

Welch

.035

.039

.049

.055

.048

.029

.055

.063

.029

.024

.021

.020

2-SAMPLE t-TEST 11

Base Pop.: N(0,2)

2nd Pop: N(0,





Base Population: Chi(2)

2nd Pop: Chi-square

Base Pop.: CN(.8,4)

2nd Pop.: CN(.8,)









1.0

1.5

2.0

1.0

1.5

2.0

1.0

1.5

2.0













Meth.



















1.0

.061

.052

.054

.058

.086

.036

.054

.064

.035

.031

.025

.023

Welch

.048

.042

.044

.047

.066

.021

.040

.050

.027

.023

.018

.016

1.6

.096

.048

.033

.027

.133

.041

.033

.032

.059

.037

.029

.024

Welch

.050

.045

.043

.042

.094

.034

.032

.041

.034

.029

.026

.022

2.0

.118

.055

.034

.025

.139

.041

.028

.024

.073

.041

.028

.023

Welch

.052

.051

.050

.051

.097

.041

.033

.042

.035

.032

.028

.025

Table 2 Simulated significance levels of two-sided tests (classical 2-sample t-test and Welch’s

t-test, each with ) for n = 10

Base Pop.: N(0,2)

2nd Pop: N(0,





Base Population: Chi(2)

2nd Pop: Chi-square

Base Pop.: CN(.8,4)

2nd Pop.: CN(.8,)









1.0

1.5

2.0

1.0

1.5

2.0

1.0

1.5

2.0













Meth.



















.020

.050

.081

.112

.039

.044

.091

.123

.021

.035

.045

.047

Welch

.046

.048

.050

.043

.047

.067

.063

.034

.028

.022

.019

1.0

.057

.051

.053

.055

.068

.044

.053

.054

.043

.042

.037

.032

Welch

.051

.049

.062

.037

.046

.049

.039

.038

.032

.027

1.5

.088

.048

.034

.029

.100

.043

.032

.064

.040

.028

.021

Welch

.050

.048

.047

.048

.074

.044

.041

.046

.035

.037

.035

.031

.110

.048

.026

.019

.133

.042

.026

.022

.093

.046

.029

.019

Welch

.048

.047

.045

.046

.083

.050

.044

.049

.036

.039

.040

.038

2-SAMPLE t-TEST 12

Table 3 Simulated significance levels of two-sided tests (classical 2-sample t-test and Welch’s

t-test, each with ) for n = 15











  













1.0

1.5

2.0

1.0

1.5

2.0

1.0

1.5

2.0













Meth.



















.53

.021

.050

.083

.110

.036

.041

.089

.114

.022

.044

.056

.062

Welch

.050

.051

.050

.047

.049

.067

.062

.044

.036

.027

.022

1.0

.049

.047

.050

.053

.064

.046

.051

.061

.045

.041

.037

Welch

.045

.046

.049

.048

.060

.042

.048

.057

.042

.043

.039

.033

1.53

.081

.049

.033

.028

.103

.042

.036

.030

.075

.048

.033

.024

Welch

.048

.049

.048

.050

.071

.042

.048

.050

.042

.045

.044

.041

2.0

.111

.050

.028

.018

.123

.049

.027

.020

.100

.046

.025

.016

Welch

.049

.051

.053

.074

.056

.045

.047

.039

.044

.042

.040

Table 4 Simulated significance levels of two-sided tests (classical 2-sample t-test and Welch’s

t-test, each with ) for n = 20











  













1.0

1.5

2.0

1.0

1.5

2.0

1.0

1.5

2.0













Meth.



















.019

.052

.087

.115

.028

.048

.087

.119

.021

.048

.067

.079

Welch

.050

.054

.053

.044

.054

.061

.048

.042

.035

.028

1.0

.048

.049

.052

.053

.057

.046

.052

.056

.049

.044

.042

.040

Welch

.045

.049

.051

.050

.055

.044

.050

.052

.047

.042

.040

.037

2-SAMPLE t-TEST 13











  













1.0

1.5

2.0

1.0

1.5

2.0

1.0

1.5

2.0













Meth.



















1.5

.086

.054

.039

.032

.098

.047

.035

.033

.075

.047

.033

.022

Welch

.054

.053

.052

.068

.047

.051

.053

.041

.043

.044

.042

2.0

.107

.049

.026

.016

.123

.046

.027

.019

.107

.047

.026

.016

Welch

.048

.049

.046

.047

.070

.054

.046

.045

.044

.043

.042

Figure 1 Simulated significance levels of two-sided tests (classical 2-sample t-test and Welch’s

t-test, each with ) based on pairs of samples generated from two normal populations

with equal or unequal variances plotted against ratio of sample sizes.

The simulation results show that for relatively small samples the classical 2-sample t-test is

robust against non-normality but is sensitive to the assumption of equal variances, unless the

two-sample design is nearly balanced. This is graphically shown in Figures 1, 2, and 3. The

simulated significance level curves based on the classical 2-sample t-test cross the targeted level

2-SAMPLE t-TEST 14

line at the point where the ratio of sample sizes is 1.0, even when the variances are very

different. For all three families of distributions (normal, chi-square, and contaminated normal

populations), if the sample sizes are different, the simulated significance levels of the classical

2-sample t-test are near the targeted level only when the variances are equal. This is depicted

on the second column of plots in each of the figures 1, 2, and 3.

The performance of the classical t-test is undesirable when the design is unbalanced and the

variances are unequal. Even small disparities between the variances are troublesome. For those

unequal-variance, unbalanced designs, normality of the data does not improve the simulated

significance levels. In fact, the simulated significance levels fall away from the targeted level as

the sample sizes increase regardless of the parent population. When the larger sample is drawn

from the population with larger variance, the simulated significance levels are smaller than the

targeted level. When the larger samples are drawn from the population with smaller variance,

the simulated levels are larger than the targeted levels. Arnold (1990, page 372) makes a similar

remark when examining the asymptotic distribution of the classical 2-sample t-test statistic

under the unequal-variance assumption.

The Welch 2-sample t-test, on the other hand, is impervious to departures from the equal-

variance assumption, as illustrated in Figures 1, 2, and 3. This is not surprising since the Welch

t-test is not derived under the assumption of equal variances. The normal assumption from

which Welch’s t-test is derived appears to be important only when the minimum of the two

samples sizes is very small. For larger samples, however, the test becomes immune to departures

from the normality assumption. This is illustrated in Figures 2 and 3 where the simulated

significance levels remain consistently close to the targeted level when the minimum size of the

two samples is 15. When both samples are generated from the chi-square distribution with 2

degrees of freedom and the size of both samples is 15, the simulated significance level is 0.042

(see Table 3).

Outliers also do not appear to affect the performance of Welch’s t-test when the minimum size

of the two samples is large enough. Table 3 and Figure 3 show that when the minimum size of

the two samples is at least 15, then the simulated significance levels are near the targeted level

(the simulated significance levels are 0.045, 0.045, 0.041, 0.037 when the ratio of standard

deviations are 0.5, 1.0, 1.5 and 2.0 respectively).

These results show that for most practical purposes, the Welch 2-sample t-test performs better

than the classical 2-sample t-test in terms of its simulated significance levels or Type I error rate.

2-SAMPLE t-TEST 15

Figure 2 Simulated significance levels of two-sided tests (classical 2-sample t-test and Welch

t-test) based on pairs of samples generated from two normal populations with equal or unequal

variances plotted against ratio of sample sizes.

2-SAMPLE t-TEST 16

Figure 3 Simulated significance levels of two-sided tests (classical 2-sample t-test and Welch

t-test) based on pairs of samples generated from two normal populations with equal or unequal

variances plotted against ratio of sample sizes.

2-SAMPLE t-TEST 17

Appendix B: Comparison of the

power functions of the two tests

We wanted to determine the conditions under which power function for Welch’s t-test may be

equal or approximately equal to the power function for the classical 2-sample t-test.

In general, the power functions of t-tests (1-sample or 2-sample) are well known and discussed

in many publications (Pearson and Hartley, 1952; Neyman et al., 1935; Srivastava, 1958). The

following theorem states the power function for each of the three different alternative

hypotheses in two-sample designs.

THEOREM B1

Under the assumptions of normality and the equality of variances, the power function of a

two-sided two-sample t-test which has nominal size may be expressed as a function of the

sample sizes and the difference 



 



















  















  

















where 







 is the C.D.F of the non-central t distribution with 







 



  degrees of

freedom and non-centrality parameter













 



Moreover, the power function associated with the alternative hypothesis 







is given as

















  



















On the other hand, when testing against the alternative 







the power is expressed as





































Although the result in the above theorem is well known, the power function of the test based

upon Welch’s modified t-test has not been specifically discussed in the literature. An

approximation can be deduced from the approximate power function given for the one-way

ANOVA model (see Kulinskaya et. al, 2003). Unfortunately, this power function is only applicable

to two-sided alternatives. However, the two-sample design is such a special case that a different

approach can be adopted to obtain the (exact) power function of Welch’s t-test for each of the

three alternatives. These functions are given in the following theorem.

THEOREM B2

Under the assumption that the populations are normally distributed (but not necessarily with

the same variance), the power function a two-sided Welch’s t-test which has nominal size  may

be expressed as a function of the sample sizes and the difference 



 





















  

















  



















2-SAMPLE t-TEST 18

where 



 is the C.D.F of the non-central t distribution with 



degrees of freedom given as

















































 

















 

and non-centrality parameter





















 







For the one-sided alternatives, the power functions are given as



















  





















and









































for testing the null hypothesis against the alternative 







and for testing null hypothesis

against the alternative 







, respectively.

The proof of the result is given in Appendix D.

Before we compare these two power functions, note that because the classical 2-sample t-test is

derived under the additional assumption that the variances of the populations are equal, the

theoretical power functions of the two tests should be compared when this second assumption

holds for Welch’s t-test.

In theory, we know that under the normality and equal variances assumptions,



































for all 









The next result states conditions under which the two functions are (approximately) equal.

THEOREM B3

Under the assumptions of normality and equality of variances we have the following:

1. If 







then 

































for each difference . In particular, if 







then 

































for each difference , so that the Welch t-test is as

powerful as the classical 2-sample t-test.

2. If 



and 



are small and 







then Welch’s t-test has less power than the classical 2-

sample t-test. However, if 



and 



are large then

































(regardless

of the difference between the sample sizes).

The proof of the result is provided in Appendix E.

Under the assumption of equality of variances, the non-centrality parameters associated with

the power functions of the two tests are identical. The difference between the power functions

can only be attributed to the difference between their respective degrees of freedom. From

theory, we know that under the stated assumptions the classical t-test is UMP (uniformly most

2-SAMPLE t-TEST 19

powerful) and therefore it has higher degrees of freedom. The point of the above results,

however, is that if the design is balanced or approximately balanced then the power functions

are identical or approximately identical. The only case where the classical t-test is noticeably

more powerful than the Welch t-test is when the design is markedly unbalanced and the

samples are small. Unfortunately, that also happens to be the case where the classical 2-sample

t-test is particularly sensitive to the assumption of equal variance as shown in Appendix A. As a

result, Welch’s t-test power function is the more reliable function for practical purposes.

We illustrate the results of Theorem B3 through the following example where the two normal

populations have the same standard deviation of 3. Power values based on the (two-sided)

power functions of Theorem B1 and Theorem B2 are calculated under the following four

scenarios:

1. Both samples are small but have the same size (







).

2. Both samples are small but one sample is two times larger than the other sample

( 







).

3. One sample is small and the other sample is moderate in size but the moderate sample

is four times larger than the smaller sample (







).

4. One sample is moderate in size and the other is large but the larger sample is four times

larger than the moderate sample (







).

Assuming that  for both tests, the power functions are evaluated in each scenario at the

difference . The results are displayed in Table 5 and the functions

are plotted in Figure 4.

Table 5 Comparison of the theoretical power functions of two-sided classical 2-sample t-tests

and two-sided Welch t-tests, . The sample sizes, 



and 



, are fixed and the power

functions are evaluated at differences  ranging from 0.0 to 5.0.



0.0

0.5

1.0

1.5

2.0

2.5

3.5

4.5

5.0



























.05

.064

.109

.185

.292

.422

.562

.694

.805

.887

.941



















.05

.064

.109

.185

.292

.422

.562

.694

.805

.887

.941



























.05

.070

.132

.239

.383

.547

.703

.828

.913

.962

.986



















.05

.070

.129

.231

.371

.531

.686

.813

.902

.955

.982



























.05

.075

.152

.283

.455

.637

.791

.899

.959

.986

.996



















.05

.072

.142

.261

.419

.592

.748

.865

.938

.976

.992

2-SAMPLE t-TEST 20



0.0

0.5

1.0

1.5

2.0

2.5

3.5

4.5

5.0



























.05

.182

.556

.883

.987

.999



















.05

.180

.548

.877

.986

.999

Figure 4 Plots of theoretical power functions of two-sided classical 2-sample t-tests and

two-sided Welch’s t-tests versus  the difference between the means to be detected. Both tests

use . The assumed populations are normal with same standard deviation of 3.

Simulation study B

The purpose of this simulation study is to compare power levels associated with the classical

2-sample t-test to power levels associated with Welch’s 2-sample t-test in balanced designs

where the variances are assumed to be unequal. The experiments in these studies are similar to

discussed in Appendix A.

In the first group of experiments, we generated pairs of sample of equal sizes from the normal

populations with unequal variances. The base population was fixed to be  and the second

normal populations were chosen such the ratio of standard deviations 







equals 0.5, 1.5

and 2. Similarly, in a second group, the two samples were drawn from chi-square distributions

with unequal variances (base population is Chi(2)). In the last set of experiments the pairs of

2-SAMPLE t-TEST 21

samples were generated from the contaminated normal distribution (base population CN(.8,4))

as previously defined in Appendix A.

For each set of experiments, we calculated the simulated power levels (at a given detectable

difference ) associated with each test for the sample sizes 







. In

each experiment, the simulated power level was calculated as the proportion of instances when

the null hypothesis was rejected when it was false. For all the experiments, the difference

between the means was specified in one unit of the standard in the base population (the first of

the two samples). More specifically, we fixed   



 because it is relatively small for

all three families of distributions in this study. The simulations results are reported in Table 2.2

and displayed in Figure 2.2a, Figure 2.2b and Figure 2.2c.

Results and summary

The results in Table 6 and Figure 4 show that under the equal-variance assumptions the

theoretical power functions are identical in balanced designs, as indicated in Theorem 2.3. In

addition, when the sample sizes are relatively small but nearly the same size, the two functions

yield power values that are approximately equal. It is only when the samples are relatively small

and one sample is about four times larger than the other sample that some noticeable

differences between the power functions begin to emerge (for example, when 







).

Even in this case the theoretical power values based on the classical 2-sample t-test are only

slightly higher than the power values based on Welch’s t-test. Finally, when the designs are

markedly unbalanced but the samples are (relatively) large, the two power functions are

essentially identical, as stated in Theorem B3.

Moreover, in balanced designs with unequal variances, the two tests yield power values that are

practically identical. In very small samples (), however, the classical 2-sample t-test

performs slightly better.

Table 6 Comparison of simulated power levels of the classical 2-sample t-test and Welch’s test

in balanced and unequal-variance designs

Base Population:

N(0,2)

Base Population:

Chi(2)

Base Population:

CN(.8,4)











1.5

2.0

1.5

2.0

1.5

2.0

0.431

0.196

0.152

0.555

0.281

0.215

0.579

0.373

0.335

Welch

0.366

0.166

0.119

0.424

0.250

0.184

0.521

0.320

0.283

0.770

0.385

0.270

0.846

0.438

0.324

0.790

0.510

0.435

Welch

0.747

0.372

0.253

0.832

0.427

0.308

0.776

0.493

0.417

0.916

0.539

0.387

0.948

0.565

0.424

0.898

0.615

0.508

Welch

0.908

0.532

0.375

0.945

0.557

0.413

0.891

0.605

0.497

2-SAMPLE t-TEST 22

Base Population:

N(0,2)

Base Population:

Chi(2)

Base Population:

CN(.8,4)

0.971

0.682

0.497

0.982

0.680

0.521

0.952

0.702

0.573

Welch

0.969

0.677

0.487

0.981

0.676

0.511

0.947

0.697

0.563

0.990

0.779

0.591

0.994

0.765

0.605

0.980

0.783

0.641

Welch

0.990

0.777

0.582

0.994

0.762

0.597

0.979

0.778

0.636

0.998

0.851

0.675

0.998

0.826

0.676

0.994

0.839

0.699

Welch

0.998

0.849

0.670

0.998

0.824

0.668

0.994

0.836

0.694

Figure 5 Comparison of simulated power levels of the classical 2-sample t-test and Welch’s

2-sample t-test in balanced and unequal-variance designs. Samples were drawn from unequal-

variance normal populations such that the ratio of standard deviations is 0.5, 1.5 and 2.0.

2-SAMPLE t-TEST 23

Figure 6 Comparison of simulated power levels of the classical 2-sample t-test and Welch’s

2-sample t-test in balanced and unequal-variance designs. Samples were drawn from unequal-

variance chi-square populations such that the ratio of standard deviations is 0.5, 1.5 and 2.0.

2-SAMPLE t-TEST 24

Figure 7 Comparison of simulated power levels of the classical 2-sample t-test and Welch’s

2-sample t-test in balanced and unequal-variance designs. Samples were drawn from unequal-

variance contaminated normal populations such that the ratio of standard deviations is 0.5, 1.5

and 2.0.

2-SAMPLE t-TEST 25

Appendix C: Power and sample size

and sensitivity to normality

In the Assistant, the power analysis for comparing the means of two populations is based upon

the power function of the Welch t-test. Should this function be sensitive to the normal

assumption under which it is derived, the power analysis may yield erroneous conclusions. For

this reason, we conducted a simulation study to examine the sensitivity of this function to the

normal assumption. Sensitivity is assessed as the consistency between simulated power levels

and power levels calculated from the theoretical power function when samples are generated

from nonnormal distributions. The normal distribution serves as the control population because,

according to Theorem B2, simulated power levels and theoretical power levels are closest when

samples are generated from the normal populations.

Simulation study C

The study is conducted in three parts using three distributions: normal, chi-square, and the

contaminated normal distribution. Refer to Appendix A for more details. For each part of the

study, the simulated power is calculated (for the given sample sizes 



and 



at a given

detectable difference ) as the proportion of instances when the null hypothesis was rejected

when it was false. In all the cases the difference to be detected is specified in one unit of the

standard in the base population. That is   



 for all three families of distributions

in this study. The theoretical power values based on Welch’s t-test are calculated as well for

comparison.

Simulation results and summary

The results show that for relatively small sample sizes the power function of the Welch t-test is

robust against the normality assumption. In general, when the minimum size of the two samples

is as low as 15, the simulated power values are near their corresponding targeted theoretical

power levels (see Tables 7-10 and Figures 8-10).

Tables 7-10 show the simulated power levels of a two-sided Welch t-test with  based on

pairs of samples generated from normal population, skew populations (chi-square), and

contaminated normal populations. The pairs of samples are from the same family of distribution

but the variances of the parent populations are not necessarily equal. The theoretical power

values were calculated for comparison.

2-SAMPLE t-TEST 26

Table 7 Simulated power levels of a two-sided Welch t-test with  for n=5

Base Population: N(0,2)

Base Population: Chi(2)

Base Population:

CN(.8,4)









1.0

1.5

2.0

1.0

1.5

2.0

1.0

1.5

2.0































Obs.

.288

.158

.113

.091

.432

.305

.211

.149

.361

.257

.234

.220

Target

.353

.192

.116

.092

.353

.192

.116

.092

.353

.192

.116

.092

1.0

Obs.

.370

.252

.169

.121

.427

.334

.248

.189

.522

.380

.319

.284

Target

.389

.286

.190

.137

.389

.286

.190

.137

.389

.286

.190

.137

1.6

Obs.

.387

.326

.242

.179

.427

.364

.286

.225

.573

.453

.374

.319

Target

.400

.345

.260

.193

.400

.345

.260

.193

.400

.345

.260

.193

2.0

Obs.

.390

.351

.272

.208

.421

.373

.296

.235

.590

.483

.394

.336

Target

.402

.364

.291

.223

.402

.364

.291

.223

.402

.364

.291

.223

Table 8 Simulated power levels of a two-sided Welch t-test with  for n=10

Base Population: N(0,2)

Base Population: Chi(2)

Base Population:

CN(.8,4)









1.0

1.5

2.0

1.0

1.5

2.0

1.0

1.5

2.0































Obs.

.651

.346

.197

.131

.768

.493

.320

.221

.689

.484

.404

.358

Target

.666

.364

.206

.139

.666

.364

.206

.139

.666

.364

.206

.139

1.0

Obs.

.742

.556

.369

.254

.831

.612

.430

.308

.776

.619

.496

.419

Target

.745

.562

.337

.259

.745

.562

.337

.259

.745

.562

.337

.259

1.5

Obs.

.765

.641

.483

.358

.865

.679

.511

.377

.792

.679

.547

.456

Target

.767

.643

.483

.352

.767

.643

.483

.352

.767

.643

.483

.352

Obs.

.774

.683

.549

.417

.898

.737

.565

.448

.797

.716

.596

.490

Target

.777

.686

.551

.422

.777

.686

.551

.422

.777

.686

.551

.422

2-SAMPLE t-TEST 27

Table 9 Simulated power levels of a two-sided Welch t-test with  for n=15

Base Population: N(0,2)

Base Population: Chi(2)

Base Population:

CN(.8,4)









1.0

1.5

2.0

1.0

1.5

2.0

1.0

1.5

2.0































.53

Obs.

.857

.569

.342

.229

.871

.651

.421

.293

.853

.632

.505

.428

Target

.861

.568

.338

.221

.861

.568

.338

.221

.861

.568

.338

.221

1.0

Obs.

.906

.745

.535

.368

.942

.763

.563

.415

.891

.760

.611

.500

Target

.910

.753

.541

.379

.910

.753

.541

.379

.910

.753

.541

.379

1.53

Obs.

.928

.831

.667

.502

.975

.858

.676

.517

.898

.825

.698

.572

Target

.925

.830

.670

.509

.925

.830

.670

.509

.925

.830

.670

.509

2.0

Obs.

.933

.861

.737

.589

.984

.903

.750

.598

.902

.847

.742

.619

Target

.931

.863

.736

.589

.931

.863

.736

.589

.931

.863

.736

.589

Table 10 Simulated power levels of a two-sided Welch t-test with  for n=20

Base Population: N(0,2)

Base Population: Chi(2)

Base Population:

CN(.8,4)









1.0

1.5

2.0

1.0

1.5

2.0

1.0

1.5

2.0































Obs.

.938

.687

.426

.275

.920

.698

.486

.333

.923

.716

.568

.476

Target

.941

.686

.424

.277

.941

.686

.424

.277

.941

.686

.424

.277

1.0

Obs.

.971

.866

.672

.485

.981

.858

.670

.506

.952

.856

.696

.567

Target

.971

.869

.673

.489

.971

.869

.673

.489

.971

.869

.673

.489

1.5

Obs.

.977

.923

.791

.629

.995

.932

.785

.631

.960

.908

.798

.662

Target

.978

.922

.791

.628

.978

.922

.791

.628

.978

.922

.791

.628

2.0

Obs.

.983

.950

.858

.724

.998

.966

.864

.726

.958

.929

.845

.725

Target

.981

.945

.854

.719

.981

.945

.854

.719

.981

.945

.854

.719

2-SAMPLE t-TEST 28

When the two samples are generated from normal populations, the simulated power values are

consistent with the theoretical power values, even for very small samples. As illustrated in Figure

7, the theoretical and simulated power curves are practically indistinguishable. These results are

consistent with Theorem B2.

Figure 8 Simulated and targeted theoretical power levels of a two-sided Welch t-test with 

 based on pairs of samples generated from two normal populations with equal or unequal

variances plotted against ratio of sample sizes.

When the samples are generated from the skew chi-square distributions, the simulated power

values are higher than the theoretical power values for very small samples; however, the power

values become closer as the sample sizes increase. Figure 9 shows that the targeted theoretical

and simulated power curves are consistently close when the minimum size of the two samples is

at least 10. This illustrates that skewed data has no remarkable effect on the power function of

the Welch t-test, even when the samples are relatively small.

2-SAMPLE t-TEST 29

Figure 9 Simulated and targeted theoretical power levels of a two-sided Welch t-test with 

 based on pairs of samples generated from two normal populations with equal or unequal

variances plotted against ratio of sample sizes.

In addition, outliers tend to influence the power function only when the sample sizes are very

small. In general, when outliers are present the simulated power values tend to be a bit higher

than the targeted theoretical power values. This is depicted in Figure 10 where the simulated

and theoretical power curves are not reasonably close until the minimum sample size reaches

15.

2-SAMPLE t-TEST 30

Figure 10 Simulated and targeted theoretical power levels of a two-sided Welch t-test with 

 based on pairs of samples generated from two normal populations with equal or unequal

variances plotted against ratio of sample sizes.

2-SAMPLE t-TEST 31

Appendix D: Proof of theorem B2

For the two-sample model, Welch’s approach for deriving the distribution of the test statistic













 























under the null hypothesis is based upon an approximation of the distribution of























as proportional to a chi-square distribution. More specifically,



























is approximately distributed as a chi-square distribution with 



degrees of freedom where

















































 

















 

(Note that in a one-sample setting this reduces to the well known classical result that



  



















)

Consider the test of the null hypothesis 











(or equivalently ) against the

alternative 











(or equivalently )

Under the null hypothesis, the power function

































  









 















where 





denotes the 100 upper percentile point of the t-distribution with  degrees of

freedom.

Under the alternative hypothesis,

 







  





















































































2-SAMPLE t-TEST 32

has the approximate non-central t distribution with 



degrees of freedom with non-centrality

parameter





















 







because as stated earlier



























is approximately distributed as a chi-square distribution with 



degrees of freedom, and

  























is distributed as a standard normal distribution.

It follows that under the alternative,

















  









 













  

















  



















where 







 is the C.D.F of the non-central t distribution with 



degrees of freedom and

non centrality parameter  as given above.

2-SAMPLE t-TEST 33

Appendix E: Proof of theorem B3

First, note that 



can be rewritten as









































 















 

where 







Similarly, the non-centrality parameter associated with the Welch t-test power function can also

be written as















 







Under the assumption of equal variance, the non-centrality parameters associated with the

power functions of the classical 2-sample t-test and the Welch test coincide. That is

















 



where  is the common variance of the two populations. Thus, the only difference in the power

functions of the two tests resides in the difference between their respective degrees of freedom.

But, under the equal variance assumption, the degrees of freedom associated with the Welch

t-test power function becomes







































 













 









 













 







 













 



 







 

By Theorem 1, the degrees of freedom related to the power function of the classical 2-sample

t-test is 







 



 . After some algebraic manipulations, we have





 











 













 



 

















 



 







 



The fact that   



 is not surprising because we know that under the assumption of

equality of variances the classical 2-sample t-test is UMP (uniformly most powerful), as a result

one should expect the degrees of freedom associated with its power function to be higher.

Now, if 







then 



and as a result the power functions have the same order of

magnitude. In particular, the power functions of the two tests are identical if 







. This proves

the first part of theorem 2.3.

If 







, then 



 



 so that the Welch t-test has less power than the classical 2-sample

t-test.

2-SAMPLE t-TEST 34

In addition, if the samples are large, that is if 



 and 



 then 



 and 



 so

that the asymptotic distribution of the test statistics associated with both tests is the standard

normal distribution. Thus, the tests are asymptotically equivalent and yield the same asymptotic

power function.

Minitab®, Quality. Analysis. Results.® and the Minitab® logo are all registered trademarks of Minitab,

Inc., in the United States and other countries. See minitab.com/legal/trademarks for more information.