John Eng, MD
Index terms:
Radiology and radiologists, research
Statistical analysis
Published online
10.1148/radiol.2272012051
Radiology 2003; 227:309–313
1
From the Russell H. Morgan Depart-
ment of Radiology and Radiological
Science, Johns Hopkins University, 600
N Wolfe St, Central Radiology Viewing
Area, Rm 117, Baltimore, MD 21287.
Received December 17, 2001; revision
requested January 29, 2002; revision
received March 7; accepted March 13.
Address correspondence to the au-
thor (e-mail: [email protected]).
©
RSNA, 2003
Sample Size Estimation: How
Many Individuals Should Be
Studied?
1
The number of individuals to include in a research study, the sample size of the
study, is an important consideration in the design of many clinical studies. This
article reviews the basic factors that determine an appropriate sample size and
provides methods for its calculation in some simple, yet common, cases. Sample size
is closely tied to statistical power, which is the ability of a study to enable detection
of a statistically significant difference when there truly is one. A trade-off exists
between a feasible sample size and adequate statistical power. Strategies for reduc-
ing the necessary sample size while maintaining a reasonable power will also be
discussed.
©
RSNA, 2003
How many individuals will I need to study? This question is commonly asked by the
clinical investigator and exposes one of many issues that are best settled before actually
carrying out a study. Consultation with a statistician is worthwhile in addressing many
issues of study design, but a statistician is not always readily available. Fortunately, many
studies in radiology have simple designs for which determination of an appropriate sample
size—the number of individuals that should be included for study—is relatively straight-
forward.
Superficial discussions of sample size determination are included in typical introductory
biostatistics texts (1–3). The goal of this article is to augment these introductory discus-
sions with additional practical material. First, the need for considering sample size will be
reviewed. Second, the study design parameters affecting sample size will be identified.
Third, formulae for calculating appropriate sample sizes for some common study designs
will be defined. Finally, some advice will be offered on what to do if the calculated sample
size is impracticably large. To assist the reader in performing the calculations described in
this article and to encourage experimentation with them, a World Wide Web page has
been developed that closely parallels the equations presented in this article. This page can
be found at www.rad.jhmi.edu/jeng/javarad/samplesize/.
Even if a statistician is readily available, the investigator may find that a working
knowledge of the factors affecting sample size will result in more fruitful communication
with the statistician and in better research design. A working knowledge of these factors is
also required to use one of the numerous Web pages (4 6) and computer programs (7–9)
that have been developed for calculating appropriate sample sizes. It should be noted that
Web pages for calculating sample size are typically limited for use in situations involving
the well-known parametric statistics, which are those involving the calculation of summary
means, proportions, or other parameters of an assumed underlying statistical distribution
such as the normal, Student t, or binomial distributions. The calculation of sample size for
nonparametric statistics such as the Wilcoxon rank sum test is performed by some
computer programs (7,9).
IMPORTANCE OF SAMPLE SIZE
In a comparative research study, the means or proportions of some characteristic in two or
more comparison groups are measured. A statistical test is then applied to determine
whether or not there is a significant difference between the means or proportions observed
in the comparison groups. We will first consider the comparative type of study.
Statistical Concepts Series
309
R
adiology
Sample size is important primarily be-
cause of its effect on statistical power. Sta-
tistical power is the probability that a
statistical test will indicate a signicant
difference when there truly is one. Statis-
tical power is analogous to the sensitivity
of a diagnostic test (10), and one could
mentally substitute the word sensitivi-
ty for the word powerduring statistical
discussions.
In a study comparing two groups of
individuals, the power (sensitivity) of a
statistical test must be sufcient to enable
detection of a statistically signicant dif-
ference between the two groups if a dif-
ference is truly present. This issue be-
comes important if the study results were
to demonstrate no statistically signicant
difference. If such a negative result were
to occur, there would be two possible
interpretations. The rst interpretation is
that the results of the statistical test are
correct and that there truly is no statisti-
cally signicant difference (a true-nega-
tive result). The second interpretation is
that the results of the statistical test are
erroneous and that there is actually an
underlying difference, but the study was
not powerful enough (sensitive enough)
to nd the difference, yielding a false-
negative result. In statistical terminol-
ogy, a false-negative result is known as a
type II error. An adequate sample size gives
a statistical test enough power (sensitiv-
ity) so that the rst interpretation (that
the results are true-negative) is much
more plausible than the second interpre-
tation (that a type II error occurred) in
the event no statistically signicant dif-
ference is found in the study.
It is well known that many published
clinical research studies possess low sta-
tistical power owing to inadequate sam-
ple size or other design issues (11,12).
One could argue that it is as wasteful and
inappropriate to conduct a study with
inadequate power as it is to obtain a di-
agnostic test of insufcient sensitivity to
rule out a disease.
PARAMETERS THAT
DETERMINE APPROPRIATE
SAMPLE SIZE
An appropriate sample size generally de-
pends on ve study design parameters:
minimum expected difference (also known
as the effect size), estimated measurement
variability, desired statistical power, sig-
nicance criterion, and whether a one- or
two-tailed statistical analysis is planned.
Minimum Expected Difference
This parameter is the smallest measured
difference between comparison groups that
the investigator would like the study to
detect. As the minimum expected differ-
ence is made smaller, the sample size
needed to detect statistical signicance
increases. The setting of this parameter is
subjective and is based on clinical judg-
ment and experience with the problem
being investigated. For example, suppose
a study is designed to compare a standard
diagnostic procedure of 80% accuracy with
a new procedure of unknown but poten-
tially higher accuracy. It would probably
be clinically unimportant if the new pro-
cedure were only 81% accurate, but sup-
pose the investigator believes that it
would be a clinically important improve-
ment if the new procedure were 90% ac-
curate. Therefore, the investigator would
choose a minimum expected difference
of 10% (0.10). The results of pilot studies
or a literature review can also guide the
selection of a reasonable minimum dif-
ference.
Estimated Measurement Variability
This parameter is represented by the
expected SD in the measurements made
within each comparison group. As statis-
tical variability increases, the sample size
needed to detect the minimum difference
increases. Ideally, the estimated measure-
ment variability should be determined
on the basis of preliminary data collected
from a similar study population. A review
of the literature can also provide esti-
mates of this parameter. If preliminary
data are not available, this parameter may
have to be estimated on the basis of sub-
jective experience, or a range of values
may be assumed. A separate estimate of
measurement variability is not required
when the measurement being compared
is a proportion (in contrast to a mean),
because the SD is mathematically derived
from the proportion.
Statistical Power
This parameter is the power that is de-
sired from the study. As power is increased,
sample size increases. While high power is
always desirable, there is an obvious
trade-off with the number of individuals
that can feasibly be studied, given the
usually xed amount of time and re-
sources available to conduct a study. In
randomized controlled trials, the statisti-
cal power is customarily set to a number
greater than or equal to 0.80, with many
clinical trial experts now advocating a
power of 0.90.
Significance Criterion
This parameter is the maximum P
value for which a difference is to be con-
sidered statistically signicant. As the sig-
nicance criterion is decreased (made
more strict), the sample size needed to
detect the minimum difference increases.
The signicance criterion is customarily
set to .05.
One- or Two-tailed Statistical
Analysis
In a few cases, it may be known before
the study that any difference between
comparison groups is possible in only
one direction. In such cases, use of a one-
tailed statistical analysis, which would re-
quire a smaller sample size for detection
of the minimum difference than would a
two-tailed analysis, may be considered.
The sample size of a one-tailed design
with a given signicance criterionfor
example, is equal to the sample size of
a two-tailed design with a signicance
criterion of 2, all other parameters being
equal. Because of this simple relationship
and because truly appropriate one-tailed
analyses are rare, a two-tailed analysis is
assumed in the remainder of this article.
SAMPLE SIZES FOR
COMPARATIVE RESEARCH
STUDIES
With knowledge of the design parame-
ters detailed in the previous section, the
calculation of an appropriate sample size
simply involves selecting an appropriate
equation. For a study comparing two
means, the equation for sample size (13)
is
N
4
2
z
crit
z
pwr
2
D
2
, (1)
where N is the total sample size (the sum
of the sizes of both comparison groups),
is the assumed SD of each group (as-
sumed to be equal for both groups), the
z
crit
value is that given in Table 1 for the
desired signicance criterion, the z
pwr
value is that given in Table 2 for the de-
sired statistical power, and D is the min-
imum expected difference between the
two means. Both z
crit
and z
pwr
are cutoff
points along the x axis of a standard nor-
mal probability distribution that demar-
cate probabilities matching the specied
signicance criterion and statistical power,
respectively. The two groups that make up
310
Radiology
May 2003 Eng
R
adiology
N are assumed to be equal in number,
and it is assumed that two-tailed statisti-
cal analysis will be used. Note that N de-
pends only on the difference between the
two means; it does not depend on the
magnitude of either one.
As an example, suppose a study is pro-
posed to compare a renovascular proce-
dure versus medical therapy in lowering
the systolic blood pressure of patients
with hypertension secondary to renal ar-
tery stenosis. On the basis of results of
preliminary studies, the investigators es-
timate that the vascular procedure may
help lower blood pressure by 20 mm Hg,
while medical therapy may help lower
blood pressure by only 10 mm Hg. On
the basis of their clinical judgment, the
investigators might also argue that the
vascular procedure would have to be
twice as effective as medical therapy to
justify the higher cost and discomfort of
the vascular procedure. On the basis of
results of preliminary studies, the SD for
blood pressure lowering is estimated to
be 15 mm Hg. According to the normal
distribution, this SD indicates an expec-
tation that 95% of the patients in either
group will experience a blood pressure
lowering within 30 mm Hg (2 SDs) of the
mean. A signicance criterion of .05 and
power of 0.80 are chosen. With these as-
sumptions, D 20 10 10 mm Hg,
␴⫽15 mm Hg, z
crit
1.960 (from Table
1), and z
pwr
0.842 (from Table 2). Equa-
tion (1) yields a sample size of N 70.6.
Therefore, a total of 70 patients (round-
ing N to the nearest even number) should
be enrolled in the study: 35 to undergo
the vascular procedure and 35 to receive
medical therapy.
For a study in which two proportions
are compared with a
2
test or a z test,
which is based on the normal approxi-
mation to the binomial distribution, the
equation for sample size (14) is
N 2 z
crit
2p 1 p
z
pwr
p
1
1 p
1
p
2
1 p
2
兲兴
2
/D
2
,
(2)
where p
1
and p
2
are pre-study estimates of
the two proportions to be compared, D
p
1
p
2
(ie, the minimum expected dif-
ference), p (p
1
p
2
)/2, and N, z
crit
, and
z
pwr
are dened as they are for Equation
(1). The two groups comprising N are as-
sumed to be equal in number, and it is
assumed that two-tailed statistical analy-
sis will be used. Note that in this case, N
depends not only on the difference be-
tween the two proportions but also on
the magnitude of the proportions them-
selves. Therefore, Equation (2) requires
the investigator to estimate p
1
and p
2,
as
well as their difference, before perform-
ing the study. However, Equation (2)
does not require an independent esti-
mate of SD because it is calculated from
p
1
and p
2
within the equation.
As an example, suppose a standard di-
agnostic procedure has an accuracy of
80% for the diagnosis of a certain disease.
A study is proposed to evaluate a new
diagnostic procedure that may have
greater accuracy. On the basis of their
experience, the investigators decide that
the new procedure would have to be at
least 90% accurate to be considered sig-
nicantly better than the standard proce-
dure. A signicance criterion of .05 and a
power of 0.90 are chosen. With these as-
sumptions, p
1
0.80, p
2
0.90, D
0.10, p 0.85, z
crit
1.960, and z
pwr
0.842. Equation (2) yields a sample size of
N 398. Therefore, a total of 398 pa-
tients should be enrolled: 199 to undergo
the standard diagnostic procedure and
199 to undergo the new one.
SAMPLE SIZES FOR
DESCRIPTIVE STUDIES
Not all research studies involve the com-
parison of two groups. The purpose of
many studies is simply to describe, with
means or proportions, one or more char-
acteristics in one particular group. In
these types of studies, known as descrip-
tive studies, sample size is important be-
cause it affects how precise the observed
means or proportions are expected to be.
In the case of a descriptive study, the min-
imum expected difference reects the dif-
ference between the upper and lower
limit of an expected confidence interval,
which is described with a percentage. For
example, a 95% CI indicates the range in
which 95% of results would fall if a study
were to be repeated an innite number of
times, with each repetition including the
number of individuals specied by the
sample size.
In studies designed to estimate a mean,
the equation for sample size (2,15) is
N
4
2
(z
crit
)
2
D
2
, (3)
where N is the sample size of the single
study group, is the assumed SD for the
group, the z
crit
value is that given in Ta-
ble 1, and D is the total width of the
expected CI. Note that Equation (3) does
not depend on statistical power because
this concept only applies to statistical com-
parisons.
As an example, suppose a fetal sonog-
rapher wants to determine the mean fetal
crown-rump length in a group of preg-
nancies. The sonographer would like the
limits of the 95% condence interval to
be no more than 1 mm above or 1 mm
below the mean crown-rump length of
the group. From previous studies, it is
known that the SD for the measurement
is 3 mm. Based on these assumptions,
D 2 mm, ␴⫽3 mm, and z
crit
1.960
(from Table 1). Equation (3) yields a sam-
ple size of N 35. Therefore, 35 fetuses
should be examined in the study.
In studies designed to measure a char-
acteristic in terms of a proportion, the
equation for sample size (2,15) is
N
4(z
crit
)
2
p1 p
D
2
, (4)
where p is a pre-study estimate of the
proportion to be measured, and N, z
crit
,
and D are dened as they are for Equa-
TABLE 1
Standard Normal Deviate (z
crit
)
Corresponding to Selected
Significance Criteria and CIs
Signicance Criterion* z
crit
Value
.01 (99) 2.576
.02 (98) 2.326
.05 (95) 1.960
.10 (90) 1.645
* Numbers in parentheses are the probabili-
ties (expressed as a percentage) associated
with the corresponding CIs. Condence
probability is the probability associated with
the corresponding CI.
A stricter (smaller) signicance criterion is
associated with a larger z
crit
value. Values not
shown in this table may be calculated in Excel
version 97 (Microsoft, Redmond, Wash) by us-
ing the formula z
crit
NORMSINV(1(P/2)),
where P is the signicance criterion.
TABLE 2
Standard Normal Deviate (z
pwr
)
Corresponding to Selected
Statistical Powers
Statistical Power z
pwr
Value*
.80 0.842
.85 1.036
.90 1.282
.95 1.645
* A higher power is associated with a larger
value for z
pwr
. Values not shown in this table
may be calculated in Excel version 97 (Mi-
crosoft, Redmond, Wash) by using the for-
mula z
pwr
NORMSINV(power). For calculat-
ing power, the inverse formula is power
NORMSDIST(z
pwr
), where z
pwr
is calculated
from Equation (1) or Equation (2) by solving
for z
pwr
.
Volume 227
Number 2 Sample Size Estimation
311
R
adiology
tion (3). Like Equation (2), Equation (4)
depends not only on the width of the
expected CI but also on the magnitude of
the proportion itself. Also like Equation
(2), Equation (4) does not require an in-
dependent estimate of SD because it is
calculated from p within the equation.
As an example, suppose an investigator
would like to determine the accuracy of a
diagnostic test with a 95% CI of 10%.
Suppose that, on the basis of results of
preliminary studies, the estimated accu-
racy is 80%. With these assumptions,
D 0.20, p 0.80, and z
crit
1.960.
Equation (4) yields a sample size of N
61. Therefore, 61 patients should be ex-
amined in the study.
MINIMIZING THE SAMPLE SIZE
Now that we understand how to calcu-
late sample size, what if the sample size
we calculate is too large to be feasibly
studied? Browner et al (16) list a number
of strategies for minimizing the sample
size. These strategies are briey discussed
in the following paragraphs.
Use Continuous Measurements
Instead of Categories
Because a radiologic diagnosis is often
expressed in terms of a binary result, such
as the presence or absence of a disease, it
is natural to convert continuous mea-
surements into categories. For example,
the size of a lesion might be encoded as
small or large. For a sample of xed
size, the use of the actual measurement
rather than the proportion in each cate-
gory yields more power. This is because
statistical tests that incorporate the use of
continuous values are mathematically
more powerful than those used for pro-
portions, given the same sample size.
Use More Precise Measurements
For studies in which Equation (1) or
Equation (2) applies, any way to increase
the precision (decrease the variability) of
the measurement process should be sought.
For some types of research, precision can
be increased by simply repeating the
measurement. More complex equations
are necessary for studies involving re-
peated measurements in the same indi-
viduals (17), but the basic principles are
similar.
Use Paired Measurements
Statistical tests like the paired t test are
mathematically more powerful for a
given sample size than are unpaired tests
because in paired tests, each measure-
ment is matched with its own control.
For example, instead of comparing the
average lesion size in a group of treated
patients with that in a control group,
measuring the change in lesion size in
each patient after treatment allows each
patient to serve as his or her own control
and yields more statistical power. Equa-
tion (1) can still be used in this case. D
represents the expected change in the
measurement, and is the expected SD
of this change. The additional power and
reduction in sample size are due to the SD
being smaller for changes within individ-
uals than for overall differences between
groups of individuals.
Use Unequal Group Sizes
Equations (1) and (2) involve the as-
sumption that the comparison groups are
equal in size. Although it is statistically
most efcient if the two groups are equal
in size, benet is still gained by studying
more individuals, even if the additional
individuals all belong to one of the groups.
For example, it may be feasible to recruit
additional individuals into the control
group even if it is difcult to recruit more
individuals into the noncontrol group.
More complex equations are necessary
for calculating sample sizes when com-
paring means (13) and proportions (18)
of unequal group sizes.
Expand the Minimum Expected
Difference
Perhaps the minimum expected differ-
ence that has been specied is unneces-
sarily small, and a larger expected differ-
ence could be justied, especially if the
planned study is a preliminary one. The
results of a preliminary study could be
used to justify a more ambitious follow-up
study of a larger number of individuals
and a smaller minimum difference.
DISCUSSION
The formulation of Equations (14) in-
volves two statistical assumptions which
should be kept in mind when these equa-
tions are applied to a particular study. First,
it is assumed that the selection of individ-
uals is random and unbiased. The decision
to include an individual in the study can-
not depend on whether or not that indi-
vidual has the characteristic or outcome
being studied. Second, in studies in which
a mean is calculated from measurements of
individuals, the measurements are as-
sumed to be normally distributed. Both of
these assumptions are required not only by
the sample size calculation method, but
also by the statistical tests themselves (such
as the t test). The situations in which Equa-
tions (14) are appropriate all involve para-
metric statistics. Different methods for de-
termining sample size are required for
nonparametric statistics such as the Wil-
coxon rank sum test.
Equations for calculating sample size,
such as Equations (1) and (2), also pro-
vide a method for determining statistical
power corresponding to a given sample
size. To calculate power, solve for z
pwr
in
the equation corresponding to the design
of the study. The power can be then de-
termined by referring to Table 2. In this
way, an observed power can be calcu-
lated after a study has been completed,
where the observed difference is used in
place of the minimum expected differ-
ence. This calculation is known as retro-
spective power analysis and is sometimes
used to aid in the interpretation of the
statistical results of a study. However, ret-
rospective power analysis is controversial
because it can be shown that observed
power is completely determined by the P
value and therefore cannot add any ad-
ditional information to its interpretation
(19). Power calculations are most appro-
priate when they incorporate a minimum
difference that is stated prospectively.
The accuracy of sample size calcula-
tions obviously depends on the accuracy
of the estimates of the parameters used in
the calculations. Therefore, these calcula-
tions should always be considered esti-
mates of an absolute minimum. It is usu-
ally prudent for the investigator to plan
to include more than the minimum
number of individuals in a study to com-
pensate for loss during follow-up or other
causes of attrition.
Sample size is best considered early in
the planning of a study, when modica-
tions in study design can still be made.
Attention to sample size will hopefully
result in a more meaningful study whose
results will eventually receive a high pri-
ority for publication.
References
1. Pagano M, Gauvreau K. Principles of bio-
statistics. 2nd ed. Pacic Grove, Calif:
Duxbury, 2000; 246249, 330331.
2. Daniel WW. Biostatistics: a foundation
for analysis in the health sciences. 7th ed.
New York, NY: Wiley, 1999; 180185, 268
270.
3. Altman DG. Practical statistics for medi-
cal research. London, England: Chapman
& Hall, 1991.
4. Bond J. Power calculator. Available at:
http://calculators.stat.ucla.edu/powercalc/.
Accessed March 11, 2003.
312
Radiology
May 2003 Eng
R
adiology
5. Uitenbroek DG. Sample size: SISAsim-
ple interactive statistical analysis. Avail-
able at: http://home.clara.net/sisa/samsize
.htm. Accessed March 3, 2003.
6. Lenth R. Java applets for power and sam-
ple size. Available at: www.stat.uiowa.edu
/rlenth/Power/index.html. Accessed March
3, 2003.
7. NCSS Statistical Software. PASS 2002.
Available at: www.ncss.com/pass.html. Ac-
cessed March 3, 2003.
8. SPSS. SamplePower. Available at: www.spss
.com/SPSSBI/SamplePower/. Accessed March
3, 2003.
9. Statistical Solutions. nQuery Advisor.
Available at: www.statsolusa.com/nquery
/nquery.htm. Accessed March 3, 2003.
10. Browner WS, Newman TB. Are all signif-
icant P values created equal? The analogy
between diagnostic tests and clinical re-
search. JAMA 1987; 257:24592463.
11. Moher D, Dulberg CS, Wells GA. Statisti-
cal power, sample size, and their report-
ing in randomized controlled trials.
JAMA 1994; 272:122124.
12. Freiman JA, Chalmers TC, Smith H, Kue-
bler RR. The importance of beta, the type
II error and sample size in the design and
interpretation of the randomized control
trial: survey of 71 negative trials.
N Engl J Med 1978; 299:690694.
13. Rosner B. Fundamentals of biostatistics.
5th ed. Pacic Grove, Calif: Duxbury,
2000; 308.
14. Feinstein AR. Principles of medical statis-
tics. Boca Raton, Fla: CRC, 2002; 503.
15. Snedecor GW, Cochran WG. Statistical
methods. 8th ed. Ames, Iowa: Iowa State
University Press, 1989; 52, 439.
16. Browner WS, Newman TB, Cummings SR,
Hulley SB. Estimating sample size and
power. In: Hulley SB, Cummings SR,
Browner WS, Grady D, Hearst N, New-
man TB. Designing clinical research: an
epidemiologic approach. 2nd ed. Phila-
delphia, Pa: Lippincott Williams &
Wilkins, 2001; 6584.
17. Frison L, Pocock S. Repeated measure-
ments in clinical trials: analysis using
mean summary statistics and its implica-
tions for design. Stat Med 1992; 11:1685
1704.
18. Fleiss JL. Statistical methods for rates and
proportions. 2nd ed. New York, NY:
Wiley, 1981; 45.
19. Hoenig JM, Heisey DM. The abuse of
power: the pervasive fallacy of power cal-
culations for data analysis. Am Stat 2001;
55:1924.
Volume 227
Number 2 Sample Size Estimation
313
R
adiology