aeraad99.wp1 3/29/99

Return to Bruce's Home Page

Common Methodology Mistakes in Educational Research, Revisited,

Along with a Primer on both Effect Sizes and the Bootstrap

Bruce Thompson

Texas A&M University 77843-4225

and

Baylor College of Medicine

Correct APA citation style:

Thompson, B. (1999, April). __Common methodology mistakes in
educational research, revisited, along with a primer on both effect sizes and
the bootstrap__. Invited address presented at the annual meeting of the
American Educational Research Association, Montreal. (ERIC Document Reproduction
Service No. ED forthcoming)

____________

Invited address presented at the annual meeting of the American
Educational Research Association (session #44.25), Montreal, April 22, 1999.
Justin Levitov first introduced me to the bootstrap, for which I remain most
grateful. I also appreciate the thoughtful comments of Cliff Lunneborg and
Russell Thompson on a previous draft of this paper. The author and related
reprints may be accessed through Internet URL:
"**index.htm**".

Abstract

The present AERA invited address was solicited to address the theme for the 1999 annual meeting, "On the Threshold of the Millennium: Challenges and Opportunities." The paper represents an extension of my 1998 invited address, and cites two additional common methodology faux pas to complement those enumerated in the previous address. The remainder of these remarks are forward-looking. The paper then considers (a) the proper role of statistical significance tests in contemporary behavioral research, (b) the utility of the descriptive bootstrap, especially as regards the use of "modern" statistics, and (c) the various types of effect sizes from which researchers should be expected to select in characterizing quantitative results. The paper concludes with an exploration of the conditions necessary and sufficient for the realization of improved practices in educational research.

In 1993, Carl Kaestle, prior to his term as President of the
National Academy of Education, published in the __Educational Researcher__ an
article titled, "The Awful Reputation of Education Research." It is noteworthy
that the article took as a given the conclusion that educational research
suffers an awful reputation, and rather than justifying this conclusion, Kaestle
focused instead on exploring the etiology of this reality. For example, Kaestle
(1993) noted that the education R&D community is seemingly in perpetual
disarray, and that there is a

...lack of consensus--lack of consensus on goals, lack of consensus on research results, and lack of a united front on funding priorities and procedures.... [T]he lack of consensus on goals is more than political; it is the result of a weak field that cannot make tough decisions to do some things and not others, so it does a little of everything... (p. 29)

Although Kaestle (1993) did not find it necessary to provide a warrant for his conclusion that educational research has an awful reputation, others have directly addressed this concern.

The National Academy of Science evaluated educational research generically, and found "methodologically weak research, trivial studies, an infatuation with jargon, and a tendency toward fads with a consequent fragmentation of effort" (Atkinson & Jackson, 1992, p. 20). Others also have argued that "too much of what we see in print is seriously flawed" as regards research methods, and that "much of the work in print ought not to be there" (Tuckman, 1990, p. 22). Gall, Borg and Gall (1996) concurred, noting that "the quality of published studies in education and related disciplines is, unfortunately, not high" (p. 151).

Indeed, __empirical__ studies of published research
involving methodology experts as judges corroborate these impressions. For
example, Hall, Ward and Comer (1988) and Ward, Hall and Schramm (1975) found
that over 40% and over 60%, respectively, of published research was seriously or
completely flawed. Wandt (1967) and Vockell and Asher (1974) reported similar
results from their empirical studies of the quality of published research.
Dissertations, too, have been examined, and have been found methodologically
wanting (cf. Thompson, 1988a, 1994a).

Researchers have also questioned the ecological validity of
both quantitative and qualitative educational studies. For example, Elliot
Eisner studied two volumes of the flagship journal of the American Educational
Research Association, the __American Educational Research Journal__
(*AERJ*). He reported that,

The median experimental treatment time for seven of the 15
experimental studies that reported experimental treatment time in Volume 18 of
the *AERJ* is 1 hour and 15 minutes. I suppose that we should take some
comfort in the fact that this represents a 66 percent increase over a 3-year
period. In 1978 the median experimental treatment time per subject was 45
minutes. (Eisner, 1983, p. 14)

Similarly, Fetterman (1982) studied major qualitative projects, and reported that, "In one study, labeled 'An ethnographic study of..., observers were on site at only one point in time for five days. In a[nother] national study purporting to be ethnographic, once-a-week, on-site observations were made for 4 months" (p. 17)

None of this is to deny that educational research, whatever its methodological and other limits, has influenced and informed educational practice (cf. Gage, 1985; Travers, 1983). Even a methodologically flawed study may still contribute something to our understanding of educational phenomena. As Glass (1979) noted, "Our research literature in education is not of the highest quality, but I suspect that it is good enough on most topics" (p. 12).

However, as I pointed out in a 1998 AERA invited address, the
problem with methodologically flawed educational studies is that these flaws are
entirely *gratuitous*. I argued that

incorrect analyses arise from doctoral methodology
*instruction* that teaches research methods as series of rotely-followed
routines, as against thoughtful elements of a reflective enterprise; from
doctoral *curricula* that seemingly have less and less room for
quantitative statistics and measurement content, even while our knowledge base
in these areas is burgeoning (Aiken, West, Sechrest, Reno, with Roediger, Scarr,
Kazdin & Sherman, 1990; Pedhazur & Schmelkin, 1991, pp. 2-3); and, in
some cases, from an unfortunate *atavistic impulse* to somehow escape
responsibility for analytic decisions by justifying choices, sans rationale,
solely on the basis that the choices are common or traditional. (Thompson,
1998a, p. 4)

Such concerns have certainly been voiced by others. For example, following the 1998 annual AERA meeting, one conference attendee wrote AERA President Alan Schoenfeld to complain that

At [the 1998 annual meeting] we had a hard time finding rigorous research that reported actual conclusions. Perhaps we should rename the association the American Educational Discussion Association.... This is a serious problem. By encouraging anything that passes for inquiry to be a valid way of discovering answers to complex questions, we support a culture of intuition and artistry rather than building reliable research bases and robust theories. Incidentally, theory was even harder to find than good research. (Anonymous, 1998, p. 41)

Subsequently, Schoenfeld appointed a new AERA committee, the Research Advisory Committee, which currently is chaired by Edmund Gordon. The current members of the Committee are: Ann Brown, Gary Fenstermacher, Eugene Garcia, Robert Glaser, James Greeno, Margaret LeCompte, Richard Shavelson, Vanessa Siddle Walker, and Alan Schoenfeld, ex officio, Lorrie Shepard, ex officio, and William Russell, ex officio. The Committee is charged to strengthen the research-related capacity of AERA and its members, coordinate its activities with appropriate AERA programs, and be entrepreneurial in nature. [In some respects, the AERA Research Advisory Committee has a mission similar to that of the APA Task Force on Statistical Inference, which was appointed in 1996 (Azar, 1997; Shea, 1996).]

AERA President Alan Schoenfeld also appointed Geoffrey Saxe the 1999 annual meeting program chair. Together, they then described the theme for the AERA annual meeting in Montreal:

As we thought about possible themes for the upcoming annual meeting, we were pressed by a sense of timeliness and urgency. With regard to timeliness, ...the calendar year for the next annual meeting is 1999, the year that heralds the new millennium.... It's a propitious time to think about what we know, what we need to know, and where we should be heading. Thus, our overarching theme [for the 1999 annual meeting] is "On the Threshold of the Millennium: Challenges and Opportunities."

There is also a sense of urgency. Like many others, we see the field of education at a point of critical choices--in some arenas, one might say crises. (Saxe & Schoenfeld, 1998, p. 41)

The present paper was among those invited by various divisions to address this theme, and is an extension of my 1998 AERA address (Thompson, 1998a).

Purpose of the Present Paper

In my 1998 AERA invited address I advocated the improvement of
educational research via the eradication of five identified *faux pas*:

(1) the use of *stepwise* methods;

(2) the failure to consider in result interpretation the
*context specificity* of analytic weights (e.g., regression beta weights,
factor pattern coefficients, discriminant function coefficients, canonical
function coefficients) that are part of all parametric quantitative
analyses;

(3) the failure to interpret * both weights and
structure coefficients* as part of result interpretation;

(4) the failure to recognize that *reliability* is a
characteristic of scores, and __not__ of tests; and

(5) the incorrect interpretation of *statistical
significance* and the related failure to report and interpret the *effect
sizes* present in all quantitative analyses.

Two Additional Methodology *Faux Pas*

The present __didactic essay__ elaborates two additional
common methodology errors to delineate a constellation of seven cardinal sins of
analytic research practice:

(6) the use of univariate analyses in the presence of multiple
outcomes variables, and the converse use of univariate analyses in post hoc
explorations of detected *multivariate effects*; and

(7) the *conversion of intervally-scaled predictor
variables* into nominally-scaled data in service of OVA (i.e., ANOVA, ANCOVA,
MANOVA, MANCOVA) analyses.

However, the present paper is more than a further elaboration of bad behaviors. Here the discussion of these two errors focuses on driving home two important realizations that should undergird best methodological practice:

1. All statistical analyses of scores on measured/observed variables actually focus on correlational analyses of scores on synthetic/latent variables derived by applying weights to the observed variables; and

2. The researcher's fundamental task in deriving defensible results is to employ an analytic model that matches the researcher's (too often implicit) model of reality.

These two realization will provide a __conceptual
foundation__ for the treatment in the remainder of the paper.

Focus on the Future: Improving Educational Research

Although the focus on common methodological *faux pas* has
some merit, in keeping with the theme of this 1999 annual meeting of AERA, the
present invited address then turns toward the constructive portrayal of a
brighter research future. Three issues are addressed. First, the proper role of
*statistical significance* testing in future practice is explored. Second,
the use of so-called "internal replicability" analyses in the form of the
*bootstrap* is described. As part of this discussion some "modern"
statistics are briefly discussed. Third, the computation and interpretation of
*effects sizes* are described.

Other methods *faux pas* and other methods improvements
might both have been elaborated. However, the proposed changes would result in
considerable improvement in future educational research. In my view, (a)
informed use of statistical tests, (b) the more frequent use of external and
internal replicability analyses, and especially (c) required reporting and
interpretation of effect sizes in all quantitative research are both necessary
and sufficient conditions for realizing improvements.

Essentials for Realizing Improvements

The essay ends by considering how fields move and what must be done to realize these potential improvements. In my view, AERA must exercise visible and coherent academic leadership if change is to occur. To date, such leadership has not often been within the organization's traditions.

Faux Pas #6: Univariate as Against Multivariate Analyses

Too often, educational researchers invoke a series of
univariate analyses (e.g., ANOVA, regression) to analyze multiple dependent
variable scores from a single sample of participants. Conversely, too often
researchers who correctly select a multivariate analysis invoke univariate
analyses *post hoc* in their investigation of the origins of multivariate
effects. Here it will be demonstrated once again, using heuristic data to make
the discussion completely concrete, that in both cases these choices may lead to
serious interpretation errors.

The fundamental conceptual emphasis of this discussion, as previously noted, is on making the point that:

1.* All statistical analyses of scores on measured/observed
variables actually focus on correlational analyses of scores on synthetic/latent
variables derived by applying weights to the observed
variables.*

Two small heuristic data sets are employed to illustrate the relevant dynamics, respectively, for the univariate (i.e., single dependent/outcome variable) and multivariate (i.e., multiple outcome variables) cases.

Univariate Case

Table 1 presents a heuristic data set involving scores on three
measured/observed variables: __Y__, __X1__, and __X2__. These variables
are called "measured" (or "observed") because they are __directly__ measured,
*without any application of additive or multiplicative weights*, via
rulers, scales, or psychometric tools.

__________________________

INSERT TABLE 1 ABOUT HERE.

__________________________

However, ALL parametric analyses apply weights to the measured/observed variables to estimate scores for each person on synthetic or latent variables. This is true notwithstanding the fact that for some statistical analyses (e.g., ANOVA) the weights are not printed by some statistical packages. As I have noted elsewhere, the weights in different analyses

...are all analogous, but are given different names in different analyses (e.g., beta weights in regression, pattern coefficients in factor analysis, discriminant function coefficients in discriminant analysis, and canonical function coefficients in canonical correlation analysis), mainly to obfuscate the commonalities of [all] parametric methods, and to confuse graduate students. (Thompson, 1992a, pp. 906-907)

The synthetic variables derived by applying weights to the measured variables then become the focus of the statistical analyses.

The fact that all analyses are part on one single General Linear Model (GLM) family is a fundamental foundational understanding essential (in my view) to the informed selection of analytic methods. The seminal readings have been provided by Cohen (1968) viz. the univariate case, by Knapp (1978) viz. the multivariate case, and by Bagozzi, Fornell and Larcker (1981) regarding the most general case of the GLM: structural equation modeling. Related heuristic demonstrations of General Linear Model dynamics have been offered by Fan (1996, 1997) and Thompson (1984, 1991, 1998a, in press-a).

In the multiple regression case, a given i_{th}
person's score on the measured/observed variable __Y___{i} is
estimated as the synthetic/latent variable __Y__^_{i}. The predicted
outcome score for a given person equals __Y__^_{i} = a +
b_{1}(__X1___{i}) + b_{2}(__X2___{i}),
which for these data, as reported in Figure 1, equals -581.735382 + [1.301899 x
X1_{i}] + [0.862072 x X2_{i}]. For example, for person 1,
__Y__^_{1} = [1.301899 x 392] + [0.862072 x 573] = 422.58.

___________________________

INSERT FIGURE 1 ABOUT HERE.

___________________________

__Some Noteworthy Revelations__. The "ordinary least
squares" (OLS) estimation used in classical regression analysis optimizes the
fit in the sample of each __Y__^_{i} to each __Y___{i}
score. Consequently, as noted by Thompson (1992b), even if all the predictors
are useless, the means of __Y__^ and __Y__ will __always__ be equal
(here 500.25), and the mean of the __e__ scores (__e___{i} =
__Y___{i} - __Y__^_{i}) will __always__ be zero. These
expectations are confirmed in the Table 1 results.

It is also worth noting that the sum of squares (i.e., the sum
of the squared deviations of each person's score from the mean) of the __Y__^
scores (i.e., 167,218.50) computed in Table 1 matches the "regression" sum of
squares (variously synonymously called "explained," "model," "between," so as to
confuse the graduate students) reported in the Figure 1 SPSS output.
Furthermore, the sum of squares of the __e__ scores reported in Table 1
(i.e., 32,821.26) exactly matches the "residual" sum of squares (variously
called "error," "unexplained," and "residual") value reported in the Figure 1
SPSS output.

It is especially noteworthy that the sum of squares explained
(i.e., 167,218.50) divided the sum of squares of the __Y__ scores (i.e., the
sum of squares "total" = 167,218.50 + 32,821.26 = 200,039.75) tells us the
proportion of the variance in the __Y__ scores that we can predict given
knowledge of the __X1__ and the __X2__ scores. For these data the
proportion is 167,218.50 / 200,039.75 = .83593. This formula is one of several
formulas with which to compute the uncorrected regression effect size, the
multiple R^{2}.

Indeed, for the univariate case, because ALL analyses are
correlational, an __r__^{2} analog of this effect size can always be
computed, using this formula across analyses. However, in ANOVA, for example,
when we compute this effect size using this generic formula, we call the result
eta^{2} (h ^{2}; or synonymously the
correlation ratio [not the correlation coefficient!]), primarily to confuse the
graduate students.

__Even More Important Revelations__. Figure 2 presents the
correlation coefficients involving all possible pairs of the five (three
measured, two synthetic) variables. Several additional revelations become
obvious.

___________________________

INSERT FIGURE 2 ABOUT HERE.

___________________________

First, note that the __Y__^ scores and the __e__ scores
are perfectly uncorrelated. This will ALWAYS be the case, by definition, since
the __Y__^ scores are the aspects of the __Y__ scores that the predictors
can explain or predict, and the __e__ scores are the aspects of the __Y__
scores that the predictors cannot explain or predict (i.e., because
__e___{i} is defined as __Y___{i} - __Y__^_{i},
therefore __r___{YHAT x e} = 0). Similarly, the measured predictor
variables (here __X1__ and __X2__) always have correlations of zero with
the __e__ scores, again because the __e__ scores by definition are the
parts of the __Y__ scores that the predictors cannot explain.

Second, note that the __r___{Y x YHAT} reported in
Figure 3 (i.e., .9143) matches the multiple __R__ reported in Figure 1 (i.e.,
.91429), except for the arbitrary decision by different computer programs to
present these statistics to different numbers of decimal places. The equality
makes sense conceptually, if we think of the __Y__^ scores as being the part
of the predictors useful in predicting/explaining the __Y__ scores,
discarding all the parts of the measured predictors that are not useful (about
which we are completely uninterested, because the focus of the analysis is
solely on the outcome variable).

This last revelation is __extremely__ important to a
conceptual understanding of statistical analyses. The fact that __R___{Y
with X1, X2} = __r___{Y x YHAT} means that the synthetic
variable, __Y__^, is actually the focus of the analysis. Indeed, synthetic
variables are ALWAYS the real focus of statistical analyses!

This makes sense, when we realize that our measures are only
indicators of our psychological constructs, and that what we really care about
in educational research are not the observed scores on our measurement tools
*per se*, but instead is the underlying construct. For example, if I wish
to improve the self-concepts of third-grade elementary students, what I really
care about is improving their unobservable self-concepts, and not the scores on
an imperfect measure of this construct, which I __only__ use as a vehicle to
estimate the latent construct of interest, because the construct cannot be
directly observed.

Third, the correlations of the measured predictor variables
with the synthetic variable (i.e., .7512 and -.0741) are called "structure"
coefficients. These can also be derived by computation (cf. Thompson &
Borrello, 1985) as __r___{S} = __r___{Y with X} / __R__
(e.g., .6868 / .91429 = .7512). [Due to a strategic error on the part of
methodology professors, who convene annually in a secret coven to generate more
statistical terminology with which to confuse the graduate students, for some
reason the mathematically analogous structure coefficients across all analyses
are uniformly called by the same name--an oversight that will doubtless soon be
corrected.]

The reason structure coefficients are called "structure"
coefficients is that these coefficients provide insight regarding what is the
nature or the structure of the underlying synthetic variables of the actual
research focus. Although space precludes further detail here, I regard the
interpretation of structure coefficients are being __essential__ in most
research applications (Thompson, 1997b, 1998a; Thompson & Borrello, 1985).
Some educational researchers erroneously believe that these coefficients are
unimportant insofar as they are not reported for all analyses by some computer
packages; these researchers incorrectly believe that SPSS and other
computer packages were written in a sole authorship venture by a benevolent God
who has elected judiciously to report on printouts (a) *all* results of
interest and (b) *only* the results of genuine interest.

__The Critical, Essential Revelation__. Figure 2 also
provides the basis for delineating a paradox which, once resolved, leads to a
fundamentally important insight regarding statistical analyses. Notice for these
data the __r__^{2} between __Y__ and __X1__ is
.6868^{2} = 47.17% and the __r__^{2} between __Y__ and
__X2__ is -.0677^{2} = 0.46%. The sum of these two values is
.4763.

Yet, as reported in Figures 2 and 3, the __R__^{2}
value for these data is .91429^{2} = 83.593%, a value approaching the
mathematical limen for __R__^{2}. How can the multiple
__R__^{2} value (83.593%) be not only larger, but nearly twice as
large as the sum of the __r__^{2} values of the two predictor
variables with __Y__?

These data illustrate a "suppressor" effect. These effects were
first noted in World War II when psychologists used paper-and-pencil measures of
spatial and mechanical ability to predict ability to pilot planes.
Counterintuitively, it was discovered that verbal ability, which is essentially
unrelated with pilot ability, nevertheless substantially improved the
__R__^{2} when used as a predictor in conjunction spatial and
mechanical ability scores. As Horst (1966, p. 355) explained, "To include the
verbal score with a negative weight served to suppress or subtract irrelevant
[measurement artifact] ability [in the spatial and mechanical ability scores],
and to discount the scores of those who did well on the test simply because of
their verbal ability rather than because of abilities required for success in
pilot training."

Thus, suppressor effects are desirable, notwithstanding what some may deem a pejorative name, because suppressor effects actually increase effect sizes. Henard (1998) and Lancaster (in press) provide readable elaborations. All this discussion leads to the extremely important point that

__ The latent or synthetic variables analyzed in all parametric
methods are always more than the sum of their constituent parts__. If we
only look at observed variables, such as by only examining a series of bivariate

Multivariate Case

Table 2 presents heuristic data for 10 people in each of two
groups on two measured/observed outcome/response variables, __X__ and
__Y__. These data are somewhat similar to those reported by Fish (1988), who
argued that multivariate analyses are usually vital. The Table 2 data are used
here to illustrate that (a) when you have more than one outcome variable,
multivariate analyses may be essential, and (b) when you do a multivariate
analysis, you must __not__ use a univariate method *post hoc* to explore
the detected multivariate effects.

__________________________

INSERT TABLE 2 ABOUT HERE.

__________________________

For these heuristic data, the outcome scores of __X__ and
__Y__ have exactly the same variance in both groups 1 and 2, as reported in
the bottom of Table 2. This exactly equal __SD__ (and variance and sum of
squares) means that the ANOVA "homogeneity of variance" assumption (called this
because this characterization sounds fancier than simply saying "the outcome
variable scores were equally 'spread out' in all groups") was perfectly met, and
therefore the calculated ANOVA __F__ test results are exactly accurate for
these data. Furthermore, the analogous multivariate "homogeneity of dispersion
matrices" assumption (meaning simply that the variance/covariance matrices in
the two groups were equal) was also perfectly met, and therefore the MANOVA
__F__ tests are exactly accurate as well. In short, the demonstrations here
are not contaminated by the failure to meet statistical assumptions!

Figure 3 presents ANOVA results for separate analyses of the
__X__ and __Y__ scores presented in Table 2. For both __X__ and
__Y__, the two means do not differ to a statistically significant degree. In
fact, for both variables the p_{CALCULATED} values were .774.
Furthermore, the eta^{2} effect sizes were both computed to be 0.469%
(e.g., 5.0 / [5.0 + 1061.0] = 5.0 / 1065.0 = .00469). Thus, the two sets of
ANOVA results are not statistically significant and they both involve extremely
small effect sizes.

___________________________

INSERT FIGURE 3 ABOUT HERE.

___________________________

However, as also reported in the Figure 3 results, a
MANOVA/Descriptive Discriminant Analysis (DDA; for a one-way MANOVA, MANOVA and
DDA yield the same results, but the DDA provides more detailed analysis--see
Huberty, 1994; Huberty & Barton, 1989; Thompson, 1995b) of the *same
data* yields a p_{CALCULATED} value of .000239, and an
eta^{2} of 62.5%. Clearly, the resulting interpretation of the same data
would be night-and-day different for these two sets of analyses. Again, the
synthetic variables in some senses can become more than the sum of their parts,
as was also the case in the previous heuristic demonstration.

Table 2 reports these latent variable scores for the 20
participants, derived by applying the weights (-1.225 and 1.225) reported in
Figure 3 to the two measured outcome variables. For heuristic purposes only, the
scores on the synthetic variable labelled "DSCORE" were then subjected to the
ANOVA reported in Figure 4. As reported in Figure 4, this analysis of the
multivariate synthetic variable, a weighted aggregation of the outcome variables
__X__ and __Y__, yields the same eta^{2} effect size (i.e., 62.5%)
reported in Figure 3 for the DDA/MANOVA results. Again, all statistical analyses
actually focus on the synthetic/latent variables actually derived in the
analyses, *quod erat demonstrandum*.

___________________________

INSERT FIGURE 4 ABOUT HERE.

___________________________

The present heuristic example can be framed in either of two
ways, both of which highlight common errors in contemporary analytic practice.
The first error involves conducting multiple univariate analyses to evaluate
multivariate data; the second error involves using univariate analyses (e.g.,
ANOVAs) in *post hoc* analyses of detected multivariate effects.

__Using Several Univariate Analyses to Analyze Multivariate
Data__. The present example might be framed as an illustration of a researcher
conducting *only* two ANOVAs to analyze the two sets of dependent variable
scores. The researcher here would find no statistically significant (both
__p___{CALCULATED} values = .774) nor (probably, depending upon the
context of the study and researcher personal values) any noteworthy effect (both
eta^{2} values = 0.469%). This researcher would remain oblivious to the
statistically significant effect (__p___{CALCULATED} = .000239) and
huge (as regards typicality; see Cohen, 1988) effect size (multivariate
eta^{2} = 62.5%).

One potentially noteworthy argument in favor of employing
multivariate methods with data involving more than one outcome variable involves
the inflation of "experimentwise" Type I error rates (a
_{EW}; i.e., the probability of making one or more Type I errors in a
set of hypothesis tests--see Thompson, 1994d). At the extreme, when the outcome
variables or the hypotheses (as in a balanced ANOVA design) are perfectly
uncorrelated, a _{EW} is a function of the
"testwise" alpha level (a _{TW}) and the number
of outcome variables or hypotheses tested (__k__), and equals

1 - (1 - a
_{TW})^{k}.

Because this function is exponential, experimentwise error rates can inflate quite rapidly! [Imagine my consternation when I detected a local dissertation invoking more than 1,000 univariate statistical significance tests (Thompson, 1994a).]

One way to control the inflation of experimentwise error is to
use a "Bonferroni correction" which adjusts the a
_{TW} downward so as to minimize the final a
_{EW}. Of course, one consequence of this strategy is lessened
statistical power against Type II error. However, the primary argument against
using a series of univariate analyses to evaluate data involving multiple
outcome variables does not invoke statistical significance testing concepts.

Multivariate methods are often vital in behavioral research
simply because *multivariate methods best honor the reality to which the
researcher is purportedly trying to generalize*. Implicit within every
analysis is an analytic model. Each researcher also has a presumptive model of
what reality is believed to be like. It is critical that our analytic models and
our models of reality match, otherwise our conclusions will be invalid. It is
generally best to consciously reflect on the fit of these two models whenever we
do research. Of course, researchers with different models of reality may make
different analytic choices, but this is not disturbing because analytic choices
are philosophically driven anyway (Cliff, 1987, p. 349).

My personal model of reality is one "in which the researcher
cares about multiple outcomes, in which most outcomes have multiple causes, and
in which most causes have multiple effects" (Thompson, 1986b, p. 9). Given such
a model of reality, it is critical that the full network of all possible
relationships be considered *simultaneously* within the analysis.
Otherwise, the Figure 3 multivariate effects, presumptively real given my model
of reality, would go undetected. Thus, Tatsuoka's (1973b) previous remarks
remain telling:

The often-heard argument, "I'm more interested in seeing how each variable, in its own right, affects the outcome" overlooks the fact that any variable taken in isolation may affect the criterion differently from the way it will act in the company of other variables. It also overlooks the fact that multivariate analysis--precisely by considering all the variables simultaneously--can throw light on how each one contributes to the relation. (p. 273)

For these various reasons __empirical__ studies (Emmons,
Stallings & Layne, 1990) show that, "In the last 20 years, the use of
multivariate statistics has become commonplace" (Grimm & Yarnold, 1995, p.
vii).

__Using Univariate Analyses post hoc to Investigate
Detected Multivariate Effects__. In ANOVA and ANCOVA,

However, in MANOVA and MANCOVA *post hoc* tests are
necessary to evaluate (a) which groups differ (b) as regards which one or more
outcome variables. Even in a two-level way (or "factor"), if the effect is
statistically significant, further analyses are necessary to determine on which
one or more outcome/response variables the two groups differ. An alarming number
of researchers employ ANOVA as a *post hoc* analysis to explore detected
MANOVA effects (Thompson, 1999b).

Unfortunately, as the previous example made clear, because the
two *post hoc* ANOVAs would fail to explain where the incredibly large and
statistically significant MANOVA effect originated, ANOVA is __not__ a
suitable MANOVA *post hoc* analysis. As Borgen and Seling (1978) argued,
"When data truly are multivariate, as implied by the application of MANOVA, a
multivariate follow-up technique seems necessary to 'discover' the complexity of
the data" (p. 696). It is simply illogical to first declare interest in a
multivariate omnibus system of variables, and to then explore detected effects
in this multivariate world by conducting non-multivariate tests!

Faux Pas #7: Discarding Variance in Intervally-Scaled Variables

Historically, OVA methods (i.e., ANOVA, ANCOVA, MANOVA, MANCOVA) dominated the social scientist's analytic landscape (Edgington, 1964, 1974). However, more recently the proportion of uses of OVA methods has declined (cf. Elmore & Woehlke, 1988; Goodwin & Goodwin, 1985; Willson, 1980). Planned contrasts (Thompson, 1985, 1986a, 1994c) have been increasingly favored over omnibus tests. And regression and related techniques within the GLM family have been increasingly employed.

Improved analytic choices have partially been a function of growing researcher awareness that:

2.* The researcher's fundamental task in deriving defensible
results is to employ an analytic model that matches the researcher's (too often
implicit) model of reality.*

This growing awareness can largely be traced to a seminal article written by Jacob Cohen (1968, p. 426).

Theory

Cohen (1968) noted that ANOVA and ANCOVA are special cases of multiple regression analysis, and argued that in this realization "lie possibilities for more relevant and therefore more powerful exploitation of research data." Since that time researchers have increasingly recognized that conventional multiple regression analysis of data as they were initially collected (no conversion of intervally scaled independent variables into dichotomies or trichotomies) does not discard information or distort reality, and that the "general linear model"

...can be used equally well in experimental or non-experimental research. It can handle continuous and categorical variables. It can handle two, three, four, or more independent variables... Finally, as we will abundantly show, multiple regression analysis can do anything the analysis of variance does--sums of squares, mean squares, F ratios--and more. (Kerlinger & Pedhazur, 1973, p. 3)

Discarding variance is generally __not__ good research
practice. As Kerlinger (1986) explained,

...partitioning a continuous variable into a dichotomy or
trichotomy throws information away... To reduce a set of values with a
relatively wide range to a dichotomy is to reduce its variance and thus its
possible correlation with other variables. A good rule of research data
analysis, therefore, is: __Do not reduce continuous variables to partitioned
variables__ (dichotomies, trichotomies, etc.) unless compelled to do so by
circumstances or the nature of the data (seriously skewed, bimodal, etc.). (p.
558, emphasis in original)

Kerlinger (1986, p. 558) noted that variance is the "stuff" on which all analysis is based. Discarding variance by categorizing intervally-scaled variables amounts to the "squandering of information" (Cohen, 1968, p. 441). As Pedhazur (1982, pp. 452-453) emphasized,

Categorization of attribute variables is all too frequently resorted to in the social sciences.... It is possible that some of the conflicting evidence in the research literature of a given area may be attributed to the practice of categorization of continuous variables.... Categorization leads to a loss of information, and consequently to a less sensitive analysis.

Some researchers may be prone to categorizing continuous
variables and overuse of ANOVA because they __unconsciously__ and
__erroneously__ associate ANOVA with the power of experimental designs. As I
have noted previously,

Even most experimental studies invoke intervally scaled "aptitude" variables (e.g., IQ scores in a study with academic achievement as a dependent variable), to conduct the aptitude-treatment interaction (ATI) analyses recommended so persuasively by Cronbach (1957, 1975) in his 1957 APA Presidential address. (Thompson, 1993a, pp. 7-8)

Thus, many researchers employ interval predictor variables, even in experimental designs, but these same researchers too often convert their interval predictor variables to nominal scale merely to conduct OVA analyses.

It is *true* that experimental designs allow causal
inferences and that ANOVA is appropriate for many experimental designs. However,
it is *not* therefore *true* that doing an ANOVA makes the design
experimental and thus allows causal inferences.

Humphreys (1978, p. 873, emphasis added) noted that:

The basic fact is that a measure of individual differences is
not an independent variable [in a experimental design], and it *does not
become one* by categorizing the scores and treating the categories as if they
defined a variable under experimental control in a factorially designed analysis
of variance.

Similarly, Humphreys and Fleishman (1974, p. 468) noted that
categorizing variables in a nonexperimental design using an ANOVA analysis "not
infrequently produces in both the investigator and his audience the illusion
that he has experimental control over the independent variable. Nothing could be
more wrong." Because within the general linear model all analyses are
correlational, and it is the design and __not__ the analysis that yields the
capacity to make causal inferences, the practice of converting intervally-scaled
predictor variables to nominal scale so that ANOVA and other OVAs (i.e., ANCOVA,
MANOVA, MANCOVA) can be conducted is inexcusable, at least in most cases.

As Cliff (1987, p. 130, emphasis added) noted, the practice of discarding variance on intervally-scaled predictor variables to perform OVA analyses creates problems in almost all cases:

Such divisions are not infallible; think of the persons near
the borders. Some who should be highs are actually classified as lows, and vice
versa. In addition, the "barely highs" are classified the same as the "very
highs," even though they are different. Therefore, reducing a reliable variable
to a dichotomy [or a trichotomy] makes the variable __more unreliable__, not
less.

In such cases, it is the reliability of the dichotomy that we
actually analyze, and __not__ the reliability of the highly-reliable,
intervally-scaled data that we originally collected, which impact the analysis
we are actually conducting.

Heuristic Examples for Three Possible Cases

When we convert an intervally-scaled independent variable into a nominally-scaled way in service of performing an OVA analysis, we are implicitly invoking a model of reality with two strict assumptions:

1. all the participants assigned to a given level of the way (or "factor") are the same, and

2. all the participants assigned to different levels of the way are different.

For example, if we have a normal distribution of IQ scores, and we use scores of 90 and 110 to trichotomize our interval data, we are saying that:

1. the 2 people in the High IQ group with IQs of 111 and 145 are the same, and

2. the 2 people in the Low and Middle IQ groups with IQs of 89 and 91, respectively, are different.

Whether our decision to convert our intervally-scaled data to nominal scale is appropriate depends entirely on the research situation. There are three possible situations.

Table 3 presents heuristic data illustrating the three
possibilities. The measured/observed outcome variable in all three cases is
__Y__.

__________________________

INSERT TABLE 3 ABOUT HERE.

__________________________

__Case #1: No harm, no foul__. In case #1 the
intervally-scaled variable __X1__ is re-expressed as a trichotomy in the form
of variable __X1'__. Assuming that the standard error of the measurement is
something like 3 or 6, the conversion in this instance does not seem
problematic, because it appears reasonable to assume that:

1. all the participants assigned to a given level of the way are the same, and

2. all the participants assigned to different levels of the way are different.

__Case #2: Creating variance where there is none__. Case #2
again assumes that the standard error of the measurement is something like 3 to
6 for the hypothetical scores. Here none of the 21 participants appear to be
different as regards their scores on Table 3 variable __X2__, so assigning
the participants to three groups via variable __X2'__ seems to create
differences where there are none. This will generate analytic results in which
the analytic model does not honor our model of reality, which in turn
compromises the integrity of our results.

Some may protest that no real researcher would ever, ever assign people to groups where there are, in fact, no meaningful differences among the participants as regards their scores on an independent variable. But consider a recent local dissertation that involved administration of a depression measure to children; based on scores on this measure the children were assigned to one of three depression groups. Regrettably, these children were all apparently happy and well-adjusted.

It is especially interesting that the highest score on this
[depression] variable... was apparently 3.43 (p. 57). As... [the student]
acknowledged, the PNID authors themselves recommend a cutoff score of 4 for
classifying subjects as being severely depressed. Thus, the __highest__ score
in... [the] entire sample appeared to be __less than the minimum__ cutoff
score suggested by the test's own authors! (Thompson, 1994a, p.
24)

__Case #3: Discarding variance, distorting distribution
shape__. Alternatively, presume that the intervally-scaled independent
variable (e.g., an aptitude way in an ATI design) is somewhat normally
distributed. Variable __X3__ in Table 3 can be used to illustrate the
potential consequences of re-expressing this information in the form of a
nominally-scaled variable such as __X3'__.

Figure 5 presents the SPSS output from analyzing the data in
both unmutilated (i.e., __X3__) and mutilated (i.e., __X3'__) form. In
unmutilated form, the results are statistically significant
(p_{CALCULATED} = .00004) and the __R__^{2} effect size is
59.7%. For the mutilated data, the results are not statistically significant at
a conventional alpha level (p_{CALCULATED} = .1145) and the
eta^{2} effect size is 21.4%, roughly a third of the effect for the
regression analysis.

___________________________

INSERT FIGURE 5 ABOUT HERE.

___________________________

Criticisms of Statistical Significance Tests

Tenor of Past Criticism

The last several decades have delineated an exponential growth curve in the decade-by-decade criticisms across disciplines of statistical testing practices (Anderson, Burnham & Thompson, 1999). In their historical summary dating back to the origins of these tests, Huberty and Pike (in press) provide a thoughtful review of how we got to where we're at. Among the recent commentaries on statistical testing practices, I prefer Cohen (1994), Kirk (1996), Rosnow and Rosenthal (1989), Schmidt (1996), and Thompson (1996). Among the classical criticisms, my favorites are Carver (1978), Meehl (1978), and Rozeboom (1960).

Among the more thoughtful works advocating statistical testing, I would cite Cortina and Dunlap (1997), Frick (1996), and especially Abelson (1997). The most balanced and comprehensive treatment is provided by Harlow, Mulaik and Steiger (1997) (for reviews of this book, see Levin, 1998 and Thompson, 1998c).

My purpose here is not to further articulate the various criticisms of statistical significance tests. My own recent thinking is elaborated in the several reports enumerated in Table 4. The focus here is on what should be the future. Therefore, criticisms of statistical tests are only briefly summarized in the present treatment.

__________________________

INSERT TABLE 4 ABOUT HERE.

__________________________

But two quotations may convey the tenor of some of these commentaries. Rozeboom (1997) recently argued that

Null-hypothesis significance testing is surely the most bone-headedly misguided procedure ever institutionalized in the rote training of science students... [I]t is a sociology-of-science wonderment that this statistical practice has remained so unresponsive to criticism... (p. 335)

And Tryon (1998) recently lamented,

[T]he fact that statistical experts and investigators publishing in the best journals cannot consistently interpret the results of these analyses is extremely disturbing. Seventy-two years of education have resulted in minuscule, if any, progress toward correcting this situation. It is difficult to estimate the handicap that widespread, incorrect, and intractable use of a primary data analytic method has on a scientific discipline, but the deleterious effects are doubtless substantial... (p. 796)

Indeed, __empirical__ studies confirm that many researchers
do not fully understand the logic of their statistical tests (cf. Mittag, 1999;
Nelson, Rosenthal & Rosnow, 1986; Oakes, 1986; Rosenthal & Gaito, 1963;
Zuckerman, Hodgins, Zuckerman & Rosenthal, 1993). Misconceptions are taught
even in widely-used statistics textbooks (Carver, 1978).

Brief Summary of Four Criticisms of Common Practice

Statistical significance tests evaluate the probability of obtaining sample statistics (e.g., means, medians, correlation coefficients) that diverge as far from the null hypothesis as the sample statistics, or further, assuming that the null hypothesis is true in the population, and given the sample size (Cohen, 1994; Thompson, 1996). The utility of these estimates has been questioned on various grounds, four of which are briefly summarized here.

__Conventionally, Statistical Tests Assume "Nil" Null
Hypotheses__. Cohen (1994) defined a "nil" null hypothesis as a null
specifying no differences (e.g., H_{0}: __SD___{1} -
__SD___{2} = 0) or zero correlations (e.g., __R__^{2}=0).
Researchers must specify some null hypothesis, or otherwise the probability of
the sample statistics is completely indeterminate (Thompson, 1996)--infinitely
many __p__ values become equally plausible. But "nil" nulls are __not__
required. Nevertheless, "as almost universally used, the null in H_{0}
is taken to mean nil, zero" (Cohen, 1994, p. 1000).

Some researchers employ nil nulls because statistical theory does not easily accommodate the testing of some non-nil nulls. But probably most researchers employ nil nulls because these nulls have been unconsciously accepted as traditional, because these nulls can be mindlessly formulated without consulting previous literature, or because most computer software defaults to tests of nil nulls (Thompson, 1998c, 1999a). As Boring (1919) argued 80 years ago, in his critique of the mindless use of statistical tests titled, "Mathematical vs. scientific significance,"

The case is one of many where statistical ability, divorced from a scientific intimacy with the fundamental observations, leads nowhere. (p. 338)

I believe that when researchers presume a nil null is true in the population, an untruth is posited. As Meehl (1978, p. 822) noted, "As I believe is generally recognized by statisticians today and by thoughtful social scientists, the [nil] null hypothesis, taken literally, is always false." Similarly, Hays (1981, p. 293) pointed out that "[t]here is surely nothing on earth that is completely independent of anything else [in the population]. The strength of association may approach zero, but it should seldom or never be exactly zero." Roger Kirk (1996) concurred, noting that:

It is ironic that a ritualistic adherence to null hypothesis
significance testing has led researchers to focus on controlling the Type I
error that cannot occur because *all* null hypotheses are false. (p. 747,
emphasis added)

A __p___{CALCULATED} value computed on the
foundation of a false premise is inherently of somewhat limited utility. As I
have noted previously, "in many contexts the use of a 'nil' hypothesis as the
hypothesis we assume can render me largely disinterested in whether a result is
'nonchance'" (Thompson, 1997a, p. 30).

Particularly egregious is the use of "nil" nulls to test measurement hypotheses, where wildly non-nil results are both anticipated and demanded. As Abelson (1997) explained,

And when a reliability coefficient is declared to be nonzero, that is the ultimate in stupefyingly vacuous information. What we really want to know is whether an estimated reliability is .50'ish or .80'ish. (p. 121)

__Statistical Tests Can be a Tautological Evaluation of Sample
Size__. When "nil" nulls are used, the null will always be rejected at some
sample size. There are infinitely many possible sample effects. Given this, the
probability of realizing an exactly zero sample effect is infinitely small.
Therefore, given a "nil" null, and a non-zero sample effect, the null hypothesis
will __always__ be rejected at some sample size!

Consequently, as Hays (1981) emphasized, "virtually any study can be made to show significant results if one uses enough subjects" (p. 293). This means that

Statistical significance testing can involve a tautological logic in which tired researchers, having collected data from hundreds of subjects, then conduct a statistical test to evaluate whether there were a lot of subjects, which the researchers already know, because they collected the data and know they're tired. (Thompson, 1992c, p. 436)

Certainly this dynamic is well known, if it is just as widely ignored. More than 60 years ago, Berkson (1938) wrote an article titled, "Some difficulties of interpretation encountered in the application of the chi-square test." He noted that when working with data from roughly 200,000 people,

an observant statistician who has had any considerable
experience with applying the chi-square test repeatedly will agree with my
statement that, as a matter of observation, when the numbers in the data are
quite large, the __P__'s tend to come out small... [W]e know in advance the
__P__ that will result from an application of a chi-square test to a large
sample... But since the result of the former is known, it is no test at all!
(pp. 526-527)

Some 30 years ago, Bakan (1966) reported that, "The author had occasion to run a number of tests of significance on a battery of tests collected on about 60,000 subjects from all over the United States. Every test came out significant" (p. 425). Shortly thereafter, Kaiser (1976) reported not being surprised when many substantively trivial factors were found to be statistically significant when data were available from 40,000 participants.

__Because Statistical Tests Assume Rather than Test the
Population, Statistical Tests Do Not Evaluate Result Replicability__. Too many
researchers incorrectly assume, consciously or unconsciously, that the __p__
values calculated in statistical significance tests evaluate the
__p__robability that results will replicate (Carver, 1978, 1993). But
statistical tests do __not__ evaluate the probability that the sample
statistics occur in the population as parameters (Cohen, 1994).

Obviously, knowing the probability of the sample is less interesting than knowing the probability of the population. Knowing the probability of population parameters would bear upon result replicability, because we would then know something about the population from which future researchers would also draw their samples. But as Shaver (1993) argued so emphatically:

[A] test of statistical significance is not an indication of the probability that a result would be obtained upon replication of the study.... Carver's (1978) treatment should have dealt a death blow to this fallacy.... (p. 304)

And so Cohen (1994) concluded that the statistical significance test "does not tell us what we want to know, and we so much want to know what we want to know that, out of desperation, we nevertheless believe that it does!" (p. 997).

__Statistical Significance Tests Do Not Solely Evaluate Effect
Magnitude__. Because various study features (including score reliability)
impact calculated __p__ values, __p___{CALCULATED} cannot be used
as a satisfactory index of study effect size. As I have noted elsewhere,

The calculated __p__ values in a given study are a function
of several study features, but are particularly influenced by the confounded,
joint influence of study sample size and study effect sizes. Because __p__
values are confounded indices, in theory 100 studies with varying sample sizes
and 100 different effect sizes could each have the same single
__p___{CALCULATED}, and 100 studies with the same single effect size
could each have 100 different values for __p___{CALCULATED}.
(Thompson, 1999a, pp. 169-170)

The recent fourth edition of the American Psychological
Association style manual (APA, 1994) explicitly acknowledged that __p__
values are not acceptable indices of effect:

Neither of the two types of probability values [statistical
significance tests] reflects the importance or magnitude of an effect because
both depend on sample size... You are [therefore] *encouraged* to provide
effect-size information. (APA, 1994, p. 18, emphasis
added)

In short, effect sizes should be reported in every quantitative study.

The "Bootstrap"

Explanation of the "bootstrap" will provide a concrete basis for facilitating genuine understanding of what statistical tests do (and do not) do. The "bootstrap" has been so named because this statistical procedure represents an attempt to "pull oneself up" on one's own, using one's sample data, without external assistance from a theoretically-derived sampling distribution.

Related books have been offered by Davison and Hinkley (1997), Efron and Tibshirani (1993), Manly (1994), and Sprent (1998). Accessible shorter conceptual treatments have been presented by Diaconis and Efron (1983) and Thompson (1993b). I especially and particularly recommend the remarkable book by Lunneborg (1999).

Software to invoke the bootstrap is available in most structural equation modeling software (e.g., EQS, AMOS). Specialized bootstrap software for microcomputers (e.g., S Plus, SC, and Resampling Stats) is also readily available.

The Sampling Distribution

Key to understanding statistical significance tests is understanding the sample distribution and distinguishing the (a) sampling distribution from (b) the population distribution and (c) the score distribution. Among the better book treatments is one offered by Hinkle, Wiersma and Jurs (1998, pp. 176-178). Shorter treatments include those by Breunig (1995), Mittag (1992), and Rennie (1997).

The *population* distribution consists of the
__scores__ of the __N__ entities (e.g., people, laboratory mice) of
interest to the researcher, regarding whom the researcher wishes to generalize.
In the social sciences, many researchers deem the population to be infinite. For
example, an educational researcher may hope to generalize about the effects of a
teaching method on all human beings across time.

Researchers typically describe the population by computing or estimating characterizations of the population scores (e.g., means, interquartile ranges), so that the population can be more readily comprehended. These characterizations of the population are called "parameters," and are conventionally symbolized using Greek letters (e.g., m for the population score mean, s for the population score standard deviation).

The *sample* distribution __also__ consists of
__scores__, but only a subsample of __n__ scores from the population. The
characterizations of the sample scores are called "statistics," and are
conventionally represented by Roman letters (e.g., __M__, __SD__,
__r__). Strictly speaking, statistical significance tests evaluate the
probability of a given set of statistics occurring, assuming that the sample
came from a population exactly described by the null hypothesis, given the
sample size.

Because each sample is only a subset of the population scores,
the sample does not exactly reproduce the population distribution. Thus, each
set of sample scores contains some idiosyncratic variance, called "sampling
error" variance, much like each person has idiosyncratic personality features.
[Of course, sampling error variance should __not__ be confused with either
"measurement error" variance or "model specification" error variance (sometimes
modeled as the "within" or "residual" sum of squares in univariate analyses)
(Thompson, 1998a).] Of course, like people, sampling distributions may differ in
how much idiosyncratic "flukiness" they each contain.

Statistical tests evaluate the probability that the deviation of the sample statistics from the assumed population parameters is due to sampling error. That is, statistical tests evaluate whether random sampling from the population may explain the deviations of the sample statistics from the hypothesized population parameters.

However, very few researchers employ random samples from the population. Rokeach (1973) was an exception; being a different person living in a different era, he was able to hire the Gallup polling organization to provide a representative national sample for his inquiry. But in the social sciences fewer than 5% of studies are based on random samples (Ludbrook & Dudley, 1998).

On the basis that most researchers do not have random samples from the population, some (cf. Shaver, 1993) have argued that statistical significance tests should almost never be used. However, most researchers presume that statistical tests may be reasonable if there are grounds to believe that the score sample of convenience is expected to be reasonably representative of a population.

In order to evaluate the probability that the sample scores
came from a population of scores described exactly by the null hypothesis, given
the sample size, researchers typically invoke the *sampling distribution*.
The sampling distribution does __not__ consist of scores (except when the
sample size is one). Rather, the sampling distribution consists of estimated
__parameters__, each computed for samples of exactly size __n__, so as to
model the influences of random sampling error on the statistics estimating the
population parameters, given the sample size.

This sampling distribution is then used to estimate the probability of the observed sample statistic(s) occurring due to sampling error. For example, we might take the population to be infinitely many IQ scores normally distributed with a mean, median and mode of 100 and a standard deviation of 15. Perhaps we have drawn a sample of 10 people, and compute the sample median (not all hypotheses have to be about means!) to be 110. We wish to know whether our statistic or one higher is unlikely, assuming the sample came from the posited population.

We can make this determination by drawing all possible samples of size 10 from the population, computing the median of each sample, and then creating the distribution of these statistics (i.e., the sampling distribution). We then examine the sampling distribution, and locate the value of 110. Perhaps only 2% of the sample statistics in the sampling distribution are 110 or higher. This suggests to us that our observed sample median of 110 is relatively unlikely to have come from the hypothesized population.

The number of samples drawn for the sampling distribution from
a given population is a function of the population size, and the sample size.
The number of such different sets of population cases for a population of size
__N__ and a sample of size __n__ equals:

N!

M = ___________.

n! (N - n)!

Clearly, if the population size is infinite (or even only
large), deriving all possible estimates becomes unmanageable. In such cases the
sampling distribution may be theoretically (i.e., mathematically) estimated,
rather than actually observed. Sometimes, rather than estimating the sampling
distribution, estimating an analog of the sampling distribution, called a "test
distribution" (e.g., __F__, __t__, c
^{2}) may be more manageable.

Heuristic Example for a Finite Population Case

Table 5 presents a finite population of scores for __N__=20
people. Presume that we wish to evaluate a sample mean for __n__=3 people. If
we know (or presume) the population, we can derive the sampling distribution (or
the test distribution) for this problem, so that we can then evaluate the
probability that the sample statistic of interest came from the assumed
population.

__________________________

INSERT TABLE 5 ABOUT HERE.

__________________________

Note that we are ultimately inferring the probability of the
sample statistic, and __not__ of the population parameter(s). Remember also
that some __specific__ population must be presumed, or infinitely many
sampling distributions (and consequently infinitely
__p___{CALCULATED} values) are plausible, and the solution becomes
indeterminate.

Here the problem is manageable, given the relatively small population and samples sizes. The number of statistics creating this sampling distribution is

N!

M = ___________

n! (N - n)!

__20! __

3! (20 - 3 )!

__20! __

3! (17)!

__20x19x18x17x16x15x14x13x12x11x10x9x8x7x6x5x4x3x2 __

3 x 2 x (17x16x15x14x13x12x11x10x9x8x7x6x5x4x3x2)

__2.433E+18 __

6 x 3.557E+14

__2.433E+18__

2.134E+15

= 1,140.

Table 6 presents the first 85 and the last 10 potential samples. [The full sampling distribution takes 25 pages to present, and so is not presented here in its entirety.]

__________________________

INSERT TABLE 6 ABOUT HERE.

__________________________

Figure 6 presents the full sampling distribution of 1,140
estimates of the mean based on samples of size __n__=3 from the Table 5
population of __N__=20 scores. Figure 7 presents the analog of a test
statistic distribution (i.e., the sampling distribution in standardized
form).

__________________________________

INSERT FIGURES 6 AND 7 ABOUT HERE.

__________________________________

If we had a sample of size __n__=3, and had some reason to
believe and wished to evaluate the probability that the sample with a mean of
__M__ = 524.0 came from the Table 5 population of __N__=20 scores, we
could use the Figure 6 sampling distribution to do so. Statistic means (i.e.,
sample means) this large or larger occur about 25% of the time due to sampling
error.

In practice researchers most frequently use sampling
distributions of test statistics (e.g., __F__, __t__, c ^{2}), rather than the sampling distributions of
sample statistics, to evaluate sample results. This is typical because the
sampling distributions for many sample statistics change for every study
variation (e.g., changes for different statistics, changes for each different
sample size for even for a given statistic). Sampling distributions of test
statistics (e.g., distributions of sample means each divided by the population
__SD__) are more general or invariant over these changes, and thus, once they
are estimated, can be used with greater regularity than the related sampling
distributions for statistics.

The problem is that the applicability and generalizability of
test distributions tend to be based on fairly strict assumptions (e.g., equal
variances of outcome variable scores across all groups in ANOVA). Furthermore,
test statistics have only been developed for a limited range of classical test
statistics. For example, test distributions have __not__ been developed for
some "modern" statistics.

"Modern" Statistics

All "classical" statistics are centered about the arithmetic
mean, __M__. For example, the standard deviation (__SD__), the coefficient
of skewness (__S__), and the coefficient of kurtosis (__K__) are all
moments about the mean, respectively:

__SD___{X} = ((S
(__X___{i} - __M___{X})^{2}) /
(__n__-1))^{.5} = ((S
__x___{i}^{2}) / (__n__-1))^{.5};

Coefficient of __S__kewness_{X}
(__S___{X}) = (S
[(__X___{i}-__M___{X})/__SD___{X}]^{3})
/ __n__; and

Coefficient of __K__urtosis_{X}
(__K___{X}) = ((S
[(__X___{i}-__M___{X})/__SD___{X}]^{4})
/ __n__) - 3.

Similarly, the Pearson product-moment correlation invokes deviations from the means of the two variables being correlated:

(S (__X___{i} -
__M___{X})(__Y___{i} - __M___{Y})) /
__n__-1

__r___{XY} = ____________________________.

(__SD___{X} * __SD___{Y})

The problem with "classical" statistics invoking the mean is that these estimates are notoriously influenced by atypical scores (outliers), partly because the mean itself is differentially influenced by outliers. Table 7 presents a heuristic data set that can be used to illustrate both these dynamics and two alternative "modern" statistics that can be employed to mitigate these problems.

__________________________

INSERT TABLE 7 ABOUT HERE.

__________________________

Wilcox (1997) presents an elaboration of some "modern" statistics choices. A shorter accessible treatment is provided by Wilcox (1998). Also see Keselman, Kowalchuk, and Lix (1998) and Keselman, Lix and Kowalchuk (1998).

The variable __X__ in Table 7 is somewhat positively skewed
(__S___{X} = 2.40), as reflected by the fact that the mean
(__M___{X} = 500.00) is to the right of the median
(__Md___{X} = 461.00). One "modern" method "winsorizes" (à la
statistician Charles Winsor) the score distribution by substituting less extreme
values in the distribution for more extreme values. In this example, the 4th
score (i.e., 433) is substituted for scores 1 through 3, and in the other tail
the 17th score (i.e., 560) is substituted for scores 18 through 20. Note that
the mean of this distribution, __M___{X'} = 480.10, is less extreme
than the original value (i.e., __M___{X} = 500.00).

Another "modern" alternative "trims" the more extreme scores,
and then computes a "trimmed" mean. In this example, .15 of the distribution is
trimmed from each tail. The resulting mean, __M___{X-} = 473.07, is
closer to the median of the distribution, which has remained 461.00.

Some "classical" statistics can also be framed as "modern." For example, the interquartile range (75th %ile - 25th %ile) might be thought of as a "trimmed" range.

In theory, "modern" statistics may generate more replicable characterizations of data, because at least in some respects the influence of more extreme scores, which are less likely to be drawn in future samples from the tails of a non-uniform (non-rectangular or non-flat) population distribution, has been minimized. However, "modern" statistics have not been widely employed in contemporary research, primarily because generally-applicable test distributions are often not available for such statistics.

Traditionally, the tail of statistical significance testing has wagged the dog of characterizing our data in the most replicable manner. However, the "bootstrap" may provide a vehicle for statistically testing, or otherwise exploring, "modern" statistics.

Univariate Bootstrap Heuristic Example

The *bootstrap* logic has been elaborated by various
methodologists, but much of this development has been due to Efron and his
colleagues (cf. Efron, 1979). As explained elsewhere,

Conceptually, these methods involve copying the data set on top
of itself again and again infinitely many times to thus create an infinitely
large "mega" data set (what's actually done is resampling from the original data
set *with replacement*). Then hundreds or thousands of different samples
[each of size __n__] are drawn from the "mega" file, and results [i.e., the
statistics of interest] are computed separately for each sample and then
averaged [and characterized in various ways]. (Thompson, 1993b, p.
369)

Table 8 presents a heuristic data set to make concrete selected aspects of bootstrap analysis. The example involves the numbers of churches and murders in 45 cities. These two variables are highly correlated. [The illustration makes clear the folly of inferring causal relationships, even from a "causal modeling" SEM analysis, if the model is not exactly correctly "specified" (cf. Thompson, 1998a).] The statistic examined here is the bivariate product-moment correlation coefficient. This statistic is "univariate" in the sense that only a single dependent/outcome variable is involved.

__________________________

INSERT TABLE 8 ABOUT HERE.

__________________________

Figure 8 presents a scattergram portraying the linear
relationship between the two measured/observed variables. For the heuristic
data, __r__ equals .779.

___________________________

INSERT FIGURE 8 ABOUT HERE.

___________________________

In this example 1,000 resamples of the rows of the Table 8 data
were drawn, each of size __n__=45, so as to model the sampling error
influences in the actual data set. In each "resample," because sampling from the
Table 8 data was done "with replacement," a given row of the data may have been
sampled multiple times, while another row of scores may not have been drawn at
all. For this analysis the bootstrap software developed by Lunneborg (1987) was
used. Table 9 presents some of the 1,000 bootstrapped estimates of __r__.

__________________________

INSERT TABLE 9 ABOUT HERE.

__________________________

Figure 9 presents a graphic representation of the
bootstrap-estimated sampling distribution for this case. Because __r__,
although a characterization of linear relation, is not itself linear (i.e.,
__r__=1.00 is not twice __r__=.50), Fisher's __r__-to-Z transformations
of the 1,000 resampled __r__ values were also computed as:

__r__-to-Z = .5 (ln [(1 + __r__)/(1 - __r__)] (Hays, 1981,
p. 465).

In SPSS this could be computed as:

compute r_to_z=.5 * ln ((1 + r)/(1 - r)).

Figure 10 presents the bootstrap-estimated sampling distribution for these values.

___________________________________

INSERT FIGURES 9 AND 10 ABOUT HERE.

___________________________________

Descriptive vs. Inferential Uses of the Bootstrap

The bootstrap *can* be used to test statistical
significance. For example, the bootstrap can be used to estimate, through Monte
Carlo simulation, sampling distributions when theoretical distributions (e.g.,
test distributions) are not known for some problems (e.g., "modern"
statistics).

The standard deviation of the bootstrap-estimated sampling
distribution characterizes the variability of the statistics estimating given
population parameters. The standard deviation of the sampling distribution is
called the "standard error of the estimate" (e.g., the standard error of the
mean, __SE___{M}). [The decision to call this standard deviation the
"standard error," so as to confuse the graduate students into not realizing that
__SE__ is an __SD__, was taken decades ago at an annual methodologists'
coven--in the coven priority is typically afforded to most confusing the
students regarding the most important concepts.] The __SE__ of a statistic
characterizes the precision or variability of the estimate.

The ratio of the statistic estimating a parameter to the
__SE__ of that estimate is a very important idea in statistics, and thus is
called by various names, such as "__t__," "Wald statistic," and "critical
ratio" (so as to confuse the students regarding an important concept). If the
statistic is large, but the __SE__ is even larger, a researcher may elect not
to vest much confidence in the estimate. Conversely, even if a statistic is
small (i.e., near zero), if the __SE__ of the statistic is very, very small,
the researcher may deem the estimate reasonably precise.

In classical statistics researchers typically estimate the
__SE__ as part of statistical testing by invoking numerous assumptions about
the population and the sampling distribution (e.g., normality of the sampling
distribution). Such __SE__ estimates are __theoretical__.

The __SD__ of the bootstrapped sampling distribution, on the
other hand, is an __empirical__ estimate of the sampling distribution's
variability. This estimate does not require as many assumptions.

Table 10 presents selected percentiles for two bootstrapped
__r__-to-z sampling distributions for the Table 8 data, one involving 100
resamples, and one involving 1,000 resamples. Notice that percentiles near the
means or the medians of the two distributions tend to be closer than the values
in the tails, and here especially in the left tail (small __z__ values) where
there are fewer values, because the distribution is skewed left. This purely
heuristic comparison makes an extremely important conceptual point that clearly
distinguishes inferential versus descriptive applications of the bootstrap.

___________________________

INSERT TABLE 10 ABOUT HERE.

___________________________

When we employ the bootstrap for inferential purposes (i.e., to
estimate the probability of the sample statistics), focus shifts to the extreme
tails of the distributions, where the less likely (and less frequent) statistics
are located, because we typically invoke small values of __p__ in statistical
tests. These are exactly the locations where the estimated distribution
densities are most unstable, because there are relatively few scores here
(presuming the sampling distribution does not have an extraordinarily small
__SE__). Thus, when we invoke the bootstrap to conduct statistical
significance tests, extremely large numbers of resamples are required (e.g.,
2,000, 5,000).

However, when our application is descriptive, we are primarily
interested in the mean (or median) statistic and the __SD__/__SE__ from
the sampling distribution. These values are less dependent on large numbers of
resamples. This is said not to discourage large numbers of resamples (which are
essentially free to use, given modern microcomputers), but is noted instead to
emphasize these two very distinct uses of the bootstrap.

The descriptive focus is appropriate. We hope to avoid obtaining results that no one else can replicate (partly because we are good scientists searching for generalizable results, and partly simply because we do not wish to be embarrassed by discovering the social sciences equivalent of cold fusion). The challenge is obtaining results that reproduce over the wide range of idiosyncracies of human personality.

The descriptive use of the bootstrap provides some evidence, short of a real (and preferred) "external" replication (cf. Thompson, 1996) of our study, that results may generalize. As noted elsewhere,

If the mean estimate [in the estimated sampling distribution] is like our sample estimate, and the standard deviation of estimates from the resampling is small, then we have some indication that the result is stable over many different configurations of subjects. (Thompson, 1993b, p. 373)

Multivariate Bootstrap Heuristic Example

The bootstrap can also be generalized to multivariate cases (e.g., Thompson, 1988b, 1992a, 1995a). The barrier to this application is that a given multivariate "factor" (also called "equation," "function," or "rule," for reasons that are, by now, obvious) may be manifested in different locations.

For example, perhaps a measurement of androgyny purports to measure two factors: masculine and feminine. In one resample masculine may be the first factor, while in the second resample masculine might be the second factor. In most applications we have no particular theoretical expectation that "factors" ("functions," etc.) will always replicate in a given order. However, if we average and otherwise characterize statistics across resamples without initially locating given constructs in the same locations, we will be pooling apples, oranges, and tangerines, and merely be creating a mess.

This barrier to the multivariate use of the bootstrap can be resolved by using Procrustean methods to rotate all "factors" into a single, common factor space prior to characterizing the results across the resamples. A brief example may be useful in communicating the procedure.

Figure 11 presents DDA/MANOVA results from an analysis of Sir Ronald Fisher's (1936) classic data for iris flowers. Here the bootstrap was conducted using my DISCSTRA program (Thompson, 1992a) to conduct 2,000 resamples.

____________________________

INSERT FIGURE 11 ABOUT HERE.

____________________________

Figure 12 presents a partial listing of the resampling of
__n__=150 rows of data (i.e., the resample size exactly matches the original
samples size). Notice in Figure 12 that case #27 was selected at least twice as
part of the first resample.

____________________________

INSERT FIGURE 12 ABOUT HERE.

____________________________

First 13 presents selected results for both the first and the last resamples. Notice that the function coefficients are first rotated to best fit position with a common designated target matrix, and then the structure coefficients are computed using these rotated results. [Here the rotations made few differences, because the functions by happenstance already fairly closely matched the target matrix--here the function coefficients from the original sample.]

____________________________

INSERT FIGURE 13 ABOUT HERE.

____________________________

Figure 14 presents an abridged map of participant selection across the 2,000 resamples. We can see that the 150 flowers were each selected approximately 2,000 times, as expected if the random selection with replacement is truly random.

____________________________

INSERT FIGURE 14 ABOUT HERE.

____________________________

Figure 15 presents a summary of the bootstrap DDA results. For
example, the mean statistic across 2,000 resample is computed along with the
__empirically-estimated__ standard error of each statistic. As generally
occurs, __SE__'s tend to be smaller for statistics that deviate most from
zero; these coefficients tend to reflect real (non-sampling error variance)
dynamics within the data, and therefore tend to re-occur across samples.

____________________________

INSERT FIGURE 15 ABOUT HERE.

____________________________

However, notice in Figure 15 that the __SE__'s for the
standardized function coefficients on Function I for variables __X2__ and
__X4__ were both essentially .40, even though the mean estimates of the two
coefficients appear to be markedly different (i.e., |1.6| and |2.9|). In a
theoretically-grounded estimate, for a given __n__ and a given population
estimate, the __SE__ will be identical. But bootstrap methods do not require
the sometimes unrealistic assumption that related coefficients even in a given
analysis with a common fixed __n__ have the same sampling
distributions.

Clarification and an Important Caveat

The bootstrap methods modeled here presume that the sample size is somewhat large (i.e., more than 20 to 40). In these cases the bootstrap invokes resampling with replacement. For small samples other methods are employed.

It is also important to emphasize that "bootstrap methods do not magically take us beyond the limits of our data" (Thompson, 1993b, p. 373). For example, the bootstrap cannot make an unrepresentative sample representative. And the bootstrap cannot make a quasi-experiment with intact groups mimic results for a true experiment in which random assignment is invoked. The bootstrap cannot make data from a correlational (i.e., non-experimental) design yield unequivocal causal conclusions.

Thus, Lunneborg (1999) makes very clear and careful
distinctions between bootstrap applications that may support either (a)
population inference (i.e., the study design invoked random sampling), __or__
(b) evaluation of how "local" a causal inference may be (i.e., the study design
invoked random assignment to experimental groups, but not random selection),
__or__ (c) evaluation of how "local" non-causal descriptions may be (i.e.,
the design invoked neither random sampling nor random assignment). Lunneborg
(1999) quite rightly emphasizes how critical it is to match study
design/purposes and the bootstrap modeling procedures.

The bootstrap and related "internal" replicability analyses are not magical. Nevertheless, these methods can be useful because

the methods combine the subjects in hand in [numerous] different ways to determine whether results are stable across sample variations, i.e., across the idiosyncracies of individuals which make generalization in social science so challenging. (Thompson, 1996, p. 29)

Effect Sizes

As noted previously, __p___{CALCULATED} values are
__not__ suitable indices of effect, "because both [types of __p__ values]
*depend on sample size*" (APA, 1994, p. 18, emphasis added). Furthermore,
unlikely events are not intrinsically noteworthy (see Shaver's (1985) classic
example). Consequently, the APA publication manual now "encourages" (p. 18)
authors to report effect sizes.

Unfortunately, a growing corpus of __empirical__ studies of
published articles portrays a consensual view that merely "encouraging" effect
size reporting (APA, 1994) has __not__ appreciably affected actual reporting
practices (e.g., Keselman et al., 1998; Kirk, 1996; Lance & Vacha-Haase,
1998; Nilsson & Vacha-Haase, 1998; Reetz & Vacha-Haase, 1998; Snyder
& Thompson, 1998; Thompson, 1999b; Thompson & Snyder, 1997, 1998;
Vacha-Haase & Ness, 1999; Vacha-Haase & Nilsson, 1998). Table 11
summarizes 11 empirical studies of recent effect size reporting practices in 23
journals.

___________________________

INSERT TABLE 11 ABOUT HERE.

___________________________

Although some of the Table 11 results appear to be more
favorable than others, it is important to note that in some of the 11 studies'
effect sizes were counted as being reported even if the relevant results were
not interpreted (e.g., an __r__^{2} was reported but not interpreted
as being big or small, or noteworthy or not). This dynamic is dramatically
illustrated in the Keselman et al. (1998) results, because the reported results
involved an exclusive focus on between-subjects OVA designs, and thus there were
no spurious counts of incidental variance-accounted-for statistic reports. Here
Keselman et al. (1998) concluded that, "as anticipated, effect sizes were almost
never reported along with __p__-values" (p. 358).

If the baseline expectation is that effect should be reported in 100% of quantitative studies (mine is), the Table 11 results are disheartening. Elsewhere I have presented various reasons why I anticipate that the current APA (1994, p. 18) "encouragement" will remain largely ineffective. I have noted that an "encouragement" is so vague as to be unenforceable (Thompson, in press-b). I have also observed that only "encouraging" effect size reporting:

presents a self-canceling mixed-message. To present an "encouragement" in the context of strict absolute standards regarding the esoterics of author note placement, pagination, and margins is to send the message, "these myriad requirements count, this encouragement doesn't." (Thompson, in press-b)

Two Heuristic Hypothetical Literatures

Two heuristic hypothetical literatures can be presented to illustrate the deleterious impacts of contemporary traditions. Here, results are reported for both statistical tests and effect sizes.

__Twenty "TinkieWinkie" Studies__. First, presume that a
televangalist suddenly denounces a hypothetical childrens' television character,
"TinkieWinkie," based on a claim that the character intrinsically by appearance
and behavior incites moral depravity in 4 year olds.

This claim immediately incites inquiries by 20 research teams, each working independently without knowledge of each others' results. These researchers conduct experiments comparing the differential effects of "The TinkieWinkie Show" against those of "Sesame Street," or "Mr. Rogers," or both.

This work results in the nascent new literature presented in
Table 12. The eta^{2} effect sizes from the 20 (10 two-level one-way and
10 three-level one-way) ANOVAs range from 1.2% to 9.9% (__M___{sq
eta}=3.00%; __SD___{sq eta}=2.0%) as regards moral depravity
being induced by "The TinkieWinkie Show." However, as reported in Table 12, only
1 of the 20 studies results in a statistically significant effect.

___________________________

INSERT TABLE 12 ABOUT HERE.

___________________________

The 19 research teams finding no statistically significant
differences in the treatment effects on the moral depravity of 4 year olds
obtained effect sizes ranging from eta^{2}=1.2% to eta^{2}=4.8%.
Unfortunately, these 19 research teams are acutely aware of how
non-statistically significant findings are valued within the profession.

They are acutely aware, for example, that revised versions of published articles were rated more highly by counseling practitioners if the revisions reported statistically significant findings than if they reported statistically nonsignificant findings (Cohen, 1979). The research teams are also acutely aware of Atkinson, Furlong and Wampold's (1982) study in which

101 consulting editors of the __Journal of Counseling
Psychology__ and the __Journal of Consulting and Clinical Practice__ were
asked to evaluate three versions, differing only with regard to level of
statistical significance, of a research manuscript. The statistically
nonsignificant and approach significance versions were more than three times as
likely to be recommended for rejection than was the statistically significant
version. (p. 189)

Indeed, Greenwald (1975) conducted a study of 48 authors and 47
reviewers for the __Journal of Personality and Social Psychology__ and
reported a

0.49 (± .06) probability of submitting a rejection of the null
hypothesis for publication (Question 4a) compared to the low probability of 0.06
(± .03) for submitting a nonrejection of the null hypothesis for publication
(Question 5a). A secondary bias is apparent [as well] *in the probability of
continuing with a problem [in future inquiry]*. (p. 5, emphasis
added)

This is the well known "file drawer problem" (Rosenthal, 1979). In the present instance, some of the 19 research teams failing to reject the null hypothesis decide not to even submit their work, while the remaining teams have their reports rejected for publication. Perhaps these researchers were socialized by a previous version of the APA publication manual, which noted that:

Even when the theoretical basis for the prediction is clear and defensible, the burden of methodological precision falls heavily on the investigator who reports negative results. (APA, 1974, p. 21)

Here only the one statistically significant result is published; everyone remains happily oblivious to the overarching substance of the literature in its entirety.

The problem is that setting a low alpha only means that the
probability of a Type I error will be small __on the average__. In the
literature as a whole, some unlikely Type I errors are still inevitable. These
will be afforded priority for publication. Yet publishing replication
disconfirmations of these Type I errors will be discouraged normatively.
Greenwald (1975, pp. 13-15) cites the expected actual examples of such
epidemics. In short, contemporary practice as regards statistical tests actively
discourages some forms of replication, or at least discourages disconfirming
replications being published.

__Twenty Cancer Treatment Studies__. Here researchers learn
of a new theory that a newly synthesized protein regulates the growth of blood
supply to cancer tumors. It is theorized that the protein might be used to
prevent new blood supplies from flowing to new tumors, or even that the protein
might be used to reduce existing blood flow to tumors and thus lead to cancer
destruction. The protein is synthesized.

Unfortunately, given the newness of the theory and the absence of previous related empirical studies upon which to ground power analyses for their new studies, the 20 research teams institute inquiries that are slightly under-powered. The results from these 20 experiments are presented in Table 13.

___________________________

INSERT TABLE 13 ABOUT HERE.

___________________________

Here all 20 studies yield __p___{CALCULATED} values
of roughly .06 (range = .0598 to .0605). As reported in Table 13, the effect
sizes range from 15.1% to 62.8%. In the present scenario, only a few of the
reports are submitted for publication, and none are published.

Yet, these inquiries yielded effect sizes ranging from
eta^{2}=15.1%, which Cohen (1988, pp. 26-27) characterized as "large,"
at least as regards result typicality, up to eta^{2}=62.8%. And a
life-saving outcome variable is being measured! At the individual study level,
perhaps each research team has decided that __p__ values evaluate result
replicability, and remain oblivious to the uniformity of efficacy findings
across the literature.

Some researchers remain devoted to statistical tests, because

of their professed dedication to reporting only replicable
results, and because they erroneously believe that statistical significance
evaluates result replicability (Cohen, 1994). In summary, ** it would be the
abject height of irony if, out of devotion to replication, we continued to
worship at the tabernacle of statistical significance testing, and at the same
time we declined to (a) formulate our hypotheses by explicit consultation of the
effect sizes reported in previous studies and (b) explicitly interpret our
obtained effect sizes in relation to those reported in related previous
inquiries**.

An Effect Size Primer

Given the central role that effect sizes should play with quantitative studies, at least a brief review of the available choices is warranted here. Very good treatments are also available from Kirk (1996), Rosenthal (1994), and Snyder and Lawson (1993).

There are dozens of effect size estimates, and no single one-size-fits-all choice. The effect sizes can be divided into two major classes: (a) standardized differences and (b) variance-accounted-for measures of strength of association. [Kirk (1996) identifies a third, "miscellaneous" category, and also summarizes some of these choices.]

__Standardized differences__. In experimental studies, and
especially studies with only two groups where the mean is of primary interest,
the differences in means can be "standardized" by dividing the difference by
some estimate of the population parameter score s . For
example, in his seminal work on meta-analysis, Glass (cf. 1976) proposed that
the difference in the two means could be divided by the *control group*
standard deviation to estimate D .

Glass presumed that the control group standard deviation is the
best estimate of s . This is reasonable particularly if
the control group received no treatment, or a placebo treatment. For example,
for the Table 2 variable, __X__, if the second of the two groups was taken as
the control group,

D _{X} = (12.50 - 11.50) / 7.68
= .130.

In this estimation the variance (see Table 2 note) is computed
by dividing the sum of squares by __n__-1.

However, others have taken the view that the most accurate
standardization can be realized by use of a *"pooled"* (across groups)
estimate of the population standard deviation. Hedges (1981) advocated
computation of __g__ using the standard deviation computed as the square root
of a pooled variance based on division of the sum of squares by __n__-1. For
the Table 2 variable, __X__,

__g___{X} = (12.50 - 11.50) / 7.49 = .134.

Cohen (1969) argued for the use of __d__, which divides the
mean difference by a *"pooled"* standard deviation computed as the square
root of a pooled variance based on division of the sum of squares by __n__.
For the Table 2 variable, __X__,

__d___{X} = (12.50 - 11.50) / 7.30 = .137.

As regards these choices, there is (as usual) no one always right one-size-fits-all choice. The comment by Huberty and Morris (1988, p. 573) is worth remembering generically: "As in all of statistical inference, subjective judgment cannot be avoided. Neither can reasonableness!"

In some studies the control group standard deviation provides the most reasonable standardization, while in others a "pooling" mechanism may be preferred. For example, an intervention may itself change score variability, and in these cases Glass's D may be preferred. But otherwise the "pooled" value may provide the more statistically precise estimate.

As regards correction for statistical bias by division by
__n__-1 versus __n__, of course the competitive differences here are a
function of the value of __n__. As __n__ gets larger, it makes less
difference which choice is made. This division is equivalent to multiplication
by 1 / the divisor. Consider the differential impacts on estimates derived using
the following selected choices of divisors.

n 1/Divisor n-1 1/Divisor Difference

10 .1000 9 .111111 .011111

100 .0100 99 .010101 .000101

1000 .0010 999 .001001 .000001

10000 .0001 9999 .000100010 .00000001

__Variance-accounted-for__. Given the omnipresence of the
General Linear Model, __all__ analyses are correlational (cf. Thompson,
1998a), and (as noted previously) an r^{2} effect size (e.g.,
eta^{2}, __R__^{2}, omega^{2} [w ^{2}; Hays, 1981], adjusted __R__^{2})
can be computed in __all__ studies. Generically, in univariate analyses
"uncorrected" variance-accounted-for effect sizes (e.g., eta^{2},
__R__^{2}) can be computed by dividing the sum of squares "explained"
("between," "model," "regression") by the sum of squares of the outcome variable
(i.e., the sum of squares "total"). For example, in the Figure 3 results, the
univariate eta^{2} effect sizes were both computed to be 0.469% (e.g.,
5.0 / [5.0 + 1061.0] = 5.0 / 1065.0 = .00469).

In multivariate analysis, one estimate of eta^{2} can
be computed as 1 - lambda (l ). For example, for the
Figure 3 results, the multivariate eta^{2} effect size was computed as
(1 - .37500) equals .625.

__Correcting for score measurement unreliability__. It is
well known that score unreliability tends to attenuate __r__ values (cf.
Walsh, 1996). Thus, some (e.g., Hunter & Schmidt, 1990) have recommended
that effect sizes be estimated incorporating statistical corrections for
measurement error. However, such corrections must be used with caution, because
any error in estimating the reliability will considerably distort the effect
sizes (cf. Rosenthal, 1991).

Because scores (__not__ tests) are reliable, reliability
coefficients fluctuate from administration to administration (Reinhardt, 1996).
In a given empirical study, the reliability for the data in hand may be used for
such corrections. In other cases, more confidence may be vested in these
corrections if the reliability estimates employed are based on the important
meta-analytic "reliability generalization" method proposed by Vacha-Haase
(1998).

__"Corrected" vs. "uncorrected" variance-accounted-for
estimates__. "Classical" statistical methods (e.g., ANOVA, regression, DDA)
use the statistical theory called "ordinary least squares." This theory
optimizes the fit of the synthetic/latent variables (e.g., Y^) to the
observed/measured outcome/response variables (e.g., __Y__) in the sample
data, and capitalizes on __all__ the variance present in the observed sample
scores, including the "sampling error variance" that it is idiosyncratic to the
particular sample. Because sampling error variance is unique to a given sample
(i.e., each sample has its own sampling error variance), "uncorrected"
variance-accounted-for effect sizes somewhat overestimate the effects that would
be replicated by applying the same weights (e.g., regression beta weights) in
either (a) the population or (b) a different sample.

However, statistical theory (or the descriptive bootstrap) can be invoked to estimate the extent of overestimation (i.e., positive bias) in the variance-accounted-for effect size estimate. [Note that "corrected" estimates are always less than or equal to "uncorrected" values.] The difference between the "uncorrected" and "corrected" variance-accounted-for effect sizes is called "shrinkage."

For example, for regression the "corrected" effect size
"adjusted __R__^{2}" is routinely provided by most statistical
packages. This correction is due to Ezekiel (1930), although the formula is
often incorrectly attributed to Wherry (Kromrey & Hines, 1996):

1 - ((__n__ - 1) / (__n__ - __v__ - 1)) x (1 -
__R__^{2}),

where __n__ is the sample size and __v__ is the number of
predictor variables. The formula can be equivalently expressed as:

__R__^{2} - ((1 - __R__^{2}) x (__v__ /
(__n__ - __v__ -1))).

In the ANOVA case, the analogous h
^{2} can be computed using the formula due to Hays (1981, p. 349):

(SS_{BETWEEN} - (__k__ - 1) x MS_{WITHIN}) /
(SS_{TOTAL} + MS_{WITHIN}),

where __k__ is the number of groups.

In the multivariate case, a multivariate omega^{2} due
to Tatsuoka (1973a) can be used as "corrected" effect estimate. Of course, using
univariate effect sizes to characterize multivariate results would be just as
wrong-headed as using ANOVA methods *post hoc* to MANOVA. As Snyder and
Lawson (1993) perceptively noted, "researchers asking multivariate questions
will need to use magnitude-of-effect indices that are consistent with their
multivariate view of the research problem" (p. 341).

Although "uncorrected" effects for a sample are larger than the
"corrected" effects estimated for the population, the "corrected" estimates for
the population effect (e.g., omega^{2}) tend in turn to be larger than
the "corrected" estimates for a future sample (e.g., Herzberg, 1969; Lord,
1950). As Snyder and Lawson (1993) explained, "the reason why estimates for
future samples result in the most *shrinkage* is that these statistical
corrections must adjust for the sampling error present in *both* the given
present study and some future study" (p. 340, emphasis in original).

It should also be noted that variance-accounted-for effect
sizes can be negative, notwithstanding the fact that a squared-metric statistic
is being estimated. This was seen in some of the omega^{2} values
reported in Table 12. Dramatic amounts of shrinkage, especially to negative
variance-accounted-for values, suggest a somewhat dire research experience.
Thus, I was somewhat distressed to see a local dissertation in which
__R__^{2}=44.6% shrunk to 0.45%, and yet it was claimed that still
"it may be possible to generalize prediction in a referred population"
(Thompson, 1994a, p. 12).

__Factors that inflate sampling error variance__.
Understanding what design features generate sampling error variance can
facilitate more thoughtful design formulation, and thus has some value in its
own right. Sampling error variance is *greater* when:

(a)* sample size* is *smaller*;

(b) the number of *measured variables* is *greater*;
and

(c) the *population effect size* (i.e., parameter) is
*smaller*.

The deleterious effects of __small sample size__ are
obvious. When we sample, there is more likelihood of "flukie" characterizations
of the population with smaller samples, and the relative influence of anomalous
scores (i.e., outliers) is greater in smaller samples, at least if we use
"classical" as against "modern" statistics.

Table 14 illustrates these variations as a function of
different sample sizes for regression analyses each involving 3 predictor
variables and presumed population parameter __R__^{2} equal to 50%.
These results illustrate that the sampling error due to sample size is not a
monotonic (i.e., constant linear) function of sample size changes. For example,
when sample size changes from __n__=10 to __n__=20, the shrinkage changes
from 25.00% (__R__^{2}=50% - __R__^{2}*=25.00%) to 9.73%
(__R__^{2}=50% - __R__^{2}*=40.63%). But even more than
doubling sample size from __n__=20 to __n__=45 changes shrinkage only from
9.73% (__R__^{2}=50% - __R__^{2}*=40.63%) to 3.66%
(__R__^{2}=50% - __R__^{2}*=46.34%).

___________________________

INSERT TABLE 14 ABOUT HERE.

___________________________

The influence of the __number of measured variables__ is
also fairly straightforward. The more variables we sample the greater is the
likelihood that an anomalous score will be incorporated in the sample data.

The common language describing a person as an "outlier" should
__not__ be erroneously interpreted to mean either (a) that a given person is
an outlier on all variables or (b) that a given score is an outlier as regards
all statistics (e.g., on the mean versus the correlation). For example, for the
following data Amanda's score may be outlying as regards __M___{Y},
but not as regards __r___{XY} (which here equal +1; see Walsh,
1996).

Person __X___{i} __Y___{i}

Kevin 1 2

Jason 2 4

Sherry 3 6

Amanda 48 96

Again, as reported in Table 14, the influence of the number of measured variables on shrinkage is not monotonic.

Less obvious is why the estimated __population parameter
effect size__ (i.e., the estimate based on the sample statistic) impacts
shrinkage. The easiest way to understand this is to conceptualize the population
for a Pearson product-moment study. Let's say the population squared correlation
is +1. In this instance, even ridiculously small samples of any 2 or 3 or 4
pairs of scores will invariably yield a sample __r__^{2} of 100% (as
long as both __X__ and __Y__ as sampled are variables, and therefore
__r__ is "defined," in that illegal division is not required by the formula
__r__ = __COV___{XY} / [__SD___{X} x
__SD___{Y}]).

Again as suggested by the Table 14 examples, the influence of
increased sample size on decreased shrinkage is not monotonic. [Thus, the use of
a sample __r__=.779 in the Table 8 heuristic data for the bootstrap example
theoretically should have resulted in relatively little variation in sample
estimates across resamples.]

Indeed, these three influences on sampling error must be
considered as they simultaneously interact with each other. For example, as
suggested by the previous discussion, the influence of sample size is an
influence conditional on the estimated parameter effect size. Table 15
illustrates these interactions for examples all of which involve shrinkage of a
5% decrement downward from the original __R__^{2} value.

___________________________

INSERT TABLE 15 ABOUT HERE.

___________________________

__Pros and cons of the effect size classes__. It is not
clear that researchers should uniformly prefer one effect index over another, or
even one class of indices over the other. The standardized difference indices do
have one considerable advantage: they tend to be readily comparable across
studies because they are expressed "metric-free" (i.e., the division by
__SD__ removes the metric from the characterization).

However, variance-accounted-for effect sizes can be directly computed in all studies. Furthermore, the use of variance-accounted-for effect sizes has the considerable heuristic value of forcing researchers to recognize that all parametric methods are part of a single general linear model family (cf. Cohen, 1968; Knapp, 1978).

In any case, the two effect sizes can be re-expressed in terms
of each other. Cohen (1988, p. 22) provided a general table for this purpose. A
__d__ can also be converted to an __r__ using Cohen's (1988, p. 23)
formula #2.2.6:

__r__ = __d__ / [(__d__^{2} +
4)^{.5}]

= __0.8 __/ [(0.8^{2} + 4)^{.5}]

= 0.8 / [(0.64 + 4)^{.5}]

= 0.8 / [( 4.64 )^{.5}]

= 0.8 / 2.154

= __0.371 __.

An __r__ can be converted to a __d__ using Friedman's
(1968, p. 246) formula #6:

__d__ = [2 (__r__)] / [(1 -
__r__^{2})^{.5}]

= [2 (__ 0.371 __)] / [(1 -
0.371^{2})^{.5}]

= [2 (0.371)] / [(1 - 0.1376)^{.5}]

= [2 (0.371)] / (0.8624)^{.5}

= [2 (0.371)] / 0.9286

= 0.742 / 0.9286

= __0.799 __.

__Effect Size Interpretation__. Schmidt and Hunter (1997)
recently argued that "logic-based arguments [against statistical testing] seem
to have had only a limited impact... [perhaps due to] the virtual brainwashing
in significance testing that all of us have undergone" (pp. 38-39). They also
spoke of a "psychology of addiction to significance testing" (Schmidt &
Hunter, 1997, p. 49).

For too long researchers have used statistical significance
tests in an illusory atavistic escape from the responsibility for defending the
value of their results. Our __p__ values were implicitly invoked as the
universal coinage with which to argue result noteworthiness (and replicability).
But as I have previously noted,

Statistics can be employed to evaluate the probability of an
event. But importance is a question of human values, and math cannot be employed
as an atavistic escape (à la Fromm's __Escape from Freedom__) from the
existential human responsibility for making value judgments. If the computer
package did not ask you your values prior to its analysis, it could not have
considered your value system in calculating __p__'s, and so __p__'s cannot
be blithely used to infer the value of research results. (Thompson, 1993b, p.
365)

The problem is that the normative traditions of contemporary social science have not yet evolved to accommodate personal values explication as part our work. As I have suggested elsewhere (Thompson, 1999a),

Normative practices for evaluating such [values] assertions will have to evolve. Research results should not be published merely because the individual researcher thinks the results are noteworthy. By the same token, editors should not quash research reports merely because they find explicated values unappealing. These resolutions will have to be formulated in a spirit of reasoned comity. (p. 175)

In his seminal book on power analysis, Cohen (1969, 1988, pp. 24-27) suggested values for what he judged to be "low," "medium," and "large" effect sizes:

Characterization __d__ __r__^{2}

"low" .2 1.0%

"medium" .5 5.9%

"large" .8 13.8%

Cohen (1988) was characterizing what he regarded as the typicality of effect sizes across the broad published literature of the social sciences. However, some empirical studies suggest that Cohen's characterization of typicality is reasonably accurate (Glass, 1979; Olejnik, 1984).

However, as Cohen (1988) himself emphasized:

The terms "small," "medium," and "large" are relative, not only
to each other, but to the content area of behavioral science or even more
particularly to the specific content and research method being employed in any
given investigation... In the face of this relativity, there is a certain risk
inherent in offering conventional operational definitions... in as diverse a
field of inquiry as behavioral science... [This] common conventional frame of
reference... is recommended for use *only when no better basis for estimating
the ES index is available*. (p. 25, emphasis added)

If in evaluating effect size we apply Cohen's conventions (against his wishes) with the same rigidity with which we have traditionally applied the a =.05 statistical significance testing convention we will merely be being stupid in a new metric.

In defending our subjective judgments that an effect size is
noteworthy in our personal value system, we must recognize that inherently any
two researchers with individual values differences may reach different
conclusions regarding the noteworthiness of the exact same effect even in the
same study. And, of course, the same effect size in two different inquiries may
differ radically in noteworthiness. Even small effects will be deemed
noteworthy, if they are replicable, when inquiry is conducted as regards highly
valued outcomes. Thus, Gage (1978) pointed out that even though the relationship
between cigarette smoking and lung cancer is relatively "small" (i.e.,
__r__^{2} = 1% to 2%):

Sometimes even very weak relationships can be important... [O]n the basis of such correlations, important public health policy has been made and millions of people have changed strong habits. (p. 21)

__Confidence Intervals for Effects__. It often is useful to
present confidence intervals for effect sizes. For example, a series of
confidence intervals across variables or studies can be conveyed in a concise
and powerful graphic. Such intervals might incorporate information regarding the
theoretical or the empirical (i.e., bootstrap) estimates of effect variability
across samples. However, as I have noted elsewhere,

If we mindlessly interpret a confidence interval with reference to whether the interval subsumes zero, we are doing little more than nil hypothesis statistical testing. But if we interpret the confidence intervals in our study in the context of the intervals in all related previous studies, the true population parameters will eventually be estimated across studies, even if our prior expectations regarding the parameters are wildly wrong (Schmidt, 1996). (Thompson, 1998b, p. 799)

Conditions Necessary (and Sufficient) for Change

Criticisms of conventional statistical significance are not new (cf. Berkson, 1938; Boring, 1919), though the publication of such criticisms does appears to be escalating at an exponentially increasing rate (Anderson et al., 1999). Nearly 40 years ago Rozeboom (1960) observed that "the perceptual defenses of psychologists [and other researchers, too] are particularly efficient when dealing with matters of methodology, and so the statistical folkways of a more primitive past continue to dominate the local scene" (p. 417).

Table 16 summarizes some of the features of contemporary practice, the problems associated with these practices, and potential improvements in practice. The implementation of these "modern" inquiry methods would result in the more thoughtful specification of research hypotheses. The design of studies with more statistical power and precision would be more likely, because power analyses would be based on more informed and realistic effect size estimates as an effect literature matured (Rossi, 1997).

___________________________

INSERT TABLE 16 ABOUT HERE.

___________________________

Emphasizing effect size reporting would eventually facilitate the development of theories that support more specific expectations. Universal effect size reporting would facilitate improved meta-analyses of literature in which cumulated effects would not be based on as many strong assumptions that are probably somewhat infrequently met. Social science would finally become the business of identifying valuable effects that replicate under stated conditions; replication would no longer receive the hollow affection of the statistical significance test, and instead the replication of specific effects would be explicitly and directly addressed.

What are the conditions necessary and sufficient to persuade
researchers to pay less attention to the likelihood of sample statistics, based
on assumptions that "nil" null hypotheses are true in the population, and more
attention to (a) effect sizes and (b) evidence of effect replicability?
Certainly current doctoral curricula seem to have less and less space for
quantitative training (Aiken et al., 1990). And too much instruction teaches
analysis as the rote application of methods *sans* rationale (Thompson,
1998a). And many textbooks, too, are flawed (Carver, 1978; Cohen, 1994).

But improved textbooks will not alone provide the magic bullet leading to improved practice. The computation and interpretation of effect sizes are already emphasized in some texts (cf. Hays, 1981). For example, Loftus and Loftus (1982) in their book argued that "it is our judgment that accounting for variance is really much more meaningful than testing for [statistical] significance" (p. 499).

Editorial Policies

I believe that changes in journal editorial policies are the
necessary (and sufficient) conditions to move the field. As Sedlmeier and
Gigerenzer (1989) argued, "there is only one force that can effect a change, and
that is the same force that helped institutionalize null hypothesis testing as
the *sine qua non* for publication, namely, the editors of the major
journals" (p. 315). Glantz (1980) agreed, noting that "The journals are the
major force for quality control in scientific work" (p. 3). And as Kirk (1996)
argued, changing requirements in journal editorial policies as regards effect
size reporting "would cause a chain reaction: Statistics teachers would change
their courses, textbook authors would revise their statistics books, and journal
authors would modify their inference strategies" (p. 757).

Fortunately, some journal editors have elaborated policies "requiring" rather than merely "encouraging" (APA, 1994, p. 18) effect size reporting (cf. Heldref Foundation, 1997, pp. 95-96; Thompson, 1994b, p. 845). It is particularly noteworthy that editorial policies even at one APA journal now indicate that:

If an author decides not to present an effect size estimate along with the outcome of a significance test, I will ask the author to provide specific justification for why effect sizes are not reported. So far, I have not heard a good argument against presenting effect sizes. Therefore, unless there is a real impediment to doing so, you should routinely include effect size information in the papers you submit. (Murphy, 1997, p. 4)

Leadership from AERA

Professional disciplines, like glaciers, move slowly, but inexorably. The hallmark of a profession is standards of conduct. And, as Biesanz and Biesanz (1969) observed, "all members of the profession are considered colleagues, equals, who are expected to uphold the dignity and mystique of the profession in return for the protection of their colleagues" (p. 155). Especially in academic professions, there is some hesitance to change existing standards, or to impose more standards than seem necessary to realize common purposes.

As might be expected, given these considerations, in its long
history AERA has been reticent to articulate standards for the conduct of
educational inquiry. Most such expectations have been articulated only in
conjunction with other organizations (e.g., AERA/APA/NCME, 1985). For example,
AERA participated with 15 other organizations in the Joint Committee on
Standards for Educational Evaluation's (1994) articulation of the program
evaluation standards. These were the first-ever American National Standards
Institute (ANSI)-approved standards for professional conduct. As ANSI-approved
standards, these represent *de facto* THE American standards for program
evaluation (cf. Sanders, 1994).

As Kaestle (1993) noted some years ago,

...[I]f education researchers could reverse their reputation for irrelevance, politicization, and disarray, however, they could rely on better support because most people, in the government and the public at large, believe that education is critically important. (pp. 30-31)

Some of the desirable movements of the field may be facilitated by the on-going work of the APA Task Force on Statistical Inference (Azar, 1997; Shea, 1996).

But AERA, too, could offer academic leadership. The children who are served by education need not wait for AERA to wait for APA to lead via continuing revisions of the APA publication manual. AERA, through the new Research Advisory Committee, and other AERA organs, might encourage the formulation of editorial policies that place less emphasis on statistical tests based on "nil" null hypotheses, and more emphasis on evaluating whether educational interventions and theories yield valued effect sizes that replicate under stated conditions.

It would be a gratifying experience to see our organization lead movement of the social sciences. Offering credible academic leadership might be one way that educators could confront the "awful reputation" (Kaestle, 1993) ascribed to our research. As I argued 3 years ago, if education "studies inform best practice in classrooms and other educational settings, the stakeholders in these locations certainly deserve better treatment from the [educational] research community via our analytic choices" (p. 29).

References

Abelson, R.P. (1997). A retrospective on the significance test
ban of 1999 (If there were no significance tests, they would be invented). In
L.L. Harlow, S.A. Mulaik & J.H. Steiger (Eds.), __What if there were no
significance tests?__ (pp. 117-141). Mahwah, NJ: Erlbaum.

Aiken, L.S., West, S.G., Sechrest, L., Reno, R.R., with
Roediger, H.L., Scarr, S., Kazdin, A.E., & Sherman, S.J. (1990). The
training in statistics, methodology, and measurement in psychology. __American
Psychologist__, __45__, 721-734.

American Educational Research Association, American
Psychological Association, National Council on Measurement in Education. (1985).
__Standards for educational and psychological testing__. Washington, DC:
Author.

American Psychological Association. (1974). __Publication
manual of the American Psychological Association__ (2nd ed.). Washington, DC:
Author.

American Psychological Association. (1994). __Publication
manual of the American Psychological Association__ (4th ed.). Washington, DC:
Author.

Anderson, D.R., Burnham, K.P., & Thompson, W.L. (1999).
__Null hypothesis testing in ecological studies: Problems, prevalence, and an
alternative__. Manuscript submitted for publication.

Anonymous. (1998). [Untitled letter]. In G. Saxe & A.
Schoenfeld, Annual meeting 1999. __Educational Researcher__, __27__(5),
41.

*****Atkinson, D.R., Furlong, M.J., & Wampold, B.E. (1982).
Statistical significance, reviewer evaluations, and the scientific process: Is
there a (statistically) significant relationship? __Journal of Counseling
Psychology__, __29__, 189-194.

Atkinson, R.C., & Jackson, G.B. (Eds.). (1992). __Research
and education reform: Roles for the Office of Educational Research and
Improvement__. Washington, DC: National Academy of Sciences. (ERIC Document
Reproduction Service No. ED 343 961)

Azar, B. (1997). APA task force urges a harder look at data.
__The APA Monitor__, __28__(3), 26.

Bagozzi, R.P., Fornell, C., & Larcker, D.F. (1981).
Canonical correlation analysis as a special case of a structural relations
model. __Multivariate Behavioral Research__, __16__, 437-454.

Bakan, D. (1966). The test of significance in psychological
research. __Psychological Bulletin__, __66__, 423-437.

Barnette, J.J., & McLean, J.E. (1998, November).
__Protected versus unprotected multiple comparison procedures__. Paper
presented at the annual meeting of the Mid-South Educational Research
Association, New Orleans.

Berkson, J. (1938). Some difficulties of interpretation
encountered in the application of the chi-square test. __Journal of the
American Statistical Association__, __33__, 526-536.

Biesanz, J., & Biesanz, M. (1969). __Introduction to
sociology__. Englewood Cliffs, NJ: Prentice-Hall.

____________

References designated with asterisks are __empirical__
studies of research practices.

Borgen, F.H., & Seling, M.J. (1978). Uses of discriminant
analysis following MANOVA: Multivariate statistics for multivariate purposes.
__Journal of Applied Psychology__, __63__, 689-697.

Boring, E.G. (1919). Mathematical vs. scientific importance.
__Psychological Bulletin__, __16__, 335-338.

Breunig, N.A. (1995, November). __Understanding the sampling
distribution and its use in testing statistical significance__. Paper
presented at the annual meeting of the Mid-South Educational Research
Association, Biloxi, MS. (ERIC Document Reproduction Service No. ED 393 939)

Carver, R. (1978). The case against statistical significance
testing. __Harvard Educational Review__, __48__, 378-399.

Carver, R. (1993). The case against statistical significance
testing, revisited. __Journal of Experimental Education__, __61__,
287-292.

Cliff, N. (1987). __Analyzing multivariate data__. San
Diego: Harcourt Brace Jovanovich.

Cohen, J. (1968). Multiple regression as a general
data-analytic system. __Psychological Bulletin__, __70__, 426-443.

Cohen, J. (1969). __Statistical power analysis for the
behavioral sciences__. New York: Academic Press.

*****Cohen, J. (1979). Clinical psychologists' judgments of the
scientific merit and clinical relevance of psychotherapy outcome research.
__Journal of Consulting and Clinical Psychology__, __47__, 421-423.

Cohen, J. (1988). __Statistical power analysis for the
behavioral sciences__ (2nd ed.). Hillsdale, NJ: Erlbaum.

Cohen, J. (1994). The earth is round (__p__ < .05).
__American Psychologist__, __49__, 997-1003.

Cortina, J.M., & Dunlap, W.P. (1997). Logic and purpose of
significance testing. __Psychological Methods__, __2__, 161-172.

Cronbach, L.J. (1957). The two disciplines of scientific
psychology. __American Psychologist__, __12__, 671-684.

Cronbach, L.J. (1975). Beyond the two disciplines of
psychology. __American Psychologist__, __30__, 116-127.

Davison, A.C., & Hinkley, D.V. (1997). __Bootstrap methods
and their applications__. Cambridge: Cambridge University Press.

Diaconis, P., & Efron, B. (1983). Computer-intensive
methods in statistics. __Scientific American__, __248__(5),
116-130.

*****Edgington, E.S. (1964). A tabulation of inferential
statistics used in psychology journals. __American Psychologist__, __19__,
202-203.

*****Edgington, E.S. (1974). A new tabulation of statistical
procedures used in APA journals. __American Psychologist__, __29__,
25-26.

Efron, B. (1979). Bootstrap methods: Another look at the
jackknife. __The Annals of Statistics__, __7__, 1-26.

Efron, B., & Tibshirani, R.J. (1993). __An introduction to
the bootstrap__. New York: Chapman and Hall.

Eisner, E.W. (1983). Anastasia might still be alive, but the
monarchy is dead. __Educational Researcher__, __12__(5), 13-14,
23-34.

*****Elmore, P.B., & Woehlke, P.L. (1988). Statistical
methods employed in __American Educational Research Journal__, __Educational
Researcher__, and __Review of Educational Research__ from 1978 to 1987.
__Educational Researcher__, __17__(9), 19-20.

*****Emmons, N.J., Stallings, W.M., & Layne, B.H. (1990,
April). __Statistical methods used in American Educational Research
Journal, Journal of Educational Psychology, and Sociology of
Education from 1972 through 1987__. Paper presented at the annual meeting
of the American Educational Research Association, Boston, MA. (ERIC Document
Reproduction Service No. ED 319 797)

Ezekiel, M. (1930). __Methods of correlational analysis__.
New York: Wiley.

Fan, X. (1996). Canonical correlation analysis as a general
analytic model. In B. Thompson (Ed.), __Advances in social science
methodology__ (Vol. 4, pp. 71-94). Greenwich, CT: JAI Press.

Fan, X. (1997). Canonical correlation analysis and structural
equation modeling: What do they have in common? __Structural Equation
Modeling__, __4__, 65-79.

Fetterman, D.M. (1982). Ethnography in educational research:
The dynamics of diffusion. __Educational Researcher__, __11__(3), 17-22,
29.

Fish, L.J. (1988). Why multivariate methods are usually vital.
__Measurement and Evaluation in Counseling and Development__, __21__,
130-137.

Fisher, R.A. (1936). The use of multiple measurements in
taxonomic problems. __Annals of Eugenics__, __7__, 179-188.

Frick, R.W. (1996). The appropriate use of null hypothesis
testing. __Psychological Methods__, __1__, 379-390.

Friedman, H. (1968). Magnitude of experimental effect and a
table for its rapid estimation. __Psychological Bulletin__, __70__,
245-251.

Gage, N.L. (1978). __The scientific basis of the art of
teaching__. New York: Teachers College Press.

Gage, N.L. (1985). __Hard gains in the soft sciences: The case
of pedagogy__. Bloomington, IN: Phi Delta Kappa Center on Evaluation,
Development, and Research.

Gall, M.D., Borg, W.R., & Gall, J.P. (1996). __Educational
research: An introduction__ (6th ed.). White Plains, NY: Longman.

Glantz, S.A. (1980). Biostatistics: How to detect, correct and
prevent errors in the medical literature. __Circulation__, __61__,
1-7.

Glass, G.V (1976). Primary, secondary, and meta-analysis of
research. __Educational Researcher__, __5__(10), 3-8.

*****Glass, G.V (1979). Policy for the unpredictable
(uncertainty research and policy). __Educational Researcher__, __8__(9),
12-14.

*****Goodwin, L.D., & Goodwin, W.L. (1985). Statistical
techniques in __AERJ__ articles, 1979-1983: The preparation of graduate
students to read the educational research literature. __Educational
Researcher__, __14__(2), 5-11.

*****Greenwald, A. (1975). Consequences of prejudice against
the null hypothesis. __Psychological Bulletin__, __82__, 1020.

Grimm, L.G., & Yarnold, P.R. (Eds.). (1995). __Reading and
understanding multivariate statistics__. Washington, DC: American
Psychological Association.

*****Hall, B.W., Ward, A.W., & Comer, C.B. (1988).
Published educational research: An empirical study of its quality. __Journal of
Educational Research__, __81__, 182-189.

Harlow, L.L., Mulaik, S.A., & Steiger, J.H. (Eds.). (1997).
__What if there were no significance tests?__. Mahwah, NJ: Erlbaum.

Hays, W. L. (1981). __Statistics__ (3rd ed.). New York:
Holt, Rinehart and Winston.

Hedges, L.V. (1981). Distribution theory for Glass's estimator
of effect sizes and related estimators. __Journal of Educational
Statistics__, __6__, 107-128.

Heldref Foundation. (1997). Guidelines for contributors.
__Journal of Experimental Education__, __65__, 95-96.

Henard, D.H. (1998, January). __Suppressor variable effects:
Toward understanding an elusive data dynamic__. Paper presented at the annual
meeting of the Southwest Educational Research Association, Houston. (ERIC
Document Reproduction Service No. ED 416 215)

Herzberg, P.A. (1969). The parameters of cross-validation.
__Psychometrika Monograph Supplement__, __16__, 1-67.

Hinkle, D.E., Wiersma, W., & Jurs, S.G. (1998). __Applied
statistics for the behavioral sciences__ (4th ed.). Boston: Houghton
Mifflin.

Horst, P. (1966). __Psychological measurement and
prediction__. Belmont, CA: Wadsworth.

Huberty, C.J (1994). __Applied discriminant analysis__. New
York: Wiley and Sons.

Huberty, C.J, & Barton, R. (1989). An introduction to
discriminant analysis. __Measurement and Evaluation in Counseling and
Development__, __22__, 158-168.

Huberty, C.J, & Morris, J.D. (1988). A single contrast test
procedure. __Educational and Psychological Measurement__, __48__,
567-578.

Huberty, C.J, & Pike, C.J. (in press). On some history
regarding statistical testing. In B. Thompson (Ed.), __Advances in social
science methodology__ (Vol. 5). Stamford, CT: JAI Press.

Humphreys, L.G. (1978). Doing research the hard way:
Substituting analysis of variance for a problem in correlational analysis.
__Journal of Educational Psychology__, __70__, 873-876.

Humphreys, L.G., & Fleishman, A. (1974). Pseudo-orthogonal
and other analysis of variance designs involving individual-differences
variables. __Journal of Educational Psychology__, __66__, 464-472.

Hunter, J.E., & Schmidt, F.L. (1990). __Methods of
meta-analysis: Correcting error and bias in research findings__. Newbury Park,
CA: Sage.

Joint Committee on Standards for Educational Evaluation.
(1994). __The program evaluation standards: How to assess evaluations of
educational programs__ (2nd ed.). Newbury Park, CA: SAGE.

Kaestle, C.F. (1993). The awful reputation of education
research. __Educational Researcher__, __22__(1), 23, 26-31.

Kaiser, H.F. (1976). Review of *Factor analysis as a
statistical method*. __Educational and Psychological Measurement__,
__36__, 586-589.

Kerlinger, F. N. (1986). __Foundations of behavioral
research__ (3rd ed.). New York: Holt, Rinehart and Winston.

Kerlinger, F. N., & Pedhazur, E. J. (1973). __Multiple
regression in behavioral research__. New York: Holt, Rinehart and
Winston.

*****Keselman, H.J., Huberty, C.J, Lix, L.M., Olejnik, S.,
Cribbie, R., Donahue, B., Kowalchuk, R.K., Lowman, L.L., Petoskey, M.D.,
Keselman, J.C., & Levin, J.R. (1998). Statistical practices of educational
researchers: An analysis of their ANOVA, MANOVA and ANCOVA analyses. __Review
of Educational Research__, __68__, 350-386.

Keselman, H.J., Kowalchuk, R.K., & Lix, L.M. (1998). Robust
nonorthogonal analyses revisited: An update based on trimmed means.
__Psychometrika__, __63__, 145-163.

Keselman, H.J., Lix, L.M., & Kowalchuk, R.K. (1998).
Multiple comparison procedures for trimmed means. __Psychological Methods__,
__3__, 123-141.

*****Kirk, R. (1996). Practical significance: A concept whose
time has come. __Educational and Psychological Measurement__, __56__,
746-759.

Knapp, T. R. (1978). Canonical correlation analysis: A general
parametric significance testing system. __Psychological Bulletin__,
__85__, 410-416.

Kromrey, J.D., & Hines, C.V. (1996). Estimating the
coefficient of cross-validity in multiple regression: A comparison of analytical
and empirical methods. __Journal of Experimental Education__, __64__,
240-266.

Lancaster, B.P. (in press). Defining and interpreting
suppressor effects: Advantages and limitations. In B. Thompson, B. (Ed.),
__Advances in social science methodology__ (Vol. 5). Stamford, CT: JAI
Press.

*****Lance, T., & Vacha-Haase, T. (1998, August). __The
Counseling Psychologist____: Trends and usages of statistical significance
testing__. Paper presented at the annual meeting of the American Psychological
Association, San Francisco.

Levin, J.R. (1998). To test or not to test H_{0}?
__Educational and Psychological Measurement__, __58__, 311-331.

Loftus, G.R., & Loftus, E.F. (1982). __Essence of
statistics__. Monterey, CA: Brooks/Cole.

Lord, F.M. (1950). __Efficiency of prediction when a
regression equation from one sample is used in a new sample__ (Research
Bulletin 50-110). Princeton, NJ: Educational Testing Service.

Ludbrook, J., & Dudley, H. (1998). Why permutation tests
are superior to __t__ and __F__ tests in medical research. __The American
Statistician__, __52__, 127-132.

Lunneborg, C.E. (1987). __Bootstrap applications for the
behavioral sciences__. Seattle: University of Washington.

Lunneborg, C.E. (1999). __Data analysis by resampling:
Concepts and applications__. Pacific Grove, CA: Duxbury.

Manly, B.F.J. (1994). __Randomization and Monte Carlo methods
in biology__ (2nd ed.). London: Chapman and Hall.

Meehl, P.E. (1978). Theoretical risks and tabular asterisks:
Sir Karl, Sir Ronald, and the slow progress of soft psychology. __Journal of
Consulting and Clinical Psychology__, __46__, 806-834.

Mittag, K. (1992, January). __Correcting for systematic bias
in sample estimates of population variances: Why do we divide by n-1?__. Paper
presented at the annual meeting of the Southwest Educational Research
Association, Houston, TX. (ERIC Document Reproduction Service No. ED 341
728)

*****Mittag, K.G. (1999, April). __A national survey of AERA
members' perceptions of the nature and meaning of statistical significance
tests__. Paper presented at the annual meeting of the American Educational
Research Association, Montreal.

Murphy, K.R. (1997). Editorial. __Journal of Applied
Psychology__, __82__, 3-5.

*****Nelson, N., Rosenthal, R., & Rosnow, R.L. (1986).
Interpretation of significance levels and effect sizes by psychological
researchers. __American Psychologist__, __41__, 1299-1301.

*****Nilsson, J., & Vacha-Haase, T. (1998, August). __A
review of statistical significance reporting in the Journal of Counseling
Psychology__. Paper presented at the annual meeting of the American
Psychological Association, San Francisco.

*****Oakes, M. (1986). __Statistical inference: A commentary
for the social and behavioral sciences__. New York: Wiley.

Olejnik, S.F. (1984). Planning educational research:
Determining the necessary sample size. __Journal of Experimental Education__,
__53__, 40-48.

Pedhazur, E. J. (1982). __Multiple regression in behavioral
research: Explanation and prediction__ (2nd ed.). New York: Holt, Rinehart and
Winston.

Pedhazur, E. J., & Schmelkin, L. P. (1991). __Measurement,
design, and analysis: An integrated approach__. Hillsdale, NJ: Erlbaum.

*****Reetz, D., & Vacha-Haase, T. (1998, August). __Trends
and usages of statistical significance testing in adult development and aging
research: A review of Psychology and Aging__. Paper presented at the
annual meeting of the American Psychological Association, San Francisco.

Reinhardt, B. (1996). Factors affecting coefficient alpha: A
mini Monte Carlo study. In B. Thompson (Ed.), __Advances in social science
methodology__ (Vol. 4, pp. 3-20). Greenwich, CT: JAI Press.

Rennie, K.M. (1997, January). __Understanding the sampling
distribution: Why we divide by n-1 to estimate the population variance__.
Paper presented at the annual meeting of the Southwest Educational Research
Association, Austin. (ERIC Document Reproduction Service No. ED 406 442)

Rokeach, M. (1973). __The nature of human values__. New
York: Free Press.

Rosenthal, R. (1979). The "file drawer problem" and tolerance
for null results. __Psychological Bulletin__, __86__, 638-641.

Rosenthal, R. (1991). __Meta-analytic procedures for social
research__ (rev. ed.). Newbury Park, CA: Sage.

Rosenthal, R. (1994). Parametric measures of effect size. In H.
Cooper & L.V. Hedges (Eds.), __The handbook of research synthesis__ (pp.
231-244. New York: Russell Sage Foundation.

*****Rosenthal, R., & Gaito, J. (1963). The interpretation
of level of significance by psychological researchers. __Journal of
Psychology__, __55__, 33-38.

Rosnow, R.L., & Rosenthal, R. (1989). Statistical
procedures and the justification of knowledge in psychological science.
__American Psychologist__, __44__, 1276-1284.

Rossi, J.S. (1997). A case study in the failure of psychology
as a cumulative science: The spontaneous recovery of verbal learning. In L.L.
Harlow, S.A. Mulaik & J.H. Steiger (Eds.), __What if there were no
significance tests?__ (pp. 176-197). Mahwah, NJ: Erlbaum.

Rozeboom, W.W. (1960). The fallacy of the null hypothesis
significance test. __Psychological Bulletin__, __57__, 416-428.

Rozeboom, W.W. (1997). Good science is abductive, not
hypothetico-deductive. In L.L. Harlow, S.A. Mulaik & J.H. Steiger (Eds.),
__What if there were no significance tests?__ (pp. 335-392). Mahwah, NJ:
Erlbaum.

Sanders, J.R. (1994). The process of developing national
standards that meet ANSI guidelines. __Journal of Experimental Education__,
__63__, 5-12.

Saxe, G., & Schoenfeld, A. (1998). Annual meeting 1999.
__Educational Researcher__, __27__(5), 41.

Schmidt, F.L. (1996). Statistical significance testing and
cumulative knowledge in psychology: Implications for the training of
researchers. __Psychological Methods__, __1__, 115-129.

Schmidt, F.L., & Hunter, J.E. (1997). Eight common but
false objections to the discontinuation of significance testing in the analysis
of research data. In L.L. Harlow, S.A. Mulaik & J.H. Steiger (Eds.), __What
if there were no significance tests?__ (pp. 37-64). Mahwah, NJ: Erlbaum.

Sedlmeier, P., & Gigerenzer, G. (1989). Do studies of
statistical power have an effect on the power of studies? __Psychological
Bulletin__, __105__, 309-316.

Shaver, J. (1985). Chance and nonsense. __Phi Delta
Kappan__, __67__(1), 57-60.

Shaver, J. (1993). What statistical significance testing is,
and what it is not. __Journal of Experimental Education__, __61__,
293-316.

Shea, C. (1996). Psychologists debate accuracy of "significance
test." __Chronicle of Higher Education__, __42__(49), A12, A16.

Snyder, P., & Lawson, S. (1993). Evaluating results using
corrected and uncorrected effect size estimates. __Journal of Experimental
Education__, __61__, 334-349.

*****Snyder, P.A., & Thompson, B. (1998). Use of tests of
statistical significance and other analytic choices in a school psychology
journal: Review of practices and suggested alternatives. __School Psychology
Quarterly__, __13__, 335-348.

Sprent, P. (1998). __Data driven statistical methods__.
London: Chapman and Hall.

Tatsuoka, M.M. (1973a). __An examination of the statistical
properties of a multivariate measure of strength of relationship__. Urbana:
University of Illinois. (ERIC Document Reproduction Service No. ED 099 406)

Tatsuoka, M.M. (1973b). Multivariate analysis in educational
research. In F. N. Kerlinger (Ed.), __Review of research in education__ (pp.
273-319). Itasca, IL: Peacock.

Thompson, B. (1984). __Canonical correlation analysis: Uses
and interpretation__. Newbury Park, CA: Sage.

Thompson, B. (1985). Alternate methods for analyzing data from
experiments. __Journal of Experimental Education__, __54__, 50-55.

Thompson, B. (1986a). ANOVA versus regression analysis of ATI
designs: An empirical investigation. __Educational and Psychological
Measurement__, __46__, 917-928.

Thompson, B. (1986b, November). __Two reasons why multivariate
methods are usually vital__. Paper presented at the annual meeting of the
Mid-South Educational Research Association, Memphis.

Thompson, B. (1988a, November). __Common methodology mistakes
in dissertations: Improving dissertation quality__. Paper presented at the
annual meeting of the Mid-South Educational Research Association, Louisville,
KY. (ERIC Document Reproduction Service No. ED 301 595)

Thompson, B. (1988b). Program FACSTRAP: A program that computes
bootstrap estimates of factor structure. __Educational and Psychological
Measurement__, __48__, 681-686.

Thompson, B. (1991). A primer on the logic and use of canonical
correlation analysis. __Measurement and Evaluation in Counseling and
Development__, __24__, 80-95.

Thompson, B. (1992a). DISCSTRA: A computer program that
computes bootstrap resampling estimates of descriptive discriminant analysis
function and structure coefficients and group centroids. __Educational and
Psychological Measurement__, __52__, 905-911.

Thompson, B. (1992b, April). __Interpreting regression
results: beta weights and structure coefficients are both important__. Paper
presented at the annual meeting of the American Educational Research
Association, San Francisco. (ERIC Document Reproduction Service No. ED 344
897)

Thompson, B. (1992c). Two and one-half decades of leadership in
measurement and evaluation. __Journal of Counseling and Development__,
__70__, 434-438.

Thompson, B. (1993a, April). __The General Linear Model (as
opposed to the classical ordinary sums of squares) approach to analysis of
variance should be taught in introductory statistical methods classes__. Paper
presented at the annual meeting of the American Educational Research
Association, Atlanta. (ERIC Document Reproduction Service No. ED 358 134)

Thompson, B. (1993b). The use of statistical significance tests
in research: Bootstrap and other alternatives. __Journal of Experimental
Education__, __61__, 361-377.

Thompson, B. (1994a, April). __Common methodology mistakes in
dissertations, revisited__. Paper presented at the annual meeting of the
American Educational Research Association, New Orleans. (ERIC Document
Reproduction Service No. ED 368 771)

Thompson, B. (1994b). Guidelines for authors. __Educational
and Psychological Measurement__, __54__(4), 837-847.

Thompson, B. (1994c). Planned versus unplanned and orthogonal
versus nonorthogonal contrasts: The neo-classical perspective. In B. Thompson
(Ed.), __Advances in social science methodology__ (Vol. 3, pp. 3-27).
Greenwich, CT: JAI Press.

Thompson, B. (1994d, February). __Why multivariate methods are
usually vital in research: Some basic concepts__. Paper presented as a
Featured Speaker at the biennial meeting of the Southwestern Society for
Research in Human Development (SWSRHD), Austin, TX. (ERIC Document Reproduction
Service No. ED 367 687)

Thompson, B. (1995a). Exploring the replicability of a study's
results: Bootstrap statistics for the multivariate case. __Educational and
Psychological Measurement__, __55__, 84-94.

Thompson, B. (1995b). Review of *Applied discriminant
analysis* by C.J Huberty. __Educational and Psychological Measurement__,
__55__, 340-350.

Thompson, B. (1996). AERA editorial policies regarding
statistical significance testing: Three suggested reforms. __Educational
Researcher__, __25__(2), 26-30.

Thompson, B. (1997a). Editorial policies regarding statistical
significance tests: Further comments. __Educational Researcher__,
__26__(5), 29-32.

Thompson, B. (1997b). The importance of structure coefficients
in structural equation modeling confirmatory factor analysis. __Educational and
Psychological Measurement__, __57__, 5-19.

Thompson, B. (1998a, April). __Five methodology errors in
educational research: The pantheon of statistical significance and other faux
pas__. Invited address presented at the annual meeting of the American
Educational Research Association, San Diego. (ERIC Document Reproduction Service
No. ED 419 023) [also available on the Internet through URL:
"**index.htm**"]

Thompson, B. (1998b). In praise of brilliance: Where that
praise really belongs. __American Psychologist__, __53__, 799-800.

Thompson, B. (1998c). Review of *What if there were no
significance tests?* by L. Harlow, S. Mulaik & J. Steiger (Eds.).
__Educational and Psychological Measurement__, __58__, 332-344.