aeraaddr.wp1 4/3/98
Five Methodology Errors in Educational Research:
The Pantheon of
Statistical Significance and Other Faux Pas
Bruce Thompson
Texas A&M University 77843-4225
and
Baylor College of Medicine
____________
Invited address (Divisions E, D, and C) presented at the annual
meeting (session #25.66) of the American Educational Research Association, San
Diego, April 15, 1998. The assistance of Xitao Fan, Utah State University, in
running the LISREL structural equation modeling program as the general linear
model, is appreciated. The author may be contacted through Internet
URL:
http:www.coe.tamu.edu/~bthompson.
ABSTRACT
After presenting a general linear model as a framework for discussion, the present paper reviews five methodology errors that occur in educational research: (a) the use of stepwise methods; (b) the failure to consider in result interpretation the context specificity of analytic weights (e.g., regression beta weights, factor pattern coefficients, discriminant function coefficients, canonical function coefficients) that are part of all parametric quantitative analyses; (c) the failure to interpret both weights and structure coefficients as part of result interpretation; (d) the failure to recognize that reliability is a characteristic of scores, and not of tests; and (e) the incorrect interpretation of statistical significance and the related failure to report and interpret the effect sizes present in all quantitative analyses. In several cases small heuristic discriminant analysis data sets are presented to make more concrete and accessible the discussion of each of these five methodology errors.
A well-known popular cliche holds that a chain is only as strong as its weakest link. So, too, a research study will be at least partially compromised by whatever is the weakest link in the sequence of activities that cumulate in a completed investigation. Too often the weakest link in contemporary quantitative educational research involves the methodologies of statistical analysis.
There is no question that educational research, whatever its methodological and other limits, has influenced and informed educational practice (cf. Gage, 1985; Travers, 1983). But there seems to be some consensus that "too much of what we see in print is seriously flawed" as regards research methods, and that "much of the work in print ought not to be there" (Tuckman, 1990, p. 22). Gall, Borg and Gall (1996) concurred, noting that "the quality of published studies in education and related disciplines is, unfortunately, not high" (p. 151).
Empirical studies of published research involving methodology experts as judges corroborate these holistic impressions. For example, Hall, Ward and Comer (1988) and Ward, Hall and Schramm (1975) found that over 40% and over 60%, respectively, of published research was judged by methods experts as being seriously or completely flawed. Wandt (1967) and Vockell and Asher (1974) reported similar results from their empirical studies of the quality of published research. Dissertations, too, have been examined, and too often have been found methodologically wanting (cf. Thompson, 1988a, 1994a).
Of course, it must be acknowledged that even a methodologically flawed study may still contribute something to our understanding of educational phenomena. As Glass (1979) noted, "Our research literature in education is not of the highest quality, but I suspect that it is good enough on most topics" (p. 12).
But the problem with methodologically flawed studies is that these methodological flaws are entirely gratuitous. There is no upside to conducting incorrect statistical analyses. Usually a more thoughtful analysis is not appreciably more demanding in time or expertise than is a compromised choice. Rather, incorrect analyses arise from doctoral methodology instruction that teaches research methods as series of rotely-followed routines, as against thoughtful elements of a reflective enterprise; from doctoral curricula that seemingly have less and less room for quantitative statistics and measurement content, even while our knowledge base in these areas is burgeoning (Aiken, West, Sechrest, Reno, with Roediger, Scarr, Kazdin & Sherman, 1990; Pedhazur & Schmelkin, 1991, pp. 2-3); and, in some cases, from an unfortunate atavistic impulse to somehow escape responsibility for analytic decisions by justifying choices, sans rationale, solely on the basis that the choices are common or traditional.
Purpose of the Paper
The purpose of the present paper is to review five methodology errors that occur in educational research: (a) the use of stepwise methods; (b) the failure to consider in result interpretation the context specificity of analytic weights (e.g., regression beta weights, factor pattern coefficients, discriminant function coefficients, canonical function coefficients) that are part of all parametric quantitative analyses; (c) the failure to interpret both weights and structure coefficients as part of result interpretation; (d) the failure to recognize that reliability is a characteristic of scores, and not of tests; and (e) the incorrect interpretation of statistical significance and the related failure to report and interpret the effect sizes present in all quantitative analyses. These comments are not new to the literature, or even to my own writing. But the field has seemingly remained somewhat recalcitrant in reflecting evolution as regards these methodological issues.
The paper presents a conceptual overview of each concern. In several cases small heuristic data sets are presented to make more concrete and accessible the discussion of each of these five methodology errors. Because, as will be shown, all parametric methods are part of one general linear model (GLM) family, methodology dynamics illustrated for one heuristic example generalize to other related cases. In the present paper, discriminant analysis examples are consistently (but arbitrarily) employed as heuristics. Nevertheless, the illustrations necessarily generalize to other analyses within the GLM family.
Delimitation
Of course, methodological errors other than these five might have been cited. For example, empirical studies (Emmons, Stallings & Layne, 1990) show that, "In the last 20 years, the use of multivariate statistics has become commonplace" (Grimm & Yarnold, 1995, p. vii), probably for very good reasons (Fish, 1988; Thompson, 1984, 1994e). Many such studies employ MANOVA (all to the good), but an unfortunate number of these studies then use ANOVA methods post hoc to explore detected multivariate effects (all to the bad) (Borgen & Seling, 1978). As I have noted elsewhere,
The multivariate analysis evaluates multivariate synthetic variables, while the univariate analysis only considers univariate latent variables. Thus, univariate post hoc tests do not inform the researcher about the differences in the multivariate latent variables actually analyzed in the multivariate analysis... It is illogical to first declare interest in a multivariate omnibus system of variables, and to then explore detected effects in this multivariate world by conducting non-multivariate tests! (Thompson, 1994e, p. 14, emphasis in original)
Similarly, all too often researchers erroneously interpret the eigenvalues in factor analysis as reflecting the variance contained in the individual factors after rotation (Thompson & Daniel, 1996a). Or the discarding of variance in order to conduct ANOVA (cf. Thompson, 1985) or incorrect use of ANCOVA (Thompson, 1992b) might have been discussed. However, space precludes discussion here of all possible common methodology errors; the present discussion necessarily must be delimited in some manner.
Premise Regarding Movement in Fields
In considering these five methodology errors, it may be important for each of us to remember that, over the course of careers, fields, including the methodology-related fields, do move. Invariably, those of us in the late stages of our careers will confront the realization that some methodology choices in our own work, published decades earlier, no longer reflect standards of present best practice, or might even now be deemed fully inappropriate. Responsible scholars must remain open, and be willing to engage in continual reflection as to whether our own personal analytic traditions remain viable.
Some have suggested that resistance to adopting revised methodological practice may in some cases be an artifact of denial, cognitive dissonance, and other classical psychological dynamics (Thompson, in press-d). For example, Schmidt and Hunter (1997) noted that "changing the beliefs and practices of a lifetime... naturally... provokes resistance" (Schmidt & Hunter, 1997, p. 49). Similarly, Rozeboom (1960) observed that "the perceptual defenses of psychologists are particularly efficient when dealing with matters of methodology, and so the statistical folkways of a more primitive past continue to dominate the local scene" (p. 417).
Recognizing the reality that fields move, and that to be fair works must be evaluated primarily against the methodological standards contemporary at the time of a given report, may facilitate helpful change. Prior to advocating selected changes, however, the general linear model (GLM) will be briefly described so as to provide a unifying conceptual framework for the remaining discussion. Structural equation modeling (SEM) will be presented as the most general case of the general linear model (GLM).
In one of his innumerable seminal contributions, the late Jacob "Jack" Cohen (1968) demonstrated that multiple regression subsumes all the univariate parametric methods as special cases, and thus provides a univariate general linear model that can be employed in all univariate analyses. Ten years later, in an equally important article Knapp (1978) presented the mathematical theory showing that canonical correlation analysis subsumes all the parametric analyses, both univariate and multivariate, as special cases. More concrete demonstrations of these relationships have also been offered (Fan, 1996; Thompson, 1984, 1991, in press-a). Both the Cohen (1968) and the Knapp (1978) articles were cited within a compilation of the most noteworthy methodology articles published during the last 50 years (Thompson & Daniel, 1996b).
However, structural equation modeling (SEM) represents an even bigger conceptual tent subsuming more restrictive methods (Bagozzi, 1981). Instructive illustrations of these relationships have been offered by Fan (1997). Prior to extracting the conceptual implications of the realization that a general linear model underlies all parametric analyses, a concrete demonstration that SEM is a general linear model subsuming canonical correlation analysis (CCA) (and its multivariate and univariate special cases) may be useful.
Heuristic Illustration that SEM Subsumes CCA
The illustration that SEM is a general linear model subsuming canonical correlation analysis (and its multivariate and univariate special cases) employs scores on seven variables (i.e., two in one set, and three in the other set) from the 301 cases in the Holzinger and Swineford (1939, pp. 81-91) data. These scores on ability batteries have classically been used as examples in both popular textbooks (Gorsuch, 1983, passim) and computer program manuals (Jöreskog & Sörbom, 1989, pp. 97-104), and thus are familiar to many readers.
Table 1 presents the bivariate correlation matrix for these data. As in all parametric analyses, a correlation or covariance matrix is the basis for all analyses; this matrix is partitioned into quadrants (see Table 1) honoring the variables' membership in criterion or predictor sets, and is then subjected to a principal components analysis (Thompson, 1984, in press-a).
__________________________
INSERT TABLE 1 ABOUT HERE.
__________________________
Appendix A presents the SPSS/LISREL computer program used to analyze the data. Table 2 presents the SPSS canonical correlation analysis of these same data.
__________________________
INSERT TABLE 2 ABOUT HERE.
__________________________
Table 3 presents the relevant portions of the LISREL analysis of the canonical correlation model for these data. The LISREL coefficients for the "gamma" matrix exactly match (within rounding error) the SPSS canonical function coefficients presented in Table 2. The only exception is that all the signs for the SEM second canonical function coefficients must be "reflected." "Reflecting" a function (changing all the signs on a given function, factor, or equation) is always permissible, because the scaling of psychological constructs is arbitrary. Thus, the SEM and the canonical analysis derived the same results. Since SEM can be employed to test a CCA model, SEM is an even more general case of the general linear model, quod erat demonstrandum.
__________________________
INSERT TABLE 3 ABOUT HERE.
__________________________
Heuristic Implications
There are a number of implications that can be drawn from the realization that a general linear model subsumes other methods as special cases. Specifically, all classical parametric methods are least squares procedures that implicitly or explicitly (a) use least squares weights (e.g., regression beta weights, standardized canonical function coefficients) to optimize explained variance and minimize model error variance, (b) focus on latent synthetic variables (e.g., the regression Y^ variable) created by applying the weights (e.g., beta weights) to scores on measured/observed variables (e.g., regression predictor variables), and (c) yield variance-accounted-for effect sizes analogous to r2 (e.g., R2, eta2, omega2). Thus, all classical analytic methods are correlational (Knapp, 1978; Thompson, 1988a).
Designs may be experimental or correlational, but all analyses are correlational. Thus, an effect size analogous to r2 can be computed in any parametric analysis (see Snyder and Lawson (1993), or Kirk (1996)).
The fact that all classical parametric methods use weights to then compute synthetic/latent variables by applying the weights to the measured/observed variables is obscured by the fact that most computer packages do not print the least squares weights that are actually invoked in ANOVA, for example, or when t-tests are conducted. Thus, some researchers unconsciously presume that such methods do not invoke optimal weighting systems.
The fact that all classical parametric methods use weights to then compute synthetic/latent variables by applying the weights to the measured/observed variables is also obscured by the inherently confusing language of statistics. As I have noted elsewhere, the weights in different analyses
...are all analogous, but are given different names in different analyses (e.g., beta weights in regression, pattern coefficients in factor analysis, discriminant function coefficients in discriminant analysis, and canonical function coefficients in canonical correlation analysis), mainly to obfuscate the commonalities of [all] parametric methods, and to confuse graduate students. (Thompson, 1992a, pp. 906-907)
If all standardized weights across analytic methods were called by the same name (e.g., beta weights), then researchers might (correctly) conclude that all analyses are part of the same general linear model.
Indeed, both the weight systems (e.g., regression equation, factor) and the synthetic variables (e.g., the regression Y^ variable) are also arbitrarily given different names across the analyses, again mainly so as to confuse the graduate students. Table 4 summarizes some of the elements of the very effective conspiracy.
__________________________
INSERT TABLE 4 ABOUT HERE.
__________________________
The present paper will employ this general linear model as a unifying conceptual framework for some of the arguments made herein. However, prior to presenting these views, a brief digression is required.
Predictive Discriminant Analysis (PDA) as a Hybrid GLM Offshoot
In the seminal work on discriminant analysis, Huberty (1994; see also Huberty and Barton (1989) and Huberty and Wisenbaker (1992)) thoughtfully distinguished two major applications: descriptive discriminant analysis (DDA) and predictive discriminant analysis (PDA). Put simply, DDA describes the differences on intervally-scaled "response" variables associated with a nominally-scaled variable, membership in different groups. PDA, on the other hand, uses intervally-scaled "response" variables to predict membership in different groups. Thus, the purpose of the analysis distinguishes the two methods (and these purposes subsequently determine which aspects of the results are relevant or irrelevant).
The drawing of a distinction between DDA and PDA is not mere statistical nit-picking. Instead, the relevant aspects of DDA and PDA results are completely different. For example, in PDA the "hit rate" (and which response variables most contribute to the hit rate) is the sina qua non of the analysis, while the weights are generally irrelevant as regards result interpretation. In DDA, on the other hand, the weights and the "structure" of the synthetic/latent variable scores are very important to interpretation, but the concept of hit rate becomes irrelevant.
The number of systems of weights (i.e., "functions," or "rules") also differs across DDA and PDA. In DDA, the number of linear discriminant functions (LDFs) is the number of groups minus one, or the number of response variables, whichever is smaller. In PDA, the number of linear classification functions (LCFs) is the number of groups. For example, with two groups and three response variables, in DDA there would be one LDF (and an associated set of scores on the synthetic variable, the discriminant scores). In the same case, in PDA there would be two LDFs (and associated sets of scores on the synthetic variables, the classification scores).
PDA is a hybrid offshoot of the general linear model, while DDA resides fully within the GLM nuclear family. Thus, the conclusions reached here based on GLM concepts may not apply to the PDA case.
When More Variables Can Hurt Study Effects
One powerful demonstration of PDA versus DDA dynamics involves a paradox. In any GLM analysis, more variables (e.g., more regression predictors) always lead to effect sizes (e.g., R2) that are equal to or greater than the effects associated with fewer variables. However, in PDA, more response variables can actually hurt the PDA hit rate.
The Table 5 data, drawn from the Holzinger and Swineford (1939) data described previously, can be analyzed to illustrate these dynamics. The Appendix B SPSS program conducts the relevant analyses.
__________________________
INSERT TABLE 5 ABOUT HERE.
__________________________
Table 6 presents the hit rates derived using three response variables as predictors using both LDF and LCF scores; these hit rates are both 66.4% ([40 + 31] / 107). [Normally only LCFs are used for classification purposes, even though SPSS incorrectly uses LDF scores for this purposes (Huberty & Lowman, 1997)]. Table 6 also presents the hit rates derived using four response variables as predictors using both LDF and LCF scores; these hit rates are both 63.6% ([38 + 30] / 107). Figure 1 presents the corresponding results in graphic form.
_______________________________________
INSERT TABLE 6 AND FIGURE 1 ABOUT HERE.
_______________________________________
Indeed, the hit rate differences with the use of three versus four response variables is even greater than the apparent difference of 71 versus 68 people, respectively, being correctly classified. In fact, as noted in Table 7, 9 persons were classified differently across the analyses using three versus four response variables, even though the net impact of using more predictors was a net loss in predictive accuracy of three hits. [If the same data were treated as reflecting a DDA case, the Wilks lambda effect size would be the same or better (i.e., a smaller lambda value) for four (0.8050684) as against three (0.8094909) response variables, as is always true in the GLM case.]
__________________________
INSERT TABLE 7 ABOUT HERE.
__________________________
Elsewhere I (Thompson, 1995b) have explained some of these counterintuitive dynamics by portraying a hypothetical set of results involving five response variables. Presume there were three "fence-riders," that is, cases very near the classification boundaries (arbitrarily cases #4, #11, and #51). Let's say with five predictor variables our initial lambda is .50, and let's say we add an additional, sixth response variable as a PDA predictor.
Clearly, having more predictive information always help us better explain data dynamics, or at least can't take away what we already know. This is reflected by the fact that the Wilks lambda value will always stay the same or get better (i.e., smaller) as we add predictor variables.
But this occurs only on the average, as reflected in on-the-average statistics such as lambda. While relative explanatory power will remain the same or improve on the average, at the case level each and every single case will not necessarily move toward its actual group's location when the additional sixth predictor variable is used. For example, let's say that all cases' positions except cases #4, #11, #51 and #43 remain fixed in essentially their initial locations and that group territorial boundaries also remain roughly unchanged.
If because the sixth predictor was especially useful in locating case #43, case #43 might move very far toward but not over the boundary that would have yielded a correct classification. Lambda would reflect this change by getting better (i.e., smaller), such as changing from .50 to perhaps .45. Cases #4, #11, and #51 might move slightly away from their actual group, because although the sixth predictor will either not change explanatory power or will provide more information on the average, it is still possible that the sixth predictor may provide misinformation about these three particular cases, resulting in their moving across their actual group boundary and becoming misclassified. This small movement will, of course, be reflected in lambda, which will correspondingly get only slightly worse (i.e., bigger), such as moving from .45 to .46. Yet even though on the average locations have gotten more accurate and lambda has consequently improved from the original .50 to the final .46, the number of cases correctly classified when using all six predictors will have gotten worse by a net classification-accuracy change of minus three cases. (Thompson, 1995b, p. 345, emphasis in original)
Huberty (1994) has noted that, "It is quite common to find the use of 'stepwise analyses' reported in empirically based journal articles" (p. 261). Huebner (1991, 1992) and Jorgenson, Jorgenson, Gillis and McCall (1993) are a few examples from among the many egregious reports of stepwise analyses.
Stepwise methods continue to be used, notwithstanding scathing indictments of many of these applications (cf. Huberty, 1989; Snyder, 1991). My own feelings are intimated by the title of one of my editorials, viz. "Why won't stepwise methods die?" (Thompson, 1989).
Three major problems with stepwise can be noted, and will be briefly summarized here. A more complete treatment is available in Thompson (1995c).
The consequences of these three problems are quite serious. As Cliff (1987, p. 185) noted, "most computer programs for [stepwise] multiple regression are positively satanic in their temptations toward Type I errors." He also suggested that, "a large proportion of the published results using this method probably present conclusions that are not supported by the data" (pp. 120-121).
Wrong Degrees of Freedom
First, most computer packages (and thus most researchers) use the wrong degrees of freedom in their statistical significance tests for stepwise methods, thus systematically always inflating the likelihood of obtaining statistically significant results. Degrees of freedom are the "coins" we pay to investigate the dynamics within our data. The statistical significance tests take into account both the number of coins we've chosen to spend and the number we have chosen to reserve.
The most rigorous tests occur when we spend few degrees of freedom and reserve many. Conversely, at the extreme, all models with no degrees of freedom reserved (i.e., degrees of freedom error =0) always fit the data perfectly. For example, the bivariate r2 with n=2 inherently is always 1.0, as long as both X and Y are variables. Similarly, the multiple regression R2 with two predictors variables and n=3 inherently must always be 1.0.
The computer packages conventionally charge degrees of freedom for the numerator (synonymously also called "model," "between," "regression," and "explained," to confuse the graduate students) that are a function of the number of response variables "entered" in the analysis at a given step. The remaining degrees of freedom (synonymously called "denominator," "residual," "error," "within," and "unexplained") are inversely related to the number of response variables "entered" in a given step.
Table 8 illustrates these dynamics for a study involving 2 steps of stepwise analysis, with k=3 groups and n=120 people. Table 8 compares the results for two steps of analysis using the degrees of freedom calculations employed by SPSS and other computer packages, labelled "Incorrect," with the same calculations employing the correct degrees of freedom.
__________________________
INSERT TABLE 8 ABOUT HERE.
__________________________
The differences in the analyses revolves around what "entered" means. The computer packages define "entered" or "used" as actually entered into the prediction equation. Thus, in step one the packages consider that only one predictor has been entered, while in step two the packages consider that two response variables have been entered.
However, in this example each and every one of the 50 response variables was "used" at each and every one of the three steps, to decide which variable to enter at each step. The 49 or 48 unselected response variables may not have been retained in the analysis, but each one was examined, and played with, and actually tasted, prior to the leftovers then being returned to the cafeteria display case.
This system of determining the degrees of freedom bill is analogous to only charging John Belushi in the movie Animal House for the food on his cafeteria tray, and charging nothing for what he has tasted and discarded. Clearly, this statistical package system of coinage is wrong. [Charging only for variables actually entered at each step would be appropriate, for example, if these response variables were randomly selected without first tasting each and every response variable.]
It is instructive to see how using the wrong degrees of freedom in the numerator of the statistical significance testing calculations, and the wrong denominator df in the calculations, both bias the tests in favor of getting statistical significance. Table 8 illustrates how dramatic the effect of using the wrong degrees of freedom can be.
After one step, the computer calculates that F(2,117) = 15.29841, with an associated probability of .0000012; the correct F(100,136) is 0.16751, with an associated probability of 1.00000. After the second step, the computer calculates that F(4,232) = 13.64322, with an associated probability of .0000945; the correct F(100,136) is 0.31991, with an associated probability of 1.00000. Obviously, the example illustrates that the correct and incorrect results can be night-vs-day different!
Three factors determine exactly how egregiously the use of the wrong degrees of freedom distorts the stepwise results. The distortions are increasingly serious as (a) sample size is smaller, (b) the number of steps is larger, and (c) the number of response variables available to be selected is larger.
Nonreplicability of Results
Second, stepwise methods tend to yield results that are sample-specific and do not generalize well to future studies. This is because stepwise requires a linear sequence of decisions, each of which is contingent upon all the previous decisions in the sequence. This is very much like walking through a maze--an incorrect decision at any point will lead to a cascade of subsequent decisions that each may themselves be wrong.
Stepwise considers all differences of any magnitudes between variance explained by the response variables to be exact and true. Since there are usually numerous combinations of the response variables, and credit for variance explained for each partition of the variables may be influenced by sampling error, any small amount of sampling error anywhere in a single response variable can lead to disastrously erroneous choices in the linear sequence of stepwise selection decisions.
Stepwise Does NOT Identify the Best Variable Set of a Given Size
Third, stepwise methods do not correctly identify the best set of predictors of a given response variable set size, k. For example, if one has 30 response variables, and does three steps of analysis, it is possible that the best predictor set of size k=3 will include none of the three variables selected after three steps of stepwise analysis of the same data, and that the three stepwise variables would also yield a lower effect size.
This may seem counter-intuitive, but upon reflection, it should be easy to see that in fact stepwise analysis does not seek to identify the best variable set of a certain size. Stepwise simply does not ask the question, "What is the best predictor set of a given size?" This question requires simultaneously considering all the combinations of the variables that are possible for a given set size. Stepwise analysis never simultaneously considers all the combinations of the predictor variables. Rather, at each step stepwise analysis takes the previously entered variables as a given, and then asks which one change in the predictor set will most improve the prediction.
Picking the best new variable in a sequence of selections is not the same as picking the best variable set of a given size. As Thompson (1995c) explained:
Suppose one was picking a basketball team consisting of five players. The stepwise selection strategy picks the best potential player first, then the second best player in the context of the characteristics of the previously-selected first player, and so forth.
An alternative strategy is an all-possible-subsets approach which asks, "which five potential players play together best as a team?". This team might conceivably contain exactly zero of the five players selected through the stepwise approach. Furthermore, this "best team" might be able to stomp the "stepwise team" by a considerable margin, because teams consisting of players of lesser abilities may still play together better as a team than players selected through a linear sequence of stepwise decisions. (pp. 528, 530, emphasis in original)
The Table 9 data provide a powerful heuristic. Table 10 presents an abridged printout for these data involving two steps of stepwise DDA, conducted using the Appendix C SPSS program. In this analysis the stepwise algorithm selects response variables X1 and X2, and the lambda value is .6553991 (F(4,232)=13.64322).
__________________________________
INSERT TABLES 9 AND 10 ABOUT HERE.
__________________________________
Compare the Table 10 results with those in Table 11. Table 11 presents the DDA results for all six possible combinations of the four response variables considered two at a time. Note that the best set of two variables (i.e., smallest lambda) involves response variables X3 and X4 (? = .6272538, F(4,232)=15.23292). The best variable set of size two contained neither of the two variables selected by the stepwise analysis!!!!!
___________________________
INSERT TABLE 11 ABOUT HERE.
___________________________
As noted previously, all univariate and multivariate methods apply weights to the measured variables to derive scores on the latent or synthetic variables that are actually the focus of all analyses. Consequently, if (and only if) noteworthy effects (e.g., R2, Rc2) are detected, it then becomes reasonable to consult the weights as part of the process of determining which response variables contributed to the detected effect. Indeed, some researchers have even taken the view that these weights (e.g., beta weights, standardized discriminant function coefficients) should be the sole basis for evaluating the importance of response variables (Harris, 1989).
Unfortunately, overinterpretation of GLM weights is a serious threat. The weights can be greatly influenced by which variables are included or are excluded from a given analysis. Furthermore, Cliff (1987, pp. 177-178) noted that weights for a given set of variables may vary widely across samples, and yet consistently still yield the same effect sizes (i.e., be what he called statistically "sensitive"). Clearly weights are not the sole story in interpretation.
Any interpretations of weights must be considered context-specific. Any change in the variables in the model can radically alter all of the weights. Too few researchers appreciate the potential magnitudes of these impacts.
The Table 12 data illustrate these dynamics. The analysis contrasts using DDA models with either three response variables (i.e., X1, X2, and X3) or four response variables (i.e., X1, X2, X3, and X4). The example can be framed as either adding one response variable to an analysis involving three response variables, or deleting one response variable from an analysis involving four. This DDA example involves variance-covariance matrices for each of three groups that are exactly equal (called "homogeneity"), so the results are not confounded by failure to meet one of the assumptions of the analysis.
___________________________
INSERT TABLE 12 ABOUT HERE.
___________________________
Table 13 presents an excerpt from an SPSS analysis of the Table 12 data conducted using the Appendix D computer program. Note the dramatic changes in the DDA standardized function coefficients. For example, with three response variables the first response variable, X1, had standardized function coefficients of 1.50086 and -.01817 on the two DDA functions. With four response variables X1 had standardized function coefficients of -.47343 and 1.22249 on the two DDA functions. Thus, the coefficients were quite variable in both magnitude and sign.
___________________________
INSERT TABLE 13 ABOUT HERE.
___________________________
These fluctuations are not problematic, if (and only if) the researcher has selected exactly the right model (i.e., has not made what statisticians call a model specification error). But as Pedhazur (1982) has noted, "The rub, however, is that the true model is seldom, if ever, known" (p. 229). And as Duncan (1975) has noted, "Indeed it would require no elaborate sophistry to show that we will never have the 'right' model in any absolute sense" (p. 101).
In other words, as a practical matter, the context-specificity of weights is always problematic, and the weights consequently must be interpreted cautiously. Some researchers acknowledge the vulnerability of the weights to sampling error influences (i.e., the so-called "bouncing beta" problem), but a more obvious concern is the context-specificity of the weights in the real-world context of full or partial model misspecification.
A response variable given a standardized weight of zero is being obliterated by the multiplicative weighting process, indicating either that (a) the variable has zero capacity to explain relationships among the variables or that (b) the variable has some explanatory capacity, but one or more other variables yield the same explanatory information and are arbitrarily (not wrongly, just arbitrarily) receiving all the credit for the variable's predictive power. Because a response variable may be assigned a standardized multiplicative weight of zero when (b) the variable has some explanatory capacity, but one or more other variables yield the same explanatory information and are arbitrarily (not wrongly, just arbitrarily) given all the credit for the variable's predictive power, it is essential to evaluate other coefficients in addition to standardized weights during interpretation, to determine the specific basis for the weighting.
Just as it would be incorrect to evaluate predictor variables in a regression analysis only by consulting beta weights (Cooley & Lohnes, 1971, p. 55; Thompson & Borrello, 1985), in any GLM analysis it would be inappropriate to only consult standardized weights during result interpretation (Borgen & Seling, 1978, p. 692; Kerlinger & Pedhazur, 1973, p. 344; Levine, 1977, p. 20; Meredith, 1964, p. 55, Thompson, 1997b). Yet, some researchers do exactly that (cf. Humphries-Wadsworth, 1998).
Under most circumstances standardized weights are not correlation coefficients. Thus, some of the weights in the Table 11 are less than -1 or are greater than +1. Structure coefficients, on the other hand, are always correlation coefficients, and reflect the linear relationship between scores on a given measured or observed variable with the scores on a given latent or synthetic variable. Thus, because synthetic variable are actually the focus of all parametric analyses, and because structure coefficients reveal the structure of these latent variables, the importance of structure coefficients seems obvious.
Three possible cases can be delineated. The three illustrations demonstrate that jointly considering both standardized weights and structure coefficients indicates to the researcher which case is present in a given analysis. Appendix E presents the SPSS computer program used to analyze the three heuristic data sets.
Case #1: Function and Structure Coefficients are Equal
In the special GLM case where measured variables are uncorrelated, the standardized weights in this case (and in this case only) are correlation coefficients. For example, in regression, if the predictor variables are uncorrelated, each predictor variable's beta weight equals that variable's product-moment correlation with the criterion variable. In discriminant analysis, the same principle applies if the "pooled" correlation matrix of the response variables indicates that the response variables are uncorrelated.
Table 14 presents a hypothetical DDA data set illustrating this case for a k=3 group problem involving scores of n=30 people on each of p=3 response variables. As indicated by the Table 15 excerpt from the SPSS output for these data, in this special case the standardized function coefficients exactly equal the respective structure coefficients of the response variables.
___________________________________
INSERT TABLES 14 AND 15 ABOUT HERE.
___________________________________
Case #2: Measured Variables with Near-zero Weights Still Important
As noted previously, measured variables may be assigned multiplicative weights of zero if the measured variable contains useful variance, but that variance is also present in some combination of the other measured variables. The researcher interpreting these results, especially if only standardized weights are interpreted, might erroneously conclude that such a response variable with a near-zero weight had essentially no utility in generating the observed effect. Instead, the result merely indicates that this variable is arbitrarily being denied credit for its potential contributions.
Table 16 presents a relevant heuristic DDA data set for this case involving k=3 groups and p=3 response variables. Table 17 presents an excerpt from the related SPSS analysis of the tabled data.
___________________________________
INSERT TABLES 16 AND 17 ABOUT HERE.
___________________________________
In this example, the standardized function coefficient on Function I for X3 was -.05507, while on the same function the other two response variables had standardized function coefficients of roughly +.95. Yet the squared structure coefficient (rS2 = .814312 = 66.3%) for X3 on the function indicates that X3 had more than twice the explanatory power as variables X1 (rS2 = .541412 = 29.3%) and X2 (rS2 = .564532 = 31.9%). Clearly, consulting only the function coefficients for this example would have resulted in a serious misinterpretation of results.
Case #3: "Suppressor" Effects
The previous case makes clear that a measured variable assigned a zero or near-zero weight may nevertheless be an important variable, as reflected in the variable having a large non-zero structure coefficient. However, although it may seem counter-intuitive, a measured/observed variable may also have a zero or near-zero structure coefficient, and still be very important in defining a detected effect, as reflected in the variable having a non-zero standardized weight. [That is, only measured variables with both near-zero weights and near-zero structure coefficients are useless in defining a given detected effect.]
Such a variable is classically termed a "suppressor" variable. However, although the name may feel pejorative, a "suppressor" variable actually increases the effect size, and so suppression is a good (and not a bad) thing. As defined by Pedhazur (1982, p. 104), in the related regression case, "A suppressor variable is a variable that has a zero, or close to zero, correlation with the criterion but is correlated with one or more than one of the predictor variables." Henard (1998) provides a nice overview of suppressor effects.
Suppressor effects are quite difficult to explain in an intuitive manner. But Horst (1966) gave an example that is relatively accessible. He described the multiple regression prediction of pilot training success during World War II using mechanical, numerical, and spatial ability scores, each measured with paper and pencil tests. The verbal scores had very low correlations with the dependent variable, but had larger correlations with the other two predictors, since they were all measured with paper and pencil tests, i.e., measurement artifacts inflate correlations among traits measured with similar methods. As Horst (1966, p. 355) noted, "Some verbal ability was necessary in order to understand the instructions and the items used to measure the other three abilities."
Including verbal ability scores in the regression equation in this example actually served to remove the contaminating influence of one predictor from the other predictors, which effectively increased the R2 value from what it would have been if only mechanical, numerical and spatial abilities had been used as predictors. The verbal ability variable had negative beta weights in the equation. As Horst (1966, p. 355) noted, "To include the verbal score with a negative weight served to suppress or subtract irrelevant ability, and to discount the scores [on the other predictors] of those who did well on the test simply because of their verbal ability rather than because of abilities required for success in pilot training." The fact that a measured variable unrelated to a measured criterion variable can still make important contributions in an analysis itself makes the very important point that the latent or synthetic variables analyzed in all parametric methods are always more than the sum of their constituent parts.
Table 18 presents a relevant heuristic DDA data set for this case involving k=3 groups and p=3 response variables. Table 19 presents an excerpt from the related SPSS analysis of the tabled data. As reported in Table 19, on Function I DDA response variable X3 had a near-zero structure coefficient (rS = -.03464), but a large non-zero standardized function coefficient (i.e., -1.58393). Indeed, on this function X3 had the largest absolute standardized function coefficient, since X1 and X2 had standardized function coefficients of +1.22956 and +1.21174, respectively.
___________________________________
INSERT TABLES 18 AND 19 ABOUT HERE.
___________________________________
Nature of Score Reliability
Misconceptions regarding the nature of reliability abound within the social sciences. For example, some researchers do not realize that, "Notwithstanding erroneous folkwisdom to the contrary, sometimes scores from shorter tests are more reliable than scores from longer tests" (Thompson, 1990, p. 586). In her important recent article, Vacha-Haase (1998a) cited the example of the Bem Sex-Role Inventory, noting that, "[i]n fact, the 20-item short-form of the Bem generally yields more reliable scores (rXX2 for the feminine scale ranging from .84 to .87) than does the 40-item long-form (rXX2 for the feminine scale ranging from .75 to .78)" (pp. 9-10).
Misconceptions regarding reliability flourish in part because
[a]lthough most programs in sociobehavioral sciences, especially doctoral programs, require a modicum of exposure to statistics and research design, few seem to require the same where measurement is concerned. Thus, many students get the impression that no special competencies are necessary for the development and use of measures... (Pedhazur & Schmelkin, 1991, pp. 2-3)
Empirical study of doctoral curricula confirms this impression (Aiken et al., 1990).
The most fundamental problem is that too few researchers act on a conscious recognition that reliability is a characteristic of scores or the data in hand, and not of tests. Test booklets are not impregnated with reliability during the printing process. The WISC that yields reliable scores for some adults on a given occasion of measurement will not necessarily do so when the same test is administered to first-graders.
Many researchers recognize these dynamics on some level, but unconscious paradigm influences constrain too many researchers from actively integrating this presumption into their actual analytic practice. The pernicious practice of saying, "the test is reliable," creates a language that unconsciously predisposes researchers against acting on a conscious realization that tests themselves are not reliable (Thompson, 1994c). Reinhardt (1996) provides an excellent relevant review of reliability coefficients, and the factors that impact score reliability.
As Rowley (1976, p. 53, emphasis added) argued, "It needs to be established that an instrument itself is neither reliable nor unreliable.... A single instrument can produce scores which are reliable, and other scores which are unreliable." Similarly, Crocker and Algina (1986, p. 144, emphasis added) argued that, "...A test is not 'reliable' or 'unreliable.' Rather, reliability is a property of the scores on a test for a particular group of examinees."
In another widely respected text, Gronlund and Linn (1990, p. 78, emphasis in original) noted,
Reliability refers to the results obtained with an evaluation instrument and not to the instrument itself.... Thus, it is more appropriate to speak of the reliability of the "test scores" or of the "measurement" than of the "test" or the "instrument."
And Eason (1991, p. 84, emphasis added) argued that:
Though some practitioners of the classical measurement paradigm [incorrectly] speak of reliability as a characteristic of tests, in fact reliability is a characteristic of data, albeit data generated on a given measure administered with a given protocol to given subjects on given occasions.
The subjects themselves impact the reliability of scores, and thus it becomes an oxymoron to speak of "the reliability of the test" without considering to whom the test was administered, or other facets of each individual measurement protocol. Reliability is driven by variance--typically, greater score variance leads to greater score reliability, and so more heterogeneous samples often lead to more variable scores, and thus to higher reliability. Therefore, the same measure, when administered to more heterogenous or to more homogeneous sets of subjects, will yield scores with differing reliability. As Dawis (1987, p. 486) observed, "[b]ecause reliability is a function of sample as well as of instrument, it should be evaluated on a sample from the intended target population--an obvious but sometimes overlooked point."
Our shorthand ways of speaking (e.g., language saying "the test is reliable") can itself cause confusion and lead to bad practice. As Pedhazur and Schmelkin (1991, p. 82, emphasis in original) observed, "Statements about the reliability of a measure are... inappropriate and potentially misleading." These telegraphic ways of speaking are not inherently problematic, but they often later become so when we come unconsciously to ascribe literal truth to our shorthand, rather than recognizing that our jargon is merely telegraphic and is not literally true. As noted elsewhere:
This is not just an issue of sloppy speaking--the problem is that sometimes we unconsciously come to think what we say or what we hear, so that sloppy speaking does sometimes lead to a more pernicious outcome, sloppy thinking and sloppy practice. (Thompson, 1992c, p. 436)
Implications for Practice
These views suggest at least three implications for research practice. These practices are, unfortunately, not yet normative within the social sciences.
Language Use. One fairly straightforward recommendation is that researchers should not use language saying that, "the test is reliable [or valid]," or that, "the reliability [or validity] of the test was .xx." Because on its face this language is inaccurate, and asserts untruth, it seems imprudent to use such language in scholarly discourse. The editorial policies of at least one journal commend better, correct practices:
Based on these considerations, use of wording such as "the reliability of the test" or "the validity of the test" will not be considered acceptable in the journal. Instead, authors should use language such as, "the scores in our study had a classical theory test-retest reliability coefficient of X," or "based on generalizability theory analysis, the scores in our study had a phi coefficient of X." Use of technically correct language will hopefully reinforce better practice. (Thompson, 1994c, p. 841)
Coefficient Reporting. Researchers also ought to routinely report the reliability coefficients for their own data. Many do not do so now, because they act under the pernicious misconception that tests are reliable, and are therefore invariant across administrations.
But it is sloppy practice to not calculate, report, and interpret the reliability of one's own scores for one's own data. As Pedhazur and Schmelkin (1991, p. 86, emphasis in original) argued:
Researchers who bother at all to report reliability estimates for the instruments they use (many do not) frequently report only reliability estimates contained in the manuals of the instruments or estimates reported by other researchers. Such information may be useful for comparative purposes, but it is imperative to recognize that the relevant reliability estimate is the one obtained for the sample used in the [present] study under consideration.
Unhappily, empirical studies indicate that such reports are infrequent (Meier & Davis, 1990; Willson, 1980) in most journals, although there are exceptions (Thompson & Snyder, in press).
In her important paper proposing "reliability generalization" methods to characterize (a) the mean and (b) the standard deviation of score reliabilities for a given instrument across studies, and to explore (c) the sources of variability in score reliabilities, Vacha-Haase noted a benefit from the routine reporting of score reliability even in substantive studies:
Furthermore, if authors of empirical studies routinely report reliability coefficients, even in substantive studies, the field will cumulate more evidence regarding the psychometric integrity of scores. Such practices would provide more fodder for reliability generalization analyses focusing upon the differential influences of various sources of measurement error. (Vacha-Haase, 1998a, p. 14)
Interpret Results in a Reliability Context. Effect sizes can and should be computed in all studies; Kirk (1996) and Snyder and Lawson (1993) provide excellent reviews of the many options. When and if these effects are deemed (a) noteworthy in magnitude and (b) replicable, then (and only then) these effect sizes should also be interpreted.
Score reliability is one of the several study features that impact detected effects. Score measurement errors always attenuate computed effects to some degree (Schneider & Darcy, 1984). This attenuation ought to be considered when interpreting reported effects. As I have noted elsewhere,
The failure to consider score reliability in substantive research may exact a toll on the interpretations within research studies. For example, we may conduct studies that could not possibly yield noteworthy effect sizes, given that score reliability inherently attenuates effect sizes. Or we may not accurately interpret the effect sizes in our studies if we do not consider the reliability of the scores we are actually analyzing. (Thompson, 1994c, p. 840)
As Pedhazur and Schmelkin (1991) noted, "probably very few methodological issues have generated as much controversy" (p. 198) as have the use and interpretation of statistical significance tests. These tests have proven surprisingly resistant to repeated efforts "to exorcise the null hypothesis" (Cronbach, 1975, p. 124). Especially noteworthy among the historical efforts to accomplish the exorcism have been works by Rozeboom (1960), Morrison and Henkel (1970), Carver (1978), Meehl (1978), Shaver (1985), and Oakes (1986).
More recently, a seemingly periodic series of articles on the extraordinary limits of statistical significance tests has been published in the American Psychologist (cf. Cohen, 1990, 1994; Kupfersmid, 1988; Rosenthal, 1991; Rosnow & Rosenthal, 1989). The entire Volume 61, Number 4 issue of the Journal of Experimental Education was devoted to these themes. Schmidt's (1996) APA Division 5 presidential address was published as the lead article in the second issue of the inagural volume of the new APA journal, Psycholgical Methods. The lead section (cf. Hunter, 1997) of the January, 1997 issue of Psychological Science was devoted to this controversy. The April, 1998 issue of Educational and Psychological Measurement featured two lengthy reviews (Levin, 1998; Thompson, 1998) of a major text (Harlow, Mulaik & Steiger, 1997) on the controversy. And the APA Task Force on Statistical Inference (Shea, 1996) has now been working for nearly two years on related recommendations for improving practices.
Illustrative condemnations of contemporary statistical testing practices can be noted. For example, Schmidt and Hunter (1997) recently argued that "Statistical significance testing retards the growth of scientific knowledge; it never makes a positive contribution" (p. 37). Rozeboom (1997) was equally direct:
Null-hypothesis significance testing is surely the most bone-headedly misguided procedure ever institutionalized in the rote training of science students... [I]t is a sociology-of-science wonderment that this statistical practice has remained so unresponsive to criticism... (p. 335)
But, without much question, two articles by the late Jacob Cohen (1990, 1994) have been the most influential. Roger Kirk (1996) characterized the two American Psychologist articles by Cohen as "classics," and argued that "the one individual most responsible for bringing the shortcomings of hypothesis testing to the attention of behavioral and educational researchers is Jacob Cohen" (p. 747).
This onslaught of criticism has provoked reactive advocacy for statistical tests (cf. Cortina & Dunlap, 1997; Frick, 1996; Greenwald, Gonzalez, Harris & Guthrie, 1996; Hagen, 1997; Robinson & Levin, 1997). Some of these treatments have been thoughtful, but others have been seriously flawed (see Thompson, in press-c, in press-d).
Yet, notwithstanding the long-term availability of these many publications, even today some researchers still do not understand what their statistical significance tests do and do not do. Empirical studies of researcher perceptions of test results confirm that researchers manifest these misconceptions (cf. Nelson, Rosenthal & Rosnow, 1986; Oakes, 1986; Rosenthal & Gaito, 1963; Zuckerman, Hodgins, Zuckerman & Rosenthal, 1993). Similarly, content reviews of the most widely-used statistics textbooks show that even our most distinguished methodologists do not have a good grasp on the meaning of statistical significance tests (Carver, 1978).
My own views have been articulated in various locations (e.g., Thompson, 1993, 1994d, 1997a, in press-a, in press-d). I believe that three other essays (Thompson, 1996, 1998, in press-b) are particularly noteworthy. And a short, public-domain ERIC Digest I published (Thompson, 1994b) may be very useful as a class handout.
I have never argued that significance tests should be banned, though obviously others have argued that view (cf. Carver, 1978; Schmidt & Hunter, 1997). As an author, I do report (without much excitement) the results of statistical significance tests. As an editor of three journals, I have accepted for publication manuscripts that report these tests.
Common Misconceptions Regarding Statistical Tests
In various locations I have criticized common misconceptions regarding the meaning and value of statistical tests (cf. Thompson, 1996, in press-b). Three of these I now briefly summarize here.
Statistical Significance Does Not Test Result Importance. Put simply, improbable events are not intrinsically interesting. Some highly improbable events, in fact, are completely inconsequential. In his classic hypothetical dialogue between two teachers, Shaver (1985, p. 58) poignantly illustrated the folly of equating result improbability with result importance:
Chris: ...I set the level of significance at .05, as my advisor suggested. So a difference that large would occur by chance less than five times in a hundred if the groups weren't really different. An unlikely occurrence like that surely must be important.
Jean: Wait a minute, Chris. Remember the other day when you went into the office to call home? Just as you completed dialing the number, your little boy picked up the phone to call someone. So you were connected and talking to one another without the phone ever ringing... Well, that must have been a truly important occurrence then?
Even more importantly, since the premises of statistical significance tests do not invoke human values, in valid logical argument statistical results therefore can not under any circumstances contain as part of their conclusions information about result value. As I have noted previously, "If the computer package did not ask you your values prior to its analysis, it could not have considered your value system in calculating p's, and so p's cannot be blithely used to infer the value of research results" (Thompson, 1993, p. 365). Thus, statistical tests cannot reasonably be used as an atavistic escape from responsibility for defending result importance (Thompson, 1993), or to maintain a mantle of feigned objectivity (Thompson, in press-b).
Statistical Significance Does Not Test Result Replicability. Social scientists seek to identify relationships that recur under stated conditions. Discovering analogs of cold fusion will make us extremely popular (free drinks, much dancing, etc.) at our next scholarly meeting, but we will eternally thereafter be shunned (no one will accept the drinks we attempt to buy for them, so much for the dancing, etc.) at all future conferences, once our results are discovered to be non-replicable. [So, only report non-replicable results at your last conference, immediately prior to retirement.]
Too many researchers, consciously or unconsciously, incorrectly assume that the p values calculated in statistical significance tests evaluate the probability that results will replicate (Carver, 1978, 1993). But statistical tests do not evaluate the probability that the sample statistics occur in the population as parameters (Cohen, 1994).
Instead, "pCALCULATED is the probability (0 to 1.0) of the sample statistics, given the sample size, and assuming the sample was derived from a population in which the null hypothesis (H0) is exactly true" (Thompson, 1996, p. 27). Obviously, knowing the probability of the sample is less interesting than knowing the probability of the population. Knowing the probability of population parameters would bear upon result replicability, since we would then know something about the population from which future researchers would also draw their samples.
But as Shaver (1993) argued so emphatically:
[A] test of statistical significance is not an indication of the probability that a result would be obtained upon replication of the study.... Carver's (1978) treatment should have dealt a death blow to this fallacy.... (p. 304)
And so Cohen (1994) concluded that the statistical significance test "does not tell us what we want to know, and we so much want to know what we want to know that, out of desperation, we nevertheless believe that it does!" (p. 997).
Statistical Significance Does Not Solely Evaluate Effect Magnitude. Because various study features (including score reliability) impact calculated p values, pCALCULATED cannot be used as a satisfactory index of study effect size. As I have noted elsewhere,
The calculated p values in a given study are a function of several study features, but are particularly influenced by the confounded, joint influence of study sample size and study effect sizes. Because p values are confounded indices, in theory 100 studies with varying sample sizes and 100 different effect sizes could each have the same single pCALCULATED, and 100 studies with the same single effect size could each have 100 different values for pCALCULATED. (Thompson, in press-b)
The recent fourth edition of the American Psychological Association style manual (APA, 1994) explicitly acknowledges that p values are not acceptable indices of effect:
Neither of the two types of probability values [statistical significance tests] reflects the importance or magnitude of an effect because both depend on sample size... You are [therefore] encouraged to provide effect-size information. (APA, 1994, p. 18, emphasis added)
Recommended Improvements in Statistical Testing Practices
In various locations (cf. Thompson, 1996, in press-b) I have advocated certain changed practices as regards the use of statistical tests. Five such suggested changes are now summarized here.
Effect Sizes Should Be Reported for All Tested Effects. The single most important potential improvement in analytic practice would be the regular and routine reporting of effect sizes in all studies. As noted previously, such reports are at least "encouraged" by the new APA (1994, p. 18) style manual.
However, empirical studies of articles published since 1994 in psychology, counseling, special education, and general education suggest that merely "encouraging" effect size reporting (APA, 1994) has not appreciably affected actual reporting practices (e.g., Kirk, 1996; Snyder & Thompson, in press; Thompson & Snyder, 1997, in press; Vacha-Haase & Nilson, in press). An on-going series of additional empirical studies of reporting practices has yielded similar results for yet more journals (Lance & Vacha-Haase, 1998; Ness & Vacha-Haase, 1998; Nillson & Vacha-Haase, 1998; Reetz & Vacha-Haase, 1998).
Effect sizes are important to report for at least two reasons. First, when these effects are noteworthy, these indices inform judgment regarding the practical or substantive significance of results (cf. Kirk, 1996). Second, reporting all effect sizes (even non-statistically significant effects, though some might not interpret them) facilitates the meta-analytic integration of findings across a given literature.
There are many effect sizes (e.g., "uncorrected," "corrected," standardized differences) that can be computed (cf. Kirk, 1996; Snyder & Lawson, 1993). In my view (Thompson, in press-b), arguments can be made that certain indices should be preferred over others. But the important point is that, as regards effect size reporting, it is generally better to report anything as against nothing, which is the effect size that most researchers currently report.
Of course, an effect size is no more magical than is statistical significance testing, for the two reasons noted by Zwick (1997). First, because human values are also not part of the calculation of an effect size, any more than values are part of the calculation of p, "largeness of effect does not guarantee practical importance any more than statistical significance does" (p. 4).
Second, some researchers have too rigidly adopted Cohen's (1988) definitions of small, medium and large effects, just as some researchers too rigidity adopted "?=.05" as their gold standard. Cohen (1988) only intended these as impressionistic characterizations of result typicality across a diverse literature. However, some empirical studies do suggest that the characterization is reasonably accurate (Glass, 1979; Olejnik, 1984), at least as regards a literature historically built with a bias against statistically non-significant results (Rosenthal, 1979).
In my view, editorial requirements (Vacha-Haase, 1998b) will ultimately be required to move the field to change analytic and reporting practices. Fortunately, editorial policies at some journals now require authors to report and interpret effect sizes. For example, the author guidelines of the Journal of Experimental Education indicate that "authors are required to report and interpret magnitude-of-effect measures in conjunction with every p value that is reported" (Heldref Foundation, 1997, pp. 95-96, emphasis added). I believe the EPM author guidelines are equally informed:
We will go further [than mere encouragement]. Authors reporting statistical significance will be required to both report and interpret effect sizes. However, these effect sizes may be of various forms, including standardized differences, or uncorrected (e.g., r2, R2, eta2) or corrected (e.g., adjusted R2, omega2) variance-accounted-for statistics. (Thompson, 1994c, p. 845, emphasis in original)
It is particularly noteworthy that editorial policies even at one APA journal now indicate that:
If an author decides not to present an effect size estimate along with the outcome of a significance test, I will ask the author to provide specific justification for why effect sizes are not reported. So far, I have not heard a good argument against presenting effect sizes. Therefore, unless there is a real impediment to doing so, you should routinely include effect size information in the papers you submit. (Murphy, 1997, p. 4)
Researchers Should More Frequently Employ Non-Nill Nulls. An important but overlooked (see Hagen, 1997; Thompson, in press-c) element of Cohen's (1994) classic article involved his striking criticism of the routine use of "nil" null hypotheses. Cohen (1994) defined a "nil" null hypothesis as a null specifying no differences (e.g., SD1-SD2 = 0) or zero correlations (e.g., R2=0).
Some researchers employ nil nulls because statistical theory does not easily accommodate the testing of some non-nil nulls. But in other cases researchers employ nil nulls because these nulls have been unconsciously accepted as traditional, because these nulls can be mindlessly formulated without consulting previous literature, or because most computer software defaults to tests of nil nulls (Thompson, 1998, in press-b, in press-c).
Unfortunately, when a statistical significance test presumes a nil null is true in the population, an untruth is posited. As Meehl (1978, p. 822) noted, "As I believe is generally recognized by statisticians today and by thoughtful social scientists, the null hypothesis, taken literally, is always false." Similarly, Hays (1981, p. 293) pointed out that "[t]here is surely nothing on earth that is completely independent of anything else [in the population]. The strength of association may approach zero, but it should seldom or never be exactly zero."
Highly respected statistician Roger Kirk (1996) put the point succinctly in his important recent article:
Because the null hypothesis is always false, a decision to reject it simply indicates that the research design had adequate power to detect a true state of affairs, which may or may not be a large effect or even a useful effect. It is ironic that a ritualistic adherence to null hypothesis significance testing has led researchers to focus on controlling the Type I error that cannot occur because all null hypotheses are false. (p. 747, emphasis added)
And a pCALCULATED value computed on the foundation of a false premise is inherently of somewhat limited utility.
There is a very important implication of the realization that the nil null is untrue in the population. As Hays (1981, p. 293) emphasized, because the nil null is untrue in the population, sample statistics should reflect some difference or some effect, and thus "virtually any study can be made to show significant results if one uses enough subjects." This means that
Statistical significance testing can involve a tautological logic in which tired researchers, having collected data from hundreds of subjects, then conduct a statistical test to evaluate whether there were a lot of subjects, which the researchers already know, because they collected the data and know they're tired. (Thompson, 1992c, p. 436)
Statistical significance would be considerably more informative if researchers reviewed relevant previous research, and then constructed hypotheses that incorporated previous results.
Measurement Results Should be Tested with Non-Nil Nulls. There is growing recognition that some uses of statistical tests in measurement studies, as regards reliability or validity coefficients or construct validity tests of means, can be particularly misguided. For example, Abelson (1997) commented on statistical tests of measurement study results using nil null hypotheses:
And when a reliability coefficient is declared to be nonzero, that is the ultimate in stupefyingly vacuous information. What we really want to know is whether an estimated reliability is .50'ish or .80'ish. (Abelson, 1997, p. 121)
Fortunately, the author guidelines of some journals have become more enlightened as regards such practices:
Statistical tests of such coefficients in a measurement context make little sense. Either statistical significance tests using the [nil] null hypothesis of zero magnitude should be by-passed, or meaningful null hypotheses should be employed. (Thompson, 1994c, p. 844)
Researchers Should Provide Some Warrant That Results Are Replicable. Because evidence of result replicability is important (if we take science to be the business of cumulating knowledge across studies), because statistical significance tests do not evaluate result replicability (Cohen, 1994; Thompson, 1996, 1997b), other methods must and should be used for this purpose. It has been suggested that
As more researchers finally realize that statistical significance tests do not test the population, and therefore do not test replicability, researchers will increasingly emphasize evidence that instead is relevant to the issue of result replicability. (Vacha-Haase & Thompson, in press)
Many warrants are available, and in fact a single study might present several such warrants.
The most persuasive, and perhaps the only conclusive, evidence for result replicability is to actually replicate the study. And replication studies are important, and probably are somewhat undervalued in the social sciences (Robinson & Levin, 1997). However, many researchers (especially doctoral students working on dissertations and junior faculty seeking tenure) find themselves unable to replicate every study.
One potential warrant for replicability would involve prospectively formulating null hypotheses by reflectively consulting the effect sizes reported in previous related studies, and by prospectively interpreting study effects in the context of specific previous findings. In effect, virtually any study might be conducted and interpreted as a partial replication of previous inquiry. Another alternative warrant involves empirical investigation of replicability by conducting what I have termed (cf. Thompson, 1996) "internal" replicability analyses.
"Internal" replicability analyses empirically use the sample in hand to combine the participants in different ways to estimate how much the idiosyncracies of individuality within the sample have compromised generalizability. The major "internal" empirical replicability analyses are cross-validation, the jackknife, and the bootstrap (Diaconis & Efron, 1983); the logics are reviewed in more detail elsewhere (cf. Thompson, 1993, 1994d). "Internal" evidence for replicability is never as good as an actual replication (Robinson & Levin, 1997; Thompson, 1997a), but is certainly better than incorrectly presuming that statistical significance assures result replicability.
However, it must be emphasized that the inferential and the descriptive uses of these logics should not be confused (Thompson, 1993). For example, the inferential use of the bootstrap involves using the bootstrap to estimate a sampling distribution when the sampling distribution is not known or assumptions for the use of a known sampling distribution cannot be met (i.e., to conduct a different form of statistical significance test). The descriptive use of the bootstrap looks primarily at the variability in effect sizes or other parameter estimates across many different combinations of the participants. The software to conduct "internal" bootstrap analyses for statistics commonly used in the social sciences (cf. Elmore & Woehlke, 1988; Goodwin & Goodwin, 1985) is already widely available (e.g., Lunneborg (1987) for univariate applications, and Thompson (1988b, 1992a, 1995a) for multivariate applications).
Improved Language Use. In Thompson (1996), I suggested that when the null hypothesis is rejected, "such results ought to always be described as 'statistically significant,' and should never be described only as 'significant'" (pp. 28-29). My argument (Thompson, 1996, 1997a; but see Robinson & Levin, 1997) has been that the common meaning of "significant" has nothing to do with the statistical use of this term, and that the use of the complete phrase might help at least some in conveying that this technical phrase has nothing to do with result importance.
Carver (1993) eloquently made the same argument:
When trying to emulate the best principles of science, it seems important to say what we mean and to mean what we say. Even though many readers of scientific journals know that the word significant is supposed to mean statistically significant when it is used in this context, many readers do not know this. Why be unnecessarily confusing when clarity should be most important? (p. 288, emphasis in original)
After presenting a general linear model as a framework for discussion, the present paper reviewed five methodology errors that occur in educational research: (a) the use of stepwise methods; (b) the failure to consider in result interpretation the context specificity of analytic weights (e.g., regression beta weights, factor pattern coefficients, discriminant function coefficients, canonical function coefficients) that are part of all parametric quantitative analyses; (c) the failure to interpret both weights and structure coefficients as part of result interpretation; (d) the failure to recognize that reliability is a characteristic of scores, and not of tests; and (e) the incorrect interpretation of statistical significance and the related failure to report and interpret the effect sizes present in all quantitative analyses. In several cases small heuristic discriminant analysis data sets were presented to make more concrete and accessible the discussion of each of these five methodology errors.
However, of the various arenas for improvement, the one where I believe the most progress could be realized involves the use of statistical significance tests and the reporting of effect sizes. Yet this is where the most resistance has seemingly occurred. For example, Schmidt and Hunter (1997) recently argued that "logic-based arguments seem to have had only a limited impact... [perhaps due to] the virtual brainwashing in significance testing that all of us have undergone" (pp. 38-39). They also spoke of a "psychology of addiction to significance testing" (Schmidt & Hunter, 1997, p. 49).
Journal editor Loftus (1994), like others, has lamented that repeated publications of
these concerns never seem to attract much attention (much less impel action). They are carefully crafted and put forth for consideration, only to just kind of dissolve away in the vast acid bath of our existing methodological orthodoxy. (p. 1)
Another editor commented: "p values are like mosquitos" that apparently "have an evolutionary niche somewhere and [unfortunately] no amount of scratching, swatting or spraying will dislodge them" (Campbell, 1982, p. 698).
Similar comments have been made by non-editors. For example, Falk and Greenbaum (1995) noted that "A massive educational effort is required to... extinguish the mindless use of a procedure that dies hard" (p. 94). And Harris (1991) observed, "it is surprising that the dragon will not stay dead" (p. 375).
Fortunately, some slow, glacial progress in the incremental movement of the field was reflected in the APA (1994, p. 18) style manual "encouraging" the reporting of effect sizes. But enlightened editorial policies (e.g., Heldref Foundation, 1997; Murphy, 1997; Thompson, 1994c) now provide the strongest basis for cautious optimism.
Abelson, R.P. (1997). A retrospective on the significance test ban of 1999 (If there were no significance tests, they would be invented). In L.L. Harlow, S.A. Mulaik & J.H. Steiger (Eds.), What if there were no significance tests? (pp. 117-141). Mahwah, NJ: Erlbaum.
Aiken, L.S., West, S.G., Sechrest, L., Reno, R.R., with Roediger, H.L., Scarr, S., Kazdin, A.E., & Sherman, S.J. (1990). The training in statistics, methodology, and measurement in psychology. American Psychologist, 45, 721-734.
American Psychological Association. (1994). Publication manual of the American Psychological Association (4th ed.). Washington, DC: Author.
Bagozzi, R.P. (1981). Canonical correlation analysis as a special case of a structural relations model. Multivariate Behavioral Research, 16, 437-454.
Borgen, F.H., & Seling, M.J. (1978). Uses of discriminant analysis following MANOVA: Multivariate statistics for multivariate purposes. Journal of Applied Psychology, 63(6), 689-697.
Campbell, N. (1982). Editorial: Some remarks from the outgoing editor. Journal of Applied Psychology, 67, 691-700.
Carver, R. (1978). The case against statistical significance testing. Harvard Educational Review, 48, 378-399.
Carver, R. (1993). The case against statistical significance testing, revisited. Journal of Experimental Education, 61, 287-292.
Cliff, N. (1987). Analyzing multivariate data. San Diego: Harcourt Brace Jovanovich.
Cohen, J. (1968). Multiple regression as a general data-analytic system. Psychological Bulletin, 70, 426-443.
Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale, NJ: Erlbaum.
Cohen, J. (1990). Things I have learned (so far). American Psychologist, 45, 1304-1312.
Cohen, J. (1994). The earth is round (p < .05). American Psychologist, 49, 997-1003.
Cooley, W.W., & Lohnes, P.R. (1971). Multivariate data analysis. New York: John Wiley & Sons.
Cortina, J.M., & Dunlap, W.P. (1997). Logic and purpose of significance testing. Psychological Methods, 2, 161-172.
Crocker, L., & Algina, J. (1986). Introduction to classical and modern test theory. New York: Holt, Rinehart and Winston.
Cronbach, L.J. (1975). Beyond the two disciplines of psychology. American Psychologist, 30, 116-127.
Dawis, R.V. (1987). Scale construction. Journal of Counseling Psychology, 34, 481-489.
Diaconis, P., & Efron, B. (1983). Computer-intensive methods in statistics. Scientific American, 248(5), 116-130.
____________
* Cited empirical studies of methodological practice are designated with asterisks.
Duncan, O.D. (1975). Introduction to structural equation models. New York: Academic Press.
Eason, S. (1991). Why generalizability theory yields better results than classical test theory: A primer with concrete examples. In B. Thompson (Ed.), Advances in educational research: Substantive findings, methodological developments (Vol. 1, pp. 83-98). Greenwich, CT: JAI Press.
Elmore, P.B., & Woehlke, P.L. (1988). Statistical methods employed in American Educational Research Journal, Educational Researcher, and Review of Educational Research from 1978 to 1987. Educational Researcher, 17(9), 19-20.
*Emmons, N.J., Stallings, W.M., & Layne, B.H. (1990, April). Statistical methods used in American Educational Research Journal, Journal of Educational Psychology, and Sociology of Education from 1972 through 1987. Paper presented at the annual meeting of the American Educational Research Association, Boston, MA. (ERIC Document Reproduction Service No. ED 319 797)
Falk, R., & Greenbaum, C.W. (1995). Significance tests die hard: The amazing persistence of a probabilistic misconception. Theory & Psychology, 5(1), 75-98.
Fan, X. (1996). Canonical correlation analysis as a general analytic model. In B. Thompson (Ed.), Advances in social science methodology (Vol. 4, pp. 71-94). Greenwich, CT: JAI Press.
Fan, X. (1997). Canonical correlation analysis and structural equation modeling: What do they have in common? Structural Equation Modeling, 4, 65-79.
Fish, L.J. (1988). Why multivariate methods are usually vital. Measurement and Evaluation in Counseling and Development, 21, 130-137.
Frick, R.W. (1996). The appropriate use of null hypothesis testing. Psychological Methods, 1, 379-390.
Gage, N.L. (1985). Hard gains in the soft sciences: The case of pedagogy. Bloomington, IN: Phi Delta Kappa Center on Evaluation, Development, and Research.
Gall, M.D., Borg, W.R., & Gall, J.P. (1996). Educational research: An introduction (6th ed.). White Plains, NY: Longman.
*Glass, G.V (1979). Policy for the unpredictable (uncertainty research and policy). Educational Researcher, 8(9), 12-14.
Goodwin, L.D., & Goodwin, W.L. (1985). Statistical techniques in AERJ articles, 1979-1983: The preparation of graduate students to read the educational research literature. Educational Researcher, 14(2), 5-11.
Gorsuch, R. L. (1983). Factor analysis (2nd ed.). Hillsdale, NJ: Erlbaum.
Greenwald, A.G., Gonzalez, R., Harris, R.J., & Guthrie, D. (1996). Effect size and p-values: What should be reported and what should be replicated? Psychophysiology, 33(2), 175-183.
Grimm, L.G., & Yarnold, P.R. (Eds.). (1995). Reading and understanding multivariate statistics. Washington, DC: American Psychological Association.
Gronlund, N.E., & Linn, R.L. (1990). Measurement and evaluation in teaching (6th ed.). New York: Macmillan.
Hagen, R.L. (1997). In praise of the null hypothesis statistical test. American Psychologist, 52, 15-24.
*Hall, B.W., Ward, A.W., & Comer, C.B. (1988). Published educational research: An empirical study of its quality. Journal of Educational Research, 81, 182-189.
Harlow, L.L., Mulaik, S.A., & Steiger, J.H. (Eds.). (1997). What if there were no significance tests?. Mahwah, NJ: Erlbaum.
Harris, M.J. (1991). Significance tests are not enough: The role of effect-size estimation in theory corroboration. Theory & Psychology, 1, 375-382.
Harris, R.J. (1989). A canonical cautionary. Multivariate Behavioral Research, 24, 17-39.
Hays, W. L. (1981). Statistics (3rd ed.). New York: Holt, Rinehart and Winston.
Heldref Foundation. (1997). Guidelines for contributors. Journal of Experimental Education, 65, 95-96.
Henard, D.H. (1998, January). Suppressor variable effects: Toward understanding an elusive data dynamic. Paper presented at the annual meeting of the Southwest Educational Research Association, Houston. (ERIC Document Reproduction Service No. ED forthcoming)
Holzinger, K. L. & Swineford, F. (1939). A study in factor analysis: The stability of a bi-factor solution (No. 48). Chicago: University of Chicago.
Horst, P. (1966). Psychological measurement and prediction. Belmont, CA: Wadsworth.
Huberty, C.J (1989). Problems with stepwise methods--better alternatives. In B. Thompson (Ed.), Advances in social science methodology (Vol. 1, pp. 43-70). Greenwich, CT: JAI Press.
Huberty, C.J (1994). Applied discriminant analysis. New York: Wiley and Sons.
Huberty, C.J, & Barton, R. (1989). An introduction to discriminant analysis. Measurement and Evaluation in Counseling and Development, 22, 158-168.
Huberty, C.J, & Lowman, L.L. (1997). Discriminant analysis via statistical packages. Educational and Psychological Measurement, 57, 759-784.
Huberty, C.J, & Wisenbaker, J. (1992). Discriminant analysis: Potential improvements in typical practice. In B. Thompson (Ed.), Advances in social science methodology (Vol. 2, pp. 169-208). Greenwich, CT: JAI Press.
Huebner, E. S. (1991). Correlates of life satisfaction in children. School Psychology Quarterly, 6, 103-111.
Huebner, E. S. (1992). Burnout among school psychologists: An exploratory investigation into its nature, extent, and correlates. School Psychology Quarterly, 7, 129-136.
Humphries-Wadsworth, T.M. (1998, April). Features of published analyses of canonical results. Paper presented at the annual meeting of the American Educational Research Association, San Diego. (ERIC Document Reproduction Service No. ED forthcoming)
Hunter, J.E. (1997). Needed: A ban on the significance test. Psychological Science, 8(1), 3-7.
Jöreskog, K.G., & Sörbom, D. (1989). LISREL 7: A guide to the program and applications (2nd ed.). Chicago: SPSS.
Jorgenson, C. B., Jorgenson, D. E., Gillis, M. K., & McCall, C. M. (1993). Validation of a screening instrument for young children with teacher assessment of school performance. School Psychology Quarterly, 8, 125-139.
Kerlinger, F. N., & Pedhazur, E. J. (1973). Multiple regression in behavioral research. New York: Holt, Rinehart and Winston.
*Kirk, R. (1996). Practical significance: A concept whose time has come. Educational and Psychological Measurement, 56, 746-759.
Knapp, T. R. (1978). Canonical correlation analysis: A general parametric significance testing system. Psychological Bulletin, 85, 410-416.
Kupfersmid, J. (1988). Improving what is published: A model in search of an editor. American Psychologist, 43, 635-642.
*Lance, T., & Vacha-Haase, T. (1998, August). The Counseling Psychologist: Trends and usages of statistical significance testing. Paper presented at the annual meeting of the American Psychological Association, San Francisco.
Levin, J.R. (1998). To test or not to test H0? Educational and Psychological Measurement, 58, 311-331.
Levine, M. S. (1977). Canonical analysis and factor comparison. Newbury Park, CA: Sage.
Loftus, G.R. (1994, August). Why psychology will never be a real science until we change the way we analyze data. Paper presented at the annual meeting of the American Psychological Association, Los Angeles.
Lunneborg, C.E. (1987). Bootstrap applications for the behavioral sciences. Seattle: University of Washington.
Meehl, P.E. (1978). Theoretical risks and tabular asterisks: Sir Karl, Sir Ronald, and the slow progress of soft psychology. Journal of Consulting and Clinical Psychology, 46, 806-834.
*Meier, S.T., & Davis, S.R. (1990). Trends in reporting psychometric properties of scales used in counseling psychology research. Journal of Counseling Psychology, 37, 113-115.
Meredith, W. (1964). Canonical correlations with fallible data. Psychometrika, 29, 55-65.
Morrison, D.E., & Henkel, R.E. (Eds.). (1970). The significance test controversy. Chicago: Aldine.
Murphy, K.R. (1997). Editorial. Journal of Applied Psychology, 82, 3-5.
*Nelson, N., Rosenthal, R., & Rosnow, R.L. (1986). Interpretation of significance levels and effect sizes by psychological researchers. American Psychologist, 41, 1299-1301.
*Ness, C., & Vacha-Haase, T. (1998, August). Statistical significance reporting: Current trends and usages within Professional Psychology: Research and Practice. Paper presented at the annual meeting of the American Psychological Association, San Francisco.
*Nillson, J., & Vacha-Haase, T. (1998, August). A review of statistical significance reporting in the Journal of Counseling Psychology. Paper presented at the annual meeting of the American Psychological Association, San Francisco.
*Oakes, M. (1986). Statistical inference: A commentary for the social and behavioral sciences. New York: Wiley.
*Olejnik, S.F. (1984). Planning educational research: Determining the necessary sample size. Journal of Experimental Education, 53, 40-48.
Pedhazur, E. J. (1982). Multiple regression in behavioral research: Explanation and prediction (2nd ed.). New York: Holt, Rinehart and Winston.
Pedhazur, E. J., & Schmelkin, L. P. (1991). Measurement, design, and analysis: An integrated approach. Hillsdale, NJ: Erlbaum.
*Reetz, D., & Vacha-Haase, T. (1998, August). Trends and usages of statistical significance testing in adult development and aging research: A review of Psychology and Aging. Paper presented at the annual meeting of the American Psychological Association, San Francisco.
Reinhardt, B. (1996). Factors affecting coefficient alpha: A mini Monte Carlo study. In B. Thompson (Ed.), Advances in social science methodology (Vol. 4, pp. 3-20). Greenwich, CT: JAI Press.
Robinson, D., & Levin, J. (1997). Reflections on statistical and substantive significance, with a slice of replication. Educational Researcher, 26(5), 21-26.
Rosenthal, R. (1979). The "file drawer problem" and tolerance for null results. Psychological Bulletin, 86, 638-641.
Rosenthal, R. (1991). Effect sizes: Pearson's correlation, its display via the BESD, and alternative indices. American Psychologist, 46, 1086-1087.
*Rosenthal, R. & Gaito, J. (1963). The interpretation of level of significance by psychological researchers. Journal og Psychology, 55, 33-38.
Rosnow, R.L., & Rosenthal, R. (1989). Statistical procedures and the justification of knowledge in psychological science. American Psychologist, 44, 1276-1284.
Rowley, G.L. (1976). The reliability of observational measures. American Educational Research Journal, 13, 51-59.
Rozeboom, W.W. (1960). The fallacy of the null hypothesis significance test. Psychological Bulletin, 57, 416-428.
Rozeboom, W.W. (1997). Good science is abductive, not hypothetico-deductive. In L.L. Harlow, S.A. Mulaik & J.H. Steiger (Eds.), What if there were no significance tests? (pp. 335-392). Mahwah, NJ: Erlbaum.
Schmidt, F. (1996). Statistical significance testing and cumulative knowledge in psychology: Implications for the training of researchers. Psychological Methods, 1(2), 115-129.
Schmidt, F.L., & Hunter, J.E. (1997). Eight common but false objections to the discontinuation of significance testing in the analysis of research data. In L.L. Harlow, S.A. Mulaik & J.H. Steiger (Eds.), What if there were no significance tests? (pp. 37-64). Mahwah, NJ: Erlbaum.
Schneider, A. L., & Darcy, R. E. (1984). Policy implications of using significance tests in evaluation research. Evaluation Review, 8, 573-582.
Shaver, J. (1985). Chance and nonsense. Phi Delta Kappan, 67(1), 57-60.
Shaver, J. (1993). What statistical significance testing is, and what it is not. Journal of Experimental Education, 61, 293-316.
Shea, C. (1996). Psychologists debate accuracy of "significance test." Chronicle of Higher Education, 42(49), A12, A16.
Snyder, P. (1991). Three reasons why stepwise regression methods should not be used by researchers. In B. Thompson (Ed.), (1991). Advances in educational research: Substantive findings, methodological developments (Vol. 1, pp. 99-105). Greenwich, CT: JAI Press.
Snyder, P., & Lawson, S. (1993). Evaluating results using corrected and uncorrected effect size estimates. Journal of Experimental Education, 61, 334-349.
Snyder, P.A., & Thompson, B. (in press). Use of tests of statistical significance and other analytic choices in a school psychology journal: Review of practices and suggested alternatives. School Psychology Quarterly.
Thompson, B. (1984). Canonical correlation analysis: Uses and interpretation. Newbury Park, CA: Sage.
Thompson, B. (1985). Alternate methods for analyzing data from experiments. Journal of Experimental Education, 54, 50-55.
Thompson, B. (1988a, November). Common methodology mistakes in dissertations: Improving dissertation quality. Paper presented at the annual meeting of the Mid-South Educational Research Association, Louisville, KY. (ERIC Document Reproduction Service No. ED 301 595)
Thompson, B. (1988b). Program FACSTRAP: A program that computes bootstrap estimates of factor structure. Educational and Psychological Measurement, 48, 681-686.
Thompson, B. (1989). Why won't stepwise methods die?. Measurement and Evaluation in Counseling and Development, 21(4), 146-148.
Thompson, B. (1990). ALPHAMAX: A program that maximizes coefficient alpha by selective item deletion. Educational and Psychological Measurement, 50, 585-589.
Thompson, B. (1991). A primer on the logic and use of canonical correlation analysis. Measurement and Evaluation in Counseling and Development, 24(2), 80-95.
Thompson, B. (1992a). DISCSTRA: A computer program that computes bootstrap resampling estimates of descriptive discriminant analysis function and structure coefficients and group centroids. Educational and Psychological Measurement, 52, 905-911.
Thompson, B. (1992b). Misuse of ANCOVA and related "statistical control" procedures. Reading Psychology, 13, iii-xviii.
Thompson, B. (1992c). Two and one-half decades of leadership in measurement and evaluation. Journal of Counseling and Development, 70, 434-438.
Thompson, B. (1993). The use of statistical significance tests in research: Bootstrap and other alternatives. Journal of Experimental Education, 61, 361-377.
Thompson, B. (1994a, April). Common methodology mistakes in dissertations, revisited. Paper presented at the annual meeting of the American Educational Research Association, New Orleans. (ERIC Document Reproduction Service No. ED 368 771)
Thompson, B. (1994b). The concept of statistical significance testing (An ERIC/AE Clearinghouse Digest #EDO-TM-94-1). Measurement Update, 4(1), 5-6. (ERIC Document Reproduction Service No. ED 366 654)
Thompson, B. (1994c). Guidelines for authors. Educational and Psychological Measurement, 54(4), 837-847.
Thompson, B. (1994d). The pivotal role of replication in psychological research: Empirically evaluating the replicability of sample results. Journal of Personality, 62, 157-176.
Thompson, B. (1994e, February). Why multivariate methods are usually vital in research: Some basic concepts. Paper presented as a Featured Speaker at the biennial meeting of the Southwestern Society for Research in Human Development, Austin, TX. (ERIC Document Reproduction Service No. ED 367 687)
Thompson, B. (1995a). Exploring the replicability of a study's results: Bootstrap statistics for the multivariate case. Educational and Psychological Measurement, 55, 84-94.
Thompson, B. (1995b). Review of Applied discriminant analysis by C.J Huberty. Educational and Psychological Measurement, 55, 340-350.
Thompson, B. (1995c). Stepwise regression and stepwise discriminant analysis need not apply here: A guidelines editorial. Educational and Psychological Measurement, 55, 525-534.
Thompson, B. (1996). AERA editorial policies regarding statistical significance testing: Three suggested reforms. Educational Researcher, 25(2), 26-30.
Thompson, B. (1997a). Editorial policies regarding statistical significance tests: Further comments. Educational Researcher, 26(5), 29-32.
Thompson, B. (1997b). The importance of structure coefficients in structural equation modeling confirmatory factor analysis. Educational and Psychological Measurement, 57, 5-19.
Thompson, B. (1998). Review of What if there were no significance tests? by L. Harlow, S. Mulaik & J. Steiger (Eds.). Educational and Psychological Measurement, 58, 332-344.
Thompson, B. (in press-a). Canonical correlation analysis. In L. Grimm & P. Yarnold (Eds.), Reading and understanding multivariate statistics (Vol. 2). Washington, DC: American Psychological Association.
Thompson, B. (in press-b). If statistical significance tests are broken/misused, what practices should supplement or replace them?. Theory & Psychology.
Thompson, B. (in press-c). In praise of brilliance, where that praise really belongs. American Psychologist.
Thompson, B. (in press-d). Why "encouraging" effect size reporting isn't working: The etiology of researcher resistance to changing practices. Journal of Psychology.
Thompson, B., & Borrello, G. M. (1985). The importance of structure coefficients in regression research. Educational and Psychological Measurement, 45, 203-209.
Thompson, B., & Daniel, L.G. (1996a). Factor analytic evidence for the construct validity of scores: An historical overview and some guidelines. Educational and Psychological Measurement, 56, 213-224.
Thompson, B., & Daniel, L.G. (1996b). Seminal readings on reliability and validity: A "hit parade" bibliography. Educational and Psychological Measurement, 56, 741-745.
*Thompson, B., & Snyder, P.A. (1997). Statistical significance testing practices in the Journal of Experimental Education. Journal of Experimental Education, 66, 75-83.
*Thompson, B., & Snyder, P.A. (in press). Statistical significance and reliability analyses in recent JCD research articles. Journal of Counseling and Development.
Travers, R.M.W. (1983). How research has changed American schools: A history from 1840 to the present. Kalamazoo, MI: Mythos Press.
Tuckman, B.W. (1990). A proposal for improving the quality of published educational research. Educational Researcher, 19(9), 22-24.
Vacha-Haase, T. (1998a). Reliability generalization: Exploring variance in measurement error affecting score reliability across studies. Educational and Psychological Measurement, 58, 6-20.
Vacha-Haase, T. (1998b, August). A review of APA journals' editorial policies regarding statistical significance testing and effect size. Paper presented at the annual meeting of the American Psychological Association, San Francisco.
*Vacha-Haase, T., & Nilsson, J.E. (in press). Statistical significance reporting: Current trends and usages within MECD. Measurement and Evaluation in Counseling and Development.
Vacha-Haase, T., & Thompson, B. (in press). Further comments on statistical significance tests. Measurement and Evaluation in Counseling and Development.
*Vockell, E.L., & Asher, W. (1974). Perceptions of document quality and use by educational decision makers and researchers. American Educational Research Journal, 11, 249-258.
*Wandt, E. (1967). An evaluation of educational research published in journals (Report of the Committee on Evaluation of Research). Washington, DC: American Educational Research Association.
*Ward, A.W., Hall, B.W., & Schramm, C.E. (1975). Evaluation of published educational research: A national survey. American Educational Research Journal, 12, 109-128.
*Willson, V.L. (1980). Research techniques in AERJ articles: 1969 to 1978. Educational Researcher, 9(6), 5-10.
*Zuckerman, M., Hodgins, H.S., Zuckerman, A., & Rosenthal, R. (1993). Contemporary issues in the analysis of data: A survey of 551 psychologists. Psychological Science, 4, 49-53.
Zwick, R. (1997, March). Would the abolition of significance testing lead to better science? Paper presented at the annual meeting of the American Educational Research Association, Chicago.
T6 T7 T2 T4 T20 T21 T22 T6 1.0000 .7332 | .1529 .1586 .3440 .3206 .4476 T7 .7332 1.0000 | .1394 .0772 .3367 .3020 .4698 T2 .1529 .1394 | 1.0000 .3398 .2812 .2433 .2812 T4 .1586 .0772 | .3398 1.0000 .3243 .3310 .3062 T20 .3440 .3367 | .2812 .3243 1.0000 .3899 .3947 T21 .3206 .3020 | .2433 .3310 .3899 1.0000 .3767 T22 .4476 .4698 | .2812 .3062 .3947 .3767 1.0000 Note. The variable labels for these seven variables are: T6 PARAGRAPH COMPREHENSION TEST T7 SENTENCE COMPLETION TEST T2 CUBES, SIMPLIFICATION OF BRIGHAM'S SPATIAL RELATIONS TEST T4 LOZENGES FROM THORNDIKE--SHAPES FLIPPED OVER THEN IDENTIFY TARGET T20 DEDUCTIVE MATH ABILITY T21 MATH NUMBER PUZZLES T22 MATH WORD PROBLEM REASONING
Table 2
Standardized Canonical Function Coefficients for the Table
1 Data
Derived Using the Appendix A SPSS/LISREL Program to Illustrate
That
SEM is the Most General Case of the General Linear Model
Standardized canonical coefficients for DEPENDENT variables Variable 1 2 T6 .44962 -1.40007 T7 .62246 1.33225 Standardized canonical coefficients for COVARIATES COVARIATE 1 2 T2 -.01468 .06704 T4 -.20012 -1.00653 T20 .34100 -.02762 T21 .26772 -.17401 T22 .73104 .35974
Table 3
LISREL "Gamma" Coefficients for the Table 1 Data
Derived
Using the Appendix A SPSS/LISREL Program to Illustrate That
SEM is the Most
General Case of the General Linear Model
GAMMA
T6 T7
ETA 1 0.44957 0.62250
GAMMA
T2 T4 T20 T21 T22
ETA 1 -0.01468 -0.20014 0.34100 0.26772 0.73104
GAMMA
T6 T7
ETA 1 0.44956 0.62251
ETA 2 1.40013 -1.33228
GAMMA
T2 T4 T20 T21 T22
ETA 1 -0.01469 -0.20014 0.34101 0.26771 0.73104
ETA 2 -0.06706 1.00653 0.02762 0.17402 -0.35972
Note. The LISREL coefficients for the "gamma" matrix exactly match (within rounding error) the canonical function coefficients presented previously. The only exception is that all the signs for the SEM second canonical function coefficients must be "reflected." "Reflecting" a function (changing all the signs on a given function, factor, or equation) is always permissible, because the scaling of psychological constructs is arbitrary. Thus, the SEM and the canonical analysis derived the same results.
Table 4
The Confusing Language of Statistics
(Intentionally Designed to
Confuse the Graduate Students)
Synthetic/
Standardized Weight Latent
Analysis Weightsa System Variable(s)
Multiple
Regression beta "equation" Yhat
Factor pattern "factor" factor
Analysis coefficients scores
Descriptive standardized "function" discriminant
Discriminant function -or- function
Analysis coefficients "rule" scores
Canonical standardized canonical
Correlation function "function" function
Analysis coefficients scores
aOf course, the term, "standardized weight", is an obvious oxymoron. A given weight is a constant applied to all the scores of all the cases/people on the observed/manifest/ measured variable, and therefore cannot be standardized. Instead, the weighting constant is applied to the measured variable in its standardized form, i.e., we should say "weight for the standardized measured variables" rather than "standardized weight".
Table 5
Holzinger and Swineford Data to Show
That More
Predictors May Actually Hurt Classification Accuracy
Seq ID GRADE T13 T17 T22 T16
1 2 7 285 12 21 100
2 3 7 159 1 18 95
3 9 7 265 18 18 105
4 14 7 211 8 22 103
5 16 7 211 5 34 102
6 18 7 189 13 16 100
7 20 7 207 3 47 107
8 22 7 194 8 19 96
9 25 7 244 6 20 99
10 28 7 163 12 24 106
11 30 7 310 10 20 101
12 34 7 121 3 18 92
13 44 7 167 11 22 112
14 46 7 100 4 25 58
15 47 7 240 6 20 103
16 50 7 226 4 39 109
17 51 7 196 8 18 96
18 52 7 218 7 18 92
19 58 7 151 15 25 102
20 66 7 142 3 13 95
21 68 7 172 10 32 110
22 71 7 181 9 27 107
23 74 7 153 15 21 99
24 75 7 141 14 19 107
25 76 7 195 10 19 103
26 78 7 186 7 30 109
27 79 7 215 10 15 103
28 81 7 165 11 22 108
29 83 7 233 2 28 100
30 85 7 203 8 24 103
31 202 7 195 9 22 106
32 203 7 228 1 43 101
33 205 7 160 9 35 99
34 208 7 333 16 45 118
35 213 7 154 3 19 106
36 225 7 236 21 29 116
37 226 7 219 6 23 104
38 230 7 189 1 7 99
39 232 7 143 2 27 94
40 235 7 162 3 16 100
41 236 7 205 6 27 101
42 239 7 112 3 18 90
43 244 7 137 0 24 105
44 245 7 214 4 26 100
45 250 7 120 3 28 112
46 252 7 165 1 10 101
47 253 7 137 1 15 89
48 256 7 214 4 28 97
49 257 7 223 5 23 106
50 263 7 205 5 35 103
51 264 7 180 6 36 97
52 268 7 130 3 14 103
53 269 7 220 4 31 113
54 277 7 149 1 21 96
55 86 8 207 19 37 112
56 88 8 217 24 20 106
57 89 8 191 10 27 109
58 90 8 208 9 17 98
59 106 8 260 17 41 104
60 112 8 148 11 34 105
61 118 8 271 11 34 113
62 120 8 175 10 24 111
63 126 8 180 11 21 96
64 131 8 247 20 26 101
65 132 8 119 2 28 91
66 133 8 234 14 44 113
67 134 8 172 23 26 99
68 137 8 177 11 25 93
69 139 8 208 18 34 107
70 140 8 227 9 13 108
71 143 8 259 16 23 107
72 148 8 196 7 39 96
73 150 8 248 17 32 110
74 151 8 255 26 34 112
75 153 8 206 11 16 105
76 155 8 238 16 49 102
77 158 8 227 18 15 101
78 160 8 197 6 25 100
79 165 8 195 9 29 91
80 282 8 241 1 27 115
81 283 8 230 4 26 103
82 284 8 200 11 8 108
83 285 8 246 16 33 109
84 287 8 227 11 48 109
85 288 8 168 11 28 104
86 289 8 224 13 43 104
87 290 8 189 7 38 110
88 297 8 199 8 30 108
89 298 8 249 15 50 119
90 299 8 212 7 29 102
91 304 8 210 5 27 104
92 311 8 198 7 34 107
93 312 8 237 6 18 108
94 313 8 206 15 50 107
95 315 8 215 5 27 101
96 317 8 183 9 18 113
97 318 8 187 8 35 109
98 322 8 220 7 26 109
99 323 8 178 8 27 103
100 324 8 150 6 8 102
101 329 8 235 6 18 101
102 338 8 206 26 37 113
103 341 8 174 7 46 105
104 342 8 162 9 29 96
105 343 8 228 1 39 104
106 345 8 204 7 25 112
107 351 8 186 25 39 109
Note. The variable labels are:
T13 SPEEDED DISCRIM STRAIGHT AND CURVED CAPS
T17 MEMORY OF OBJECT-NUMBER ASSOCIATION TARGETS
T22 MATH WORD PROBLEM REASONING
T16 MEMORY OF TARGET SHAPES
Table 6
Holzinger and Swineford Results to Show That
More
Predictors May Actually Hurt Classification Accuracy
--LDF and LCF
Score Classification Tables--
GRADE by LDFCL3 LDF classification 3 predictors
Count I
I
I Row
I 7I 8I Total
GRADE --------+------+------+
7 I 40I 14I 54
I I I 50.5
+------+------+
8 I 22I 31I 53
I I I 49.5
+------+------+
Column 62 45 107
Total 57.9 42.1 100.0
GRADE by LDFCL4 LDF classification 4 predictors
Count I
I
I Row
I 7I 8I Total
GRADE --------+------+------+
7 I 38I 16I 54
I I I 50.5
+------+------+
8 I 23I 30I 53
I I I 49.5
+------+------+
Column 61 46 107
Total 57.0 43.0 100.0
GRADE by LCFCL3 LCF classification 3 predictors
Count I
I
I Row
I 7I 8I Total
GRADE --------+------+------+
7 I 40I 14I 54
I I I 50.5
+------+------+
8 I 22I 31I 53
I I I 49.5
+------+------+
Column 62 45 107
Total 57.9 42.1 100.0
GRADE by LCFCL4 LCF classification 4 predictors
Count I
I
I Row
I 7I 8I Total
GRADE --------+------+------+
7 I 38I 16I 54
I I I 50.5
+------+------+
8 I 23I 30I 53
I I I 49.5
+------+------+
Column 61 46 107
Total 57.0 43.0 100.0
Table 7
Holzinger and Swineford Results to Show That
More
Predictors May Actually Hurt Classification Accuracy
--Both LDF and
LCF Actual Classifications--
Seq ID GRADE LDFCL3 LDFCL4 LCFCL3 LCFCL4
1 2 7 8 8 8 8
2 3 7 7 7 7 7
3 9 7 8 8 8 8
4 14 7 7 7 7 7
5 16 7 7 7 7 7
6 18 7 7 7 7 7
7 20 7 8 8 8 8
8 22 7 7 7 7 7
9 25 7 7 7 7 7
10 28 7 8 8 8 8
11 30 7 + 8 7 8 7
12 34 7 7 7 7 7
13 44 7 - 7 8 7 8
14 46 7 7 7 7 7
15 47 7 7 7 7 7
16 50 7 8 8 8 8
17 51 7 7 7 7 7
18 52 7 7 7 7 7
19 58 7 8 8 8 8
20 66 7 7 7 7 7
21 68 7 8 8 8 8
22 71 7 - 7 8 7 8
23 74 7 8 8 8 8
24 75 7 8 8 8 8
25 76 7 7 7 7 7
26 78 7 - 7 8 7 8
27 79 7 7 7 7 7
28 81 7 - 7 8 7 8
29 83 7 7 7 7 7
30 85 7 7 7 7 7
31 202 7 7 7 7 7
32 203 7 7 7 7 7
33 205 7 8 8 8 8
34 208 7 8 8 8 8
35 213 7 7 7 7 7
36 225 7 8 8 8 8
37 226 7 7 7 7 7
38 230 7 7 7 7 7
39 232 7 7 7 7 7
40 235 7 7 7 7 7
41 236 7 7 7 7 7
42 239 7 7 7 7 7
43 244 7 7 7 7 7
44 245 7 7 7 7 7
45 250 7 7 7 7 7
46 252 7 7 7 7 7
47 253 7 7 7 7 7
48 256 7 7 7 7 7
49 257 7 7 7 7 7
50 263 7 7 7 7 7
51 264 7 + 8 7 8 7
52 268 7 7 7 7 7
53 269 7 7 7 7 7
54 277 7 7 7 7 7
55 86 8 8 8 8 8
56 88 8 8 8 8 8
57 89 8 8 8 8 8
58 90 8 7 7 7 7
59 106 8 8 8 8 8
60 112 8 8 8 8 8
61 118 8 8 8 8 8
62 120 8 + 7 8 7 8
63 126 8 7 7 7 7
64 131 8 8 8 8 8
65 132 8 7 7 7 7
66 133 8 8 8 8 8
67 134 8 8 8 8 8
68 137 8 - 8 7 8 7
69 139 8 8 8 8 8
70 140 8 7 7 7 7
71 143 8 8 8 8 8
72 148 8 8 8 8 8
73 150 8 8 8 8 8
74 151 8 8 8 8 8
75 153 8 7 7 7 7
76 155 8 8 8 8 8
77 158 8 8 8 8 8
78 160 8 7 7 7 7
79 165 8 - 8 7 8 7
80 282 8 7 7 7 7
81 283 8 7 7 7 7
82 284 8 7 7 7 7
83 285 8 8 8 8 8
84 287 8 8 8 8 8
85 288 8 8 8 8 8
86 289 8 8 8 8 8
87 290 8 8 8 8 8
88 297 8 8 8 8 8
89 298 8 8 8 8 8
90 299 8 7 7 7 7
91 304 8 7 7 7 7
92 311 8 8 8 8 8
93 312 8 7 7 7 7
94 313 8 8 8 8 8
95 315 8 7 7 7 7
96 317 8 7 7 7 7
97 318 8 8 8 8 8
98 322 8 7 7 7 7
99 323 8 7 7 7 7
100 324 8 7 7 7 7
101 329 8 7 7 7 7
102 338 8 8 8 8 8
103 341 8 8 8 8 8
104 342 8 7 7 7 7
105 343 8 7 7 7 7
106 345 8 7 7 7 7
107 351 8 8 8 8 8
Note. The variable labels are:
LCFCL3 'LCF classification with 3 preds'
LCFCL4 'LCF classification with 4 preds'
LDFCL3 'LDF classification with 3 preds'
LDFCL4 'LDF classification with 4 preds'
For the present example, for both the 3 and the 4 response variable analyses, the LDF and the LCF scores classified all 107 persons into the same groups. This need not have happened, but will happen as the covariance matrices approach equality across groups.
However, in both the LDF and the LCF analyses, 9 persons were classified differently across these two analyses; these cases are underlined within the table. In both the LDF and the LCF analyses, the use of 4 rather than 3 response variables (a) correctly changed the predicted classification of 3 people (denoted with plus signs in the table), (b) incorrectly changed the predicted classification of 6 people (denoted with minus signs in the table), thus (c) resulting in a net worsening from using more information for prediction as regards 3 persons.
Table 8
Incorrect and Correct Statistical Tests
for Two Steps of
Stepwise Analysis
Involving k=3 Groups and n=120 People
Incorrect Step #1 Correct Step #1
For k=3, p=1 For k=3, p=50
df numerator = n - 1 = 2 df numerator = 2 * p = 100
df denominator = n - k = 117 df denominator = 2 (n-p-2)= 136
lambda = 0.79270
F exact = 1 - lambda n - k F exact = 1 -lambda.5 n-p-2
lambda k - 1 lambda.5 p
1 -0.79270 117 1-0.79270.5 136
0.79270 2 0.79270 100
1 -0.89034 136
0.89034 100
0.20730 58.5 0.10966 1.36
0.79270 0.89034
0.26151 58.5 0.12317 1.36
F exact = 15.29841 F exact = 0.16751
p calculated = .0000012 p calculated = 1.00000
Incorrect Step #2 Correct Step #2
For k=3, p=2 For k=3, p=50
df numerator = 2(k - 1) = 4 df numerator = 2 * p = 100
df denominator = 2(n-k-1)= 232 df denominator = 2 (n-p-2)= 136
lambda = 0.65540
F exact = 1 - lambda.5 n-k-1 F exact = 1 -lambda.5 n-p-2
lambda.5 k-1 lambda.5 p
1 -0.65540.5 232 1-0.65540.5 136
0.65540 4 0.65540 100
1 -0.80957 232 1 -0.80957 136
0.80957 4 0.80957 100
0.19043 58 0.19043 1.36
0.80957 0.80957
0.23523 58 0.23523 1.36
F exact = 13.64322 F exact = 0.31991
p calculated = .0000945 p calculated = 1.00000
Note. The formulae for degrees of freedom and F are presented by Tatsuoka (1971, pp. 88-89).
multivar.wk1 3/31/98
Table 9
Heuristic Data Illustrating That
Stepwise Methods Do
Not Identify the Best Variable Set
ID Grp X1 X2 X3
X4
1 1 30.202 46.146 36.393 44.268
2 1 36.268 44.816 46.370 42.663
3 1 39.381 30.775 32.532 31.966
4 1 32.511 26.201 35.776 40.843
5 1 42.809 39.137 40.845 47.970
6 1 54.841 32.072 32.474 52.689
7 1 32.669 51.460 55.332 40.989
8 1 36.884 45.926 29.255 44.400
9 1 49.781 42.148 43.681 37.719
10 1 51.618 44.373 41.579 48.125
11 1 51.375 43.457 55.160 35.306
12 1 55.102 46.903 44.780 44.669
13 1 33.286 38.660 39.553 32.117
14 1 31.384 41.336 36.259 44.751
15 1 50.000 50.275 61.363 33.207
16 1 39.322 56.273 55.674 34.216
17 1 41.290 47.550 38.913 63.592
18 1 48.098 45.198 38.960 58.692
19 1 61.910 27.474 38.298 46.657
20 1 50.028 51.954 50.832 44.419
21 1 34.585 44.304 36.311 46.899
22 1 57.834 49.899 49.276 50.643
23 1 49.760 29.312 44.098 61.037
24 1 26.010 60.816 58.574 31.081
25 1 23.075 57.059 48.307 40.710
26 1 34.310 44.277 34.315 52.634
27 1 54.714 41.616 51.413 52.284
28 1 60.945 43.890 44.886 40.360
29 1 44.667 52.236 53.525 51.628
30 1 48.442 57.685 57.240 34.324
31 1 38.796 49.830 34.957 45.241
32 1 47.693 43.561 28.529 52.057
33 1 44.497 53.306 41.543 46.079
34 1 55.224 62.785 58.527 32.167
35 1 50.654 26.676 40.851 30.122
36 1 42.632 54.313 49.072 34.758
37 1 50.753 54.410 45.739 59.575
38 1 43.564 42.998 39.366 51.515
39 1 34.850 58.913 64.975 39.955
40 1 50.408 43.214 43.598 59.859
41 2 47.213 37.836 44.151 50.418
42 2 34.168 33.221 29.149 46.838
43 2 58.639 27.033 48.206 52.029
44 2 38.730 49.495 48.813 48.258
45 2 51.596 53.009 51.326 45.759
46 2 62.621 39.735 52.727 71.905
47 2 51.737 37.667 45.013 38.552
48 2 43.922 61.284 55.784 55.129
49 2 42.726 54.703 54.281 37.671
50 2 44.939 48.408 36.004 64.368
51 2 42.050 59.340 61.987 63.012
52 2 37.950 63.446 55.519 35.175
53 2 46.938 56.395 65.436 48.823
54 2 59.976 53.046 51.431 54.273
55 2 59.651 46.707 58.262 48.909
56 2 61.465 36.292 45.301 63.513
57 2 51.051 46.853 51.258 43.695
58 2 40.534 43.357 40.944 50.941
59 2 48.756 53.468 56.950 39.971
60 2 69.683 38.471 49.262 37.572
61 2 46.532 48.917 49.324 62.440
62 2 47.390 33.825 28.706 53.079
63 2 45.617 69.776 56.763 51.743
64 2 56.300 47.684 57.178 51.941
65 2 36.826 69.819 62.206 60.214
66 2 55.413 49.488 48.629 43.843
67 2 52.831 56.210 56.712 45.976
68 2 53.087 46.471 48.024 43.155
69 2 47.221 57.142 52.413 48.072
70 2 54.653 57.012 51.724 48.850
71 2 51.779 65.569 66.259 46.466
72 2 46.009 52.845 48.452 54.614
73 2 52.968 48.023 50.156 50.077
74 2 43.296 45.937 45.162 58.516
75 2 55.779 55.454 59.676 23.961
76 2 55.410 62.863 58.090 48.973
77 2 51.454 57.612 54.929 45.531
78 2 48.538 44.353 49.021 49.085
79 2 62.931 45.867 53.116 54.326
80 2 68.626 47.541 49.993 70.532
81 3 40.113 52.329 50.289 49.856
82 3 63.539 41.711 46.398 59.927
83 3 45.115 61.546 65.551 61.702
84 3 36.029 43.581 38.991 45.273
85 3 51.691 31.516 41.387 55.789
86 3 66.255 59.021 45.930 63.253
87 3 54.119 53.613 57.157 56.673
88 3 49.996 64.174 63.878 61.408
89 3 60.048 59.992 61.433 41.806
90 3 46.350 50.215 59.540 57.780
91 3 49.121 60.275 44.200 69.682
92 3 48.088 68.394 59.637 51.042
93 3 52.787 59.393 61.506 46.042
94 3 44.986 41.866 39.170 43.529
95 3 55.269 68.011 59.191 60.153
96 3 50.261 47.608 44.830 54.833
97 3 56.321 57.470 59.734 51.043
98 3 50.766 49.361 54.050 50.134
99 3 65.540 45.512 58.401 54.444
100 3 47.305 63.725 55.889 44.630
101 3 61.232 52.462 59.623 49.975
102 3 43.688 54.287 54.662 44.419
103 3 74.301 49.445 45.461 64.624
104 3 46.216 55.011 43.794 70.389
105 3 50.882 46.326 42.779 48.925
106 3 48.898 58.229 56.452 60.881
107 3 60.911 60.077 62.039 62.825
108 3 60.918 49.582 43.208 48.960
109 3 49.932 65.463 79.812 53.265
110 3 55.415 61.860 64.733 49.648
111 3 66.505 36.375 41.958 60.718
112 3 59.574 52.291 63.181 60.637
113 3 62.806 42.934 51.890 57.537
114 3 55.761 68.426 60.399 52.615
115 3 73.150 46.255 38.224 77.559
116 3 56.814 60.450 64.211 40.352
117 3 50.092 65.513 44.826 54.327
118 3 65.086 58.518 62.482 48.116
119 3 57.997 66.886 58.486 63.017
120 3 73.867 46.347 70.118 61.087
Table 10
Heuristic Results Illustrating That
Stepwise Methods Do
Not Identify the Best Variable Set:
The DDA Stepwise
Results
AT STEP 1, X1 WAS INCLUDED IN THE ANALYSIS.
DEGREES OF FREEDOM SIGNIF.
BETWEEN GROUPS
WILKS' LAMBDA 0.79270 1 2 117.0
EQUIVALENT F 15.2988 2 117.0 0.0000
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
AT STEP 2, X2 WAS INCLUDED IN THE ANALYSIS.
DEGREES OF FREEDOM SIGNIF.
BETWEEN GROUPS
WILKS' LAMBDA 0.65540 2 2 117.0
EQUIVALENT F 13.6432 4 232.0 0.0000
CANONICAL DISCRIMINANT FUNCTIONS
: AFTER
: FUNCTION WILKS' LAMBDA CHI-SQUARED D.F. SIGNIFICANCE
: 0 0.6553991 49.223 4 0.0000
: 1 0.9992265 0.90148E-01 1 0.7640
Note. These results were extracted from the output created by applying the Appendix C program to the Table 9 heuristic data.
Table 11
Heuristic Results Illustrating That
Stepwise Methods Do
Not Identify the Best Variable Set:
The DDA
All-Possible-Subsets Results
X1,X2
DDA LAMBDA CHI-SQUARED D.F. SIGNIFICANCE
0.6553991 49.223 4 0.0000
0.9992265 0.90148E-01 1 0.7640
1-Way MANOVA Wilks L. F Hypoth. DF Error DF Sig. of F
.65540 13.64322 4.00 232.00 .000
.99923 .09057 1.00 117.00 .764
X1,X3
DDA LAMBDA CHI-SQUARED D.F. SIGNIFICANCE
0.6961866 42.189 4 0.0000
0.9988321 0.13614 1 0.7122
1-Way MANOVA Wilks L. F Hypoth. DF Error DF Sig. of F
.69619 11.51286 4.00 232.00 .000
.99883 .13680 1.00 117.00 .712
X1,X4
DDA LAMBDA CHI-SQUARED D.F. SIGNIFICANCE
0.7081264 40.208 4 0.0000
0.9991168 0.10294 1 0.7483
1-Way MANOVA Wilks L. F Hypoth. DF Error DF Sig. of F
.70813 10.92434 4.00 232.00 .000
.99912 .10343 1.00 117.00 .748
X2,X3
DDA LAMBDA CHI-SQUARED D.F. SIGNIFICANCE
0.8094569 24.627 4 0.0001
0.9913438 1.0128 1 0.3142
1-Way MANOVA Wilks L. F Hypoth. DF Error DF Sig. of F
.80946 6.46606 4.00 232.00 .000
.99134 1.02162 1.00 117.00 .314
X2,X4
DDA LAMBDA CHI-SQUARED D.F. SIGNIFICANCE
0.6966245 42.116 4 0.0000
0.9999445 0.64643E-02 1 0.9359
1-Way MANOVA Wilks L. F Hypoth. DF Error DF Sig. of F
.69662 11.49101 4.00 232.00 .000
.99994 .00649 1.00 117.00 .936
X3,X4
DDA LAMBDA CHI-SQUARED D.F. SIGNIFICANCE
0.6272538 54.336 4 0.0000
0.9973925 0.30417 1 0.5813
1-Way MANOVA Wilks L. F Hypoth. DF Error DF Sig. of F
.62725 15.23292 4.00 232.00 .000
.99739 .30588 1.00 117.00 .581
Note. In addition to illustrating that the stepwise selection of variables X1 and X2 as the first two variables is incorrect, since the lambda value for X3 and X4 is better (.62725 vs .65540), the tabled results also illustrate that DDA and a one-way MANOVA are the same analysis, even though the SPSS programmers made inconsistent choices of test statistics and the number of decimals to report across these two analyses.
Table 12
Heuristic Data Illustrating
the Context
Specificity of GLM Weights
ID/ Response Variablea
Stat. Grp X1 X2 X3 X4
1 1 4 3 7 19
2 1 4 4 4 17
3 1 3 5 3 17
4 1 2 6 4 19
5 1 2 7 7 17
6 1 4 8 12 12
7 1 3 5 7 12
8 2 5 1 6 12
9 2 5 2 3 10
10 2 4 3 2 10
11 2 3 4 3 12
12 2 3 5 6 10
13 2 5 6 11 5
14 2 4 3 6 5
15 3 6 2 5 7
16 3 6 3 2 5
17 3 5 4 1 5
18 3 4 5 2 7
19 3 4 6 5 5
20 3 6 7 10 0
21 3 5 4 5 0
M1 3.143 5.429 6.286 16.143
M2 4.143 3.429 5.286 9.143
M3 5.143 4.429 4.286 4.143
SD1 0.899 1.718 3.039 2.968b
SD2 0.899 1.718 3.039 2.968
SD3 0.899 1.718 3.039 2.968
Covariance matrix for group 1 (n=7)
X1 .8095
X2 -.5714 2.9524
X3 .9524 2.8571 9.2381
X4 -.6905 -2.4048 -5.8810 8.8095b
Covariance matrix for group 2 (n=7)
X1 .8095
X2 -.5714 2.9524
X3 .9524 2.8571 9.2381
X4 -.6905 -2.4048 -5.8810 8.8095
Covariance matrix for group 3 (n=7)
X1 .8095
X2 -.5714 2.9524
X3 .9524 2.8571 9.2381
X4 -.6905 -2.4048 -5.8810 8.8095
Pooled within-groups covariance matrix (n=21)c
X1 .8095
X2 -.5714 2.9524
X3 .9524 2.8571 9.2381
X4 -.6905 -2.4048 -5.8810 8.8095
aera9801.wk1 3/20/98
aThe "response variables" in a discriminant analysis are the intervally-scaled variables. In a DDA the response variables are the intervally-scaled criterion variables being predicted by group membership data. In a PDA the response variables are the intervally-scaled predictor variables predicting group membership.
bThe variance on the diagonal of the variance/covariance matrix is the square of the SD of the variable (e.g., 2.9682 = 8.8095), and the SD of the variable is the square root of the variance of the variable (e.g., 8.8095.5 = 2.968).
cBecause here the group sizes are equal and the variance-covariance matrices computed seperately "within" each group are also exactly equal (staticians call this "homogeneity" of the covariance matrices--it sounds more sophisticated than simply [clearly] saying these matrices are equal), the weighted average of the covariance matrices (called the "pooled" covariance matrix) also equals each of the three separate group covariance matrices.
Table 13
Heuristic Results Illustrating
the Context
Specificity of GLM Weights
Weights in the Context of 3 Response Variables
Standardized canonical discriminant function coefficients
Func 1 Func 2
X1 1.50086 -.01817
X2 1.25012 1.16078
X3 -1.37261 -.44995
Structure matrix
Func 1 Func 2
X1 .56076 -.60392*
X2 -.05557 .92134*
X3 -.16600 .17877*
Weights in the Context of 4 Response Variables
Standardized canonical discriminant function coefficients
Func 1 Func 2
X1 -.47343 1.22249
X2 -.12685 1.77579
X3 1.09588 -1.04760
X4 1.16456 .56180
Structure matrix
Func 1 Func 2
X1 -.34600* .05602
X2 .09855 .48590*
X3 .10242* -.01658
X4 .63238* .09130
Table 14
Heuristic Data Set #1 Illustrating
ID/ Response Variablea
Stat. Grp X1 X2 X3
1 1 0 13 13
2 1 4 6 18
3 1 2 9 33
4 1 6 4 8
5 1 8 3 13
6 1 10 3 25
7 1 12 4 30
8 1 14 6 20
9 1 18 13 25
10 1 16 9 5
11 2 1 14 9
12 2 5 7 14
13 2 3 10 29
14 2 7 5 4
15 2 9 4 9
16 2 11 4 21
17 2 13 5 26
18 2 15 7 16
19 2 19 14 21
20 2 17 10 1
21 3 3 11 10
22 3 7 4 15
23 3 5 7 30
24 3 9 2 5
25 3 11 1 10
26 3 13 1 22
27 3 15 2 27
28 3 17 4 17
29 3 21 11 22
30 3 19 7 2
M1 9.000 7.000 19.000
M2 10.000 8.000 15.000
M3 12.000 5.000 16.000
SD1 5.745 3.633 8.832
SD2 5.745 3.633 8.832
SD3 5.745 3.633 8.832
aThe "response variables" in a discriminant analysis are the intervally-scaled variables. In a DDA the response variables are the intervally-scaled criterion variables being predicted by group membership data. In a PDA the response variables are the intervally-scaled predictor variables predicting group membership.
Table 15
Heuristic Results #1 Illustrating
the Importance of
Both Function and Structure Coefficients
STANDARDIZED CANONICAL DISCRIMINANT FUNCTION COEFFICIENTS
FUNC 1 FUNC 2
X1 -0.50132 -0.42337
X2 0.86161 -0.32427
X3 0.07938 0.84594
STRUCTURE MATRIX
FUNC 1 FUNC 2
X1 -0.50132* -0.42337
X2 0.86161* -0.32427
X3 0.07938 0.84594*
Table 16
Heuristic Data Set #2 Illustrating
the Importance of
Both Function and Structure Coefficients
ID/ Response Variable
ID Grp X1 X2 X3
1 1 29.504 42.923 29.576
2 1 35.377 40.427 37.666
3 1 38.646 30.333 29.319
4 1 32.166 29.527 29.132
5 1 42.123 37.132 37.234
6 1 53.744 28.508 35.073
7 1 32.359 49.590 44.558
8 1 36.474 44.465 29.162
9 1 48.948 38.320 41.963
10 1 50.738 39.708 41.576
11 1 50.535 39.256 49.973
12 1 54.179 41.307 45.286
13 1 33.117 40.453 33.975
14 1 31.286 43.078 31.644
15 1 49.303 45.602 54.567
16 1 39.003 53.124 48.315
17 1 40.929 45.935 37.544
18 1 47.503 42.345 39.500
19 1 60.888 25.192 41.614
20 1 49.430 47.577 48.575
21 1 34.541 45.680 33.675
22 1 57.003 44.198 50.022
23 1 49.220 30.174 41.746
24 1 26.350 61.440 47.090
25 1 23.518 59.255 39.290
26 1 34.368 46.340 32.676
27 1 54.078 39.067 49.668
28 1 60.099 39.284 47.905
29 1 44.404 50.167 49.101
30 1 48.057 53.530 53.324
31 1 38.759 49.947 35.416
32 1 47.351 42.744 33.542
33 1 44.271 51.256 41.813
34 1 54.653 56.111 57.113
35 1 50.281 29.187 40.420
36 1 42.553 53.080 46.333
37 1 50.407 51.156 46.945
38 1 43.467 44.039 39.270
39 1 35.070 58.895 54.395
40 1 50.100 42.617 44.246
41 2 47.031 39.315 42.967
42 2 34.440 39.074 28.846
43 2 58.123 28.294 48.125
44 2 38.938 51.302 44.884
45 2 51.361 50.766 51.048
46 2 62.011 37.532 53.917
47 2 51.508 38.750 45.354
48 2 44.008 59.607 52.539
49 2 42.859 54.788 50.448
50 2 45.016 49.405 39.081
51 2 42.240 58.822 55.714
52 2 38.303 63.250 50.925
53 2 46.990 55.456 59.208
54 2 59.590 49.553 54.338
55 2 59.283 44.703 57.759
56 2 61.041 36.092 49.032
57 2 50.987 47.080 50.649
58 2 40.836 47.054 40.398
59 2 48.792 53.000 54.324
60 2 69.031 36.061 54.631
61 2 46.701 50.368 48.506
62 2 47.542 38.375 34.157
63 2 45.837 67.118 55.424
64 2 56.153 47.024 56.497
65 2 37.357 69.471 56.009
66 2 55.308 48.748 51.137
67 2 52.835 54.825 56.223
68 2 53.123 47.373 49.864
69 2 47.460 57.273 51.987
70 2 54.667 55.468 54.064
71 2 51.908 63.035 63.285
72 2 46.339 54.534 48.788
73 2 53.074 49.063 51.585
74 2 43.738 49.927 45.095
75 2 55.794 54.218 59.348
76 2 55.442 60.159 59.143
77 2 51.622 57.054 55.261
78 2 48.822 47.487 49.094
79 2 62.728 45.031 56.524
80 2 68.232 44.920 56.643
81 3 40.712 56.076 48.200
82 3 63.342 41.788 52.130
83 3 45.558 62.146 60.511
84 3 36.825 50.628 38.987
85 3 51.950 37.143 44.118
86 3 66.036 55.171 55.063
87 3 54.325 54.081 57.507
88 3 50.350 63.482 61.656
89 3 60.064 57.666 62.915
90 3 46.860 53.651 56.202
91 3 49.541 60.881 48.760
92 3 48.559 67.626 59.086
93 3 53.106 59.407 60.727
94 3 45.581 47.690 42.188
95 3 55.511 65.603 61.156
96 3 50.678 50.874 48.134
97 3 56.552 57.215 60.662
98 3 51.219 52.479 54.457
99 3 65.509 45.815 61.583
100 3 47.918 64.920 56.355
101 3 61.419 52.850 62.042
102 3 44.501 58.930 53.558
103 3 74.096 47.502 57.292
104 3 46.979 59.091 47.819
105 3 51.521 51.319 47.794
106 3 49.607 61.195 57.117
107 3 61.220 59.648 64.766
108 3 61.227 51.405 51.763
109 3 50.665 67.000 73.107
110 3 55.987 62.945 65.147
111 3 66.710 40.179 51.533
112 3 60.045 54.641 64.535
113 3 63.249 47.013 57.722
114 3 56.559 69.593 64.217
115 3 73.365 47.814 53.417
116 3 57.605 63.250 66.086
117 3 51.241 69.792 52.887
118 3 65.732 60.535 67.970
119 3 58.946 69.314 64.397
120 3 74.402 49.980 74.818
Table 17
Heuristic Results #2 Illustrating
the Importance of
Both Function and Structure Coefficients
STANDARDIZED CANONICAL DISCRIMINANT FUNCTION COEFFICIENTS
FUNC 1 FUNC 2
X1 0.93660 1.07729
X2 0.95259 1.43338
X3 -0.05507 -1.70996
STRUCTURE MATRIX
FUNC 1 FUNC 2
X1 0.54141* -0.28008
X2 0.56453* 0.24316
X3 0.81431* -0.55744
Table 18
Heuristic Data Set #3 Illustrating
the Importance of
Both Function and Structure Coefficients
ID/ Response Variable
ID Grp X1 X2 X3
1 1 31.107 41.920 44.130
2 1 37.386 43.111 55.702
3 1 40.301 29.292 40.991
4 1 32.981 21.197 33.741
5 1 43.659 38.767 49.266
6 1 56.148 36.705 51.915
7 1 33.099 46.916 53.419
8 1 37.419 42.930 36.720
9 1 50.786 44.536 56.564
10 1 52.673 47.540 56.989
11 1 52.384 46.404 65.522
12 1 56.199 51.374 61.670
13 1 33.543 33.607 37.093
14 1 31.563 35.508 33.774
15 1 50.840 52.470 68.917
16 1 39.741 53.984 57.161
17 1 41.753 45.846 44.518
18 1 48.821 46.360 49.784
19 1 63.104 34.040 55.862
20 1 50.748 53.881 60.467
21 1 34.687 39.388 34.741
22 1 58.809 55.013 64.993
23 1 50.412 30.512 48.941
24 1 25.678 52.286 45.763
25 1 22.628 47.189 35.085
26 1 34.290 38.953 31.661
27 1 55.465 44.953 60.297
28 1 61.929 49.927 61.620
29 1 45.002 51.401 55.457
30 1 48.912 58.554 62.578
31 1 38.876 46.354 36.464
32 1 48.112 43.765 37.997
33 1 44.788 52.322 46.752
34 1 55.896 66.513 69.875
35 1 51.101 27.581 43.654
36 1 42.750 52.288 49.343
37 1 51.167 55.825 53.826
38 1 43.702 41.100 40.248
39 1 34.634 53.552 54.444
40 1 50.778 44.174 48.748
41 2 47.442 37.269 44.752
42 2 33.891 26.956 21.808
43 2 59.232 30.951 53.661
44 2 38.516 45.224 42.184
45 2 51.874 54.318 55.888
46 2 63.316 45.484 62.852
47 2 52.006 38.689 47.471
48 2 43.839 59.299 53.661
49 2 42.588 52.051 49.779
50 2 44.862 46.513 37.011
51 2 41.847 56.389 55.251
52 2 37.563 58.752 47.927
53 2 46.886 55.382 60.616
54 2 60.409 57.514 61.021
55 2 60.062 50.884 63.752
56 2 61.939 41.016 53.500
57 2 51.123 47.324 51.388
58 2 40.200 39.241 34.206
59 2 48.717 53.027 54.776
60 2 70.405 46.587 63.022
61 2 46.339 47.253 45.489
62 2 47.215 32.182 27.762
63 2 45.367 68.085 54.977
64 2 56.457 50.116 58.726
65 2 36.227 64.334 51.076
66 2 55.522 51.540 52.288
67 2 52.817 57.229 57.121
68 2 53.035 47.277 48.144
69 2 46.942 55.647 48.891
70 2 54.621 58.601 54.281
71 2 51.618 66.050 63.898
72 2 45.624 50.589 43.209
73 2 52.831 48.600 48.886
74 2 42.785 42.318 36.503
75 2 55.743 57.369 59.765
76 2 55.355 64.758 60.197
77 2 51.246 57.704 53.225
78 2 48.200 42.878 42.936
79 2 63.135 50.564 58.436
80 2 69.042 54.695 61.503
81 3 39.424 47.331 38.259
82 3 63.734 46.500 52.782
83 3 44.596 58.825 55.308
84 3 35.115 36.477 23.724
85 3 51.373 30.868 35.705
86 3 66.467 65.097 58.124
87 3 53.858 54.360 54.014
88 3 49.569 63.348 57.775
89 3 59.997 63.364 63.516
90 3 45.747 47.449 47.539
91 3 48.617 58.883 41.743
92 3 47.527 66.677 53.470
93 3 52.395 59.477 56.346
94 3 44.285 38.227 29.425
95 3 54.960 69.305 58.946
96 3 49.757 46.332 39.322
97 3 56.022 58.921 57.115
98 3 50.216 48.169 45.791
99 3 65.529 50.497 60.166
100 3 46.575 61.215 47.174
101 3 60.970 55.521 58.002
102 3 42.729 49.800 39.596
103 3 74.471 57.862 57.693
104 3 45.310 51.507 33.811
105 3 50.109 44.526 34.282
106 3 48.046 55.814 45.142
107 3 60.504 62.806 59.641
108 3 60.510 52.098 43.801
109 3 49.050 63.455 63.262
110 3 54.709 62.040 56.101
111 3 66.204 40.738 42.938
112 3 58.976 53.925 55.458
113 3 62.231 45.499 46.641
114 3 54.781 68.140 51.342
115 3 72.817 52.917 44.863
116 3 55.839 60.360 52.383
117 3 48.707 62.249 32.418
118 3 64.259 61.493 55.595
119 3 56.829 66.812 47.966
120 3 73.147 52.237 62.332
Table 19
Heuristic Results #3 Illustrating
the Importance of
Both Function and Structure Coefficients
STANDARDIZED CANONICAL DISCRIMINANT FUNCTION COEFFICIENTS
FUNC 1 FUNC 2
X1 1.22956 0.28470
X2 1.21174 -0.20978
X3 -1.58393 0.89694
STRUCTURE MATRIX
FUNC 1 FUNC 2
X1 0.39129 0.82637*
X2 0.38294 0.39748*
X3 -0.03464 0.94557*
Figure 1
PDA Territorial Maps for the Table 5 Heuristic
Data
Illustrating That More Predictors
May Actually Hurt
Classification Accuracy
3 Predictor/Response Variables
ALL-GROUPS STACKED HISTOGRAM
CANONICAL DISCRIMINANT FUNCTION 1
16 + +
I I
I I
F I I
R 12 + +
E I 2 I
Q I 2 I
U I 2 I
E 8 + 1 +
N I 1 2 2 I
C I 2 2 2 1 2 2 I
Y I 1 122 2 1 2 1 I
4 + 1 211222 122 1 +
I 1 1 2 211211 121 1 2 2 2 I
I 1 1 1 211211 111 1 2 2 22 2 2 2 2 I
I 11 1 1211 112111111211121212221 2222122 1 22 I
----+---------+---------+---------+---------+---------X
-2.0 -1.0 .0 1.0 2.0 OUT
CLASS 1111111111111111111111111222222222222222222222222222222
CENTROIDS 1 2
4 Predictor/Response Variables
ALL-GROUPS STACKED HISTOGRAM
CANONICAL DISCRIMINANT FUNCTION 1
16 + +
I I
I I
F I I
R 12 + +
E I I
Q I 2 I
U I 2 I
E 8 + 2 +
N I 2 22 2 I
C I 2 22 2 I
Y I 222 12 22 I
4 + 122 12 112 2 +
I 21222211 112 2 2 2 I
I 1 1112 2 11111111 11121 2 2 22 2 2 2 I
I 111 111111111111111111111212212 222222 1 1 2 2 I
----+---------+---------+---------+---------+---------X
-2.0 -1.0 .0 1.0 2.0 OUT
CLASS 1111111111111111111111111222222222222222222222222222222
CENTROIDS 1 2
APPENDIX A
SPSS/LISREL Program Illustraing That
SEM is the Most General
Case of the General Linear Model
Using the Holzinger and Swineford (1939)
Data
TITLE 'CANLISRL.SPS Holzinger & Swineford (1939) Data**'. COMMENT***********************************************************. COMMENT Holzinger, K.J., & Swineford, F. (1939). A study in. COMMENT factor analysis: The stability of a bi-factor solution. COMMENT (No. 48). Chicago, IL: University of Chicago. COMMENT (data on pp. 81-91). COMMENT***********************************************************. SET BLANKS=SYSMIS UNDEFINED=WARN. DATA LIST FILE=abc FIXED RECORDS=2 TABLE /1 id 1-3 sex 4-4 ageyr 6-7 agemo 8-9 t1 11-12 t2 14-15 t3 17-18 t4 20-21 t5 23-24 t6 26-27 t7 29-30 t8 32-33 t9 35-36 t10 38-40 t11 42-44 t12 46-48 t13 50-52 t14 54-56 t15 58-60 t16 62-64 t17 66-67 t18 69-70 t19 72-73 t20 74-76 t21 78-79 /2 t22 11-12 t23 14-15 t24 17-18 t25 20-21 t26 23-24 . EXECUTE. COMPUTE SCHOOL=1. IF (ID GT 200)SCHOOL=2. IF (ID GE 1 AND ID LE 85)GRADE=7. IF (ID GE 86 AND ID LE 168)GRADE=8. IF (ID GE 201 AND ID LE 281)GRADE=7. IF (ID GE 282 AND ID LE 351)GRADE=8. IF (ID GE 1 AND ID LE 44)TRACK=2. IF (ID GE 45 AND ID LE 85)TRACK=1. IF (ID GE 86 AND ID LE 129)TRACK=2. IF (ID GE 130)TRACK=1. PRINT FORMATS SCHOOL TO TRACK(F1.0). VALUE LABELS SCHOOL(1)PASTEUR (2) GRANT-WHITE/ TRACK (1)JUNE PROMOTIONS (2)FEB PROMOTIONS/. VARIABLE LABELS T1 VIS PERCEPT TEST FROM SPEARMAN VPT, PART III T2 CUBES, SIMPLIFICATION OF BRIGHAM'S SPATIAL RELATIONS TEST T3 PAPER FORM BOARD--SHAPES THAT CAN BE COMBINED TO FORM A TARGET T4 LOZENGES THORNDIKE--SHAPES FLIPPED OVER THEN IDENTIFY TARGET T5 GENERAL INFORMATION VERBAL TEST T6 PARAGRAPH COMPREHENSION TEST T7 SENTENCE COMPLETION TEST T8 WORD CLASSIFICATION--WHICH WORD NOT BELONG IN SET T9 WORD MEANING TEST T10 SPEEDED ADDITION TEST T11 SPEEDED CODE TEST--TRANSFORM SHAPES INTO ALPHA WITH CODE T12 SPEEDED COUNTING OF DOTS IN SHAPE T13 SPEEDED DISCRIM STRAIGHT AND CURVED CAPS T14 MEMORY OF TARGET WORDS T15 MEMORY OF TARGET NUMBERS T16 MEMORY OF TARGET SHAPES T17 MEMORY OF OBJECT-NUMBER ASSOCIATION TARGETS T18 MEMORY OF NUMBER-OBJECT ASSOCIATION TARGETS T19 MEMORY OF FIGURE-WORD ASSOCIATION TARGETS T20 DEDUCTIVE MATH ABILITY T21 MATH NUMBER PUZZLES T22 MATH WORD PROBLEM REASONING T23 COMPLETION OF A MATH NUMBER SERIES T24 WOODY-MCCALL MIXED MATH FUNDAMENTALS TEST T25 REVISION OF T3--PAPER FORM BOARD T26 FLAGS--POSSIBLE SUBSTITUTE FOR T4 LOZENGES. SUBTITLE 'CCA ##############'. correlations variables=t6 t7 t2 t4 t20 t21 t22/ statistics=all . manova t6 t7 with t2 t4 t20 t21 t22/ print=signif(multiv eigen dimenr)/ discrim=stan cor alpha(.999)/design . SUBTITLE 'Function I 2nd Variate n=301 v=7'. execute . PRELIS /VARIABLES t2 (CO) t4 (CO) t20 (CO) t21 (CO) t22 (CO) t6 (CO) t7 (CO) /TYPE=CORRELATION /MATRIX=OUT(CR1) LISREL /"1b First Function n=301 v=7" /DA NI=7 NO=301 MA=KM /MATRIX=IN(CR1) /MO BE=ZE PS=ZE TD=ZE LX=ID LY=FU,FI TE=SY,FR GA=FU,FI PH=SY,FR NX=2 NY=5 NK=2 NE=1 /VA 1.0 PH(1,1) PH(2,2) /VA 1.0 LY(1,1) /FR LY(2,1) LY(3,1) LY(4,1) LY(5,1) /FR GA(1,1) GA(1,2) /OU SS FS SL=1 TM=1200 ND=5 SUBTITLE 'Function I 1st Variate n=301 v=7'. execute . PRELIS /VARIABLES t6 (CO) t7 (CO) t2 (CO) t4 (CO) t20 (CO) t21 (CO) t22 (CO) /TYPE=CORRELATION /MATRIX=OUT(CR2) LISREL /"1a First Function n=301 v=7" /DA NI=7 NO=301 MA=KM /MATRIX=IN(CR2) /MO BE=ZE PS=ZE TD=ZE LX=ID LY=FU,FI TE=SY,FR GA=FU,FI PH=SY,FR NX=5 NY=2 NK=5 NE=1 /VA 1.0 PH(1,1) PH(2,2) PH(3,3) PH(4,4) PH(5,5) /VA 1.0 LY(1,1) /FR LY(2,1) /FR GA(1,1) GA(1,2) GA(1,3) GA(1,4) GA(1,5) /OU SS FS SL=1 TM=1200 ND=5 SUBTITLE 'Function II 2nd Variate n=301 v=7'. execute . LISREL /"2b Second Function n=301 v=7" /DA NI=7 NO=301 MA=KM /MATRIX=IN(CR1) /MO BE=ZE PS=ZE TD=ZE LX=ID LY=FU,FI TE=SY,FR GA=FU,FI PH=SY,FR NX=2 NY=5 NK=2 NE=2 /VA 1.0 PH(1,1) PH(2,2) /VA 1.0 LY(1,1) LY(1,2) /VA 0.76757 LY(2,1) /VA 2.34225 LY(3,1) /VA 2.13559 LY(4,1) /VA 3.17417 LY(5,1) /FR LY(2,2) LY(3,2) LY(4,2) LY(5,2) /VA 0.06992 GA(1,1) /VA 0.09682 GA(1,2) /FR GA(2,1) GA(2,2) /OU SS FS SL=1 TM=1200 ND=5 SUBTITLE 'Function II 1st Variate n=301 v=7'. execute . LISREL /"2a Second Function n=301 v=7" /DA NI=7 NO=301 MA=KM /MATRIX=IN(CR2) /MO BE=ZE PS=ZE TD=ZE LX=ID LY=FU,FI TE=SY,FR GA=FU,FI PH=SY,FR NX=5 NY=2 NK=5 NE=2 /VA 1.0 PH(1,1) PH(2,2) PH(3,3) PH(4,4) PH(5,5) /VA 1.0 LY(1,1) LY(1,2) /VA 1.05093 LY(2,1) /FR LY(2,2) /VA -.00729 GA(1,1) /VA -.09934 GA(1,2) /VA 0.16926 GA(1,3) /VA 0.13288 GA(1,4) /VA 0.36285 GA(1,5) /FR GA(2,1) GA(2,2) GA(2,3) GA(2,4) GA(2,5) /OU SS FS SL=1 TM=1200 ND=5
APPENDIX B
SPSS Program for the Table 5 Actual Data Illustrating
That
More Predictors May Actually Hurt Classification Accuracy
TITLE 'AERA9804.SPS Holzinger & Swineford (1939) Data' COMMENT***********************************************************. COMMENT Holzinger, K.J., & Swineford, F. (1939). A study in. COMMENT factor analysis: The stability of a bi-factor solution. COMMENT (No. 48). Chicago, IL: University of Chicago. COMMENT (data on pp. 81-91). COMMENT***********************************************************. DATA LIST FILE=BT RECORDS=2 /1 ID 1-3 SEX 4 AGEYR 6-7 AGEMO 8-9 T1 11-12 T2 14-15 T3 17-18 T4 20-21 T5 23-24 T6 26-27 T7 29-30 T8 32-33 T9 35-36 T10 38-40 T11 42-44 T12 46-48 T13 50-52 T14 54-56 T15 58-60 T16 62-64 T17 66-67 T18 69-70 T19 72-73 T20 74-76 T21 78-79 /2 T22 11-12 T23 14-15 T24 17-18 T25 20-21 T26 23-24 COMPUTE SCHOOL=1 IF (ID GT 200)SCHOOL=2 IF (ID GE 1 AND ID LE 85)GRADE=7 IF (ID GE 86 AND ID LE 168)GRADE=8 IF (ID GE 201 AND ID LE 281)GRADE=7 IF (ID GE 282 AND ID LE 351)GRADE=8 IF (ID GE 1 AND ID LE 44)TRACK=2 IF (ID GE 45 AND ID LE 85)TRACK=1 IF (ID GE 86 AND ID LE 129)TRACK=2 IF (ID GE 130)TRACK=1 PRINT FORMATS SCHOOL TO TRACK(F1.0) VALUE LABELS SCHOOL(1)PASTEUR (2) GRANT-WHITE/ TRACK (1)JUNE PROMOTIONS (2)FEB PROMOTIONS/ VARIABLE LABELS T1 VIS PERCEPT TEST FROM SPEARMAN VPT, PART III T2 CUBES, SIMPLIFICATION OF BRIGHAM'S SPATIAL RELATIONS TEST T3 PAPER FORM BOARD--SHAPES THAT CAN BE COMBINED TO FORM A TARGET T4 LOZENGES THORNDIKE--SHAPES FLIPPED OVER THEN IDENTIFY TARGET T5 GENERAL INFORMATION VERBAL TEST T6 PARAGRAPH COMPREHENSION TEST T7 SENTENCE COMPLETION TEST T8 WORD CLASSIFICATION--WHICH WORD NOT BELONG IN SET T9 WORD MEANING TEST T10 SPEEDED ADDITION TEST T11 SPEEDED CODE TEST--TRANSFORM SHAPES INTO ALPHA WITH CODE T12 SPEEDED COUNTING OF DOTS IN SHAPE T13 SPEEDED DISCRIM STRAIGHT AND CURVED CAPS T14 MEMORY OF TARGET WORDS T15 MEMORY OF TARGET NUMBERS T16 MEMORY OF TARGET SHAPES T17 MEMORY OF OBJECT-NUMBER ASSOCIATION TARGETS T18 MEMORY OF NUMBER-OBJECT ASSOCIATION TARGETS T19 MEMORY OF FIGURE-WORD ASSOCIATION TARGETS T20 DEDUCTIVE MATH ABILITY T21 MATH NUMBER PUZZLES T22 MATH WORD PROBLEM REASONING T23 COMPLETION OF A MATH NUMBER SERIES T24 WOODY-MCCALL MIXED MATH FUNDAMENTALS TEST T25 REVISION OF T3--PAPER FORM BOARD T26 FLAGS--POSSIBLE SUBSTITUTE FOR T4 LOZENGES subtitle '0 PDA with 3 Predictor Variables **n=301' discriminant groups=grade(7,8)/ variables=t13 t17 t22/analysis=t13 t17 t22/ method=direct/priors=equal/save=scores(discrim)/ classify=pooled/ statistics=mean stddev gcov tcov corr boxm coef table/ plot=all select if (discrim1 lt -1.5 or discrim1 gt 1.5 or (discrim1 gt -.3 and discrim1 lt .3)) sort cases by grade id list variables=id grade t13 t17 t22 t16/ cases=999/format=numbered subtitle '1 PDA with 3 Predictor Variables **n=107' discriminant groups=grade(7,8)/ variables=t13 t17 t22/analysis=t13 t17 t22/ method=direct/priors=equal/save=class(LDFCL3)/ classify=pooled/ statistics=mean stddev gcov tcov corr boxm coef table/ plot=all subtitle '2 PDA with 4 Predictor Variables **n=107' discriminant groups=grade(7,8)/ variables=t13 t17 t22 t16/analysis=t13 t17 t22 t16/ method=direct/priors=equal/save=class(LDFCL4)/ classify=pooled/ statistics=mean stddev gcov tcov corr boxm coef table/ plot=all subtitle '3 Compare the 4 Sets of Classification Results!!' compute lcf31=(T13 * 0.1091137) + (T17 * -0.06245298) + (T22 * 0.1659288) + -12.84927 compute lcf32=(T13 * 0.1117800) + (T17 * 0.06471948) + (T22 * 0.2171317) + -15.91867 compute lcf41=(T13 * -0.008489698) + (T17 * -0.5090838) + (T22 * -0.09004268) + (T16 * 1.974350) + -97.20442 compute lcf42=(T13 * -0.007301202) + (T17 * -0.3875236) + (T22 * -0.04205625) + (T16 * 1.999159) + -102.4071 compute LCFCL3=8 if (lcf31 gt lcf32)LCFCL3=7 compute LCFCL4=8 if (lcf41 gt lcf42)LCFCL4=7 print formats LCFCL3 LCFCL4 (F1) variable labels lcf31 'Linear Class Function (LCF) score #1 3 preds' lcf32 'Linear Class Function (LCF) score #2 3 preds' lcf41 'Linear Class Function (LCF) score #1 4 preds' lcf42 'Linear Class Function (LCF) score #2 4 preds' LCFCL3 'LCF classification 3 preds' LCFCL4 'LCF classification 4 preds' LDFCL3 'LDF classification 3 preds' LDFCL4 'LDF classification 4 preds' list variables=id grade LDFCL3 LDFCL4 LCFCL3 LCFCL4/ cases=9999/format=numbered crosstabs grade by LDFCL3 crosstabs grade by LDFCL4 crosstabs grade by LCFCL3 crosstabs grade by LCFCL4 subtitle '1 LDFCL3 and LCFCL3 <>' temporary select if (LDFCL3 ne LCFCL3) list variables=id grade LDFCL3 LDFCL4 LCFCL3 LCFCL4/cases=99 subtitle '2 LDFCL4 and LCFCL4 <>' temporary select if (LDFCL4 ne LCFCL4) list variables=id grade LDFCL3 LDFCL4 LCFCL3 LCFCL4/cases=99 subtitle '3 LDFCL4 and LDFCL3 <>' temporary select if (LDFCL4 ne LDFCL3) list variables=id grade LDFCL3 LDFCL4 LCFCL3 LCFCL4/cases=99 subtitle '4 LCFCL4 and LCFCL3 <>' temporary select if (LCFCL4 ne LCFCL3) list variables=id grade LDFCL3 LDFCL4 LCFCL3 LCFCL4/cases=99
APPENDIX C
SPSS Program for the Table 9 Heuristic Data Illustrating
that Stepwise Methods Do Not Identify the Best Variable Set
title 'AERA9802.SPS *************************************'. data list file=abc records=1 table/ ID grp x1 to x4 (2F4,4F9.3) list variables=all/cases=99 . subtitle '1 Stepwise DDA ###############################'. discriminant groups=grp(1,3)/variables=x1 to x4/analysis=x1 to x4/ method=wilks/maxsteps=2/ statistics=mean stddev gcov cov boxm/ . subtitle '2a Enter X1,X2 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@'. discriminant groups=grp(1,3)/variables=x1 to x4/analysis=x1 x2/ method=direct/ . subtitle '2b X1,X2 Show 1-way MANOVA is DDA !!!!!!!!!!!!'. manova x1 x2 by grp(1,3)/print=signif(multiv eigen dimenr)/ discrim(stan corr alpha(.999))/design . subtitle '3a Enter X1,X3 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@'. discriminant groups=grp(1,3)/variables=x1 to x4/analysis=x1 x3/ method=direct/ . subtitle '3b X1,X3 Show 1-way MANOVA is DDA !!!!!!!!!!!!'. manova x1 x3 by grp(1,3)/print=signif(multiv eigen dimenr)/ discrim(stan corr alpha(.999))/design . subtitle '4a Enter X1,X4 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@'. discriminant groups=grp(1,3)/variables=x1 to x4/analysis=x1 x4/ method=direct/ . subtitle '4b X1,X4 Show 1-way MANOVA is DDA !!!!!!!!!!!!'. manova x1 x4 by grp(1,3)/print=signif(multiv eigen dimenr)/ discrim(stan corr alpha(.999))/design . subtitle '5a Enter X2,X3 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@'. discriminant groups=grp(1,3)/variables=x1 to x4/analysis=x2 x3/ method=direct/ . subtitle '5b X2,X3 Show 1-way MANOVA is DDA !!!!!!!!!!!!'. manova x2 x3 by grp(1,3)/print=signif(multiv eigen dimenr)/ discrim(stan corr alpha(.999))/design . subtitle '6a Enter X2,X4 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@'. discriminant groups=grp(1,3)/variables=x1 to x4/analysis=x2 x4/ method=direct/ . subtitle '6b X2,X4 Show 1-way MANOVA is DDA !!!!!!!!!!!!'. manova x2 x4 by grp(1,3)/print=signif(multiv eigen dimenr)/ discrim(stan corr alpha(.999))/design . subtitle '7a Enter X3,X4 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@'. discriminant groups=grp(1,3)/variables=x1 to x4/analysis=x3 x4/ method=direct/ . subtitle '7b X3,X4 Show 1-way MANOVA is DDA !!!!!!!!!!!!'. manova x3 x4 by grp(1,3)/print=signif(multiv eigen dimenr)/ discrim(stan corr alpha(.999))/design .
APPENDIX D
SPSS for Windows Program
for the Table 12 Heuristic
Data Illustrating
the Context Specificity of GLM
Weights
set printback=listing blanks=sysmis undefined=warn . COMMENT 'AERA9801.SPS' . title 'Illustrate **Context Specificity** of GLM Weights' . data list file='c:\123\temp.prn' fixed records=1 table /1 id 1-3 grp 8 x1 14-15 x2 21-22 x3 28-29 x4 35-36 . list variables=all/cases=9999/ . subtitle '1 Discrim ***Smaller Variable Set***' . discriminant groups=grp(1,3)/variables=x1 to x3/ analysis=x1 to x3/ method=direct/priors=equal/save scores(dscr)/ plot=cases/classify=pooled/ statistics=mean stddev gcov cov corr boxm coef table. variable label dscr1 'Discriminant score Func I 3 predictors' dscr2 'Discriminant score Func II 3 predictors' . execute . subtitle '2 Discrim ###Larger Variable Set###' . discriminant groups=grp(1,3)/variables=x1 to x4/ analysis=x1 to x4/ method=direct/priors=equal/save scores(dscore)/ plot=cases/classify=pooled/ statistics=mean stddev gcov cov corr boxm coef table. variable label dscore1 'Discriminant score Func I 4 predictors' dscore2 'Discriminant score Func II 4 predictors' . execute .
APPENDIX E
SPSS Program for the Heuristic Data (Tables 14, 16, and
18)
Illustrating the Importance of Both Function and Structure
Coefficients
title 'AERA9803.SPS *************************************'. data list file=abc records=1 table/ ID 2-3 grp 8 x1 14-15 X2 21-22 X3 28-29 list variables=all/cases=99 . subtitle '1 Uncorrelated Response Variables #############'. discriminant groups=grp(1,3)/variables=x1 to x3/analysis=x1 to x3/ method=direct/ statistics=mean stddev gcov cov boxm/ .