Learning Team 2
Table of Contents
05978 Topic: Learning Team
Number of Pages: 2 (Double Spaced)
Number of sources: 1
Writing Style: APA
Type of document: Essay
Language Style: English (U.S.)
Modules Chapter 6 Week 2 p655
C H A P T E R 6
In everyday language we say that something is valid if it is sound, meaningful, or well grounded on principles or evidence. For example, we speak of a valid theory, a valid argument, or a valid reason. In legal terminology, lawyers say that something is valid if it is “executed with the proper formalities” (Black, 1979), such as a valid contract and a valid will. In each of these instances, people make judgments based on evidence of the meaningfulness or the veracity of something. Similarly, in the language of psychological assessment, validity is a term used in conjunction with the meaningfulness of a test score—what the test score truly means.
The Concept of Validity
Validity , as applied to a test, is a judgment or estimate of how well a test measures what it purports to measure in a particular context. More specifically, it is a judgment based on evidence about the appropriateness of inferences drawn from test scores.1 An inference is a logical result or deduction. Characterizations of the validity of tests and test scores are frequently phrased in terms such as “acceptable” or “weak.” These terms reflect a judgment about how adequately the test measures what it purports to measure.
Inherent in a judgment of an instrument’s validity is a judgment of how useful the instrument is for a particular purpose with a particular population of people. As a shorthand, assessors may refer to a particular test as a “valid test.” However, what is really meant is that the test has been shown to be valid for a particular use with a particular population of testtakers at a particular time. No test or measurement technique is “universally valid” for all time, for all uses, with all types of testtaker populations. Rather, tests may be shown to be valid within what we would characterize as reasonable boundaries of a contemplated usage. If those boundaries are exceeded, the validity of the test may be called into question. Further, to the extent that the validity of a test may diminish as the culture or the times change, the validity of a test may have to be re-established with the same as well as other testtaker populations.Page 176
JUST THINK . . .
Why is the phrase valid test sometimes misleading?
Validation is the process of gathering and evaluating evidence about validity. Both the test developer and the test user may play a role in the validation of a test for a specific purpose. It is the test developer’s responsibility to supply validity evidence in the test manual. It may sometimes be appropriate for test users to conduct their own validation studies with their own groups of testtakers. Such local validation studies may yield insights regarding a particular population of testtakers as compared to the norming sample described in a test manual. Local validation studies are absolutely necessary when the test user plans to alter in some way the format, instructions, language, or content of the test. For example, a local validation study would be necessary if the test user sought to transform a nationally standardized test into Braille for administration to blind and visually impaired testtakers. Local validation studies would also be necessary if a test user sought to use a test with a population of testtakers that differed in some significant way from the population on which the test was standardized.
JUST THINK . . .
Local validation studies require professional time and know-how, and they may be costly. For these reasons, they might not be done even if they are desirable or necessary. What would you recommend to a test user who is in no position to conduct such a local validation study but who nonetheless is contemplating the use of a test that requires one?
One way measurement specialists have traditionally conceptualized validity is according to three categories:
1. Content validity. This is a measure of validity based on an evaluation of the subjects, topics, or content covered by the items in the test.
2. Criterion-related validity. This is a measure of validity obtained by evaluating the relationship of scores obtained on the test to scores on other tests or measures
3. Construct validity. This is a measure of validity that is arrived at by executing a comprehensive analysis of
a. how scores on the test relate to other test scores and measures, and
b. how scores on the test can be understood within some theoretical framework for understanding the construct that the test was designed to measure.
In this classic conception of validity, referred to as the trinitarian view (Guion, 1980), it might be useful to visualize construct validity as being “umbrella validity” because every other variety of validity falls under it. Why construct validity is the overriding variety of validity will become clear as we discuss what makes a test valid and the methods and procedures used in validation. Indeed, there are many ways of approaching the process of test validation, and these different plans of attack are often referred to as strategies. We speak, for example, of content validation strategies, criterion-related validation strategies, and construct validation strategies.
Trinitarian approaches to validity assessment are not mutually exclusive. That is, each of the three conceptions of validity provides evidence that, with other evidence, contributes to a judgment concerning the validity of a test. Stated another way, all three types of validity evidence contribute to a unified picture of a test’s validity. A test user may not need to know about all three. Depending on the use to which a test is being put, one type of validity evidence may be more relevant than another.
The trinitarian model of validity is not without its critics (Landy, 1986). Messick (1995), for example, condemned this approach as fragmented and incomplete. He called for a unitary view of validity, one that takes into account everything from the implications of test scores in terms of societal values to the consequences of test use. However, even in the so-called unitary view, different elements of validity may come to the fore for scrutiny, and so an understanding of those elements in isolation is necessary.
In this chapter we discuss content validity, criterion-related validity, and construct validity; three now-classic approaches to judging whether a test measures what it purports to measure. Page 177Let’s note at the outset that, although the trinitarian model focuses on three types of validity, you are likely to come across other varieties of validity in your readings. For example, you are likely to come across the term ecological validity. You may recall from Chapter 1 that the term ecological momentary assessment (EMA) refers to the in-the-moment and in-the-place evaluation of targeted variables (such as behaviors, cognitions, and emotions) in a natural, naturalistic, or real-life context. In a somewhat similar vein, the term ecological validity refers to a judgment regarding how well a test measures what it purports to measure at the time and place that the variable being measured (typically a behavior, cognition, or emotion) is actually emitted. In essence, the greater the ecological validity of a test or other measurement procedure, the greater the generalizability of the measurement results to particular real-life circumstances.
Part of the appeal of EMA is that it does not have the limitations of retrospective self-report. Studies of the ecological validity of many tests or other assessment procedures are conducted in a natural (or naturalistic) environment, which is identical or similar to the environment in which a targeted behavior or other variable might naturally occur (see, for example, Courvoisier et al., 2012; Lewinski et al., 2014; Lo et al., 2015). However, in some cases, owing to the nature of the particular variable under study, such research may be retrospective in nature (see, for example, the 2014 Weems et al. study of memory for traumatic events).
Other validity-related terms that you will come across in the psychology literature are predictive validity and concurrent validity. We discuss these terms later in this chapter in the context of criterion-related validity. Yet another term you may come across is face validity (see Figure 6–1). In fact, you will come across that term right now . . .
Figure 6–1 Face Validity and Comedian Rodney Dangerfield Rodney Dangerfield (1921–2004) was famous for complaining, “I don’t get no respect.” Somewhat analogously, the concept of face validity has been described as the “Rodney Dangerfield of psychometric variables” because it has “received little attention—and even less respect—from researchers examining the construct validity of psychological tests and measures” (Bornstein et al., 1994, p. 363). By the way, the tombstone of this beloved stand-up comic and film actor reads: “Rodney Dangerfield . . . There goes the neighborhood.”© Arthur Schatz/The Life Images Collection/Getty Images
Face validity relates more to what a test appears to measure to the person being tested than to what the test actually measures. Face validity is a judgment concerning how relevant the Page 178test items appear to be. Stated another way, if a test definitely appears to measure what it purports to measure “on the face of it,” then it could be said to be high in face validity. A paper-and-pencil personality test labeled The Introversion/Extraversion Test, with items that ask respondents whether they have acted in an introverted or an extraverted way in particular situations, may be perceived by respondents as a highly face-valid test. On the other hand, a personality test in which respondents are asked to report what they see in inkblots may be perceived as a test with low face validity. Many respondents would be left wondering how what they said they saw in the inkblots really had anything at all to do with personality.
In contrast to judgments about the reliability of a test and judgments about the content, construct, or criterion-related validity of a test, judgments about face validity are frequently thought of from the perspective of the testtaker, not the test user. A test’s lack of face validity could contribute to a lack of confidence in the perceived effectiveness of the test—with a consequential decrease in the testtaker’s cooperation or motivation to do his or her best. In a corporate environment, lack of face validity may lead to unwillingness of administrators or managers to “buy-in” to the use of a particular test (see this chapter’s Meet an Assessment Professional ). In a similar vein, parents may object to having their children tested with instruments that lack ostensible validity. Such concern might stem from a belief that the use of such tests will result in invalid conclusions.
MEET AN ASSESSMENT PROFESSIONAL
Meet Dr. Adam Shoemaker
In the “real world,” tests require buy-in from test administrators and candidates. While the reliability and validity of the test are always of primary importance, the test process can be short-circuited by administrators who don’t know how to use the test or who don’t have a good understanding of test theory. So at least half the battle of implementing a new testing tool is to make sure administrators know how to use it, accept the way that it works, and feel comfortable that it is tapping the skills and abilities necessary for the candidate to do the job.
Here’s an example: Early in my company’s history of using online assessments, we piloted a test that had acceptable reliability and criterion validity. We saw some strongly significant correlations between scores on the test and objective performance numbers, suggesting that this test did a good job of distinguishing between high and low performers on the job. The test proved to be unbiased and showed no demonstrable adverse impact against minority groups. However, very few test administrators felt comfortable using the assessment because most people felt that the skills that it tapped were not closely related to the skills needed for the job. Legally, ethically, and statistically, we were on firm ground, but we could never fully achieve “buy-in” from the people who had to administer the test.
On the other hand, we also piloted a test that showed very little criterion validity at all. There were no significant correlations between scores on the test and performance outcomes; the test was unable to distinguish between a high and a low performer. Still . . . the test administrators loved this test because it “looked” so much like the job. That is, it had high face validity and tapped skills that seemed to be precisely the kinds of skills that were needed on the job. From a legal, ethical, and statistical perspective, we knew we could not use this test to select employees, but we continued to use it to provide a “realistic job preview” to candidates. That way, the test continued to work for us in really showing candidates that this was the kind of thing they would be doing all day at work. More than a few times, candidates voluntarily withdrew from the process because they had a better understanding of what the job involved long before they even sat down at a desk.
Adam Shoemaker, Ph.D., Human Resources Consultant for Talent Acquisition, Tampa, Florida © Adam Shoemaker
The moral of this story is that as scientists, we have to remember that reliability and validity are super important in the development and implementation of a test . . . but as human beings, we have to remember that the test we end up using must also be easy to use and appear face valid for both the candidate and the administrator.
Read more of what Dr. Shoemaker had to say—his complete essay—through the Instructor Resources within Connect.
Used with permission of Adam Shoemaker.
JUST THINK . . .
What is the value of face validity from the perspective of the test user?
In reality, a test that lacks face validity may still be relevant and useful. However, if the test is not perceived as relevant and useful by testtakers, parents, legislators, and others, then negative consequences may result. These consequences may range from poor testtaker attitude to lawsuits filed by disgruntled parties against a test user and test publisher. Ultimately, face validity may be more a matter of public relations than psychometric soundness. Still, it is important nonetheless, and (much like Rodney Dangerfield) deserving of respect.
Content validity describes a judgment of how adequately a test samples behavior representative of the universe of behavior that the test was designed to sample. For example, the universe of behavior referred to as assertive is very wide-ranging. A content-valid, paper-and-pencil test of assertiveness would be one that is adequately representative of this wide range. We might expect that such a test would contain items sampling from hypothetical situations at home (such as whether the respondent has difficulty in making her or his views known to fellow family members), on the job (such as whether the respondent has difficulty in asking subordinates to do what is required of them), and in social situations (such as whether the respondent would send back a steak not done to order in a fancy restaurant). Ideally, test developers have a clear (as opposed to “fuzzy”) vision of the construct being measured, and the clarity of this vision can be reflected in the content validity of the test (Haynes et al., 1995). In the interest of ensuring content validity, test developers strive to include key components of the construct targeted for measurement, and exclude content irrelevant to the construct targeted for measurement.
With respect to educational achievement tests, it is customary to consider a test a content-valid measure when the proportion of material covered by the test approximates the proportion of material covered in the course. A cumulative final exam in introductory statistics would be considered content-valid if the proportion and type of introductory statistics problems on the test approximates the proportion and type of introductory statistics problems presented in the course.
The early stages of a test being developed for use in the classroom—be it one classroom or those throughout the state or the nation—typically entail research exploring the universe of possible instructional objectives for the course. Included among the many possible sources of information on such objectives are course syllabi, course textbooks, teachers of the course, specialists who Page 180develop curricula, and professors and supervisors who train teachers in the particular subject area. From the pooled information (along with the judgment of the test developer), there emerges a test blueprint for the “structure” of the evaluation—that is, a plan regarding the types of information to be covered by the items, the number of items tapping each area of coverage, the organization of the items in the test, and so forth (see Figure 6–2). In many instances the test blueprint represents the culmination of efforts to adequately sample the universe of content areas that conceivably could be sampled in such a test.2
Figure 6–2 Building a Test from a Test Blueprint An architect’s blueprint usually takes the form of a technical drawing or diagram of a structure, sometimes written in white lines on a blue background. The blueprint may be thought of as a plan of a structure, typically detailed enough so that the structure could actually be constructed from it. Somewhat comparable to the architect’s blueprint is the test blueprint of a test developer. Seldom, if ever, on a blue background and written in white, it is nonetheless a detailed plan of the content, organization, and quantity of the items that a test will contain—sometimes complete with “weightings” of the content to be covered (He, 2011; Spray & Huang, 2000; Sykes & Hou, 2003). A test administered on a regular basis may require “item-pool management” to manage the creation of new items and the output of old items in a manner that is consistent with the test’s blueprint (Ariel et al., 2006; van der Linden et al., 2000).© John Rowley/Getty Images RF
JUST THINK . . .
A test developer is working on a brief screening instrument designed to predict student success in a psychological testing and assessment course. You are the consultant called upon to blueprint the content areas covered. Your recommendations?
For an employment test to be content-valid, its content must be a representative sample of the job-related skills required for employment. Behavioral observation is one technique frequently used in blueprinting the content areas to be covered in certain types of employment tests. The test developer will observe successful veterans on that job, note the behaviors necessary for success on the job, and design the test to include a representative Page 181sample of those behaviors. Those same workers (as well as their supervisors and others) may subsequently be called on to act as experts or judges in rating the degree to which the content of the test is a representative sample of the required job-related skills. At that point, the test developer will want to know about the extent to which the experts or judges agree. A description of one such method for quantifying the degree of agreement between such raters can be found “online only” through the Instructor Resources within Connect (refer to OOBAL-6-B2).
Culture and the relativity of content validity
Tests are often thought of as either valid or not valid. A history test, for example, either does or does not accurately measure one’s knowledge of historical fact. However, it is also true that what constitutes historical fact depends to some extent on who is writing the history. Consider, for example, a momentous event in the history of the world, one that served as a catalyst for World War I. Archduke Franz Ferdinand was assassinated on June 28, 1914, by a Serb named Gavrilo Princip (Figure 6–3). Now think about how you would answer the following multiple-choice item on a history test:
Figure 6–3 Cultural Relativity, History, and Test Validity Austro-Hungarian Archduke Franz Ferdinand and his wife, Sophia, are pictured (left) as they left Sarajevo’s City Hall on June 28, 1914. Moments later, Ferdinand would be assassinated by Gavrilo Princip, shown in custody at right. The killing served as a catalyst for World War I and is discussed and analyzed in history textbooks in every language around the world. Yet descriptions of the assassin Princip in those textbooks—and ability test items based on those descriptions—vary as a function of culture.© Ingram Publishing RF
Gavrilo Princip was
a. a poet
b. a hero
c. a terrorist
d. a nationalist
e. all of the above
For various textbooks in the Bosnian region of the world, choice “e”—that’s right, “all of the above”—is the “correct” answer. Hedges (1997) observed that textbooks in areas of Bosnia and Herzegovina that were controlled by different ethnic groups imparted widely varying characterizations of the assassin. In the Serb-controlled region of the country, history textbooks—and presumably the tests constructed to measure students’ learning—regarded Princip as a “hero and poet.” By contrast, Croatian students might read that Princip was an assassin trained to commit a terrorist act. Muslims in the region were taught that Princip was a nationalist whose deed sparked anti-Serbian rioting.
JUST THINK . . .
The passage of time sometimes serves to place historical figures in a different light. How might the textbook descriptions of Gavrilo Princip have changed in these regions?
A history test considered valid in one classroom, at one time, and in one place will not necessarily be considered so in another classroom, at another time, and in another place. Consider a test containing the true-false item, “Colonel Claus von Stauffenberg is a hero.” Such an item is useful in illustrating the cultural relativity affecting item scoring. In 1944, von Stauffenberg, a German officer, was an active participant in a bomb plot to assassinate Germany’s leader, Adolf Hitler. When the plot (popularized in the film, Operation Valkyrie) failed, von Stauffenberg was executed and promptly villified in Germany as a despicable traitor. Today, the light of history shines favorably on von Stauffenberg, and he is perceived as a hero in Germany. A German postage stamp with his face on it was issued to honor von Stauffenberg’s 100th birthday.
Politics is another factor that may well play a part in perceptions and judgments concerning the validity of tests and test items. In many countries throughout the world, a response that is keyed incorrect to a particular test item can lead to consequences far more dire than a deduction in points towards the total test score. Sometimes, even constructing a test with a reference to a taboo topic can have dire consequences for the test developer. For example, one Palestinian professor who included items pertaining to governmental corruption on an examination was tortured by authorities as a result (“Brother Against Brother,” 1997). Such scenarios bring new meaning to the term politically correct as it applies to tests, test items, and testtaker responses.
JUST THINK . . .
Commercial test developers who publish widely used history tests must maintain the content validity of their tests. What challenges do they face in doing so?
Criterion-related validity is a judgment of how adequately a test score can be used to infer an individual’s most probable standing on some measure of interest—the measure of interest being the criterion. Two types of validity evidence are subsumed under the heading criterion-related validity. Concurrent validity is an index of the degree to which a test score is related to some criterion measure obtained at the same time (concurrently). Predictive validity is an index of the degree to which a test score predicts some criterion measure. Before we discuss each of these types of validity evidence in detail, it seems appropriate to raise (and answer) an important question.
What Is a Criterion?
We were first introduced to the concept of a criterion in Chapter 4, where, in the context of defining criterion-referenced assessment, we defined a criterion broadly as a standard on which a judgment or decision may be based. Here, in the context of our discussion of criterion-related validity, we will define a criterion just a bit more narrowly as the standard against which a test Page 183or a test score is evaluated. So, for example, if a test purports to measure the trait of athleticism, we might expect to employ “membership in a health club” or any generally accepted measure of physical fitness as a criterion in evaluating whether the athleticism test truly measures athleticism. Operationally, a criterion can be most anything: pilot performance in flying a Boeing 767, grade on examination in Advanced Hairweaving, number of days spent in psychiatric hospitalization; the list is endless. There are no hard-and-fast rules for what constitutes a criterion. It can be a test score, a specific behavior or group of behaviors, an amount of time, a rating, a psychiatric diagnosis, a training cost, an index of absenteeism, an index of alcohol intoxication, and so on. Whatever the criterion, ideally it is relevant, valid, and uncontaminated. Let’s explain.
Characteristics of a criterion
An adequate criterion is relevant. By this we mean that it is pertinent or applicable to the matter at hand. We would expect, for example, that a test purporting to advise testtakers whether they share the same interests of successful actors to have been validated using the interests of successful actors as a criterion.
An adequate criterion measure must also be valid for the purpose for which it is being used. If one test (X) is being used as the criterion to validate a second test (Y), then evidence should exist that test X is valid. If the criterion used is a rating made by a judge or a panel, then evidence should exist that the rating is valid. Suppose, for example, that a test purporting to measure depression is said to have been validated using as a criterion the diagnoses made by a blue-ribbon panel of psychodiagnosticians. A test user might wish to probe further regarding variables such as the credentials of the “blue-ribbon panel” (or, their educational background, training, and experience) and the actual procedures used to validate a diagnosis of depression. Answers to such questions would help address the issue of whether the criterion (in this case, the diagnoses made by panel members) was indeed valid.
Ideally, a criterion is also uncontaminated. Criterion contamination is the term applied to a criterion measure that has been based, at least in part, on predictor measures. As an example, consider a hypothetical “Inmate Violence Potential Test” (IVPT) designed to predict a prisoner’s potential for violence in the cell block. In part, this evaluation entails ratings from fellow inmates, guards, and other staff in order to come up with a number that represents each inmate’s violence potential. After all of the inmates in the study have been given scores on this test, the study authors then attempt to validate the test by asking guards to rate each inmate on their violence potential. Because the guards’ opinions were used to formulate the inmate’s test score in the first place (the predictor variable), the guards’ opinions cannot be used as a criterion against which to judge the soundness of the test. If the guards’ opinions were used both as a predictor and as a criterion, then we would say that criterion contamination had occurred.
Here is another example of criterion contamination. Suppose that a team of researchers from a company called Ventura International Psychiatric Research (VIPR) just completed a study of how accurately a test called the MMPI-2-RF predicted psychiatric diagnosis in the psychiatric population of the Minnesota state hospital system. As we will see in Chapter 12, the MMPI-2-RF is, in fact, a widely used test. In this study, the predictor is the MMPI-2-RF, and the criterion is the psychiatric diagnosis that exists in the patient’s record. Further, let’s suppose that while all the data are being analyzed at VIPR headquarters, someone informs these researchers that the diagnosis for every patient in the Minnesota state hospital system was determined, at least in part, by an MMPI-2-RF test score. Should they still proceed with their analysis? The answer is no. Because the predictor measure has contaminated the criterion measure, it would be of little value to find, in essence, that the predictor can indeed predict itself.
When criterion contamination does occur, the results of the validation study cannot be taken seriously. There are no methods or statistics to gauge the extent to which criterion contamination has taken place, and there are no methods or statistics to correct for such contamination.
Now, let’s take a closer look at concurrent validity and predictive validity.Page 184
If test scores are obtained at about the same time as the criterion measures are obtained, measures of the relationship between the test scores and the criterion provide evidence of concurrent validity. Statements of concurrent validity indicate the extent to which test scores may be used to estimate an individual’s present standing on a criterion. If, for example, scores (or classifications) made on the basis of a psychodiagnostic test were to be validated against a criterion of already diagnosed psychiatric patients, then the process would be one of concurrent validation. In general, once the validity of the inference from the test scores is established, the test may provide a faster, less expensive way to offer a diagnosis or a classification decision. A test with satisfactorily demonstrated concurrent validity may therefore be appealing to prospective users because it holds out the potential of savings of money and professional time.
Sometimes the concurrent validity of a particular test (let’s call it Test A) is explored with respect to another test (we’ll call Test B). In such studies, prior research has satisfactorily demonstrated the validity of Test B, so the question becomes: “How well does Test A compare with Test B?” Here, Test B is used as the validating criterion. In some studies, Test A is either a brand-new test or a test being used for some new purpose, perhaps with a new population.
Here is a real-life example of a concurrent validity study in which a group of researchers explored whether a test validated for use with adults could be used with adolescents. The Beck Depression Inventory (BDI; Beck et al., 1961, 1979; Beck & Steer, 1993) and its revision, the Beck Depression Inventory-II (BDI-II; Beck et al., 1996) are self-report measures used to identify symptoms of depression and quantify their severity. Although the BDI had been widely used with adults, questions were raised regarding its appropriateness for use with adolescents. Ambrosini et al. (1991) conducted a concurrent validity study to explore the utility of the BDI with adolescents. They also sought to determine if the test could successfully differentiate patients with depression from those without depression in a population of adolescent outpatients. Diagnoses generated from the concurrent administration of an instrument previously validated for use with adolescents were used as the criterion validators. The findings suggested that the BDI is valid for use with adolescents.
JUST THINK . . .
What else might these researchers have done to explore the utility of the BDI with adolescents?
We now turn our attention to another form of criterion validity, one in which the criterion measure is obtained not concurrently but at some future time.
Test scores may be obtained at one time and the criterion measures obtained at a future time, usually after some intervening event has taken place. The intervening event may take varied forms, such as training, experience, therapy, medication, or simply the passage of time. Measures of the relationship between the test scores and a criterion measure obtained at a future time provide an indication of the predictive validity of the test; that is, how accurately scores on the test predict some criterion measure. Measures of the relationship between college admissions tests and freshman grade point averages, for example, provide evidence of the predictive validity of the admissions tests.
In settings where tests might be employed—such as a personnel agency, a college admissions office, or a warden’s office—a test’s high predictive validity can be a useful aid to decision-makers who must select successful students, productive workers, or good parole risks. Whether a test result is valuable in decision making depends on how well the test results improve selection decisions over decisions made without knowledge of test results. In an Page 185industrial setting where volume turnout is important, if the use of a personnel selection test can enhance productivity to even a small degree, then that enhancement will pay off year after year and may translate into millions of dollars of increased revenue. And in a clinical context, no price could be placed on a test that could save more lives from suicide or by providing predictive accuracy over and above existing tests with respect to such acts. Unfortunately, the difficulties inherent in developing such tests are numerous and multifaceted (Mulvey & Lidz, 1984; Murphy, 1984; Petrie & Chamberlain, 1985). When evaluating the predictive validity of a test, researchers must take into consideration the base rate of the occurrence of the variable in question, both as that variable exists in the general population and as it exists in the sample being studied. Generally, a base rate is the extent to which a particular trait, behavior, characteristic, or attribute exists in the population (expressed as a proportion). In psychometric parlance, a hit rate may be defined as the proportion of people a test accurately identifies as possessing or exhibiting a particular trait, behavior, characteristic, or attribute. For example, hit rate could refer to the proportion of people accurately predicted to be able to perform work at the graduate school level or to the proportion of neurological patients accurately identified as having a brain tumor. In like fashion, a miss rate may be defined as the proportion of people the test fails to identify as having, or not having, a particular characteristic or attribute. Here, a miss amounts to an inaccurate prediction. The category of misses may be further subdivided. A false positive is a miss wherein the test predicted that the testtaker did possess the particular characteristic or attribute being measured when in fact the testtaker did not. A false negative is a miss wherein the test predicted that the testtaker did not possess the particular characteristic or attribute being measured when the testtaker actually did.
To evaluate the predictive validity of a test, a test targeting a particular attribute may be administered to a sample of research subjects in which approximately half of the subjects possess or exhibit the targeted attribute and the other half do not. Evaluating the predictive validity of a test is essentially a matter of evaluating the extent to which use of the test results in an acceptable hit rate.
Judgments of criterion-related validity, whether concurrent or predictive, are based on two types of statistical evidence: the validity coefficient and expectancy data.
The validity coefficient
The validity coefficient is a correlation coefficient that provides a measure of the relationship between test scores and scores on the criterion measure. The correlation coefficient computed from a score (or classification) on a psychodiagnostic test and the criterion score (or classification) assigned by psychodiagnosticians is one example of a validity coefficient. Typically, the Pearson correlation coefficient is used to determine the validity between the two measures. However, depending on variables such as the type of data, the sample size, and the shape of the distribution, other correlation coefficients could be used. For example, in correlating self-rankings of performance on some job with rankings made by job supervisors, the formula for the Spearman rho rank-order correlation would be employed.
Like the reliability coefficient and other correlational measures, the validity coefficient is affected by restriction or inflation of range. And as in other correlational studies, a key issue is whether the range of scores employed is appropriate to the objective of the correlational analysis. In situations where, for example, attrition in the number of subjects has occurred over the course of the study, the validity coefficient may be adversely affected.
The problem of restricted range can also occur through a self-selection process in the sample employed for the validation study. Thus, for example, if the test purports to measure something as technical or as dangerous as oil-barge firefighting skills, it may well be that the only people who reply to an ad for the position of oil-barge firefighter are those who are actually highly qualified for the position. Accordingly, the range of the distribution of scores on this test of oil-barge firefighting skills would be restricted. For less technical or dangerous positions, a self-selection factor might be operative if the test developer selects a group of Page 186newly hired employees to test (with the expectation that criterion measures will be available for this group at some subsequent date). However, because the newly hired employees have probably already passed some formal or informal evaluation in the process of being hired, there is a good chance that ability to do the job will be higher among this group than among a random sample of ordinary job applicants. Consequently, scores on the criterion measure that is later administered will tend to be higher than scores on the criterion measure obtained from a random sample of ordinary job applicants. Stated another way, the scores will be restricted in range.
Whereas it is the responsibility of the test developer to report validation data in the test manual, it is the responsibility of test users to read carefully the description of the validation study and then to evaluate the suitability of the test for their specific purposes. What were the characteristics of the sample used in the validation study? How matched are those characteristics to the people for whom an administration of the test is contemplated? For a specific test purpose, are some subtests of a test more appropriate than the entire test?
How high should a validity coefficient be for a user or a test developer to infer that the test is valid? There are no rules for determining the minimum acceptable size of a validity coefficient. In fact, Cronbach and Gleser (1965) cautioned against the establishment of such rules. They argued that validity coefficients need to be large enough to enable the test user to make accurate decisions within the unique context in which a test is being used. Essentially, the validity coefficient should be high enough to result in the identification and differentiation of testtakers with respect to target attribute(s), such as employees who are likely to be more productive, police officers who are less likely to misuse their weapons, and students who are more likely to be successful in a particular course of study.
Test users involved in predicting some criterion from test scores are often interested in the utility of multiple predictors. The value of including more than one predictor depends on a couple of factors. First, of course, each measure used as a predictor should have criterion-related predictive validity. Second, additional predictors should possess incremental validity , defined here as the degree to which an additional predictor explains something about the criterion measure that is not explained by predictors already in use.
Incremental validity may be used when predicting something like academic success in college. Grade point average (GPA) at the end of the first year may be used as a measure of academic success. A study of potential predictors of GPA may reveal that time spent in the library and time spent studying are highly correlated with GPA. How much sleep a student’s roommate allows the student to have during exam periods correlates with GPA to a smaller extent. What is the most accurate but most efficient way to predict GPA? One approach, employing the principles of incremental validity, is to start with the best predictor: the predictor that is most highly correlated with GPA. This may be time spent studying. Then, using multiple regression techniques, one would examine the usefulness of the other predictors.
Even though time in the library is highly correlated with GPA, it may not possess incremental validity if it overlaps too much with the first predictor, time spent studying. Said another way, if time spent studying and time in the library are so highly correlated with each other that they reflect essentially the same thing, then only one of them needs to be included as a predictor. Including both predictors will provide little new information. By contrast, the variable of how much sleep a student’s roommate allows the student to have during exams may have good incremental validity. This is so because it reflects a different aspect of preparing for exams (resting) from the first predictor (studying). Incremental validity has been used to improve the prediction of job performance for Marine Corps mechanics (Carey, 1994) and the prediction of child abuse (Murphy-Berman, 1994). In both instances, predictor measures were included only if they demonstrated that they could explain something about the criterion measure that was not already known from the other predictors.Page 187
Construct validity is a judgment about the appropriateness of inferences drawn from test scores regarding individual standings on a variable called a construct. A construct is an informed, scientific idea developed or hypothesized to describe or explain behavior. Intelligence is a construct that may be invoked to describe why a student performs well in school. Anxiety is a construct that may be invoked to describe why a psychiatric patient paces the floor. Other examples of constructs are job satisfaction, personality, bigotry, clerical aptitude, depression, motivation, self-esteem, emotional adjustment, potential dangerousness, executive potential, creativity, and mechanical comprehension, to name but a few.
Constructs are unobservable, presupposed (underlying) traits that a test developer may invoke to describe test behavior or criterion performance. The researcher investigating a test’s construct validity must formulate hypotheses about the expected behavior of high scorers and low scorers on the test. These hypotheses give rise to a tentative theory about the nature of the construct the test was designed to measure. If the test is a valid measure of the construct, then high scorers and low scorers will behave as predicted by the theory. If high scorers and low scorers on the test do not behave as predicted, the investigator will need to reexamine the nature of the construct itself or hypotheses made about it. One possible reason for obtaining results contrary to those predicted by the theory is that the test simply does not measure the construct. An alternative explanation could lie in the theory that generated hypotheses about the construct. The theory may need to be reexamined.
In some instances, the reason for obtaining contrary findings can be traced to the statistical procedures used or to the way the procedures were executed. One procedure may have been more appropriate than another, given the particular assumptions. Thus, although confirming evidence contributes to a judgment that a test is a valid measure of a construct, evidence to the contrary can also be useful. Contrary evidence can provide a stimulus for the discovery of new facets of the construct as well as alternative methods of measurement.
Traditionally, construct validity has been viewed as the unifying concept for all validity evidence (American Educational Research Association et al., 1999). As we noted at the outset, all types of validity evidence, including evidence from the content- and criterion-related varieties of validity, come under the umbrella of construct validity. Let’s look at the types of evidence that might be gathered.
Evidence of Construct Validity
A number of procedures may be used to provide different kinds of evidence that a test has construct validity. The various techniques of construct validation may provide evidence, for example, that
· the test is homogeneous, measuring a single construct;
· test scores increase or decrease as a function of age, the passage of time, or an experimental manipulation as theoretically predicted;
· test scores obtained after some event or the mere passage of time (or, posttest scores) differ from pretest scores as theoretically predicted;
· test scores obtained by people from distinct groups vary as predicted by the theory;
· test scores correlate with scores on other tests in accordance with what would be predicted from a theory that covers the manifestation of the construct in question.
A brief discussion of each type of construct validity evidence and the procedures used to obtain it follows.
Evidence of homogeneity
When describing a test and its items, homogeneity refers to how uniform a test is in measuring a single concept. A test developer can increase test homogeneity in several ways. Consider, for example, a test of academic achievement that contains subtests in areas Page 188such as mathematics, spelling, and reading comprehension. The Pearson r could be used to correlate average subtest scores with the average total test score. Subtests that in the test developer’s judgment do not correlate very well with the test as a whole might have to be reconstructed (or eliminated) lest the test not measure the construct academic achievement. Correlations between subtest scores and total test score are generally reported in the test manual as evidence of homogeneity.
One way a test developer can improve the homogeneity of a test containing items that are scored dichotomously (such as a true-false test) is by eliminating items that do not show significant correlation coefficients with total test scores. If all test items show significant, positive correlations with total test scores and if high scorers on the test tend to pass each item more than low scorers do, then each item is probably measuring the same construct as the total test. Each item is contributing to test homogeneity.
The homogeneity of a test in which items are scored on a multipoint scale can also be improved. For example, some attitude and opinion questionnaires require respondents to indicate level of agreement with specific statements by responding, for example, strongly agree, agree, disagree, or strongly disagree.Each response is assigned a numerical score, and items that do not show significant Spearman rank-order correlation coefficients are eliminated. If all test items show significant, positive correlations with total test scores, then each item is most likely measuring the same construct that the test as a whole is measuring (and is thereby contributing to the test’s homogeneity). Coefficient alpha may also be used in estimating the homogeneity of a test composed of multiple-choice items (Novick & Lewis, 1967).
As a case study illustrating how a test’s homogeneity can be improved, consider the Marital Satisfaction Scale (MSS; Roach et al., 1981). Designed to assess various aspects of married people’s attitudes toward their marital relationship, the MSS contains an approximately equal number of items expressing positive and negative sentiments with respect to marriage. For example, My life would seem empty without my marriage and My marriage has “smothered” my personality. In one stage of the development of this test, subjects indicated how much they agreed or disagreed with the various sentiments in each of 73 items by marking a 5-point scale that ranged from strongly agree to strongly disagree. Based on the correlations between item scores and total score, the test developers elected to retain 48 items with correlation coefficients greater than .50, thus creating a more homogeneous instrument.
Item-analysis procedures have also been employed in the quest for test homogeneity. One item-analysis procedure focuses on the relationship between testtakers’ scores on individual items and their score on the entire test. Each item is analyzed with respect to how high scorers versus low scorers responded to it. If it is an academic test and if high scorers on the entire test for some reason tended to get that particular item wrong while low scorers on the test as a whole tended to get the item right, the item is obviously not a good one. The item should be eliminated in the interest of test homogeneity, among other considerations. If the test is one of marital satisfaction, and if individuals who score high on the test as a whole respond to a particular item in a way that would indicate that they are not satisfied whereas people who tend not to be satisfied respond to the item in a way that would indicate that they are satisfied, then again the item should probably be eliminated or at least reexamined for clarity.
JUST THINK . . .
Is it possible for a test to be too homogeneous in item content?
Although test homogeneity is desirable because it assures us that all the items on the test tend to be measuring the same thing, it is not the be-all and end-all of construct validity. Knowing that a test is homogeneous contributes no information about how the construct being measured relates to other constructs. It is therefore important to report evidence of a test’s homogeneity along with other evidence of construct validity.
Evidence of changes with age
Some constructs are expected to change over time. Reading rate, for example, tends to increase dramatically year by year from age 6 to the early teens. If a test score purports to be a measure of a construct that could be expected to change over time, then the Page 189test score, too, should show the same progressive changes with age to be considered a valid measure of the construct. For example, if children in grades 6, 7, 8, and 9 took a test of eighth-grade vocabulary, then we would expect that the total number of items scored as correct from all the test protocols would increase as a function of the higher grade level of the testtakers.
Some constructs lend themselves more readily than others to predictions of change over time. Thus, although we may be able to predict that a gifted child’s scores on a test of reading skills will increase over the course of the testtaker’s years of elementary and secondary education, we may not be able to predict with such confidence how a newlywed couple will score through the years on a test of marital satisfaction. This fact does not relegate a construct such as marital satisfaction to a lower stature than reading ability. Rather, it simply means that measures of marital satisfaction may be less stable over time or more vulnerable to situational events (such as in-laws coming to visit and refusing to leave for three months) than is reading ability. Evidence of change over time, like evidence of test homogeneity, does not in itself provide information about how the construct relates to other constructs.
Evidence of pretest–posttest changes
Evidence that test scores change as a result of some experience between a pretest and a posttest can be evidence of construct validity. Some of the more typical intervening experiences responsible for changes in test scores are formal education, a course of therapy or medication, and on-the-job experience. Of course, depending on the construct being measured, almost any intervening life experience could be predicted to yield changes in score from pretest to posttest. Reading an inspirational book, watching a TV talk show, undergoing surgery, serving a prison sentence, or the mere passage of time may each prove to be a potent intervening variable.
Returning to our example of the Marital Satisfaction Scale, one investigator cited in Roach et al. (1981) compared scores on that instrument before and after a sex therapy treatment program. Scores showed a significant change between pretest and posttest. A second posttest given eight weeks later showed that scores remained stable (suggesting the instrument was reliable), whereas the pretest–posttest measures were still significantly different. Such changes in scores in the predicted direction after the treatment program contribute to evidence of the construct validity for this test.
JUST THINK . . .
Might it have been advisable to have simultaneous testing of a matched group of couples who did not participate in sex therapy and simultaneous testing of a matched group of couples who did not consult divorce attorneys? In both instances, would there have been any reason to expect any significant changes in the test scores of these two control groups?
We would expect a decline in marital satisfaction scores if a pretest were administered to a sample of couples shortly after they took their nuptial vows and a posttest were administered shortly after members of the couples consulted their respective divorce attorneys sometime within the first five years of marriage. The experimental group in this study would consist of couples who consulted a divorce attorney within the first five years of marriage. The design of such pretest–posttest research ideally should include a control group to rule out alternative explanations of the findings.
Evidence from distinct groups
Also referred to as the method of contrasted groups , one way of providing evidence for the validity of a test is to demonstrate that scores on the test vary in a predictable way as a function of membership in some group. The rationale here is that if a test is a valid measure of a particular construct, then test scores from groups of people who would be presumed to differ with respect to that construct should have correspondingly different test scores. Consider in this context a test of depression wherein the higher the test score, the more depressed the testtaker is presumed to be. We would expect individuals psychiatrically hospitalized for depression to score higher on this measure than a random sample of Walmart shoppers.
Now, suppose it was your intention to provide construct validity evidence for the Marital Satisfaction Scale by showing differences in scores between distinct groups. How might you go about doing that?Page 190
Roach and colleagues (1981) proceeded by identifying two groups of married couples, one relatively satisfied in their marriage, the other not so satisfied. The groups were identified by ratings by peers and professional marriage counselors. A t test on the difference between mean score on the test was significant ( p < .01)—evidence to support the notion that the Marital Satisfaction Scale is indeed a valid measure of the construct marital satisfaction.
In a bygone era, the method many test developers used to create distinct groups was deception. For example, if it had been predicted that more of the construct would be exhibited on the test in question if the subject felt highly anxious, an experimental situation might be designed to make the subject feel highly anxious. Virtually any feeling state the theory called for could be induced by an experimental scenario that typically involved giving the research subject some misinformation. However, given the ethical constraints of contemporary psychologists and the reluctance of academic institutions and other sponsors of research to condone deception in human research, the method of obtaining distinct groups by creating them through the dissemination of deceptive information is frowned upon (if not prohibited) today.
Evidence for the construct validity of a particular test may converge from a number of sources, such as other tests or measures designed to assess the same (or a similar) construct. Thus, if scores on the test undergoing construct validation tend to correlate highly in the predicted direction with scores on older, more established, and already validated tests designed to measure the same (or a similar) construct, this would be an example of convergent evidence . 3
Convergent evidence for validity may come not only from correlations with tests purporting to measure an identical construct but also from correlations with measures purporting to measure related constructs. Consider, for example, a new test designed to measure the construct test anxiety. Generally speaking, we might expect high positive correlations between this new test and older, more established measures of test anxiety. However, we might also expect more moderate correlations between this new test and measures of general anxiety.
Roach et al. (1981) provided convergent evidence of the construct validity of the Marital Satisfaction Scale by computing a validity coefficient between scores on it and scores on the Marital Adjustment Test (Locke & Wallace, 1959). The validity coefficient of .79 provided additional evidence of their instrument’s construct validity.
A validity coefficient showing little (a statistically insignificant) relationship between test scores and/or other variables with which scores on the test being construct-validated should not theoretically be correlated provides discriminant evidence of construct validity (also known as discriminant validity). In the course of developing the Marital Satisfaction Scale (MSS), its authors correlated scores on that instrument with scores on the Marlowe-Crowne Social Desirability Scale (Crowne & Marlowe, 1964). Roach et al. (1981) hypothesized that high correlations between these two instruments would suggest that respondents were probably not answering items on the MSS entirely honestly but instead were responding in socially desirable ways. But the correlation between the MSS and the social desirability measure did not prove to be significant, so the test developers concluded that social desirability could be ruled out as a primary factor in explaining the meaning of MSS test scores.
In 1959 an experimental technique useful for examining both convergent and discriminant validity evidence was presented in Psychological Bulletin.This rather technical procedure was called the multitrait-multimethod matrix . A detailed description of it, along with an Page 191illustration, can be found in OOBAL-6-B1. Here, let’s simply point out that multitrait means “two or more traits” and multimethod means “two or more methods.” The multitrait-multimethod matrix (Campbell & Fiske, 1959) is the matrix or table that results from correlating variables (traits) within and between methods. Values for any number of traits (such as aggressiveness or extraversion) as obtained by various methods (such as behavioral observation or a personality test) are inserted into the table, and the resulting matrix of correlations provides insight with respect to both the convergent and the discriminant validity of the methods used.4
Both convergent and discriminant evidence of construct validity can be obtained by the use of factor analysis. Factor analysis is a shorthand term for a class of mathematical procedures designed to identify factors or specific variables that are typically attributes, characteristics, or dimensions on which people may differ. In psychometric research, factor analysis is frequently employed as a data reduction method in which several sets of scores and the correlations between them are analyzed. In such studies, the purpose of the factor analysis may be to identify the factor or factors in common between test scores on subscales within a particular test, or the factors in common between scores on a series of tests. In general, factor analysis is conducted on either an exploratory or a confirmatory basis. Exploratory factor analysis typically entails “estimating, or extracting factors; deciding how many factors to retain; and rotating factors to an interpretable orientation” (Floyd & Widaman, 1995, p. 287). By contrast, in confirmatory factor analysis , researchers test the degree to which a hypothetical model (which includes factors) fits the actual data.
A term commonly employed in factor analysis is factor loading , which is “a sort of metaphor. Each test is thought of as a vehicle carrying a certain amount of one or more abilities” (Tyler, 1965, p. 44). Factor loading in a test conveys information about the extent to which the factor determines the test score or scores. A new test purporting to measure bulimia, for example, can be factor-analyzed with other known measures of bulimia, as well as with other kinds of measures (such as measures of intelligence, self-esteem, general anxiety, anorexia, or perfectionism). High factor loadings by the new test on a “bulimia factor” would provide convergent evidence of construct validity. Moderate to low factor loadings by the new test with respect to measures of other eating disorders such as anorexia would provide discriminant evidence of construct validity.
Factor analysis frequently involves technical procedures so complex that few contemporary researchers would attempt to conduct one without the aid of sophisticated software. But although the actual data analysis has become work for computers, humans still tend to be very much involved in the naming of factors once the computer has identified them. Thus, for example, suppose a factor analysis identified a common factor being measured by two hypothetical instruments, a “Bulimia Test” and an “Anorexia Test.” This common factor would have to be named. One factor analyst looking at the data and the items of each test might christen the common factor an eating disorder factor. Another factor analyst examining exactly the same materials might label the common factor a body weight preoccupation factor. A third analyst might name the factor a self-perception disorder factor. Which of these is correct?
From a statistical perspective, it is simply impossible to say what the common factor should be named. Naming factors that emerge from a factor analysis has more to do with knowledge, judgment, and verbal abstraction ability than with mathematical expertise. There are no hard-and-fast rules. Factor analysts exercise their own judgment about what factor name best communicates the meaning of the factor. Further, even the criteria used to identify a common factor, as well as related technical matters, can be a matter of debate, if not heated controversy.Page 192
Factor analysis is a subject rich in technical complexity. Its uses and applications can vary as a function of the research objectives as well as the nature of the tests and the constructs under study. Factor analysis is the subject of our Close-Up in Chapter 9. More immediately, our Close-Up here brings together much of the information imparted so far in this chapter to provide a “real life” example of the test validation process.
The Preliminary Validation of a Measure of Individual Differences in Constructive Versus Unconstructive Worry*
Establishing validity is an important step in the development of new psychological measures. The development of a questionnaire that measures individual differences in worry called the Constructive and Unconstructive Worry Questionnaire (CUWQ; McNeill & Dunlop, 2016) provides an illustration of some of the steps in the test validation process.
Prior to the development of this questionnaire, research on worry had shown that the act of worrying can lead to both positive outcomes (such as increased work performance; Perkins & Corr, 2005) and negative outcomes (such as insomnia; Carney & Waters, 2006). Importantly, findings suggested that the types of worrying thoughts that lead to positive outcomes (which are referred to by the test authors as constructive worry) may differ from the types of worrying thoughts that lead to negative outcomes (referred to as unconstructive worry). However, a review of existing measures of individual differences in worry suggested that none of the measures were made to distinguish people’s tendency to worry constructively from their tendency to worry unconstructively. Since the ability to determine whether individuals are predominantly worrying constructively or unconstructively holds diagnostic and therapeutic benefits, the test authors set out to fill this gap and develop a new questionnaire that would be able to capture both these dimensions of the worry construct.
During the first step of questionnaire development, the creation of an item pool, it was important to ensure the questionnaire would have good content validity. That is, the items would need to adequately sample the variety of characteristics of constructive and unconstructive worry. Based on the test authors’ definition of these two constructs, a literature review was conducted and a list of potential characteristics of constructive versus unconstructive worry was created. This list of characteristics was used to develop a pool of 40 items. These 40 items were cross checked by each author, as well as one independent expert, to ensure that each item was unique and concise. A review of the list as a whole was conducted to ensure that it covered the full range of characteristics identified by the literature review. This process resulted in the elimination of 11 of the initial items, leaving a pool of 29 items. Of the 29 items in total, 13 items were expected to measure the tendency to worry constructively, and the remaining 16 items were expected to measure the tendency to worry unconstructively.
Next, drawing from the theoretical background behind the test authors’ definition of constructive and unconstructive worry, a range of criteria that should be differentially related to one’s tendency to worry constructively versus unconstructively were selected. More specifically, it was hypothesized that the tendency to worry unconstructively would be positively related to trait-anxiety (State Trait Anxiety Inventory (STAI-T); Spielberger et al., 1970) and amount of worry one experiences (e.g., Worry Domains Questionnaire (WDQ); Stöber & Joormann, 2001). In addition, this tendency to worry unconstructively was hypothesized to be negatively related to one’s tendency to be punctual and one’s actual performance of risk-mitigating behaviors. The tendency to worry constructively, on the other hand, was hypothesized to be negatively related to trait-anxiety and amount of worry, and positively related to one’s tendency to be punctual and one’s performance of risk-mitigating behaviors. Identification of these criteria prior to data collection would pave the way for the test authors to conduct an evaluation of the questionnaire’s criterion-based construct-validity in the future.
Upon completion of item pool construction and criterion identification, two studies were conducted. In Study 1, data from 295 participants from the United States was collected on the 29 newly developed worry items, plus two criterion-based measures, namely trait-anxiety and punctuality. An exploratory factor analysis was conducted, and the majority of the 29 items grouped together into a two-factor solution (as expected). The items predicted to capture a tendency to worry constructively loaded strongly on one factor, and the items predicted to capture a tendency to worry unconstructively loaded strongly on the other factor. However, 11 out of the original 29 items either did not load strongly on either factor, or they cross-loaded onto the other factor to a moderate extent. To increase construct validity through increased homogeneity of the two scales, these 11 items were removed from the final version of the questionnaire. The 18 items that remained included eight that primarily loaded on the factor labeled as constructive worry and ten that primarily loaded on the factor labeled as unconstructive worry.
A confirmatory factor analysis on these 18 items showed a good model fit. However, this analysis does not prove that these two factors actually captured the tendencies to worry constructively and unconstructively. To test the construct validity of these factor scores, the relations of the unconstructive and constructive worry factors with both trait-anxiety (Spielberger et al., 1970) and the tendency to be punctual were examined. Results supported the hypotheses and supported an assumption of criterion-based construct validity. That is, as hypothesized, scores on the constructive worry factor were negatively associated with trait-anxiety and positively associated with the tendency to be punctual. Scores on the Unconstructive Worry factor were positively associated with trait-anxiety and negatively associated with the tendency to be punctual.
To further test the construct validity of this newly developed measure, a second study was conducted. In Study 2, data from 998 Australian residents of wildfire-prone areas responded to the 18 (final) worry items from Study 1, plus two additional items, respectively, capturing two additional criteria. These two additional criteria were (1) the amount of worry one tends to experience as captured by two existing worry questionnaires, namely the Worry Domains Questionnaire (Stöber & Joormann, 2001) and the Penn State Worry Questionnaire (Meyer et al., 1990), and (2) the performance of risk-mitigating behaviors that reduce the risk of harm or property damage resulting from a potential wildfire threat. A confirmatory factor analysis on this second data set supported the notion that constructive worry versus unconstructive worry items were indeed capturing separate constructs in a homogenous manner. Furthermore, as hypothesized, the constructive worry factor was positively associated with the performance of wildfire risk-mitigating behaviors, and negatively associated with the amount of worry one experiences. The unconstructive worry factor, on the other hand, was negatively associated with the performance of wildfire risk-mitigating behaviors, and positively associated with the amount of worry one experiences. This provided further criterion-based construct validity.
There are several ways in which future studies could provide additional evidence of construct validity of the CUWQ. For one, both studies reported above looked at the two scales’ concurrent criterion-based validity, but not at their predictive criterion-based validity. Future studies could focus on filling this gap. For example, since both constructs are hypothesized to predict the experience of anxiety (which was confirmed by the scales’ relationships with trait-anxiety in Study 1), they should predict the likelihood of an individual being diagnosed with an anxiety disorder in the future, with unconstructive worry being a positive predictor and constructive worry being a negative predictor. Furthermore, future studies could provide additional evidence of construct validity by testing whether interventions, such as therapy aimed at reducing unconstructive worry, can lead to a reduction in scores on the unconstructive worry scale over time. Finally, it is important to note that all validity testing to date has been conducted in samples from the general population, so the test should be further tested in samples from a clinical population of pathological worriers before test validity in this population can be assumed. The same applies to the use of the questionnaire in samples from non-US/Australian populations.
*This Close-Up was guest-authored by Ilona M. McNeill of The University of Melbourne, and Patrick D. Dunlop of The University of Western Australia.
JUST THINK . . .
What might be an example of a valid test used in an unfair manner?
Validity, Bias, and Fairness
In the eyes of many laypeople, questions concerning the validity of a test are intimately tied to questions concerning the fair use of tests and the issues of bias and fairness. Let us hasten to point out that validity, fairness in test use, and test bias are three separate issues. It is possible, for example, for a valid test to be used fairly or unfairly.
For the general public, the term bias as applied to psychological and educational tests may conjure up many meanings having to do with prejudice and preferential treatment (Brown et al., 1999). For federal judges, the term bias as it relates to items on children’s intelligence tests is synonymous with “too difficult for one group as compared to another” (Sattler, 1991). For psychometricians, bias is a factor inherent in a test that systematically prevents accurate, impartial measurement.
Psychometricians have developed the technical means to identify and remedy bias, at least in the mathematical sense. As a simple illustration, consider a test we will call the “flip-coin test” (FCT). The “equipment” needed to conduct this test is a two-sided coin. One side (“heads”) has the image of a profile and the other side (“tails”) does not. The FCT would be considered biased if the instrument (the coin) were weighted so that either heads or tails appears more frequently than by chance alone. If the test in question were an intelligence test, the test would be considered biased if it were constructed so that people who had brown eyes consistently and systematically obtained higher scores than people with green eyes—assuming, of course, that in reality people with brown eyes are not generally more intelligent than people with green eyes. Systematic is a key word in our definition of test bias. We have previously looked at sources of random or chance variation in test scores. Bias implies systematic variation.
Another illustration: Let’s suppose we need to hire 50 secretaries and so we place an ad in the newspaper. In response to the ad, 200 people reply, including 100 people who happen to have brown eyes and 100 people who happen to have green eyes. Each of the 200 applicants is individually administered a hypothetical test we will call the “Test of Secretarial Skills” (TSS). Logic tells us that eye color is probably not a relevant variable with respect to performing the duties of a secretary. We would therefore have no reason to believe that green-eyed people are better secretaries than brown-eyed people or vice versa. We might reasonably expect that, after the tests have been scored and the selection process has been completed, an approximately equivalent number of brown-eyed and green-eyed people would have been hired (or, approximately 25 brown-eyed people and 25 green-eyed people). But what if it turned out that 48 green-eyed people were hired and only 2 brown-eyed people were hired? Is this evidence that the TSS is a biased test?
Although the answer to this question seems simple on the face of it—“Yes, the test is biased because they should have hired 25 and 25!”—a truly responsible answer to this question would entail statistically troubleshooting the test and the entire selection procedure (see Berk, 1982). One reason some tests have been found to be biased has more to do with the design of the research study than the design of the test. For example, if there are too few testtakers in one of the groups (such as the minority group—literally), this methodological problem will make it appear as if the test is biased when in fact it may not be. A test may justifiably be deemed biased if some portion of its variance stems from some factor(s) that are irrelevant to performance on the criterion measure; as a consequence, one group of testtakers will systematically perform differently from another. Prevention during test development is the best cure for test bias, though a procedure called estimated true score transformations represents one of many available post hoc remedies (Mueller, 1949; see also Reynolds & Brown, 1984).5Page 193
A rating is a numerical or verbal judgment (or both) that places a person or an attribute along a continuum identified by a scale of numerical or word descriptors known as a rating scale . Simply stated, a rating error is a judgment resulting from the intentional or unintentional misuse of a rating scale. Thus, for example, a leniency error (also known as a generosity error ) is, as its name implies, an error in rating that arises from the tendency on the part of the rater to be lenient in scoring, marking, and/or grading. From your own experience during course registration, you might be aware that a section of a particular course will quickly Page 195be filled if it is being taught by a professor with a reputation for leniency errors in end-of-term grading. As another possible example of a leniency or generosity error, consider comments in the “Twittersphere” after a high-profile performance of a popular performer. Intuitively, one would expect more favorable (and forgiving) ratings of the performance from die-hard fans of the performer, regardless of the actual quality of the performance as rated by more objective reviewers. The phenomenon of leniency and severity in ratings can be found mostly in any setting that ratings are rendered. In psychotherapy settings, for example, it is not unheard of for supervisors to be a bit too generous or too lenient in their ratings of their supervisees.
Reviewing the literature on psychotherapy supervision and supervision in other disciplines, Gonsalvez and Crowe (2014) concluded that raters’ judgments of psychotherapy supervisees’ competency are compromised by leniency errors. In an effort to remedy the state of affairs, they offered a series of concrete suggestions including a list of specific competencies to be evaluated, as well as when and how such evaluations for competency should be conducted.
JUST THINK . . .
What factor do you think might account for the phenomenon of raters whose ratings always seem to fall victim to the central tendency error?
At the other extreme is a severity error . Movie critics who pan just about everything they review may be guilty of severity errors. Of course, that is only true if they review a wide range of movies that might consensually be viewed as good and bad.
Another type of error might be termed a central tendency error . Here the rater, for whatever reason, exhibits a general and systematic reluctance to giving ratings at either the positive or the negative extreme. Consequently, all of this rater’s ratings would tend to cluster in the middle of the rating continuum.
One way to overcome what might be termed restriction-of-range rating errors (central tendency, leniency, severity errors) is to use rankings , a procedure that requires the rater to measure individuals against one another instead of against an absolute scale. By using rankings instead of ratings, the rater (now the “ranker”) is forced to select first, second, third choices, and so forth.
Halo effect describes the fact that, for some raters, some ratees can do no wrong. More specifically, a halo effect may also be defined as a tendency to give a particular ratee a higher rating than he or she objectively deserves because of the rater’s failure to discriminate among conceptually distinct and potentially independent aspects of a ratee’s behavior. Just for the sake of example—and not for a moment because we believe it is even in the realm of possibility—let’s suppose Lady Gaga consented to write and deliver a speech on multivariate analysis. Her speech probably would earn much higher all-around ratings if given before the founding chapter of the Lady Gaga Fan Club than if delivered before and rated by the membership of, say, the Royal Statistical Society. This would be true even in the highly improbable case that the members of each group were equally savvy with respect to multivariate analysis. We would expect the halo effect to be operative at full power as Lady Gaga spoke before her diehard fans.
Criterion data may also be influenced by the rater’s knowledge of the ratee’s race or sex (Landy & Farr, 1980). Males have been shown to receive more favorable evaluations than females in traditionally masculine occupations. Except in highly integrated situations, ratees tend to receive higher ratings from raters of the same race (Landy & Farr, 1980). Returning to our hypothetical Test of Secretarial Skills (TSS) example, a particular rater may have had particularly great—or particularly distressing—prior experiences with green-eyed (or brown-eyed) people and so may be making extraordinarily high (or low) ratings on that irrational basis.
Training programs to familiarize raters with common rating errors and sources of rater bias have shown promise in reducing rating errors and increasing measures of reliability and validity. Lecture, role playing, discussion, watching oneself on videotape, and computer simulation of different situations are some of the many techniques that could be brought to bear in such training programs. We revisit the subject of rating and rating error in our discussion of personality assessment later. For now, let’s take up the issue of test fairness.Page 196
In contrast to questions of test bias, which may be thought of as technically complex statistical problems, issues of test fairness tend to be rooted more in thorny issues involving values (Halpern, 2000). Thus, although questions of test bias can sometimes be answered with mathematical precision and finality, questions of fairness can be grappled with endlessly by well-meaning people who hold opposing points of view. With that caveat in mind, and with exceptions most certainly in the offing, we will define fairness in a psychometric context as the extent to which a test is used in an impartial, just, and equitable way.6
Some uses of tests are patently unfair in the judgment of any reasonable person. During the cold war, the government of what was then called the Soviet Union used psychiatric tests to suppress political dissidents. People were imprisoned or institutionalized for verbalizing opposition to the government. Apart from such blatantly unfair uses of tests, what constitutes a fair and an unfair use of tests is a matter left to various parties in the assessment enterprise. Ideally, the test developer strives for fairness in the test development process and in the test’s manual and usage guidelines. The test user strives for fairness in the way the test is actually used. Society strives for fairness in test use by means of legislation, judicial decisions, and administrative regulations.
Fairness as applied to tests is a difficult and complicated subject. However, it is possible to discuss some rather common misunderstandings regarding what are sometimes perceived as unfair or even biased tests. Some tests, for example, have been labeled “unfair” because they discriminate among groups of people.7 The reasoning here goes something like this: “Although individual differences exist, it is a truism that all people are created equal. Accordingly, any differences found among groups of people on any psychological trait must be an artifact of an unfair or biased test.” Because this belief is rooted in faith as opposed to scientific evidence—in fact, it flies in the face of scientific evidence—it is virtually impossible to refute. One either accepts it on faith or does not.
We would all like to believe that people are equal in every way and that all people are capable of rising to the same heights given equal opportunity. A more realistic view would appear to be that each person is capable of fulfilling a personal potential. Because people differ so obviously with respect to physical traits, one would be hard put to believe that psychological differences found to exist between individuals—and groups of individuals—are purely a function of inadequate tests. Again, although a test is not inherently unfair or biased simply because it is a tool by which group differences are found, the useof the test data, like the use of any data, can be unfair.
Another misunderstanding of what constitutes an unfair or biased test is that it is unfair to administer to a particular population a standardized test that did not include members of that population in the standardization sample. In fact, the test may well be biased, but that must be determined by statistical or other means. The sheer fact that no members of a particular group were included in the standardization sample does not in itself invalidate the test for use with that group.
A final source of misunderstanding is the complex problem of remedying situations where bias or unfair test usage has been found to occur. In the area of selection for jobs, positions in universities and professional schools, and the like, a number of different preventive measures and remedies have been attempted. As you read about the tools used in these attempts in this chapter’s Everyday Psychometrics , form your own opinions regarding what constitutes a fair use of employment and other tests in a selection process.Page 197
Adjustment of Test Scores by Group Membership: Fairness in Testing or Foul Play?
Any test, regardless of its psychometric soundness, may be knowingly or unwittingly used in a way that has an adverse impact on one or another group. If such adverse impact is found to exist and if social policy demands some remedy or an affirmative action program, then psychometricians have a number of techniques at their disposal to create change. Table 1 lists some of these techniques.
Psychometric Techniques for Preventing or Remedying Adverse Impact and/or Instituting an Affirmative Action Program
Some of these techniques may be preventive if employed in the test development process, and others may be employed with already established tests. Some of these techniques entail direct score manipulation; others, such as banding, do not. Preparation of this table benefited from Sackett and Wilk (1994), and their work should be consulted for more detailed consideration of the complex issues involved.
|Addition of Points||1. A constant number of points is added to the test score of members of a particular group. The purpose of the point addition is to reduce or eliminate observed differences between groups.|
|1. Differential Scoring of Items||1. This technique incorporates group membership information, not in adjusting a raw score on a test but in deriving the score in the first place. The application of the technique may involve the scoring of some test items for members of one group but not scoring the same test items for members of another group. This technique is also known as empirical keying by group.|
|1. Elimination of Items Based on Differential Item Functioning||1. This procedure entails removing from a test any items found to inappropriately favor one group’s test performance over another’s. Ideally, the intent of the elimination of certain test items is not to make the test easier for any group but simply to make the test fairer. Sackett and Wilk (1994) put it this way: “Conceptually, rather than asking ‘Is this item harder for members of Group X than it is for Group Y?’ these approaches ask ‘Is this item harder for members of Group X with true score Z than it is for members of Group Y with true score Z?’”|
|1. Differential Cutoffs||1. Different cutoffs are set for members of different groups. For example, a passing score for members of one group is 65, whereas a passing score for members of another group is 70. As with the addition of points, the purpose of differential cutoffs is to reduce or eliminate observed differences between groups.|
|1. Separate Lists||1. Different lists of testtaker scores are established by group membership. For each list, test performance of testtakers is ranked in top-down fashion. Users of the test scores for selection purposes may alternate selections from the different lists. Depending on factors such as the allocation rules in effect and the equivalency of the standard deviation within the groups, the separate-lists technique may yield effects similar to those of other techniques, such as the addition of points and differential cutoffs. In practice, the separate list is popular in affirmative action programs where the intent is to overselect from previously excluded groups.|
|1. Within-Group Norming||1. Used as a remedy for adverse impact if members of different groups tend to perform differentially on a particular test, within-group norming entails the conversion of all raw scores into percentile scores or standard scores based on the test performance of one’s own group. In essence, an individual testtaker is being compared only with other members of his or her own group. When race is the primary criterion of group membership and separate norms are established by race, this technique is known as race-norming.|
|1. Banding||1. The effect of banding of test scores is to make equivalent all scores that fall within a particular range or band. For example, thousands of raw scores on a test may be transformed to a stanine having a value of 1 to 9. All scores that fall within each of the stanine boundaries will be treated by the test user as either equivalent or subject to some additional selection criteria. A sliding band (Cascio et al., 1991) is a modified banding procedure wherein a band is adjusted (“slid”) to permit the selection of more members of some group than would otherwise be selected.|
|1. Preference Policies||1. In the interest of affirmative action, reverse discrimination, or some other policy deemed to be in the interest of society at large, a test user might establish a policy of preference based on group membership. For example, if a municipal fire department sought to increase the representation of female personnel in its ranks, it might institute a test-related policy designed to do just that. A key provision in this policy might be that when a male and a female earn equal scores on the test used for hiring, the female will be hired.|
Although psychometricians have the tools to institute special policies through manipulations in test development, scoring, and interpretation, there are few clear guidelines in this controversial area (Brown, 1994; Gottfredson, 1994, 2000; Sackett & Wilk, 1994). The waters are further muddied by the fact that some of the guidelines seem to have contradictory implications. For example, although racial preferment in employee selection (disparate impact) is unlawful, the use of valid and unbiased selection procedures virtually guarantees disparate impact. This state of affairs will change only when racial disparities in job-related skills and abilities are minimized (Gottfredson, 1994).
In 1991, Congress enacted legislation effectively barring employers from adjusting testtakers’ scores for the purpose of making hiring or promotion decisions. Section 106 of the Civil Rights Act of 1991 made it illegal for employers “in connection with the selection or referral of applicants or candidates for employment or promotion to adjust the scores of, use different cutoffs for, or otherwise alter the results of employment-related tests on the basis of race, color, religion, sex, or national origin.”
The law prompted concern on the part of many psychologists who believed it would adversely affect various societal groups and might reverse social gains. Brown (1994, p. 927) forecast that “the ramifications of the Act are more far-reaching than Congress envisioned when it considered the amendment and could mean that many personality tests and physical ability tests that rely on separate scoring for men and women are outlawed in employment selection.” Arguments in favor of group-related test-score adjustment have been made on philosophical as well as technical grounds. From a philosophical perspective, increased minority representation is socially valued to the point that minority preference in test scoring is warranted. In the same vein, minority preference is viewed both as a remedy for past societal wrongs and as a contemporary guarantee of proportional workplace representation. From a more technical perspective, it is argued that some tests require adjustment in scores because (1) the tests are biased, and a given score on them does not necessarily carry the same meaning for all testtakers; and/or (2) “a particular way of using a test is at odds with an espoused position as to what constitutes fair use” (Sackett & Wilk, 1994, p. 931).
In contrast to advocates of test-score adjustment are those who view such adjustments as part of a social agenda for preferential treatment of certain groups. These opponents of test-score adjustment reject the subordination of individual effort and ability to group membership as criteria in the assignment of test scores (Gottfredson, 1988, 2000). Hunter and Schmidt (1976, p. 1069) described the unfortunate consequences for all parties involved in a college selection situation wherein poor-risk applicants were accepted on the basis of score adjustments or quotas. With reference to the employment setting, Hunter and Schmidt (1976) described one case in which entrance standards were lowered so more members of a particular group could be hired. However, many of these new hires did not pass promotion tests—with the result that the company was sued for discriminatory promotion practice. Yet another consideration concerns the feelings of “minority applicants who are selected under a quota system but who also would have been selected under unqualified individualism and must therefore pay the price, in lowered prestige and self-esteem” (Jensen, 1980, p. 398).
A number of psychometric models of fairness in testing have been presented and debated in the scholarly literature (Hunter & Schmidt, 1976; Petersen & Novick, 1976; Schmidt & Hunter, 1974; Thorndike, 1971). Despite a wealth of research and debate, a long-standing question in the field of personnel psychology remains: “How can group differences on cognitive ability tests be reduced while retaining existing high levels of reliability and criterion-related validity?”
According to Gottfredson (1994), the answer probably will not come from measurement-related research because differences in scores on many of the tests in question arise principally from differences in job-related abilities. For Gottfredson (1994, p. 963), “the biggest contribution personnel psychologists can make in the long run may be to insist collectively and candidly that their measurement tools are neither the cause of nor the cure for racial differences in job skills and consequent inequalities in employment.”
Beyond the workplace and personnel psychology, what role, if any, should measurement play in promoting diversity? As Haidt et al. (2003) reflected, there are several varieties of diversity, some perceived as more valuable than others. Do we need to develop more specific measures designed, for example, to discourage “moral diversity” while encouraging “demographic diversity”? These types of questions have implications in a number of areas from academic admission policies to immigration.
JUST THINK . . .
How do you feel about the use of various procedures to adjust test scores on the basis of group membership? Are these types of issues best left to measurement experts?
If performance differences are found between identified groups of people on a valid and reliable test used for selection purposes, some hard questions may have to be dealt with if the test is to continue to be used. Is the problem due to some technical deficiency in the test, or is the test in reality too good at identifying people of different levels of ability? Regardless, is the test being used fairly? If so, what might society do to remedy the skill disparity between different groups as reflected on the test?
Our discussion of issues of test fairness and test bias may seem to have brought us far afield of the seemingly cut-and-dried, relatively nonemotional subject of test validity. However, the complex issues accompanying discussions of test validity, including issues of fairness and bias, must be wrestled with by us all. For further consideration of the philosophical issues involved, we refer you to the solitude of your own thoughts and the reading of your own conscience.
Test your understanding of elements of this chapter by seeing if you can explain each of the following terms, expressions, and abbreviations:
· hit rate
· slope bias
|TitleABC/123 Version X||1|
|Dr. Zak Case StudyPSYCH/655 Version 4||1|
University of Phoenix Material
Dr. Zak Case Study
Read the following case study. Use the information in the case study to answer the accompanying follow-up questions. Although questions 1 & 2 have short answers, you should prepare a 150- to 200-word response for each of the remaining questions.
Dr. Zak developed a test to measure depression. He sampled 100 university students to take his five item test. The group of students was comprised of 30 men and 70 women. In this group, four persons were African American, six persons were Hispanic, and one person was Asian. Zak’s Miraculous Test of Depression is printed below:
1. I feel depressed: Yes No
2. I have been sad for the last two weeks: Yes No
3. I have seen changes in my eating and sleeping: Yes No
4. I don’t feel that life is going to get better: Yes No
5. I feel happy most of the day: Yes No
Yes = 1; No = 0
The mean on this test is 3.5 with a standard deviation of .5.
1. Sally scores 1.5 on this test. How many standard deviations is Sally from the mean? (Show your calculations)
Because the Mean = 3.5 with a SD = .5 , and Sally’s score = 1.5, my calculations would be first 1.5 – 3.5 = -2, then -2 / .5 = -4. So in conclusion, this means that Sally is -4 standard deviations away from the mean.
(1.5 – 3.5) / (.5) = – 4
2. Billy scores 5. What is his standard score?
In order to determine the standard score here, we would have to calculate the z – score using the raw score of (5) minus the mean (3.5), devided by standard deviation (.5). So first we have the equation 5 – 3.5 = 1.5, followed by the equation of 1.5 /. 5 = 3. As a result of this, we can determine that Billy has a Standard score of 3.
(5 – 3.5) / (.5) = 3
3. What scale of measurement is Dr. Zak using? Do you think Dr. Zak’s choice of scaling is appropriate? Why or why not? What are your suggestions?
4. Do you think Dr. Zak has a good sample on which to norm his test? Why or why not? What are your suggestions?
5. What other items do you think need to be included in Dr. Zak’s domain sampling?
Depression is a serious illness, and if it is not examined to its fullest it can evolve into a serious case which can lead severe consequences. This study could benefit from asking more questions because asking more questions can provide further insight into how serious their depression is, and perhaps gauge what level of depression they are experiencing. Some other questions that can be asked can be, it is difficult for me to concentrate, I have trouble making decisions, my sleep patterns bad and having difficulty relaxing, I often feel nervous or anxious, committing suicide has come into my mind, you are no longer sexually driven, is depression present in your family history, you often feel sad or worthless.
The issue with this test is that people experience different forms of depression where some cases may not be as severe as others. There are many different factors that have an effect in the development of depression from changes in hormones, genetic predispositions, relationships, or the experience of internal and external stressors. The more questions that are involved in the testing, the more information that can be gathered to help produce effective results.
6. Suggest changes to this test to make it better. Justify your reason for each suggestion supporting each reason with psychometric principles from the text book or other materials used in your course.
7. Dr. Zak also gave his students the Beck Depression Inventory (BDI). The correlation between his test and the BDI was r =.14. Evaluate this correlation. What does this correlation tell you about the relationship between these two instruments?
Cohen, R. J., Swerdlik, M. E, & Sturman, E. D. (2013). Psychological testing and assessment: An introduction to tests and measurement (8th ed.). New York, NY: McGraw-Hill.
Copyright © XXXX by University of Phoenix. All rights reserved.
Copyright © 2017, 2015, 2013 by University of Phoenix. All rights reserved.
Modules Chapter 5 wk2 p655
C H A P T E R 5
In everyday conversation, reliability is a synonym for dependability or consistency. We speak of the train that is so reliable you can set your watch by it. If we’re lucky, we have a reliable friend who is always there for us in a time of need.
Broadly speaking, in the language of psychometrics reliability refers to consistency in measurement. And whereas in everyday conversation reliability always connotes something positive, in the psychometric sense it really only refers to something that is consistent—not necessarily consistently good or bad, but simply consistent.
It is important for us, as users of tests and consumers of information about tests, to know how reliable tests and other measurement procedures are. But reliability is not an all-or-none matter. A test may be reliable in one context and unreliable in another. There are different types and degrees of reliability. A reliability coefficient is an index of reliability, a proportion that indicates the ratio between the true score variance on a test and the total variance. In this chapter, we explore different kinds of reliability coefficients, including those for measuring test-retest reliability, alternate-forms reliability, split-half reliability, and inter-scorer reliability.
The Concept of Reliability
Recall from our discussion of classical test theory that a score on an ability test is presumed to reflect not only the testtaker’s true score on the ability being measured but also error.1 In its broadest sense, error refers to the component of the observed test score that does not have to do with the testtaker’s ability. If we use X to represent an observed score, T to represent a true score, and E to represent error, then the fact that an observed score equals the true score plus error may be expressed as follows:
A statistic useful in describing sources of test score variability is the variance (σ2)—the standard deviation squared. This statistic is useful because it can be broken into components. Page 142Variance from true differences is true variance , and variance from irrelevant, random sources is error variance . If σ2 represents the total variance, the true variance, and the error variance, then the relationship of the variances can be expressed as
In this equation, the total variance in an observed distribution of test scores (σ2) equals the sum of the true variance plus the error variance . The term reliability refers to the proportion of the total variance attributed to true variance. The greater the proportion of the total variance attributed to true variance, the more reliable the test. Because true differences are assumed to be stable, they are presumed to yield consistent scores on repeated administrations of the same test as well as on equivalent forms of tests. Because error variance may increase or decrease a test score by varying amounts, consistency of the test score—and thus the reliability—can be affected.
In general, the term measurement error refers to, collectively, all of the factors associated with the process of measuring some variable, other than the variable being measured. To illustrate, consider an English-language test on the subject of 12th-grade algebra being administered, in English, to a sample of 12-grade students, newly arrived to the United States from China. The students in the sample are all known to be “whiz kids” in algebra. Yet for some reason, all of the students receive failing grades on the test. Do these failures indicate that these students really are not “whiz kids” at all? Possibly. But a researcher looking for answers regarding this outcome would do well to evaluate the English-language skills of the students. Perhaps this group of students did not do well on the algebra test because they could neither read nor understand what was required of them. In such an instance, the fact that the test was written and administered in English could have contributed in large part to the measurement error in this evaluation. Stated another way, although the test was designed to evaluate one variable (knowledge of algebra), scores on it may have been more reflective of another variable (knowledge of and proficiency in English language). This source of measurement error (the fact that the test was written and administered in English) could have been eliminated by translating the test and administering it in the language of the testtakers.
Measurement error, much like error in general, can be categorized as being either systematic or random. Random error is a source of error in measuring a targeted variable caused by unpredictable fluctuations and inconsistencies of other variables in the measurement process. Sometimes referred to as “noise,” this source of error fluctuates from one testing situation to another with no discernible pattern that would systematically raise or lower scores. Examples of random error that could conceivably affect test scores range from unanticipated events happening in the immediate vicinity of the test environment (such as a lightning strike or a spontaneous “occupy the university” rally), to unanticipated physical events happening within the testtaker (such as a sudden and unexpected surge in the testtaker’s blood sugar or blood pressure).
JUST THINK . . .
What might be a source of random error inherent in all the tests an assessor administers in his or her private office?
In contrast to random error, systematic error refers to a source of error in measuring a variable that is typically constant or proportionate to what is presumed to be the true value of the variable being measured. For example, a 12-inch ruler may be found to be, in actuality, a tenth of one inch longer than 12 inches. All of the 12-inch measurements previously taken with that ruler were systematically off by one-tenth of an inch; that is, anything measured to be exactly 12 inches with that ruler was, in reality, 12 and one-tenth inches. In this example, it is the measuring instrument itself that has been found to be a source of systematic error. Once a systematic error becomes known, it becomes predictable—as well as fixable. Note also that a systematic source of error does not affect score consistency. So, for example, suppose a measuring instrument such as the official weight scale used on The Biggest Loser television Page 143program consistently underweighed by 5 pounds everyone who stepped on it. Regardless of this (systematic) error, the relative standings of all of the contestants weighed on that scale would remain unchanged. A scale underweighing all contestants by 5 pounds simply amounts to a constant being subtracted from every “score.” Although weighing contestants on such a scale would not yield a true (or valid) weight, such a systematic error source would not change the variability of the distribution or affect the measured reliability of the instrument. In the end, the individual crowned “the biggest loser” would indeed be the contestant who lost the most weight—it’s just that he or she would actually weigh 5 pounds more than the weight measured by the show’s official scale. Now moving from the realm of reality television back to the realm of psychological testing and assessment, let’s take a closer look at the source of some error variance commonly encountered during testing and assessment.
JUST THINK . . .
What might be a source of systematic error inherent in all the tests an assessor administers in his or her private office?
Sources of Error Variance
Sources of error variance include test construction, administration, scoring, and/or interpretation.
One source of variance during test construction is item sampling or content sampling , terms that refer to variation among items within a test as well as to variation among items between tests. Consider two or more tests designed to measure a specific skill, personality attribute, or body of knowledge. Differences are sure to be found in the way the items are worded and in the exact content sampled. Each of us has probably walked into an achievement test setting thinking “I hope they ask this question” or “I hope they don’t ask that question.” If the only questions on the examination were the ones we hoped would be asked, we might achieve a higher score on that test than on another test purporting to measure the same thing. The higher score would be due to the specific content sampled, the way the items were worded, and so on. The extent to which a testtaker’s score is affected by the content sampled on a test and by the way the content is sampled (that is, the way in which the item is constructed) is a source of error variance. From the perspective of a test creator, a challenge in test development is to maximize the proportion of the total variance that is true variance and to minimize the proportion of the total variance that is error variance.
Sources of error variance that occur during test administration may influence the testtaker’s attention or motivation. The testtaker’s reactions to those influences are the source of one kind of error variance. Examples of untoward influences during administration of a test include factors related to the test environment: room temperature, level of lighting, and amount of ventilation and noise, for instance. A relentless fly may develop a tenacious attraction to an examinee’s face. A wad of gum on the seat of the chair may make itself known only after the testtaker sits down on it. Other environment-related variables include the instrument used to enter responses and even the writing surface on which responses are entered. A pencil with a dull or broken point can make it difficult to blacken the little grids. The writing surface on a school desk may be riddled with heart carvings, the legacy of past years’ students who felt compelled to express their eternal devotion to someone now long forgotten. External to the test environment in a global sense, the events of the day may also serve as a source of error. So, for example, test results may vary depending upon whether the testtaker’s country is at war or at peace (Gil et al., 2016). A variable of interest when evaluating a patient’s general level of suspiciousness or fear is the patient’s home neighborhood and lifestyle. Especially in patients who live in and must cope daily with an unsafe neighborhood, Page 144what is actually adaptive fear and suspiciousness can be misinterpreted by an interviewer as psychotic paranoia (Wilson et al., 2016).
Other potential sources of error variance during test administration are testtaker variables. Pressing emotional problems, physical discomfort, lack of sleep, and the effects of drugs or medication can all be sources of error variance. Formal learning experiences, casual life experiences, therapy, illness, and changes in mood or mental state are other potential sources of testtaker-related error variance. It is even conceivable that significant changes in the testtaker’s body weight could be a source of error variance. Weight gain and obesity are associated with a rise in fasting glucose level—which in turn is associated with cognitive impairment. In one study that measured performance on a cognitive task, subjects with high fasting glucose levels made nearly twice as many errors as subjects whose fasting glucose level was in the normal range (Hawkins et al., 2016).
Examiner-related variables are potential sources of error variance. The examiner’s physical appearance and demeanor—even the presence or absence of an examiner—are some factors for consideration here. Some examiners in some testing situations might knowingly or unwittingly depart from the procedure prescribed for a particular test. On an oral examination, some examiners may unwittingly provide clues by emphasizing key words as they pose questions. They might convey information about the correctness of a response through head nodding, eye movements, or other nonverbal gestures. In the course of an interview to evaluate a patient’s suicidal risk, highly religious clinicians may be more inclined than their moderately religious counterparts to conclude that such risk exists (Berman et al., 2015). Clearly, the level of professionalism exhibited by examiners is a source of error variance.
Test scoring and interpretation
In many tests, the advent of computer scoring and a growing reliance on objective, computer-scorable items have virtually eliminated error variance caused by scorer differences. However, not all tests can be scored from grids blackened by no. 2 pencils. Individually administered intelligence tests, some tests of personality, tests of creativity, various behavioral measures, essay tests, portfolio assessment, situational behavior tests, and countless other tools of assessment still require scoring by trained personnel.
Manuals for individual intelligence tests tend to be very explicit about scoring criteria, lest examinees’ measured intelligence vary as a function of who is doing the testing and scoring. In some tests of personality, examinees are asked to supply open-ended responses to stimuli such as pictures, words, sentences, and inkblots, and it is the examiner who must then quantify or qualitatively evaluate responses. In one test of creativity, examinees might be given the task of creating as many things as they can out of a set of blocks. Here, it is the examiner’s task to determine which block constructions will be awarded credit and which will not. For a behavioral measure of social skills in an inpatient psychiatric service, the scorers or raters might be asked to rate patients with respect to the variable “social relatedness.” Such a behavioral measure might require the rater to check yes or no to items like Patient says “Good morning” to at least two staff members.
JUST THINK . . .
Can you conceive of a test item on a rating scale requiring human judgment that all raters will score the same 100% of the time?
Scorers and scoring systems are potential sources of error variance. A test may employ objective-type items amenable to computer scoring of well-documented reliability. Yet even then, a technical glitch might contaminate the data. If subjectivity is involved in scoring, then the scorer (or rater) can be a source of error variance. Indeed, despite rigorous scoring criteria set forth in many of the better-known tests of intelligence, examiner/scorers occasionally still are confronted by situations where an examinee’s response lies in a gray area. The element of subjectivity in scoring may be much greater in the administration of certain nonobjective-type personality tests, tests of creativity (such as the block test just described), and certain academic tests (such as essay examinations). Subjectivity in scoring can even enter into Page 145behavioral assessment. Consider the case of two behavior observers given the task of rating one psychiatric inpatient on the variable of “social relatedness.” On an item that asks simply whether two staff members were greeted in the morning, one rater might judge the patient’s eye contact and mumbling of something to two staff members to qualify as a yes response. The other observer might feel strongly that a no response to the item is appropriate. Such problems in scoring agreement can be addressed through rigorous training designed to make the consistency—or reliability—of various scorers as nearly perfect as can be.
Other sources of error
Surveys and polls are two tools of assessment commonly used by researchers who study public opinion. In the political arena, for example, researchers trying to predict who will win an election may sample opinions from representative voters and then draw conclusions based on their data. However, in the “fine print” of those conclusions is usually a disclaimer that the conclusions may be off by plus or minus a certain percent. This fine print is a reference to the margin of error the researchers estimate to exist in their study. The error in such research may be a result of sampling error—the extent to which the population of voters in the study actually was representative of voters in the election. The researchers may not have gotten it right with respect to demographics, political party affiliation, or other factors related to the population of voters. Alternatively, the researchers may have gotten such factors right but simply did not include enough people in their sample to draw the conclusions that they did. This brings us to another type of error, called methodological error. So, for example, the interviewers may not have been trained properly, the wording in the questionnaire may have been ambiguous, or the items may have somehow been biased to favor one or another of the candidates.
Certain types of assessment situations lend themselves to particular varieties of systematic and nonsystematic error. For example, consider assessing the extent of agreement between partners regarding the quality and quantity of physical and psychological abuse in their relationship. As Moffitt et al. (1997) observed, “Because partner abuse usually occurs in private, there are only two persons who ‘really’ know what goes on behind closed doors: the two members of the couple” (p. 47). Potential sources of nonsystematic error in such an assessment situation include forgetting, failing to notice abusive behavior, and misunderstanding instructions regarding reporting. A number of studies (O’Leary & Arias, 1988; Riggs et al., 1989; Straus, 1979) have suggested that underreporting or overreporting of perpetration of abuse also may contribute to systematic error. Females, for example, may underreport abuse because of fear, shame, or social desirability factors and overreport abuse if they are seeking help. Males may underreport abuse because of embarrassment and social desirability factors and overreport abuse if they are attempting to justify the report.
Just as the amount of abuse one partner suffers at the hands of the other may never be known, so the amount of test variance that is true relative to error may never be known. A so-called true score, as Stanley (1971, p. 361) put it, is “not the ultimate fact in the book of the recording angel.” Further, the utility of the methods used for estimating true versus error variance is a hotly debated matter (see Collins, 1996; Humphreys, 1996; Williams & Zimmerman, 1996a, 1996b). Let’s take a closer look at such estimates and how they are derived.
Test-Retest Reliability Estimates
A ruler made from the highest-quality steel can be a very reliable instrument of measurement. Every time you measure something that is exactly 12 inches long, for example, your ruler will tell you that what you are measuring is exactly 12 inches long. The reliability of this instrument Page 146of measurement may also be said to be stable over time. Whether you measure the 12 inches today, tomorrow, or next year, the ruler is still going to measure 12 inches as 12 inches. By contrast, a ruler constructed of putty might be a very unreliable instrument of measurement. One minute it could measure some known 12-inch standard as 12 inches, the next minute it could measure it as 14 inches, and a week later it could measure it as 18 inches. One way of estimating the reliability of a measuring instrument is by using the same instrument to measure the same thing at two points in time. In psychometric parlance, this approach to reliability evaluation is called the test-retest method, and the result of such an evaluation is an estimate of test-retest reliability.
Test-retest reliability is an estimate of reliability obtained by correlating pairs of scores from the same people on two different administrations of the same test. The test-retest measure is appropriate when evaluating the reliability of a test that purports to measure something that is relatively stable over time, such as a personality trait. If the characteristic being measured is assumed to fluctuate over time, then there would be little sense in assessing the reliability of the test using the test-retest method.
As time passes, people change. For example, people may learn new things, forget some things, and acquire new skills. It is generally the case (although there are exceptions) that, as the time interval between administrations of the same test increases, the correlation between the scores obtained on each testing decreases. The passage of time can be a source of error variance. The longer the time that passes, the greater the likelihood that the reliability coefficient will be lower. When the interval between testing is greater than six months, the estimate of test-retest reliability is often referred to as the coefficient of stability .
An estimate of test-retest reliability from a math test might be low if the testtakers took a math tutorial before the second test was administered. An estimate of test-retest reliability from a personality profile might be low if the testtaker suffered some emotional trauma or received counseling during the intervening period. A low estimate of test-retest reliability might be found even when the interval between testings is relatively brief. This may well be the case when the testings occur during a time of great developmental change with respect to the variables they are designed to assess. An evaluation of a test-retest reliability coefficient must therefore extend beyond the magnitude of the obtained coefficient. If we are to come to proper conclusions about the reliability of the measuring instrument, evaluation of a test-retest reliability estimate must extend to a consideration of possible intervening factors between test administrations.
An estimate of test-retest reliability may be most appropriate in gauging the reliability of tests that employ outcome measures such as reaction time or perceptual judgments (including discriminations of brightness, loudness, or taste). However, even in measuring variables such as these, and even when the time period between the two administrations of the test is relatively small, various factors (such as experience, practice, memory, fatigue, and motivation) may intervene and confound an obtained measure of reliability.2
Taking a broader perspective, psychological science, and science in general, demands that the measurements obtained by one experimenter be replicable by other experimenters using the same instruments of measurement and following the same procedures. However, as observed in this chapter’s Close-Up , a replicability problem of epic proportions appears to be brewing.Page 147
Psychology’s Replicability Crisis*
In the mid-2000s, academic scientists became concerned that science was not being performed rigorously enough to prevent spurious results from reaching consensus within the scientific community. In other words, they worried that scientific findings, although peer-reviewed and published, were not replicable by independent parties. Since that time, hundreds of researchers have endeavored to determine if there is really a problem, and if there is, how to curb it. In 2015, a group of researchers called the Open Science Collaboration attempted to redo 100 psychology studies that had already been peer-reviewed and published in leading journals (Open Science Collaboration, 2015). Their results, published in the journal Science, indicated that, depending on the criteria used, only 40–60% of replications found the same results as the original studies. This low replication rate helped confirm that science indeed had a problem with replicability, the seriousness of which is reflected in the term replicability crisis.
Why and how did this crisis of replicability emerge? Here it will be argued that the major causal factors are (1) a general lack of published replication attempts in the professional literature, (2) editorial preferences for positive over negative findings, and (3) questionable research practices on the part of authors of published studies. Let’s consider each of these factors.
Lack of Published Replication Attempts
Journals have long preferred to publish novel results instead of replications of previous work. In fact, a recent study found that only 1.07% of the published psychological scientific literature sought to directly replicate previous work (Makel et al., 2012). Academic scientists, who depend on publication in order to progress in their careers, respond to this bias by focusing their research on unexplored phenomena instead of replications. The implications for science are dire. Replication by independent parties provides for confidence in a finding, reducing the likelihood of experimenter bias and statistical anomaly. Indeed, had scientists been as focused on replication as they were on hunting down novel results, the field would likely not be in crisis now.
Editorial Preference for Positive over Negative Findings
Journals prefer positive over negative findings. “Positive” in this context does not refer to how upbeat, beneficial, or heart-warming the study is. Rather, positive refers to whether the study concluded that an experimental effect existed. Stated another way, and drawing on your recall from that class you took in experimental methods, positive findings typically entail a rejection of the null hypothesis. In essence, from the perspective of most journals, rejecting the null hypothesis as a result of a research study is a newsworthy event. By contrast, accepting the null hypothesis might just amount to “old news.”
The fact that journals are more apt to publish positive rather than negative studies has consequences in terms of the types of studies that even get submitted for publication. Studies submitted for publication typically report the existence of an effect rather than the absence of one. The vast majority of studies that actually get published also report the existence of an effect. Those studies designed to disconfirm reports of published effects are few-and-far-between to begin with, and may not be deemed publishable even when they are conducted and submitted to a journal for review. The net result is that scientists, policy-makers, judges, and anyone else who has occasion to rely on published research may have a difficult time determining the actual strength and robustness of a reported finding.
Questionable Research Practices (QRPs)
In this admittedly nonexhaustive review of factors contributing to the replicability crisis, the third factor is QRPs. Included here are questionable scientific practices that do not rise to the level of fraud but still introduce error into bodies of scientific evidence. For example, a recent survey of psychological scientists found that nearly 60% of the respondents reported that they decided to collect more data after peeking to see if their already-collected data had reached statistical significance (John et al., 2012). While this procedure may seem relatively benign, it is not. Imagine you are trying to determine if a nickel is fair, or weighted toward heads. Rather than establishing the number flips you plan on performing prior to your “test,” you just start flipping and from time-to-time check how many times the coin has come up heads. After a run of five heads, you notice that your weighted-coin hypothesis is looking strong and decide to stop flipping. The nonindependence between the decision to collect data and the data themselves introduces bias. Over the course of many studies, such practices can seriously undermine a body of research.
There are many other sorts of QRPs. For example, one variety entails the researcher failing to report all of the research undertaken in a research program, and then Page 148selectively only reporting the studies that confirm a particular hypothesis. With only the published study in hand, and without access to the researchers’ records, it would be difficult if not impossible for the research consumer to discern important milestones in the chronology of the research (such as what studies were conducted in what sequence, and what measurements were taken).
One proposed remedy for such QRPs is preregistration (Eich, 2014). Preregistration involves publicly committing to a set of procedures prior to carrying out a study. Using such a procedure, there can be no doubt as to the number of observations planned, and the number of measures anticipated. In fact, there are now several websites that allow researchers to preregister their research plans. It is also increasingly common for academic journals to demand preregistration (or at least a good explanation for why the study wasn’t preregistered). Alternatively, some journals award special recognition to studies that were preregistered so that readers can have more confidence in the replicability of the reported findings.
Lessons Learned from the Replicability Crisis
The replicability crisis represents an important learning opportunity for scientists and students. Prior to such replicability issues coming to light, it was typically assumed that science would simply self-correct over the long run. This means that at some point in time, the nonreplicable study would be exposed as such, and the scientific record would somehow be straightened out. Of course, while some self-correction does occur, it occurs neither fast enough nor often enough, nor in sufficient magnitude. The stark reality is that unreliable findings that reach general acceptance can stay in place for decades before they are eventually disconfirmed. And even when such long-standing findings are proven incorrect, there is no mechanism in place to alert other scientists and the public of this fact.
Traditionally, science has only been admitted into courtrooms if an expert attests that the science has reached “general acceptance” in the scientific community from which it comes. However, in the wake of science’s replicability crisis, it is not at all uncommon for findings to meet this general acceptance standard. Sadly, the standard may be met even if the findings from the subject study are questionable at best, or downright inaccurate at worst. Fortunately, another legal test has been created in recent years (Chin, 2014). In this test, judges are asked to play a gatekeeper role and only admit scientific evidence if it has been properly tested, has a sufficiently low error rate, and has been peer-reviewed and published. In this latter test, judges can ask more sensible questions, such as whether the study has been replicated and if the testing was done using a safeguard like preregistration.
Spurred by the recognition of a crisis of replicability, science is moving to right from both past and potential wrongs. As previously noted, there are now mechanisms in place for preregistration of experimental designs and growing acceptance of the importance of doing so. Further, organizations that provide for open science (e.g., easy and efficient preregistration) are receiving millions of dollars in funding to provide support for researchers seeking to perform more rigorous research. Moreover, replication efforts—beyond even that of the Open Science Collaboration—are becoming more common (Klein et al, 2013). Overall, it appears that most scientists now recognize replicability as a concern that needs to be addressed with meaningful changes to what has constituted “business-as-usual” for so many years.
Effectively addressing the replicability crisis is important for any profession that relies on scientific evidence. Within the field of law, for example, science is used every day in courtrooms throughout the world to prosecute criminal cases and adjudicate civil disputes. Everyone from a criminal defendant facing capital punishment to a major corporation arguing that its violent video games did not promote real-life violence may rely at some point in a trial on a study published in a psychology journal. Appeals are sometimes limited. Costs associated with legal proceedings are often prohibitive. With a momentous verdict in the offing, none of the litigants has the luxury of time—which might amount to decades, if at all—for the scholarly research system to self-correct.
When it comes to psychology’s replicability crisis, there is good and bad news. The bad news is that it is real, and that it has existed perhaps, since scientific studies were first published. The good news is that the problem has finally been recognized, and constructive steps are being taken to address it.
Used with permission of Jason Chin.
*This Close-Up was guest-authored by Jason Chin of the University of Toronto.
Parallel-Forms and Alternate-Forms Reliability Estimates
If you have ever taken a makeup exam in which the questions were not all the same as on the test initially given, you have had experience with different forms of a test. And if you have ever wondered whether the two forms of the test were really equivalent, you have wondered about the alternate-forms or parallel-forms reliability of the test. The degree of the relationship between various forms of a test can be evaluated by means of an alternate-forms or parallel-forms coefficient of reliability, which is often termed the coefficient of equivalence .
Although frequently used interchangeably, there is a difference between the terms alternate forms and parallel forms. Parallel forms of a test exist when, for each form of the test, the means and the variances of observed test scores are equal. In theory, the means of scores obtained on parallel forms correlate equally with the true score. More practically, scores obtained on parallel tests correlate equally with other measures. The term parallel forms reliability refers to an estimate of the extent to which item sampling and other errors have affected test scores on versions of the same test when, for each form of the test, the means and variances of observed test scores are equal.
Alternate forms are simply different versions of a test that have been constructed so as to be parallel. Although they do not meet the requirements for the legitimate designation “parallel,” alternate forms of a test are typically designed to be equivalent with respect to variables such as content and level of difficulty. The term alternate forms reliability refers to an estimate of the extent to which these different forms of the same test have been affected by item sampling error, or other error.
JUST THINK . . .
You missed the midterm examination and have to take a makeup exam. Your classmates tell you that they found the midterm impossibly difficult. Your instructor tells you that you will be taking an alternate form, not a parallel form, of the original test. How do you feel about that?
Obtaining estimates of alternate-forms reliability and parallel-forms reliability is similar in two ways to obtaining an estimate of test-retest reliability: (1) Two test administrations with the same group are required, and (2) test scores may be affected by factors such as motivation, fatigue, or intervening events such as practice, learning, or therapy (although not as much as when the same test is administered twice). An additional source of error variance, item sampling, is inherent in the computation of an alternate- or parallel-forms reliability coefficient. Testtakers may do better or worse on a specific form of the test not as a function of their true ability but simply because of the particular items that were selected for inclusion in the test.3
Developing alternate forms of tests can be time-consuming and expensive. Imagine what might be involved in trying to create sets of equivalent items and then getting the same people to sit for repeated administrations of an experimental test! On the other hand, once an alternate or parallel form of a test has been developed, it is advantageous to the test user in several ways. For example, it minimizes the effect of memory for the content of a previously administered form of the test.
JUST THINK . . .
From the perspective of the test user, what are other possible advantages of having alternate or parallel forms of the same test?
Certain traits are presumed to be relatively stable in people over time, and we would expect tests measuring those traits—alternate forms, parallel forms, or otherwise—to reflect that stability. As an example, we expect that there will be, and in fact there is, a reasonable degree of stability in scores on intelligence tests. Conversely, we might expect relatively little stability in scores obtained on a measure of state anxiety (anxiety felt at the moment).Page 150
An estimate of the reliability of a test can be obtained without developing an alternate form of the test and without having to administer the test twice to the same people. Deriving this type of estimate entails an evaluation of the internal consistency of the test items. Logically enough, it is referred to as an internal consistency estimate of reliability or as an estimate of inter-item consistency . There are different methods of obtaining internal consistency estimates of reliability. One such method is the split-half estimate.
Split-Half Reliability Estimates
An estimate of split-half reliability is obtained by correlating two pairs of scores obtained from equivalent halves of a single test administered once. It is a useful measure of reliability when it is impractical or undesirable to assess reliability with two tests or to administer a test twice (because of factors such as time or expense). The computation of a coefficient of split-half reliability generally entails three steps:
· Step 1. Divide the test into equivalent halves.
· Step 2. Calculate a Pearson r between scores on the two halves of the test.
· Step 3. Adjust the half-test reliability using the Spearman–Brown formula (discussed shortly).
When it comes to calculating split-half reliability coefficients, there’s more than one way to split a test—but there are some ways you should never split a test. Simply dividing the test in the middle is not recommended because it’s likely that this procedure would spuriously raise or lower the reliability coefficient. Different amounts of fatigue for the first as opposed to the second part of the test, different amounts of test anxiety, and differences in item difficulty as a function of placement in the test are all factors to consider.
One acceptable way to split a test is to randomly assign items to one or the other half of the test. Another acceptable way to split a test is to assign odd-numbered items to one half of the test and even-numbered items to the other half. This method yields an estimate of split-half reliability that is also referred to as odd-even reliability . 4 Yet another way to split a test is to divide the test by content so that each half contains items equivalent with respect to content and difficulty. In general, a primary objective in splitting a test in half for the purpose of obtaining a split-half reliability estimate is to create what might be called “mini-parallel-forms,” with each half equal to the other—or as nearly equal as humanly possible—in format, stylistic, statistical, and related aspects.
Step 2 in the procedure entails the computation of a Pearson r, which requires little explanation at this point. However, the third step requires the use of the Spearman–Brown formula.
The Spearman–Brown formula
The Spearman–Brown formula allows a test developer or user to estimate internal consistency reliability from a correlation of two halves of a test. It is a specific application of a more general formula to estimate the reliability of a test that is lengthened or shortened by any number of items. Because the reliability of a test is affected by its length, a formula is necessary for estimating the reliability of a test that has been shortened or lengthened. The general Spearman–Brown (rSB) formula is
where rSB is equal to the reliability adjusted by the Spearman–Brown formula, rxy is equal to the Pearson r in the original-length test, and n is equal to the number of items in the revised version divided by the number of items in the original version.
By determining the reliability of one half of a test, a test developer can use the Spearman–Brown formula to estimate the reliability of a whole test. Because a whole test is two times longer than half a test, n becomes 2 in the Spearman–Brown formula for the adjustment of split-half reliability. The symbol rhh stands for the Pearson r of scores in the two half tests:
Usually, but not always, reliability increases as test length increases. Ideally, the additional test items are equivalent with respect to the content and the range of difficulty of the original items. Estimates of reliability based on consideration of the entire test therefore tend to be higher than those based on half of a test. Table 5–1 shows half-test correlations presented alongside adjusted reliability estimates for the whole test. You can see that all the adjusted correlations are higher than the unadjusted correlations. This is so because Spearman–Brown estimates are based on a test that is twice as long as the original half test. For the data from the kindergarten pupils, for example, a half-test reliability of .718 is estimated to be equivalent to a whole-test reliability of .836.
|Grade||Half-Test Correlation (unadjusted r )||Whole-Test Estimate (rSB)|
|Table 5–1Odd-Even Reliability Coefficients before and after the Spearman-Brown Adjustment*|
*For scores on a test of mental ability
If test developers or users wish to shorten a test, the Spearman–Brown formula may be used to estimate the effect of the shortening on the test’s reliability. Reduction in test size for the purpose of reducing test administration time is a common practice in certain situations. For example, the test administrator may have only limited time with a particular testtaker or group of testtakers. Reduction in test size may be indicated in situations where boredom or fatigue could produce responses of questionable meaningfulness.
JUST THINK . . .
What are other situations in which a reduction in test size or the time it takes to administer a test might be desirable? What are the arguments against reducing test size?
A Spearman–Brown formula could also be used to determine the number of items needed to attain a desired level of reliability. In adding items to increase test reliability to a desired level, the rule is that the new items must be equivalent in content and difficulty so that the longer test still measures what the original test measured. If the reliability of the original test is relatively low, then it may be impractical to increase the number of items to reach an acceptable level of reliability. Another alternative would be to abandon this relatively unreliable instrument and locate—or develop—a suitable alternative. The reliability of the instrument could also be raised in some way. For example, the reliability of the instrument might be raised by creating new items, clarifying the test’s instructions, or simplifying the scoring rules.
Internal consistency estimates of reliability, such as that obtained by use of the Spearman–Brown formula, are inappropriate for measuring the reliability of heterogeneous tests and speed tests. The impact of test characteristics on reliability is discussed in detail later in this chapter.Page 152
Other Methods of Estimating Internal Consistency
In addition to the Spearman–Brown formula, other methods used to obtain estimates of internal consistency reliability include formulas developed by Kuder and Richardson (1937) and Cronbach (1951). Inter-item consistency refers to the degree of correlation among all the items on a scale. A measure of inter-item consistency is calculated from a single administration of a single form of a test. An index of inter-item consistency, in turn, is useful in assessing the homogeneity of the test. Tests are said to be homogeneous if they contain items that measure a single trait. As an adjective used to describe test items, homogeneity (derived from the Greek words homos, meaning “same,” and genos, meaning “kind”) is the degree to which a test measures a single factor. In other words, homogeneity is the extent to which items in a scale are unifactorial.
In contrast to test homogeneity, heterogeneity describes the degree to which a test measures different factors. A heterogeneous (or nonhomogeneous) test is composed of items that measure more than one trait. A test that assesses knowledge only of ultra high definition (UHD) television repair skills could be expected to be more homogeneous in content than a general electronics repair test. The former test assesses only one area whereas the latter assesses several, such as knowledge not only of UHD televisions but also of digital video recorders, Blu-Ray players, MP3 players, satellite radio receivers, and so forth.
The more homogeneous a test is, the more inter-item consistency it can be expected to have. Because a homogeneous test samples a relatively narrow content area, it is to be expected to contain more inter-item consistency than a heterogeneous test. Test homogeneity is desirable because it allows relatively straightforward test-score interpretation. Testtakers with the same score on a homogeneous test probably have similar abilities in the area tested. Testtakers with the same score on a more heterogeneous test may have quite different abilities.
Although a homogeneous test is desirable because it so readily lends itself to clear interpretation, it is often an insufficient tool for measuring multifaceted psychological variables such as intelligence or personality. One way to circumvent this potential source of difficulty has been to administer a series of homogeneous tests, each designed to measure some component of a heterogeneous variable.5
The Kuder–Richardson formulas
Dissatisfaction with existing split-half methods of estimating reliability compelled G. Frederic Kuder and M. W. Richardson (1937; Richardson & Kuder, 1939) to develop their own measures for estimating reliability. The most widely known of the many formulas they collaborated on is their Kuder–Richardson formula 20 , or KR-20, so named because it was the 20th formula developed in a series. Where test items are highly homogeneous, KR-20 and split-half reliability estimates will be similar. However, KR-20 is the statistic of choice for determining the inter-item consistency of dichotomous items, primarily those items that can be scored right or wrong (such as multiple-choice items). If test items are more heterogeneous, KR-20 will yield lower reliability estimates than the split-half method. Table 5–2 summarizes items on a sample heterogeneous test (the HERT), and Table 5–3 summarizes HERT performance for 20 testtakers. Assuming the difficulty level of all the items on the test to be about the same, would you expect a split-half (odd-even) estimate of reliability to be fairly high or low? How would the KR-20 reliability estimate compare with the odd-even estimate of reliability—would it be higher or lower?
|Item Number||Content Area|
|3||Digital video recorder (DVR)|
|4||Digital video recorder (DVR)|
|11||Compact disc player|
|12||Compact disc player|
|13||Satellite radio receiver|
|14||Satellite radio receiver|
|Table 5–2Content Areas Sampled for 18 Items of the Hypothetical Electronics Repair Test (HERT)|
|Item Number||Number of Testtakers Correct|
|Table 5–3Performance on the 18-Item HERT by Item for 20 Testtakers|
We might guess that, because the content areas sampled for the 18 items from this “Hypothetical Electronics Repair Test” are ordered in a manner whereby odd and even items Page 153tap the same content area, the odd-even reliability estimate will probably be quite high. Because of the great heterogeneity of content areas when taken as a whole, it could reasonably be predicted that the KR-20 estimate of reliability will be lower than the odd-even one. How is KR-20 computed? The following formula may be used:
where rKR20 stands for the Kuder–Richardson formula 20 reliability coefficient, k is the number of test items, σ2 is the variance of total test scores, p is the proportion of testtakers who pass the item, q is the proportion of people who fail the item, and Σ pq is the sum of the pq products over all items. For this particular example, k equals 18. Based on the data in Table 5–3, Σpq can be computed to be 3.975. The variance of total test scores is 5.26. Thus, rKR20 = .259.
An approximation of KR-20 can be obtained by the use of the 21st formula in the series developed by Kuder and Richardson, a formula known as—you guessed it—KR-21. The KR-21 formula may be used if there is reason to assume that all the test items have approximately Page 154the same degree of difficulty. Let’s add that this assumption is seldom justified. Formula KR-21 has become outdated in an era of calculators and computers. Way back when, KR-21 was sometimes used to estimate KR-20 only because it required many fewer calculations.
Numerous modifications of Kuder–Richardson formulas have been proposed through the years. The one variant of the KR-20 formula that has received the most acceptance and is in widest use today is a statistic called coefficient alpha. You may even hear it referred to as coefficient α−20. This expression incorporates both the Greek letter alpha (α) and the number 20, the latter a reference to KR-20.
Developed by Cronbach (1951) and subsequently elaborated on by others (such as Kaiser & Michael, 1975; Novick & Lewis, 1967), coefficient alpha may be thought of as the mean of all possible split-half correlations, corrected by the Spearman–Brown formula. In contrast to KR-20, which is appropriately used only on tests with dichotomous items, coefficient alpha is appropriate for use on tests containing nondichotomous items. The formula for coefficient alpha is
where ra is coefficient alpha, k is the number of items, is the variance of one item, Σ is the sum of variances of each item, and σ2 is the variance of the total test scores.
Coefficient alpha is the preferred statistic for obtaining an estimate of internal consistency reliability. A variation of the formula has been developed for use in obtaining an estimate of test-retest reliability (Green, 2003). Essentially, this formula yields an estimate of the mean of all possible test-retest, split-half coefficients. Coefficient alpha is widely used as a measure of reliability, in part because it requires only one administration of the test.
Unlike a Pearson r, which may range in value from −1 to +1, coefficient alpha typically ranges in value from 0 to 1. The reason for this is that, conceptually, coefficient alpha (much like other coefficients of reliability) is calculated to help answer questions about how similar sets of data are. Here, similarity is gauged, in essence, on a scale from 0 (absolutely no similarity) to 1 (perfectly identical). It is possible, however, to conceive of data sets that would yield a negative value of alpha (Streiner, 2003b). Still, because negative values of alpha are theoretically impossible, it is recommended under such rare circumstances that the alpha coefficient be reported as zero (Henson, 2001). Also, a myth about alpha is that “bigger is always better.” As Streiner (2003b) pointed out, a value of alpha above .90 may be “too high” and indicate redundancy in the items.
In contrast to coefficient alpha, a Pearson r may be thought of as dealing conceptually with both dissimilarity and similarity. Accordingly, an r value of −1 may be thought of as indicating “perfect dissimilarity.” In practice, most reliability coefficients—regardless of the specific type of reliability they are measuring—range in value from 0 to 1. This is generally true, although it is possible to conceive of exceptional cases in which data sets yield an r with a negative value.
Average proportional distance (APD)
A relatively new measure for evaluating the internal consistency of a test is the average proportional distance (APD) method (Sturman et al., 2009). Rather than focusing on similarity between scores on items of a test (as do split-half methods and Cronbach’s alpha), the APD is a measure that focuses on the degree of difference that exists between item scores. Accordingly, we define the average proportional distance method as a measure used to evaluate the internal consistency of a test that focuses on the degree of difference that exists between item scores.
To illustrate how the APD is calculated, consider the (hypothetical) “3-Item Test of Extraversion” (3-ITE). As conveyed by the title of the 3-ITE, it is a test that has only three Page 155items. Each of the items is a sentence that somehow relates to extraversion. Testtakers are instructed to respond to each of the three items with reference to the following 7-point scale: 1 = Very strongly disagree, 2 = Strongly disagree, 3 = Disagree, 4 = Neither Agree nor Disagree, 5 = Agree, 6 = Strongly agree, and 7 = Very strongly agree.
Typically, in order to evaluate the inter-item consistency of a scale, the calculation of the APD would be calculated for a group of testtakers. However, for the purpose of illustrating the calculations of this measure, let’s look at how the APD would be calculated for one testtaker. Yolanda scores 4 on Item 1, 5 on Item 2, and 6 on Item 3. Based on Yolanda’s scores, the APD would be calculated as follows:
· Step 1: Calculate the absolute difference between scores for all of the items.
· Step 2: Average the difference between scores.
· Step 3: Obtain the APD by dividing the average difference between scores by the number of response options on the test, minus one.
So, for the 3-ITE, here is how the calculations would look using Yolanda’s test scores:
· Step 1: Absolute difference between Items 1 and 2 = 1
· Absolute difference between Items 1 and 3 = 2
· Absolute difference between Items 2 and 3 = 1
· Step 2: In order to obtain the average difference (AD), add up the absolute differences in Step 1 and divide by the number of items as follows:
· Step 3: To obtain the average proportional distance (APD), divide the average difference by 6 (the 7 response options in our ITE scale minus 1). Using Yolanda’s data, we would divide 1.33 by 6 to get .22. Thus, the APD for the ITE is .22. But what does this mean?
The general “rule of thumb” for interpreting an APD is that an obtained value of .2 or lower is indicative of excellent internal consistency, and that a value of .25 to .2 is in the acceptable range. A calculated APD of .25 is suggestive of problems with the internal consistency of the test. These guidelines are based on the assumption that items measuring a single construct such as extraversion should ideally be correlated with one another in the .6 to .7 range. Let’s add that the expected inter-item correlation may vary depending on the variables being measured, so the ideal correlation values are not set in stone. In the case of the 3-ITE, the data for our one subject suggests that the scale has acceptable internal consistency. Of course, in order to make any meaningful conclusions about the internal consistency of the 3-ITE, the instrument would have to be tested with a large sample of testtakers.
One potential advantage of the APD method over using Cronbach’s alpha is that the APD index is not connected to the number of items on a measure. Cronbach’s alpha will be higher when a measure has more than 25 items (Cortina, 1993). Perhaps the best course of action when evaluating the internal consistency of a given measure is to analyze and integrate the information using several indices, including Cronbach’s alpha, mean inter-item correlations, and the APD.
Before proceeding, let’s emphasize that all indices of reliability provide an index that is a characteristic of a particular group of test scores, not of the test itself (Caruso, 2000; Yin & Fan, 2000). Measures of reliability are estimates, and estimates are subject to error. The precise amount of error inherent in a reliability estimate will vary with various factors, such as the sample of testtakers from which the data were drawn. A reliability index published in a test manual might be very impressive. However, keep in mind that the reported reliability was achieved with a particular group of testtakers. If a new group of testtakers is sufficiently Page 156different from the group of testtakers on whom the reliability studies were done, the reliability coefficient may not be as impressive—and may even be unacceptable.
Measures of Inter-Scorer Reliability
When being evaluated, we usually would like to believe that the results would be the same no matter who is doing the evaluating.6 For example, if you take a road test for a driver’s license, you would like to believe that whether you pass or fail is solely a matter of your performance behind the wheel and not a function of who is sitting in the passenger’s seat. Unfortunately, in some types of tests under some conditions, the score may be more a function of the scorer than of anything else. This was demonstrated back in 1912, when researchers presented one pupil’s English composition to a convention of teachers and volunteers graded the papers. The grades ranged from a low of 50% to a high of 98% (Starch & Elliott, 1912). Concerns about inter-scorer reliability are as relevant today as they were back then (Chmielewski et al., 2015; Edens et al., 2015; Penney et al., 2016). With this as background, it can be appreciated that certain tests lend themselves to scoring in a way that is more consistent than with other tests. It is meaningful, therefore, to raise questions about the degree of consistency, or reliability, that exists between scorers of a particular test.
Variously referred to as scorer reliability, judge reliability, observer reliability, and inter-rater reliability, inter-scorer reliability is the degree of agreement or consistency between two or more scorers (or judges or raters) with regard to a particular measure. Reference to levels of inter-scorer reliability for a particular test may be published in the test’s manual or elsewhere. If the reliability coefficient is high, the prospective test user knows that test scores can be derived in a systematic, consistent way by various scorers with sufficient training. A responsible test developer who is unable to create a test that can be scored with a reasonable degree of consistency by trained scorers will go back to the drawing board to discover the reason for this problem. If, for example, the problem is a lack of clarity in scoring criteria, then the remedy might be to rewrite the scoring criteria section of the manual to include clearly written scoring rules. Inter-rater consistency may be promoted by providing raters with the opportunity for group discussion along with practice exercises and information on rater accuracy (Smith, 1986).
Inter-scorer reliability is often used when coding nonverbal behavior. For example, a researcher who wishes to quantify some aspect of nonverbal behavior, such as depressed mood, would start by composing a checklist of behaviors that constitute depressed mood (such as looking downward and moving slowly). Accordingly, each subject would be given a depressed mood score by a rater. Researchers try to guard against such ratings being products of the rater’s individual biases or idiosyncrasies in judgment. This can be accomplished by having at least one other individual observe and rate the same behaviors. If consensus can be demonstrated in the ratings, the researchers can be more confident regarding the accuracy of the ratings and their conformity with the established rating system.
JUST THINK . . .
Can you think of a measure in which it might be desirable for different judges, scorers, or raters to have different views on what is being judged, scored, or rated?
Perhaps the simplest way of determining the degree of consistency among scorers in the scoring of a test is to calculate a coefficient of correlation. This correlation coefficient is referred to as a coefficient of inter-scorer reliability . In this chapter’s Everyday Psychometrics section, the nature of the relationship between the specific method used and the resulting estimate of diagnostic reliability is considered in greater detail.Page 157
The Importance of the Method Used for Estimating Reliability*
As noted throughout this text, reliability is extremely important in its own right and is also a necessary, but not sufficient, condition for validity. However, researchers often fail to understand that the specific method used to obtain reliability estimates can lead to large differences in those estimates, even when other factors (such as subject sample, raters, and specific reliability statistic used) are held constant. A published study by Chmielewski et al. (2015) highlighted the substantial influence that differences in method can have on estimates of inter-rater reliability.
As one might expect, high levels of diagnostic (inter-rater) reliability are vital for the accurate diagnosis of psychiatric/psychological disorders. Diagnostic reliability must be acceptably high in order to accurately identify risk factors for a disorder that are common to subjects in a research study. Without satisfactory levels of diagnostic reliability, it becomes nearly impossible to accurately determine the effectiveness of treatments in clinical trials. Low diagnostic reliability can also lead to improper information regarding how a disorder changes over time. In applied clinical settings, unreliable diagnoses can result in ineffective patient care—or worse. The utility and validity of a particular diagnosis itself can be called into question if expert diagnosticians cannot, for whatever reason, consistently agree on who should and should not be so diagnosed. In sum, high levels of diagnostic reliability are essential for establishing diagnostic validity (Freedman, 2013; Nelson-Gray, 1991).
The official nomenclature of psychological/psychiatric diagnoses in the United States is the Diagnostic and Statistical Manual of Mental Disorders (DSM-5; American Psychiatric Association, 2013), which provides explicit diagnostic criteria for all mental disorders. A perceived strength of recent versions of the DSM is that disorders listed in the manual can be diagnosed with a high level of inter-rater reliability (Hyman, 2010; Nathan & Langenbucher, 1999), especially when trained professionals use semistructured interviews to assign those diagnoses. However, the field trials for the newest version of the manual, the DSM-5, demonstrated a mean kappa of only .44 (Regier et al., 2013), which is considered a “fair” level of agreement that is only moderately greater than chance (Cicchetti, 1994; Fleiss, 1981). Moreover, DSM-5 kappas were much lower than those from previous versions of the manual which had been in the “excellent” range. As one might expect, given the assumption that psychiatric diagnoses are reliable, the results of the DSM-5 field trials caused considerable controversy and led to numerous criticisms of the new manual (Frances, 2012; Jones, 2012). Interestingly, several diagnoses, which were unchanged from previous versions of the manual, also demonstrated low diagnostic reliability suggesting that the manual itself was not responsible for the apparent reduction in reliability. Instead, differences in the methods used to obtain estimates of inter-rater reliability in the DSM-5 Field Trials, compared to estimates for previous versions of the manual, may have led to the lower observed diagnostic reliability.
Prior to DSM-5, estimates of DSM inter-rater reliability were largely derived using the audio-recording method. In the audio-recording method, one clinician interviews a patient and assigns diagnoses. Then a second clinician, who does not know what diagnoses were assigned, listens to an audio-recording (or watches a video-recording) of the interview and independently assigns diagnoses. These two sets of ratings are then used to calculate inter-rater reliability coefficients (such as kappa). However, in recent years, several researchers have made the case that the audio-recording method might inflate estimates of diagnostic reliability for a variety of reasons (Chmielewski et al., 2015; Kraemer et al., 2012). First, if the interviewing clinician decides the patient they are interviewing does not meet diagnostic criteria for a disorder, they typically do not ask about any remaining symptoms of the disorder (this is a feature of semistructured interviews designed to reduce administration times). However, it also means that the clinician listening to the audio-tape, even if they believe the patient might meet diagnostic criteria for a disorder, does not have all the information necessary to assign a diagnosis and therefore is forced to agree that no diagnosis is present. Second, only the interviewing clinician can follow up patient responses with further questions or obtain clarification regarding symptoms to help them make a decision. Third, even when semistructured interviews are used it is possible that two highly trained clinicians might obtain different responses from a patient if they had each conducted their own interview. In other words, the patient may volunteer more or perhaps even different information to one of the clinicians for any number of reasons. All of the above result in the audio- or video-recording method artificially constraining the information provided to the clinicians to be identical, which is unlikely to occur in actual research or Page 158clinical settings. As such, this method does not allow for truly independent ratings and therefore likely results in overestimates of what would be obtained if separate interviews were conducted.
In the test-retest method, separate independent interviews are conducted by two different clinicians, with neither clinician knowing what occurred during the other interview. These interviews are conducted over a time frame short enough that true change in diagnostic status is highly unlikely, making this method similar to the dependability method of assessing reliability (Chmielewski & Watson, 2009). Because diagnostic reliability is intended to assess the extent to which a patient would receive the same diagnosis at different hospitals or clinics—or, alternatively, the extent to which different studies are recruiting similar patients—the test-retest method provides a more meaningful, realistic, and ecologically valid estimate of diagnostic reliability.
Chmielewski et al. (2015) examined the influence of method on estimates of reliability by using both the audio-recording and test-retest methods in a large sample of psychiatric patients. The authors’ analyzed DSM-5 diagnoses because of the long-standing claims in the literature that they were reliable and the fact that structured interviews had not yet been created for the DSM-5. They carefully selected a one-week test-retest interval, based on theory and research, to minimize the likelihood that true diagnostic change would occur while substantially reducing memory effects and patient fatigue which might exist if the interviews were conducted immediately after each other. Clinicians in the study were at least master’s level and underwent extensive training that far exceeded the training of clinicians in the vast majority of research studies. The same pool of clinicians and patients was used for the audio-recording and test-retest methods. Diagnoses were assigned using the Structured Clinical Interview for DSM-IV (SCID-I/P; First et al., 2002), which is widely considered the gold-standard diagnostic interview in the field. Finally, patients completed self-report measures which were examined to ensure patients’ symptoms did not change over the one-week interval.
Diagnostic (inter-rater) reliability using the audio-recording method was very high (mean kappa = .80) and would be considered “excellent” by traditional standards (Cicchetti, 1994; Fleiss, 1981). Moreover, estimates of diagnostic reliability were equivalent or superior to previously published values for the DSM-5. However, estimates of diagnostic reliability obtained from the test-retest method were substantially lower (mean kappa = .47) and would be considered only “fair” by traditional standards. Moreover, approximately 25% of the disorders demonstrated “poor” diagnostic reliability. Interestingly, this level of diagnostic reliability was very similar to that observed in the DSM-5 Field Trials (mean kappa = .44), which also used the test-retest method (Regier et al., 2013). It is important to note these large differences in estimates of diagnostic reliability emerged despite the fact that (1) the same highly trained master’s-level clinicians were used for both methods; (2) the SCID-I/P, which is considered the “gold standard” in diagnostic interviews, was used; (3) the same patient sample was used; and (4) patients’ self-report of their symptoms was very stable (or, patients were experiencing their symptoms the same way during both interviews) and any changes in self-report were unrelated to diagnostic disagreements between clinicians. These results suggest that the reliability of diagnoses is far lower than commonly believed. Moreover, the results demonstrate the substantial influence that method has on estimates of diagnostic reliability even when other factors are held constant.
Used with permission of Michael Chmielewski.
*This Everyday Psychometrics was guest-authored by Michael Chmielewski of Southern Methodist University and was based on an article by Chmielewski et al. (2015), published in the Journal of Abnormal Psychology (copyright © 2015 by the American Psychological Association). The use of this information does not imply endorsement by the publisher.
Using and Interpreting a Coefficient of Reliability
We have seen that, with respect to the test itself, there are basically three approaches to the estimation of reliability: (1) test-retest, (2) alternate or parallel forms, and (3) internal or inter-item consistency. The method or methods employed will depend on a number of factors, such as the purpose of obtaining a measure of reliability.
Another question that is linked in no trivial way to the purpose of the test is, “How high should the coefficient of reliability be?” Perhaps the best “short answer” to this question is: Page 159“On a continuum relative to the purpose and importance of the decisions to be made on the basis of scores on the test.” Reliability is a mandatory attribute in all tests we use. However, we need more of it in some tests, and we will admittedly allow for less of it in others. If a test score carries with it life-or-death implications, then we need to hold that test to some high standards—including relatively high standards with regard to coefficients of reliability. If a test score is routinely used in combination with many other test scores and typically accounts for only a small part of the decision process, that test will not be held to the highest standards of reliability. As a rule of thumb, it may be useful to think of reliability coefficients in a way that parallels many grading systems: In the .90s rates a grade of A (with a value of .95 higher for the most important types of decisions), in the .80s rates a B (with below .85 being a clear B−), and anywhere from .65 through the .70s rates a weak, “barely passing” grade that borders on failing (and unacceptable). Now, let’s get a bit more technical with regard to the purpose of the reliability coefficient.
The Purpose of the Reliability Coefficient
If a specific test of employee performance is designed for use at various times over the course of the employment period, it would be reasonable to expect the test to demonstrate reliability across time. It would thus be desirable to have an estimate of the instrument’s test-retest reliability. For a test designed for a single administration only, an estimate of internal consistency would be the reliability measure of choice. If the purpose of determining reliability is to break down the error variance into its parts, as shown in Figure 5–1, then a number of reliability coefficients would have to be calculated.
Figure 5–1 Sources of Variance in a Hypothetical Test In this hypothetical situation, 5% of the variance has not been identified by the test user. It is possible, for example, that this portion of the variance could be accounted for by transient error, a source of error attributable to variations in the testtaker’s feelings, moods, or mental state over time. Then again, this 5% of the error may be due to other factors that are yet to be identified.
Note that the various reliability coefficients do not all reflect the same sources of error variance. Thus, an individual reliability coefficient may provide an index of error from test construction, test administration, or test scoring and interpretation. A coefficient of inter-rater reliability, for example, provides information about error as a result of test scoring. Specifically, it can be used to answer questions about how consistently two scorers score the same test items. Table 5–4 summarizes the different kinds of error variance that are reflected in different reliability coefficients.
|Type of Reliability||Purpose||Typical uses||Number of Testing Sessions||Sources of Error Variance||Statistical Procedures|
|· Test-retest||· To evaluate the stabilityof a measure||· When assessing the stability of various personality traits||· 2||· Administration||· Pearson r or Spearman rho|
|· Alternate-forms||· To evaluate the relationship between different forms of a measure||· When there is a need for different forms of a test (e.g., makeup tests)||· 1 or 2||· Test construction or administration||· Pearson r or Spearman rho|
|· Internal consistency||· To evaluate the extent to which items on a scale relate to one another||· When evaluating the homogeneity of a measure (or, all items are tapping a single construct)||· 1||· Test construction||· Pearson r between equivalent test halves with Spearman Brown correction or Kuder-R-ichardson for dichotomous items, or coefficient alpha for multipoint items or APD|
|· Inter-scorer||· To evaluate the level of agreement between raters on a measure||· Interviews or coding of behavior. Used when researchers need to show that there is consensus in the way that different raters view a particular behavior pattern (and hence no observer bias).||· 1||· Scoring and interpretation||· Cohen’s kappa, Pearson r or Spearman rho|
|Table 5–4Summary of Reliability Types|
The Nature of the Test
Closely related to considerations concerning the purpose and use of a reliability coefficient are those concerning the nature of the test itself. Included here are considerations such as whether (1) the test items are homogeneous or heterogeneous in nature; (2) the characteristic, ability, or trait being measured is presumed to be dynamic or static; (3) the range of test scores is or is not restricted; (4) the test is a speed or a power test; and (5) the test is or is not criterion-referenced.
Some tests present special problems regarding the measurement of their reliability. For example, a number of psychological tests have been developed for use with infants to help identify children who are developing slowly or who may profit from early intervention of some sort. Measuring the internal consistency reliability or the inter-scorer reliability of such tests is accomplished in much the same way as it is with other tests. However, measuring test-retest reliability presents a unique problem. The abilities of the very young children being tested are fast-changing. It is common knowledge that cognitive development during the first months and years of life is both rapid and uneven. Children often grow in spurts, sometimes changing dramatically in as little as days (Hetherington & Parke, 1993). The child tested just before and again just after a developmental advance may perform very differently on the two testings. In such cases, a marked change in test score might be attributed to error when in reality it reflects a genuine change in the testtaker’s skills. The challenge in gauging the test-retest reliability of such tests is to do so in such a way that it is not spuriously lowered by the testtaker’s actual Page 161developmental changes between testings. In attempting to accomplish this, developers of such tests may design test-retest reliability studies with very short intervals between testings, sometimes as little as four days.
Homogeneity versus heterogeneity of test items
Recall that a test is said to be homogeneous in items if it is functionally uniform throughout. Tests designed to measure one factor, such as one ability or one trait, are expected to be homogeneous in items. For such tests, it is reasonable to expect a high degree of internal consistency. By contrast, if the test is heterogeneous in items, an estimate of internal consistency might be low relative to a more appropriate estimate of test-retest reliability.
Dynamic versus static characteristics
Whether what is being measured by the test is dynamic or static is also a consideration in obtaining an estimate of reliability. A dynamic characteristic is a trait, state, or ability presumed to be ever-changing as a function of situational and cognitive experiences. If, for example, one were to take hourly measurements of the dynamic characteristic of anxiety as manifested by a stockbroker throughout a business day, one might find the measured level of this characteristic to change from hour to hour. Such changes might even be related to the magnitude of the Dow Jones average. Because the true amount of anxiety presumed to exist would vary with each assessment, a test-retest measure would be of little help in gauging the reliability of the measuring instrument. Therefore, the best estimate of reliability would be obtained from a measure of internal consistency. Contrast this situation to one in which hourly assessments of this same stockbroker are made on a trait, state, or ability presumed to be relatively unchanging (a static characteristic ), such as intelligence. In this instance, obtained measurement would not be expected to vary significantly as a function of time, and either the test-retest or the alternate-forms method would be appropriate.
JUST THINK . . .
Provide another example of both a dynamic characteristic and a static characteristic that a psychological test could measure.
Restriction or inflation of range
In using and interpreting a coefficient of reliability, the issue variously referred to as restriction of range or restriction of variance (or, conversely, inflation of range or inflation of variance ) is important. If the variance of either variable in a correlational analysis is restricted by the sampling procedure used, then the resulting correlation coefficient tends to be lower. If the variance of either variable in a correlational analysis is inflated by the sampling procedure, then the resulting correlation coefficient tends to be higher. Refer back to Figure 3–17 on page 111 (Two Scatterplots Illustrating Unrestricted and Restricted Ranges) for a graphic illustration.
Also of critical importance is whether the range of variances employed is appropriate to the objective of the correlational analysis. Consider, for example, a published educational test designed for use with children in grades 1 through 6. Ideally, the manual for this test should contain not one reliability value covering all the testtakers in grades 1 through 6 but instead reliability values for testtakers at each grade level. Here’s another example: A corporate personnel officer employs a certain screening test in the hiring process. For future testing and hiring purposes, this personnel officer maintains reliability data with respect to scores achieved by job applicants—as opposed to hired employees—in order to avoid restriction of range effects in the data. This is so because the people who were hired typically scored higher on the test than any comparable group of applicants.
Speed tests versus power tests
When a time limit is long enough to allow testtakers to attempt all items, and if some items are so difficult that no testtaker is able to obtain a perfect score, then the test is a power test . By contrast, a speed test generally contains items of Page 162uniform level of difficulty (typically uniformly low) so that, when given generous time limits, all testtakers should be able to complete all the test items correctly. In practice, however, the time limit on a speed test is established so that few if any of the testtakers will be able to complete the entire test. Score differences on a speed test are therefore based on performance speed because items attempted tend to be correct.
A reliability estimate of a speed test should be based on performance from two independent testing periods using one of the following: (1) test-retest reliability, (2) alternate-forms reliability, or (3) split-half reliability from two separately timed half tests. If a split-half procedure is used, then the obtained reliability coefficient is for a half test and should be adjusted using the Spearman–Brown formula.
Because a measure of the reliability of a speed test should reflect the consistency of response speed, the reliability of a speed test should not be calculated from a single administration of the test with a single time limit. If a speed test is administered once and some measure of internal consistency, such as the Kuder–Richardson or a split-half correlation, is calculated, the result will be a spuriously high reliability coefficient. To understand why the KR-20 or split-half reliability coefficient will be spuriously high, consider the following example.
When a group of testtakers completes a speed test, almost all the items completed will be correct. If reliability is examined using an odd-even split, and if the testtakers completed the items in order, then testtakers will get close to the same number of odd as even items correct. A testtaker completing 82 items can be expected to get approximately 41 odd and 41 even items correct. A testtaker completing 61 items may get 31 odd and 30 even items correct. When the numbers of odd and even items correct are correlated across a group of testtakers, the correlation will be close to 1.00. Yet this impressive correlation coefficient actually tells us nothing about response consistency.
Under the same scenario, a Kuder–Richardson reliability coefficient would yield a similar coefficient that would also be, well, equally useless. Recall that KR-20 reliability is based on the proportion of testtakers correct (p) and the proportion of testtakers incorrect (q) on each item. In the case of a speed test, it is conceivable that p would equal 1.0 and q would equal 0 for many of the items. Toward the end of the test—when many items would not even be attempted because of the time limit—p might equal 0 and q might equal 1.0. For many, if not a majority, of the items, then, the product pq would equal or approximate 0. When 0 is substituted in the KR-20 formula for Σ pq, the reliability coefficient is 1.0 (a meaningless coefficient in this instance).
A criterion-referenced test is designed to provide an indication of where a testtaker stands with respect to some variable or criterion, such as an educational or a vocational objective. Unlike norm-referenced tests, criterion-referenced tests tend to contain material that has been mastered in hierarchical fashion. For example, the would-be pilot masters on-ground skills before attempting to master in-flight skills. Scores on criterion-referenced tests tend to be interpreted in pass–fail (or, perhaps more accurately, “master-failed-to-master”) terms, and any scrutiny of performance on individual items tends to be for diagnostic and remedial purposes.
Traditional techniques of estimating reliability employ measures that take into account scores on the entire test. Recall that a test-retest reliability estimate is based on the correlation between the total scores on two administrations of the same test. In alternate-forms reliability, a reliability estimate is based on the correlation between the two total scores on the two forms. In split-half reliability, a reliability estimate is based on the correlation between scores on two halves of the test and is then adjusted using the Spearman–Brown formula to obtain a reliability estimate of the whole test. Although there are exceptions, such traditional procedures of Page 163estimating reliability are usually not appropriate for use with criterion-referenced tests. To understand why, recall that reliability is defined as the proportion of total variance (σ2) attributable to true variance (σ2th). Total variance in a test score distribution equals the sum of the true variance plus the error variance (σe2)
A measure of reliability, therefore, depends on the variability of the test scores: how different the scores are from one another. In criterion-referenced testing, and particularly in mastery testing, how different the scores are from one another is seldom a focus of interest. In fact, individual differences between examinees on total test scores may be minimal. The critical issue for the user of a mastery test is whether or not a certain criterion score has been achieved.
As individual differences (and the variability) decrease, a traditional measure of reliability would also decrease, regardless of the stability of individual performance. Therefore, traditional ways of estimating reliability are not always appropriate for criterion-referenced tests, though there may be instances in which traditional estimates can be adopted. An example might be a situation in which the same test is being used at different stages in some program—training, therapy, or the like—and so variability in scores could reasonably be expected. Statistical techniques useful in determining the reliability of criterion-referenced tests are discussed in great detail in many sources devoted to that subject (e.g., Hambleton & Jurgensen, 1990).
The True Score Model of Measurement and Alternatives to It
Thus far—and throughout this book, unless specifically stated otherwise—the model we have assumed to be operative is classical test theory (CTT) , also referred to as the true score (or classical) model of measurement. CTT is the most widely used and accepted model in the psychometric literature today—rumors of its demise have been greatly exaggerated (Zickar & Broadfoot, 2009). One of the reasons it has remained the most widely used model has to do with its simplicity, especially when one considers the complexity of other proposed models of measurement. Comparing CTT to IRT, for example, Streiner (2010) mused, “CTT is much simpler to understand than IRT; there aren’t formidable-looking equations with exponentiations, Greek letters, and other arcane symbols” (p. 185). Additionally, the CTT notion that everyone has a “true score” on a test has had, and continues to have, great intuitive appeal. Of course, exactly how to define this elusive true score has been a matter of sometimes contentious debate. For our purposes, we will define true score as a value that according to classical test theory genuinely reflects an individual’s ability (or trait) level as measured by a particular test. Let’s emphasize here that this value is indeed very test dependent. A person’s “true score” on one intelligence test, for example, can vary greatly from that same person’s “true score” on another intelligence test. Similarly, if “Form D” of an ability test contains items that the testtaker finds to be much more difficult than those on “Form E” of that test, then there is a good chance that the testtaker’s true score on Form D will be lower than that on Form E. The same holds for true scores obtained on different tests of personality. One’s true score on one test of extraversion, for example, may not bear much resemblance to one’s true score on another test of extraversion. Comparing a testtaker’s scores on two different tests purporting to measure the same thing requires a sophisticated knowledge of the properties of each of the two tests, as well as some rather complicated statistical procedures designed to equate the scores.
Another aspect of the appeal of CTT is that its assumptions allow for its application in most situations (Hambleton & Swaminathan, 1985). The fact that CTT assumptions are rather easily met and therefore applicable to so many measurement situations can be Page 164advantageous, especially for the test developer in search of an appropriate model of measurement for a particular application. Still, in psychometric parlance, CTT assumptions are characterized as “weak”—this precisely because its assumptions are so readily met. By contrast, the assumptions in another model of measurement, item response theory (IRT), are more difficult to meet. As a consequence, you may read of IRT assumptions being characterized in terms such as “strong,” “hard,” “rigorous,” and “robust.” A final advantage of CTT over any other model of measurement has to do with its compatibility and ease of use with widely used statistical techniques (as well as most currently available data analysis software). Factor analytic techniques, whether exploratory or confirmatory, are all “based on the CTT measurement foundation” (Zickar & Broadfoot, 2009, p. 52).
For all of its appeal, measurement experts have also listed many problems with CTT. For starters, one problem with CTT has to do with its assumption concerning the equivalence of all items on a test; that is, all items are presumed to be contributing equally to the score total. This assumption is questionable in many cases, and particularly questionable when doubt exists as to whether the scaling of the instrument in question is genuinely interval level in nature. Another problem has to do with the length of tests that are developed using a CTT model. Whereas test developers favor shorter rather than longer tests (as do most testtakers), the assumptions inherent in CTT favor the development of longer rather than shorter tests. For these reasons, as well as others, alternative measurement models have been developed. Below we briefly describe domain sampling theory and generalizability theory. We will then describe in greater detail, item response theory (IRT), a measurement model that some believe is a worthy successor to CTT (Borsbroom, 2005; Harvey & Hammer, 1999).
Domain sampling theory and generalizability theory
The 1950s saw the development of a viable alternative to CTT. It was originally referred to as domain sampling theory and is better known today in one of its many modified forms as generalizability theory. As set forth by Tryon (1957), the theory of domain sampling rebels against the concept of a true score existing with respect to the measurement of psychological constructs. Whereas those who subscribe to CTT seek to estimate the portion of a test score that is attributable to error, proponents of domain sampling theory seek to estimate the extent to which specific sources of variation under defined conditions are contributing to the test score. In domain sampling theory, a test’s reliability is conceived of as an objective measure of how precisely the test score assesses the domain from which the test draws a sample (Thorndike, 1985). A domain of behavior, or the universe of items that could conceivably measure that behavior, can be thought of as a hypothetical construct: one that shares certain characteristics with (and is measured by) the sample of items that make up the test. In theory, the items in the domain are thought to have the same means and variances of those in the test that samples from the domain. Of the three types of estimates of reliability, measures of internal consistency are perhaps the most compatible with domain sampling theory.
In one modification of domain sampling theory called generalizability theory, a “universe score” replaces that of a “true score” (Shavelson et al., 1989). Developed by Lee J. Cronbach (1970) and his colleagues (Cronbach et al., 1972), generalizability theory is based on the idea that a person’s test scores vary from testing to testing because of variables in the testing situation. Instead of conceiving of all variability in a person’s scores as error, Cronbach encouraged test developers and researchers to describe the details of the particular test situation or universe leading to a specific test score. This universe is described in terms of its facets , which include things like the number of items in the test, the amount of training the test scorers have had, and the purpose of the test administration. Page 165According to generalizability theory, given the exact same conditions of all the facets in the universe, the exact same test score should be obtained. This test score is the universe score , and it is, as Cronbach noted, analogous to a true score in the true score model. Cronbach (1970) explained as follows:
“What is Mary’s typing ability?” This must be interpreted as “What would Mary’s word processing score on this be if a large number of measurements on the test were collected and averaged?” The particular test score Mary earned is just one out of a universe of possible observations. If one of these scores is as acceptable as the next, then the mean, called the universe score and symbolized here by Mp (mean for person p), would be the most appropriate statement of Mary’s performance in the type of situation the test represents.
The universe is a collection of possible measures “of the same kind,” but the limits of the collection are determined by the investigator’s purpose. If he needs to know Mary’s typing ability on May 5 (for example, so that he can plot a learning curve that includes one point for that day), the universe would include observations on that day and on that day only. He probably does want to generalize over passages, testers, and scorers—that is to say, he would like to know Mary’s ability on May 5 without reference to any particular passage, tester, or scorer… .
The person will ordinarily have a different universe score for each universe. Mary’s universe score covering tests on May 5 will not agree perfectly with her universe score for the whole month of May… . Some testers call the average over a large number of comparable observations a “true score”; e.g., “Mary’s true typing rate on 3-minute tests.” Instead, we speak of a “universe score” to emphasize that what score is desired depends on the universe being considered. For any measure there are many “true scores,” each corresponding to a different universe.
When we use a single observation as if it represented the universe, we are generalizing. We generalize over scorers, over selections typed, perhaps over days. If the observed scores from a procedure agree closely with the universe score, we can say that the observation is “accurate,” or “reliable,” or “generalizable.” And since the observations then also agree with each other, we say that they are “consistent” and “have little error variance.” To have so many terms is confusing, but not seriously so. The term most often used in the literature is “reliability.” The author prefers “generalizability” because that term immediately implies “generalization to what?” … There is a different degree of generalizability for each universe. The older methods of analysis do not separate the sources of variation. They deal with a single source of variance, or leave two or more sources entangled. (Cronbach, 1970, pp. 153–154)
How can these ideas be applied? Cronbach and his colleagues suggested that tests be developed with the aid of a generalizability study followed by a decision study. A generalizability study examines how generalizable scores from a particular test are if the test is administered in different situations. Stated in the language of generalizability theory, a generalizability study examines how much of an impact different facets of the universe have on the test score. Is the test score affected by group as opposed to individual administration? Is the test score affected by the time of day in which the test is administered? The influence of particular facets on the test score is represented by coefficients of generalizability . These coefficients are similar to reliability coefficients in the true score model.
After the generalizability study is done, Cronbach et al. (1972) recommended that test developers do a decision study, which involves the application of information from the generalizability study. In the decision study , developers examine the usefulness of test scores in helping the test user make decisions. In practice, test scores are used to guide a variety of decisions, from placing a child in special education to hiring new employees to Page 166discharging mental patients from the hospital. The decision study is designed to tell the test user how test scores should be used and how dependable those scores are as a basis for decisions, depending on the context of their use. Why is this so important? Cronbach (1970) noted:
The decision that a student has completed a course or that a patient is ready for termination of therapy must not be seriously influenced by chance errors, temporary variations in performance, or the tester’s choice of questions. An erroneous favorable decision may be irreversible and may harm the person or the community. Even when reversible, an erroneous unfavorable decision is unjust, disrupts the person’s morale, and perhaps retards his development. Research, too, requires dependable measurement. An experiment is not very informative if an observed difference could be accounted for by chance variation. Large error variance is likely to mask a scientifically important outcome. Taking a better measure improves the sensitivity of an experiment in the same way that increasing the number of subjects does. (p. 152)
Generalizability has not replaced CTT. Perhaps one of its chief contributions has been its emphasis on the fact that a test’s reliability does not reside within the test itself. From the perspective of generalizability theory, a test’s reliability is very much a function of the circumstances under which the test is developed, administered, and interpreted.
Item response theory (IRT)
Another alternative to the true score model is item response theory (IRT; Lord & Novick, 1968; Lord, 1980). The procedures of item response theory provide a way to model the probability that a person with X ability will be able to perform at a level of Y. Stated in terms of personality assessment, it models the probability that a person with X amount of a particular personality trait will exhibit Y amount of that trait on a personality test designed to measure it. Because so often the psychological or educational construct being measured is physically unobservable (stated another way, is latent) and because the construct being measured may be a trait (it could also be something else, such as an ability), a synonym for IRT in the academic literature is latent-trait theory . Let’s note at the outset, however, that IRT is not a term used to refer to a single theory or method. Rather, it refers to a family of theories and methods—and quite a large family at that—with many other names used to distinguish specific approaches. There are well over a hundred varieties of IRT models. Each model is designed to handle data with certain assumptions and data characteristics.
Examples of two characteristics of items within an IRT framework are the difficulty level of an item and the item’s level of discrimination; items may be viewed as varying in terms of these, as well as other, characteristics. “Difficulty” in this sense refers to the attribute of not being easily accomplished, solved, or comprehended. In a mathematics test, for example, a test item tapping basic addition ability will have a lower difficulty level than a test item tapping basic algebra skills. The characteristic of difficulty as applied to a test item may also refer to physical difficulty—that is, how hard or easy it is for a person to engage in a particular activity. Consider in this context three items on a hypothetical “Activities of Daily Living Questionnaire” (ADLQ), a true–false questionnaire designed to tap the extent to which respondents are physically able to participate in activities of daily living. Item 1 of this test is I am able to walk from room to room in my home. Item 2 is I require assistance to sit, stand, and walk. Item 3 is I am able to jog one mile a day, seven days a week. With regard to difficulty related to mobility, the respondent who answers true to item 1 and false to item 2 may be presumed to have more mobility than the respondent who answers false to item 1 and true to item 2. In classical test theory, each of these items might be scored with 1 point awarded to responses indicative Page 167of mobility and 0 points for responses indicative of a lack of mobility. Within IRT, however, responses indicative of mobility (as opposed to a lack of mobility or impaired mobility) may be assigned different weights. A true response to item 1 may therefore earn more points than a false response to item 2, and a true response to item 3 may earn more points than a true response to item 1.
In the context of IRT, discrimination signifies the degree to which an item differentiates among people with higher or lower levels of the trait, ability, or whatever it is that is being measured. Consider two more ADLQ items: item 4, My mood is generally good; and item 5, I am able to walk one block on flat ground. Which of these two items do you think would be more discriminating in terms of the respondent’s physical abilities? If you answered “item 5” then you are correct. And if you were developing this questionnaire within an IRT framework, you would probably assign differential weight to the value of these two items. Item 5 would be given more weight for the purpose of estimating a person’s level of physical activity than item 4. Again, within the context of classical test theory, all items of the test might be given equal weight and scored, for example, 1 if indicative of the ability being measured and 0 if not indicative of that ability.
A number of different IRT models exist to handle data resulting from the administration of tests with various characteristics and in various formats. For example, there are IRT models designed to handle data resulting from the administration of tests with dichotomous test items (test items or questions that can be answered with only one of two alternative responses, such as true–false, yes–no, or correct–incorrect questions). There are IRT models designed to handle data resulting from the administration of tests with polytomous test items (test items or questions with three or more alternative responses, where only one is scored correct or scored as being consistent with a targeted trait or other construct). Other IRT models exist to handle other types of data.
In general, latent-trait models differ in some important ways from CTT. For example, in CTT, no assumptions are made about the frequency distribution of test scores. By contrast, such assumptions are inherent in latent-trait models. As Allen and Yen (1979, p. 240) have pointed out, “Latent-trait theories propose models that describe how the latent trait influences performance on each test item. Unlike test scores or true scores, latent traits theoretically can take on values from −∞ to +∞ [negative infinity to positive infinity].” Some IRT models have very specific and stringent assumptions about the underlying distribution. In one group of IRT models developed by the Danish mathematician Georg Rasch, each item on the test is assumed to have an equivalent relationship with the construct being measured by the test. A shorthand reference to these types of models is “Rasch,” so reference to the Rasch model is a reference to an IRT model with very specific assumptions about the underlying distribution.
The psychometric advantages of IRT have made this model appealing, especially to commercial and academic test developers and to large-scale test publishers. It is a model that in recent years has found increasing application in standardized tests, professional licensing examinations, and questionnaires used in behavioral and social sciences (De Champlain, 2010). However, the mathematical sophistication of the approach has made it out of reach for many everyday users of tests such as classroom teachers or “mom and pop” employers (Reise & Henson, 2003). To learn more about the approach that Roid (2006) once characterized as having fostered “new rules of measurement” for ability testing ask your instructor to access the Instructor Resources within Connect and check out OOBAL-5-B2, “Item Response Theory (IRT).” More immediately, you can meet a “real-life” user of IRT in this chapter’s Meet an Assessment Professional feature.Page 168
MEET AN ASSESSMENT PROFESSIONAL
Meet Dr. Bryce B. Reeve
Iuse my skills and training as a psychometrician to design questionnaires and studies to capture the burden of cancer and its treatment on patients and their families… . The types of questionnaires I help to create measure a person’s health-related quality of life (HRQOL). HRQOL is a multidimensional construct capturing such domains as physical functioning, mental well-being, and social well-being. Different cancer types and treatments for those cancers may have different impact on the magnitude and which HRQOL domain is affected. All cancers can impact a person’s mental health with documented increases in depressive symptoms and anxiety… . There may also be positive impacts of cancer as some cancer survivors experience greater social well-being and appreciation of life. Thus, our challenge is to develop valid and precise measurement tools that capture these changes in patients’ lives. Psychometrically strong measures also allow us to evaluate the impact of new behavioral or pharmacological interventions developed to improve quality of life. Because many patients in our research studies are ill, it is important to have very brief questionnaires to minimize their burden responding to a battery of questionnaires.
… we … use both qualitative and quantitative methodologies to design … HRQOL instruments. We use qualitative methods like focus groups and cognitive interviewing to make sure we have captured the experiences and perspectives of cancer patients and to write questions that are comprehendible to people with low literacy skills or people of different cultures. We use quantitative methods to examine how well individual questions and scales perform for measuring the HRQOL domains. Specifically, we use classical test theory, factor analysis, and item response theory (IRT) to: (1) develop and refine questionnaires; (2) identify the performance of instruments across different age groups, males and females, and cultural/racial groups; and (3) to develop item banks which allow for creating standardized questionnaires or administering computerized adaptive testing (CAT).
Bryce B. Reeve, Ph.D., U.S. National Cancer Institute © Bryce B. Reeve/National Institute of Health
I use IRT models to get an in-depth look as to how questions and scales perform in our cancer research studies. [Using IRT], we were able to reduce a burdensome 21-item scale down to a brief 10-item scale… .
Differential item function (DIF) is a key methodology to identify … biased items in questionnaires. I have used IRT modeling to examine DIF in item responses on many HRQOL questionnaires. It is especially important to evaluate DIF in questionnaires that have been translated to multiple languages for the purpose of conducting international research studies. An instrument may be translated to have the same words in multiple languages, but the words themselves may have entirely different meaning to people of different cultures. For example, researchers at the University of Massachusetts found Chinese respondents gave lower satisfaction ratings of their medical doctors than non-Chinese. In a review of the translation, the “Excellent” response category translated into Chinese as “God-like.” IRT modeling gives me the ability to not only detect DIF items, but the flexibility to correct for bias as well. I can use IRT to look at unadjusted and adjusted IRT scores to see the effect of the DIF item without removing the item from the scale if the item is deemed relevant… .Page 169
The greatest challenges I found to greater application or acceptance of IRT methods in health care research are the complexities of the models themselves and lack of easy-to-understand resources and tools to train researchers. Many researchers have been trained in classical test theory statistics, are comfortable interpreting these statistics, and can use readily available software to generate easily familiar summary statistics, such as Cronbach’s coefficient α or item-total correlations. In contrast, IRT modeling requires an advanced knowledge of measurement theory to understand the mathematical complexities of the models, to determine whether the assumptions of the IRT models are met, and to choose the model from within the large family of IRT models that best fits the data and the measurement task at hand. In addition, the supporting software and literature are not well adapted for researchers outside the field of educational testing.
Read more of what Dr. Reeve had to say—his complete essay—through the Instructor Resources within Connect.
Used with permission of Bryce B. Reeve.
Reliability and Individual Scores
The reliability coefficient helps the test developer build an adequate measuring instrument, and it helps the test user select a suitable test. However, the usefulness of the reliability coefficient does not end with test construction and selection. By employing the reliability coefficient in the formula for the standard error of measurement, the test user now has another descriptive statistic relevant to test interpretation, this one useful in estimating the precision of a particular test score.
The Standard Error of Measurement
The standard error of measurement, often abbreviated as SEM or SEM, provides a measure of the precision of an observed test score. Stated another way, it provides an estimate of the amount of error inherent in an observed score or measurement. In general, the relationship between the SEM and the reliability of a test is inverse; the higher the reliability of a test (or individual subtest within a test), the lower the SEM.
To illustrate the utility of the SEM, let’s revisit The Rochester Wrenchworks (TRW) and reintroduce Mary (from Cronbach’s excerpt earlier in this chapter), who is now applying for a job as a word processor. To be hired at TRW as a word processor, a candidate must be able to word-process accurately at the rate of 50 words per minute. The personnel office administers a total of seven brief word-processing tests to Mary over the course of seven business days. In words per minute, Mary’s scores on each of the seven tests are as follows:
52 55 39 56 35 50 54
If you were in charge of hiring at TRW and you looked at these seven scores, you might logically ask, “Which of these scores is the best measure of Mary’s ‘true’ word-processing ability?” And more to the point, “Which is her ‘true’ score?”
The “true” answer to this question is that we cannot conclude with absolute certainty from the data we have exactly what Mary’s true word-processing ability is. We can, however, make an educated guess. Our educated guess would be that her true word-processing ability is equal to the mean of the distribution of her word-processing scores plus or minus a number of points accounted for by error in the measurement process. We do not know how many points are accounted for by error in the measurement process. The best we can do is estimate how much error entered into a particular test score.
The standard error of measurement is the tool used to estimate or infer the extent to which an observed score deviates from a true score. We may define the standard error of Page 170measurement as the standard deviation of a theoretically normal distribution of test scores obtained by one person on equivalent tests. Also known as the standard error of a score and denoted by the symbol σmeas, the standard error of measurement is an index of the extent to which one individual’s scores vary over tests presumed to be parallel. In accordance with the true score model, an obtained test score represents one point in the theoretical distribution of scores the testtaker could have obtained. But where on the continuum of possible scores is this obtained score? If the standard deviation for the distribution of test scores is known (or can be calculated) and if an estimate of the reliability of the test is known (or can be calculated), then an estimate of the standard error of a particular score (or, the standard error of measurement) can be determined by the following formula:
where σmeas is equal to the standard error of measurement, σ is equal to the standard deviation of test scores by the group of testtakers, and rxx is equal to the reliability coefficient of the test. The standard error of measurement allows us to estimate, with a specific level of confidence, the range in which the true score is likely to exist.
If, for example, a spelling test has a reliability coefficient of .84 and a standard deviation of 10, then
In order to use the standard error of measurement to estimate the range of the true score, we make an assumption: If the individual were to take a large number of equivalent tests, scores on those tests would tend to be normally distributed, with the individual’s true score as the mean. Because the standard error of measurement functions like a standard deviation in this context, we can use it to predict what would happen if an individual took additional equivalent tests:
· approximately 68% (actually, 68.26%) of the scores would be expected to occur within ±1σmeas of the true score;
· approximately 95% (actually, 95.44%) of the scores would be expected to occur within ±2σmeas of the true score;
· approximately 99% (actually, 99.74%) of the scores would be expected to occur within ±3σmeas of the true score.
Of course, we don’t know the true score for any individual testtaker, so we must estimate it. The best estimate available of the individual’s true score on the test is the test score already obtained. Thus, if a student achieved a score of 50 on one spelling test and if the test had a standard error of measurement of 4, then—using 50 as the point estimate—we can be:
· 68% (actually, 68.26%) confident that the true score falls within 50 ± 1σmeas (or between 46 and 54, including 46 and 54);
· 95% (actually, 95.44%) confident that the true score falls within 50 ± 2σmeas (or between 42 and 58, including 42 and 58);
· 99% (actually, 99.74%) confident that the true score falls within 50 ± 3σmeas (or between 38 and 62, including 38 and 62).
The standard error of measurement, like the reliability coefficient, is one way of expressing test reliability. If the standard deviation of a test is held constant, then the smaller the σmeas, the more reliable the test will be; as rxx increases, the σmeas decreases. For example, when a reliability coefficient equals .64 and σ equals 15, the standard error of measurement equals 9:
With a reliability coefficient equal to .96 and σ still equal to 15, the standard error of measurement decreases to 3:
In practice, the standard error of measurement is most frequently used in the interpretation of individual test scores. For example, intelligence tests are given as part of the assessment of individuals for intellectual disability. One of the criteria for mental retardation is an IQ score of 70 or below (when the mean is 100 and the standard deviation is 15) on an individually administered intelligence test (American Psychiatric Association, 1994). One question that could be asked about these tests is how scores that are close to the cutoff value of 70 should be treated. Specifically, how high above 70 must a score be for us to conclude confidently that the individual is unlikely to be retarded? Is 72 clearly above the retarded range, so that if the person were to take a parallel form of the test, we could be confident that the second score would be above 70? What about a score of 75? A score of 79?
Useful in answering such questions is an estimate of the amount of error in an observed test score. The standard error of measurement provides such an estimate. Further, the standard error of measurement is useful in establishing what is called a confidence interval : a range or band of test scores that is likely to contain the true score.
Consider an application of a confidence interval with one hypothetical measure of adult intelligence. The manual for the test provides a great deal of information relevant to the reliability of the test as a whole as well as more specific reliability-related information for each of its subtests. As reported in the manual, the standard deviation is 3 for the subtest scaled scores and 15 for IQ scores. Across all of the age groups in the normative sample, the average reliability coefficient for the Full Scale IQ (FSIQ) is .98, and the average standard error of measurement for the FSIQ is 2.3.
Knowing an individual testtaker’s FSIQ score and his or her age, we can calculate a confidence interval. For example, suppose a 22-year-old testtaker obtained a FSIQ of 75. The test user can be 95% confident that this testtaker’s true FSIQ falls in the range of 70 to 80. This is so because the 95% confidence interval is set by taking the observed score of 75, plus or minus 1.96, multiplied by the standard error of measurement. In the test manual we find that the standard error of measurement of the FSIQ for a 22-year-old testtaker is 2.37. With this information in hand, the 95% confidence interval is calculated as follows:
The calculated interval of 4.645 is rounded to the nearest whole number, 5. We can therefore be 95% confident that this testtaker’s true FSIQ on this particular test of intelligence lies somewhere in the range of the observed score of 75 plus or minus 5, or somewhere in the range of 70 to 80.
In the interest of increasing your SEM “comfort level,” consider the data presented in Table 5–5. These are SEMs for selected age ranges and selected types of IQ measurements as reported in the Technical Manual for the Stanford-Binet Intelligence Scales, fifth edition (SB5). When presenting these and related data, Roid (2003c, p. 65) noted: “Scores that are more precise and consistent have smaller differences between true and observed scores, resulting in lower SEMs.” Given this, just think: What hypotheses come to mind regarding SB5 IQ scores at ages 5, 10, 15, and 80+?
|Age (in years)|
|Full Scale IQ||2.12||2.60||2.12||2.12|
|Abbreviated Battery IQ||4.24||5.20||4.50||3.00|
|Table 5–5Standard Errors of Measurement of SB5 IQ Scores at Ages 5, 10, 15, and 80+|
The standard error of measurement can be used to set the confidence interval for a particular score or to determine whether a score is significantly different from a criterion (such as the cutoff score of 70 described previously). But the standard error of measurement cannot be used to compare scores. So, how do test users compare scores?Page 172
The Standard Error of the Difference Between Two Scores
Error related to any of the number of possible variables operative in a testing situation can contribute to a change in a score achieved on the same test, or a parallel test, from one administration of the test to the next. The amount of error in a specific test score is embodied in the standard error of measurement. But scores can change from one testing to the next for reasons other than error.
True differences in the characteristic being measured can also affect test scores. These differences may be of great interest, as in the case of a personnel officer who must decide which of many applicants to hire. Indeed, such differences may be hoped for, as in the case of a psychotherapy researcher who hopes to prove the effectiveness of a particular approach to therapy. Comparisons between scores are made using the standard error of the difference , a statistical measure that can aid a test user in determining how large a difference should be before it is considered statistically significant. As you are probably aware from your course in statistics, custom in the field of psychology dictates that if the probability is more than 5% that the difference occurred by chance, then, for all intents and purposes, it is presumed that there was no difference. A more rigorous standard is the 1% standard. Applying the 1% standard, no statistically significant difference would be deemed to exist unless the observed difference could have occurred by chance alone less than one time in a hundred.
The standard error of the difference between two scores can be the appropriate statistical tool to address three types of questions:
1. How did this individual’s performance on test 1 compare with his or her performance on test 2?
2. How did this individual’s performance on test 1 compare with someone else’s performance on test 1?
3. How did this individual’s performance on test 1 compare with someone else’s performance on test 2?
As you might have expected, when comparing scores achieved on the different tests, it is essential that the scores be converted to the same scale. The formula for the standard error of the difference between two scores is
where σdiff is the standard error of the difference between two scores, is the squared standard error of measurement for test 1, and is the squared standard error of measurement for test 2. If we substitute reliability coefficients for the standard errors of measurement of the separate scores, the formula becomes
where r1 is the reliability coefficient of test 1, r2 is the reliability coefficient of test 2, and σ is the standard deviation. Note that both tests would have the same standard deviation because they must be on the same scale (or be converted to the same scale) before a comparison can be made.
The standard error of the difference between two scores will be larger than the standard error of measurement for either score alone because the former is affected by measurement error in both scores. This also makes good sense: If two scores each contain error such that in each case the true score could be higher or lower, then we would want the two scores to be further apart before we conclude that there is a significant difference between them.
The value obtained by calculating the standard error of the difference is used in much the same way as the standard error of the mean. If we wish to be 95% confident that the two scores are different, we would want them to be separated by 2 standard errors of the difference. A separation of only 1 standard error of the difference would give us 68% confidence that the two true scores are different.
As an illustration of the use of the standard error of the difference between two scores, consider the situation of a corporate personnel manager who is seeking a highly responsible person for the position of vice president of safety. The personnel officer in this hypothetical situation decides to use a new published test we will call the Safety-Mindedness Test (SMT) to screen applicants for the position. After placing an ad in the employment section of the local newspaper, the personnel officer tests 100 applicants for the position using the SMT. The personnel officer narrows the search for the vice president to the two highest scorers on the SMT: Moe, who scored 125, and Larry, who scored 134. Assuming the measured reliability of this test to be .92 and its standard deviation to be 14, should the personnel officer conclude that Larry performed significantly better than Moe? To answer this question, first calculate the standard error of the difference:
Note that in this application of the formula, the two test reliability coefficients are the same because the two scores being compared are derived from the same test.
What does this standard error of the difference mean? For any standard error of the difference, we can be:
· 68% confident that two scores differing by 1σdiff represent true score differences;
· 95% confident that two scores differing by 2σdiff represent true score differences;
· 99.7% confident that two scores differing by 3σdiff represent true score differences.
Applying this information to the standard error of the difference just computed for the SMT, we see that the personnel officer can be:
· 68% confident that two scores differing by 5.6 represent true score differences;
· 95% confident that two scores differing by 11.2 represent true score differences;
· 99.7% confident that two scores differing by 16.8 represent true score differences.
The difference between Larry’s and Moe’s scores is only 9 points, not a large enough difference for the personnel officer to conclude with 95% confidence that the two individuals have true scores that differ on this test. Stated another way: If Larry and Moe were to take a parallel form of the SMT, then the personnel officer could not be 95% confident that, at the next testing, Larry would again outperform Moe. The personnel officer in this example would have to resort to other means to decide whether Moe, Larry, or someone else would be the best candidate for the position (Curly has been patiently waiting in the wings).
JUST THINK . . .
With all of this talk about Moe, Larry, and Curly, please tell us that you have not forgotten about Mary. You know, Mary from the Cronbach quote on page 165—yes, that Mary. Should she get the job at TRW? If your instructor thinks it would be useful to do so, do the math before responding.
As a postscript to the preceding example, suppose Larry got the job primarily on the basis of data from our hypothetical SMT. And let’s further suppose that it soon became all too clear that Larry was the hands-down absolute worst vice president of safety that the company had ever seen. Larry spent much of his time playing practical jokes on fellow corporate officers, and he spent many of his off-hours engaged in his favorite pastime, flagpole sitting. The personnel officer might then have very good reason to question how well the instrument called the Safety-Mindedness Test truly measured safety-mindedness. Or, to put it another way, the personnel officer might question the validity of the test. Not coincidentally, the subject of test validity is taken up in the next chapter.
Test your understanding of elements of this chapter by seeing if you can explain each of the following terms, expressions, and abbreviations:
Modules chapter 3 wk2 p655
C H A P T E R 3
A Statistics Refresher
From the red-pencil number circled at the top of your first spelling test to the computer printout of your college entrance examination scores, tests and test scores touch your life. They seem to reach out from the paper and shake your hand when you do well and punch you in the face when you do poorly. They can point you toward or away from a particular school or curriculum. They can help you to identify strengths and weaknesses in your physical and mental abilities. They can accompany you on job interviews and influence a job or career choice.
JUST THINK . . .
For most people, test scores are an important fact of life. But what makes those numbers so meaningful? In general terms, what information, ideally, should be conveyed by a test score?
In your role as a student, you have probably found that your relationship to tests has been primarily that of a testtaker. But as a psychologist, teacher, researcher, or employer, you may find that your relationship with tests is primarily that of a test user—the person who breathes life and meaning into test scores by applying the knowledge and skill to interpret them appropriately. You may one day create a test, whether in an academic or a business setting, and then have the responsibility for scoring and interpreting it. In that situation, or even from the perspective of one who would take that test, it’s essential to understand the theory underlying test use and the principles of test-score interpretation.
Test scores are frequently expressed as numbers, and statistical tools are used to describe, make inferences from, and draw conclusions about numbers. 1 In this statistics refresher, we cover scales of measurement, tabular and graphic presentations of data, measures of central tendency, measures of variability, aspects of the normal curve, and standard scores. If these statistics-related terms look painfully familiar to you, we ask your indulgence and ask you to remember that overlearning is the key to retention. Of course, if any of these terms appear unfamiliar, we urge you to learn more about them. Feel free to supplement the discussion here with a review of these and related terms in any good elementary statistics text. The brief review of statistical concepts that follows can in no way replace a sound grounding in basic statistics gained through an introductory course in that subject.Page 76
Scales of Measurement
We may formally define measurement as the act of assigning numbers or symbols to characteristics of things (people, events, whatever) according to rules. The rules used in assigning numbers are guidelines for representing the magnitude (or some other characteristic) of the object being measured. Here is an example of a measurement rule: Assign the number 12 to all lengths that are exactly the same length as a 12-inch ruler. A scale is a set of numbers (or other symbols) whose properties model empirical properties of the objects to which the numbers are assigned. 2
JUST THINK . . .
What is another example of a measurement rule?
There are various ways in which a scale can be categorized. One way of categorizing a scale is according to the type of variable being measured. Thus, a scale used to measure a continuous variable might be referred to as a continuous scale, whereas a scale used to measure a discrete variable might be referred to as a discrete scale. A continuous scale exists when it is theoretically possible to divide any of the values of the scale. A distinction must be made, however, between what is theoretically possible and what is practically desirable. The units into which a continuous scale will actually be divided may depend on such factors as the purpose of the measurement and practicality. In measurement to install venetian blinds, for example, it is theoretically possible to measure by the millimeter or even by the micrometer. But is such precision necessary? Most installers do just fine with measurement by the inch.
As an example of measurement using a discrete scale, consider mental health research that presorted subjects into one of two discrete groups: (1) previously hospitalized and (2) never hospitalized. Such a, categorization scale would be characterized as discrete because it would not be accurate or meaningful to categorize any of the subjects in the study as anything other than “previously hospitalized” or “not previously hospitalized.”
JUST THINK . . .
The scale with which we are all perhaps most familiar is the common bathroom scale. How are a psychological test and a bathroom scale alike? How are they different? Your answer may change as you read on.
JUST THINK . . .
Assume the role of a test creator. Now write some instructions to users of your test that are designed to reduce to the absolute minimum any error associated with test scores. Be sure to include instructions regarding the preparation of the site where the test will be administered.
Measurement always involves error. In the language of assessment, error refers to the collective influence of all of the factors on a test score or measurement beyond those specifically measured by the test or measurement. As we will see, there are many different sources of error in measurement. Consider, for example, the score someone received on a test in American history. We might conceive of part of the score as reflecting the testtaker’s knowledge of American history and part of the score as reflecting error. The error part of the test score may be due to many different factors. One source of error might have been a distracting thunderstorm going on outside at the time the test was administered. Another source of error was the particular selection of test items the instructor chose to use for the test. Had a different item or two been used in the test, the testtaker’s score on the test might have been higher or lower. Error is very much an element of all measurement, and it is an element for which any theory of measurement must surely account.
Measurement using continuous scales always involves error. To illustrate why, let’s go back to the scenario involving venetian Page 77blinds. The length of the window measured to be 35.5 inches could, in reality, be 35.7 inches. The measuring scale is conveniently marked off in grosser gradations of measurement. Most scales used in psychological and educational assessment are continuous and therefore can be expected to contain this sort of error. The number or score used to characterize the trait being measured on a continuous scale should be thought of as an approximation of the “real” number. Thus, for example, a score of 25 on some test of anxiety should not be thought of as a precise measure of anxiety. Rather, it should be thought of as an approximation of the real anxiety score had the measuring instrument been calibrated to yield such a score. In such a case, perhaps the score of 25 is an approximation of a real score of, say, 24.7 or 25.44.
It is generally agreed that there are four different levels or scales of measurement. Within these levels or scales of measurement, assigned numbers convey different kinds of information. Accordingly, certain statistical manipulations may or may not be appropriate, depending upon the level or scale of measurement. 3
JUST THINK . . .
Acronyms like noir are useful memory aids. As you continue in your study of psychological testing and assessment, create your own acronyms to help remember related groups of information. Hey, you may even learn some French in the process.
The French word for black is noir (pronounced “‘nwăre”). We bring this up here only to call attention to the fact that this word is a useful acronym for remembering the four levels or scales of measurement. Each letter in noir is the first letter of the succeedingly more rigorous levels: N stands for nominal, ofor ordinal, i for interval, and r for ratio scales.
Nominal scales are the simplest form of measurement. These scales involve classification or categorization based on one or more distinguishing characteristics, where all things measured must be placed into mutually exclusive and exhaustive categories. For example, in the specialty area of clinical psychology, a nominal scale in use for many years is the Diagnostic and Statistical Manual of Mental Disorders. Each disorder listed in that manual is assigned its own number. In a past version of that manual, the version really does not matter for the purposes of this example, the number 303.00 identified alcohol intoxication, and the number 307.00 identified stuttering. But these numbers were used exclusively for classification purposes and could not be meaningfully added, subtracted, ranked, or averaged. Hence, the middle number between these two diagnostic codes, 305.00, did not identify an intoxicated stutterer.
Individual test items may also employ nominal scaling, including yes/no responses. For example, consider the following test items:
Instructions: Answer either yes or no.
Are you actively contemplating suicide? __________
Are you currently under professional care for a psychiatric disorder? _______
Have you ever been convicted of a felony? _______
JUST THINK . . .
What are some other examples of nominal scales?
In each case, a yes or no response results in the placement into one of a set of mutually exclusive groups: suicidal or not, under care for psychiatric disorder or not, and felon or not. Arithmetic operations that can legitimately be performed with Page 78nominal data include counting for the purpose of determining how many cases fall into each category and a resulting determination of proportion or percentages. 4
JUST THINK . . .
What are some other examples of interval scales?
Like nominal scales, ordinal scales permit classification. However, in addition to classification, rank ordering on some characteristic is also permissible with ordinal scales. In business and organizational settings, job applicants may be rank-ordered according to their desirability for a position. In clinical settings, people on a waiting list for psychotherapy may be rank-ordered according to their need for treatment. In these examples, individuals are compared with others and assigned a rank (perhaps 1 to the best applicant or the most needy wait-listed client, 2 to the next, and so forth).
Although he may have never used the term ordinal scale, Alfred Binet, a developer of the intelligence test that today bears his name, believed strongly that the data derived from an intelligence test are ordinal in nature. He emphasized that what he tried to do with his test was not to measure people (as one might measure a person’s height), but merely to classify (and rank) people on the basis of their performance on the tasks. He wrote:
I have not sought . . . to sketch a method of measuring, in the physical sense of the word, but only a method of classification of individuals. The procedures which I have indicated will, if perfected, come to classify a person before or after such another person, or such another series of persons; but I do not believe that one may measure one of the intellectual aptitudes in the sense that one measures a length or a capacity. Thus, when a person studied can retain seven figures after a single audition, one can class him, from the point of his memory for figures, after the individual who retains eight figures under the same conditions, and before those who retain six. It is a classification, not a measurement . . . we do not measure, we classify. (Binet, cited in Varon, 1936, p. 41)
Assessment instruments applied to the individual subject may also use an ordinal form of measurement. The Rokeach Value Survey uses such an approach. In that test, a list of personal values—such as freedom, happiness, and wisdom—are put in order according to their perceived importance to the testtaker (Rokeach, 1973). If a set of 10 values is rank ordered, then the testtaker would assign a value of “1” to the most important and “10” to the least important.
Ordinal scales imply nothing about how much greater one ranking is than another. Even though ordinal scales may employ numbers or “scores” to represent the rank ordering, the numbers do not indicate units of measurement. So, for example, the performance difference between the first-ranked job applicant and the second-ranked applicant may be small while the difference between the second- and third-ranked applicants may be large. On the Rokeach Value Survey, the value ranked “1” may be handily the most important in the mind of the testtaker. However, ordering the values that follow may be difficult to the point of being almost arbitrary.
JUST THINK . . .
What are some other examples of ordinal scales?
Ordinal scales have no absolute zero point. In the case of a test of job performance ability, every testtaker, regardless of standing on the test, is presumed to have some ability. No testtaker is presumed to have zero ability. Zero is without meaning in such a test because the number of units that separate one testtaker’s score from another’s is simply not known. The scores are ranked, but the actual number of units separating one score from the next may be many, just a few, or practically none. Because there is no zero point on an ordinal scale, the ways in which data from such scales can be analyzed statistically are limited. One cannot average the qualifications of the Page 79first- and third-ranked job applicants, for example, and expect to come out with the qualifications of the second-ranked applicant.
In addition to the features of nominal and ordinal scales, interval scales contain equal intervals between numbers. Each unit on the scale is exactly equal to any other unit on the scale. But like ordinal scales, interval scales contain no absolute zero point. With interval scales, we have reached a level of measurement at which it is possible to average a set of measurements and obtain a meaningful result.
Scores on many tests, such as tests of intelligence, are analyzed statistically in ways appropriate for data at the interval level of measurement. The difference in intellectual ability represented by IQs of 80 and 100, for example, is thought to be similar to that existing between IQs of 100 and 120. However, if an individual were to achieve an IQ of 0 (something that is not even possible, given the way most intelligence tests are structured), that would not be an indication of zero (the total absence of) intelligence. Because interval scales contain no absolute zero point, a presumption inherent in their use is that no testtaker possesses none of the ability or trait (or whatever) being measured.
In addition to all the properties of nominal, ordinal, and interval measurement, a ratio scale has a true zero point. All mathematical operations can meaningfully be performed because there exist equal intervals between the numbers on the scale as well as a true or absolute zero point.
In psychology, ratio-level measurement is employed in some types of tests and test items, perhaps most notably those involving assessment of neurological functioning. One example is a test of hand grip, where the variable measured is the amount of pressure a person can exert with one hand (see Figure 3–1 ). Another example is a timed test of perceptual-motor ability that requires the testtaker to assemble a jigsaw-like puzzle. In such an instance, the time taken to successfully complete the puzzle is the measure that is recorded. Because there is a true zero point on this scale (or, 0 seconds), it is meaningful to say that a testtaker who completes the assembly in 30 seconds has taken half the time of a testtaker who completed it in 60 seconds. In this example, it is meaningful to speak of a true zero point on the scale—but in theory only. Why? Just think . . .
Figure 3–1 Ratio-Level Measurement in the Palm of One’s Hand Pictured above is a dynamometer, an instrument used to measure strength of hand grip. The examinee is instructed to squeeze the grips as hard as possible. The squeezing of the grips causes the gauge needle to move and reflect the number of pounds of pressure exerted. The highest point reached by the needle is the score. This is an example of ratio-level measurement. Someone who can exert 10 pounds of pressure (and earns a score of 10) exerts twice as much pressure as a person who exerts 5 pounds of pressure (and earns a score of 5). On this test it is possible to achieve a score of 0, indicating a complete lack of exerted pressure. Although it is meaningful to speak of a score of 0 on this test, we have to wonder about its significance. How might a score of 0 result? One way would be if the testtaker genuinely had paralysis of the hand. Another way would be if the testtaker was uncooperative and unwilling to comply with the demands of the task. Yet another way would be if the testtaker was attempting to malinger or “fake bad” on the test. Ratio scales may provide us “solid” numbers to work with, but some interpretation of the test data yielded may still be required before drawing any “solid” conclusions.© BanksPhotos/Getty Images RF
JUST THINK . . .
What are some other examples of ratio scales?
No testtaker could ever obtain a score of zero on this assembly task. Stated another way, no testtaker, not even The Flash (a comic-book superhero whose power is the ability to move at superhuman speed), could assemble the puzzle in zero seconds.
Measurement Scales in Psychology
The ordinal level of measurement is most frequently used in psychology. As Kerlinger (1973, p. 439) put it: “Intelligence, aptitude, and personality test scores are, basically and strictly speaking, ordinal. These tests indicate with more or less accuracy not the amount of intelligence, aptitude, and personality traits of individuals, but rather the rank-order positions of the individuals.” Kerlinger allowed that “most psychological and educational scales approximate interval equality fairly well,” though he cautioned that if ordinal measurements are treated as if they were interval measurements, then the test user must “be constantly alert to the possibility of gross inequality of intervals” (pp. 440–441).Page 80
Why would psychologists want to treat their assessment data as interval when those data would be better described as ordinal? Why not just say that they are ordinal? The attraction of interval measurement for users of psychological tests is the flexibility with which such data can be manipulated statistically. “What kinds of statistical manipulation?” you may ask.
In this chapter we discuss the various ways in which test data can be described or converted to make those data more manageable and understandable. Some of the techniques we’ll describe, such as the computation of an average, can be used if data are assumed to be interval- or ratio-level in nature but not if they are ordinal- or nominal-level. Other techniques, such as those involving the creation of graphs or tables, may be used with ordinal- or even nominal-level data.Page 81
Suppose you have magically changed places with the professor teaching this course and that you have just administered an examination that consists of 100 multiple-choice items (where 1 point is awarded for each correct answer). The distribution of scores for the 25 students enrolled in your class could theoretically range from 0 (none correct) to 100 (all correct). A distribution may be defined as a set of test scores arrayed for recording or study. The 25 scores in this distribution are referred to as raw scores. As its name implies, a raw score is a straightforward, unmodified accounting of performance that is usually numerical. A raw score may reflect a simple tally, as in number of items responded to correctly on an achievement test. As we will see later in this chapter, raw scores can be converted into other types of scores. For now, let’s assume it’s the day after the examination and that you are sitting in your office looking at the raw scores listed in Table 3–1 . What do you do next?
|Student||Score (number correct)|
|Table 3–1Data from Your Measurement Course Test|
JUST THINK . . .
In what way do most of your instructors convey test-related feedback to students? Is there a better way they could do this?
One task at hand is to communicate the test results to your class. You want to do that in a way that will help students understand how their performance on the test compared to the performance of other students. Perhaps the first step is to organize the data by transforming it from a random listing of raw scores into something that immediately conveys a bit more information. Later, as we will see, you may wish to transform the data in other ways.
The data from the test could be organized into a distribution of the raw scores. One way the scores could be distributed is by the frequency with which they occur. In a frequency distribution , all scores are listed alongside the number of times each score occurred. The scores might be listed in tabular or graphic form. Table 3–2 lists the frequency of occurrence of each score in one column and the score itself in the other column.
|Table 3–2Frequency Distribution of Scores from Your Test|
Often, a frequency distribution is referred to as a simple frequency distribution to indicate that individual scores have been used and the data have not been grouped. Another kind of Page 82frequency distribution used to summarize data is a grouped frequency distribution. In a grouped frequency distribution , test-score intervals, also called class intervals, replace the actual test scores. The number of class intervals used and the size or width of each class interval (or, the range of test scores contained in each class interval) are for the test user to decide. But how?
In most instances, a decision about the size of a class interval in a grouped frequency distribution is made on the basis of convenience. Of course, virtually any decision will represent a trade-off of sorts. A convenient, easy-to-read summary of the data is the trade-off for the loss of detail. To what extent must the data be summarized? How important is detail? These types of questions must be considered. In the grouped frequency distribution in Table 3–3 , the test scores have been grouped into 12 class intervals, where each class interval is equal to 5 points. 5 The highest class interval (95–99) and the lowest class interval (40–44) are referred to, respectively, as the upper and lower limits of the distribution. Here, the need for convenience in reading the data outweighs the need for great detail, so such groupings of data seem logical.
|Class Interval||f (frequency)|
|Table 3–3A Grouped Frequency Distribution|
Frequency distributions of test scores can also be illustrated graphically. A graph is a diagram or chart composed of lines, points, bars, or other symbols that describe and illustrate Page 83data. With a good graph, the place of a single score in relation to a distribution of test scores can be understood easily. Three kinds of graphs used to illustrate frequency distributions are the histogram, the bar graph, and the frequency polygon ( Figure 3–2 ). A histogram is a graph Page 84with vertical lines drawn at the true limits of each test score (or class interval), forming a series of contiguous rectangles. It is customary for the test scores (either the single scores or the midpoints of the class intervals) to be placed along the graph’s horizontal axis (also referred to as the abscissa or X-axis) and for numbers indicative of the frequency of occurrence to be placed along the graph’s vertical axis (also referred to as the ordinate or Y-axis). In a bar graph , numbers indicative of frequency also appear on the Y-axis, and reference to some categorization (e.g., yes/no/maybe, male/female) appears on the X-axis. Here the rectangular bars typically are not contiguous. Data illustrated in a frequency polygon are expressed by a continuous line connecting the points where test scores or class intervals (as indicated on the X-axis) meet frequencies (as indicated on the Y-axis).
Figure 3–2 Graphic Illustrations of Data from Table 3–3 A histogram (a), a bar graph (b), and a frequency polygon (c) all may be used to graphically convey information about test performance. Of course, the labeling of the bar graph and the specific nature of the data conveyed by it depend on the variables of interest. In (b), the variable of interest is the number of students who passed the test (assuming, for the purpose of this illustration, that a raw score of 65 or higher had been arbitrarily designated in advance as a passing grade). Returning to the question posed earlier—the one in which you play the role of instructor and must communicate the test results to your students—which type of graph would best serve your purpose? Why? As we continue our review of descriptive statistics, you may wish to return to your role of professor and formulate your response to challenging related questions, such as “Which measure(s) of central tendency shall I use to convey this information?” and “Which measure(s) of variability would convey the information best?”
Graphic representations of frequency distributions may assume any of a number of different shapes ( Figure 3–3 ). Regardless of the shape of graphed data, it is a good idea for the consumer of the information contained in the graph to examine it carefully—and, if need be, critically. Consider, in this context, this chapter’s Everyday Psychometrics.
Figure 3–3 Shapes That Frequency Distributions Can Take
As we discuss in detail later in this chapter, one graphic representation of data of particular interest to measurement professionals is the normal or bell-shaped curve. Before getting to that, however, let’s return to the subject of distributions and how we can describe and characterize them. One way to describe a distribution of test scores is by a measure of central tendency.
Measures of Central Tendency
A measure of central tendency is a statistic that indicates the average or midmost score between the extreme scores in a distribution. The center of a distribution can be defined in different ways. Perhaps the most commonly used measure of central tendency is the arithmetic mean (or, more simply, mean ), which is referred to in everyday language as the “average.” The mean takes into account the actual numerical value of every score. In special instances, such as when there are only a few scores and one or two of the scores are extreme in relation to the remaining ones, a measure of central tendency other than the mean may be desirable. Other measures of central tendency we review include the median and the mode. Note that, in the formulas to follow, the standard statistical shorthand called “summation notation” (summation meaning “the sum of”) is used. The Greek uppercase letter sigma, Σ, is the symbol used to signify “sum”; if X represents a test score, then the expression Σ X means “add all the test scores.”
The arithmetic mean
The arithmetic mean , denoted by the symbol (and pronounced “X bar”), is equal to the sum of the observations (or test scores, in this case) divided by the number of observations. Symbolically written, the formula for the arithmetic mean is where n equals the number of observations or test scores. The arithmetic mean is typically the most appropriate measure of central tendency for interval or ratio data when the distributions are believed to be approximately normal. An arithmetic mean can also be computed from a frequency distribution. The formula for doing this is
where Σ(fX) means “multiply the frequency of each score by its corresponding score and then sum.” An estimate of the arithmetic mean may also be obtained from a grouped frequency distribution using the same formula, where X is equal to the midpoint of the class interval. Table 3–4 illustrates a calculation of the mean from a grouped frequency distribution. After doing the math you will find that, using the grouped data, a mean of 71.8 (which may be rounded to 72) is calculated. Using the raw scores, a mean of 72.12 (which also may be rounded to 72) is calculated. Frequently, the choice of statistic will depend on the required degree of precision in measurement.
|Class Interval||f||X (midpoint of class interval)||fX|
|Σ f = 25||Σ (fX) = 1,795|
|Table 3–4Calculating the Arithmetic Mean from a Grouped Frequency Distribution|
To estimate the arithmetic mean of this grouped frequency distribution,
To calculate the mean of this distribution using raw scores,
JUST THINK . . .
Imagine that a thousand or so engineers took an extremely difficult pre-employment test. A handful of the engineers earned very high scores but the vast majority did poorly, earning extremely low scores. Given this scenario, what are the pros and cons of using the mean as a measure of central tendency for this test?
Consumer (of Graphed Data), Beware!
One picture is worth a thousand words, and one purpose of representing data in graphic form is to convey information at a glance. However, although two graphs may be accurate with respect to the data they represent, their pictures—and the impression drawn from a glance at them—may be vastly different. As an example, consider the following hypothetical scenario involving a hamburger restaurant chain we’ll call “The Charred House.”
The Charred House chain serves very charbroiled, microscopically thin hamburgers formed in the shape of little triangular houses. In the 10-year period since its founding in 1993, the company has sold, on average, 100 million burgers per year. On the chain’s tenth anniversary, The Charred House distributes a press release proudly announcing “Over a Billion Served.”
Reporters from two business publications set out to research and write a feature article on this hamburger restaurant chain. Working solely from sales figures as compiled from annual reports to the shareholders, Reporter 1 focuses her story on the differences in yearly sales. Her article is entitled “A Billion Served—But Charred House Sales Fluctuate from Year to Year,” and its graphic illustration is reprinted here.
Quite a different picture of the company emerges from Reporter 2’s story, entitled “A Billion Served—And Charred House Sales Are as Steady as Ever,” and its accompanying graph. The latter story is based on a diligent analysis of comparable data for the same number of hamburger chains in the same areas of the country over the same time period. While researching the story, Reporter 2 learned that yearly fluctuations in sales are common to the entire industry and that the annual fluctuations observed in the Charred House figures were—relative to other chains—insignificant.
Compare the graphs that accompanied each story. Although both are accurate insofar as they are based on the correct numbers, the impressions they are likely to leave are quite different.
Incidentally, custom dictates that the intersection of the two axes of a graph be at 0 and that all the points on the Y-axis be in equal and proportional intervals from 0. This custom is followed in Reporter 2’s story, where the first point on the ordinate is 10 units more than 0, and each succeeding point is also 10 more units away from 0. However, the custom is violated in Reporter 1’s story, where the first point on the ordinate is 95 units more than 0, and each succeeding point increases only by 1. The fact that the custom is violated in Reporter 1’s story should serve as a warning to evaluate pictorial representations of data all the more critically.
The median , defined as the middle score in a distribution, is another commonly used measure of central tendency. We determine the median of a distribution of scores by ordering the scores in a list by magnitude, in either ascending or descending order. If the total number of scores ordered is an odd number, then the median will be the score that is exactly in the middle, with one-half of the remaining scores lying above it and the other half of the remaining scores lying below it. When the total number of scores ordered is an even number, then the median can be calculated by determining the arithmetic mean of the two middle scores. For example, suppose that 10 people took a preemployment word-processing test at The Rochester Wrenchworks (TRW) Corporation. They obtained the following scores, presented here in descending order:
66 65 61 59 53 52 41 36 35 32
The median of these data would be calculated by obtaining the average (or, the arithmetic mean) of the two middle scores, 53 and 52 (which would be equal to 52.5). The median is an appropriate measure of central tendency for ordinal, interval, and ratio data. The median may be a particularly useful measure of central tendency in cases where relatively few scores fall at the high end of the distribution or relatively few scores fall at the low end of the distribution.
Suppose not 10 but rather tens of thousands of people had applied for jobs at The Rochester Wrenchworks. It would be impractical to find the median by simply ordering the data and finding the midmost scores, so how would the median score be identified? For our purposes, the answer is simply that there are advanced methods for doing so. There are also techniques for identifying the median in other sorts of distributions, such as a grouped frequency distribution and a distribution wherein various scores are identical. However, instead of delving into such new and complex territory, let’s resume our discussion of central tendency and consider another such measure.
The most frequently occurring score in a distribution of scores is the mode . 6 As an example, determine the mode for the following scores obtained by another TRW job applicant, Bruce. The scores reflect the number of words Bruce word-processed in seven 1-minute trials:
43 34 45 51 42 31 51
It is TRW policy that new hires must be able to word-process at least 50 words per minute. Now, place yourself in the role of the corporate personnel officer. Would you hire Bruce? The most frequently occurring score in this distribution of scores is 51. If hiring guidelines gave you the freedom to use any measure of central tendency in your personnel decision making, then it would be your choice as to whether or not Bruce is hired. You could hire him and justify this decision on the basis of his modal score (51). You also could not hire him and justify this decision on the basis of his mean score (below the required 50 words per minute). Ultimately, whether Rochester Wrenchworks will be Bruce’s new home away from home will depend on other job-related factors, such as the nature of the job market in Rochester and the qualifications of competing applicants. Of course, if company guidelines dictate that only the mean score be used in hiring decisions, then a career at TRW is not in Bruce’s immediate future.
Distributions that contain a tie for the designation “most frequently occurring score” can have more than one mode. Consider the following scores—arranged in no particular order—obtained by 20 students on the final exam of a new trade school called the Home Study School of Elvis Presley Impersonators:
51 49 51 50 66 52 53 38 17 66 33 44 73 13 21 91 87 92 47 3
These scores are said to have a bimodal distribution because there are two scores (51 and 66) that occur with the highest frequency (of two). Except with nominal data, the mode tends not to be a very commonly used measure of central tendency. Unlike the arithmetic mean, which has to be calculated, the value of the modal score is not calculated; one simply counts and determines which score occurs most frequently. Because the mode is arrived at in this manner, the modal score may be totally atypical—for instance, one at an extreme end of the distribution—which nonetheless occurs with the greatest frequency. In fact, it is theoretically possible for a bimodal distribution to have two modes, each of which falls at the high or the low end of the distribution—thus violating the expectation that a measure of central tendency should be . . . well, central (or indicative of a point at the middle of the distribution).Page 89
Even though the mode is not calculated in the sense that the mean is calculated, and even though the mode is not necessarily a unique point in a distribution (a distribution can have two, three, or even more modes), the mode can still be useful in conveying certain types of information. The mode is useful in analyses of a qualitative or verbal nature. For example, when assessing consumers’ recall of a commercial by means of interviews, a researcher might be interested in which word or words were mentioned most by interviewees.
The mode can convey a wealth of information in addition to the mean. As an example, suppose you wanted an estimate of the number of journal articles published by clinical psychologists in the United States in the past year. To arrive at this figure, you might total the number of journal articles accepted for publication written by each clinical psychologist in the United States, divide by the number of psychologists, and arrive at the arithmetic mean. This calculation would yield an indication of the average number of journal articles published. Whatever that number would be, we can say with certainty that it would be more than the mode. It is well known that most clinical psychologists do not write journal articles. The mode for publications by clinical psychologists in any given year is zero. In this example, the arithmetic mean would provide us with a precise measure of the average number of articles published by clinicians. However, what might be lost in that measure of central tendency is that, proportionately, very few of all clinicians do most of the publishing. The mode (in this case, a mode of zero) would provide us with a great deal of information at a glance. It would tell us that, regardless of the mean, most clinicians do not publish.
Because the mode is not calculated in a true sense, it is a nominal statistic and cannot legitimately be used in further calculations. The median is a statistic that takes into account the order of scores and is itself ordinal in nature. The mean, an interval-level statistic, is generally the most stable and useful measure of central tendency.
JUST THINK . . .
Devise your own example to illustrate how the mode, and not the mean, can be the most useful measure of central tendency.
Measures of Variability
Variability is an indication of how scores in a distribution are scattered or dispersed. As Figure 3–4 illustrates, two or more distributions of test scores can have the same mean even though differences in the dispersion of scores around the mean can be wide. In both distributions A and B, test scores could range from 0 to 100. In distribution A, we see that the mean score was 50 and the remaining scores were widely distributed around the mean. In distribution B, the mean was also 50 but few people scored higher than 60 or lower than 40.
Figure 3–4 Two Distributions with Differences in Variability
Statistics that describe the amount of variation in a distribution are referred to as measures of variability . Some measures of variability include the range, the interquartile range, the semi-interquartile range, the average deviation, the standard deviation, and the variance.Page 90
The range of a distribution is equal to the difference between the highest and the lowest scores. We could describe distribution B of Figure 3–3 , for example, as having a range of 20 if we knew that the highest score in this distribution was 60 and the lowest score was 40 (60 − 40 = 20). With respect to distribution A, if we knew that the lowest score was 0 and the highest score was 100, the range would be equal to 100 − 0, or 100. The range is the simplest measure of variability to calculate, but its potential use is limited. Because the range is based entirely on the values of the lowest and highest scores, one extreme score (if it happens to be the lowest or the highest) can radically alter the value of the range. For example, suppose distribution B included a score of 90. The range of this distribution would now be equal to 90 − 40, or 50. Yet, in looking at the data in the graph for distribution B, it is clear that the vast majority of scores tend to be between 40 and 60.
JUST THINK . . .
Devise two distributions of test scores to illustrate how the range can overstate or understate the degree of variability in the scores.
As a descriptive statistic of variation, the range provides a quick but gross description of the spread of scores. When its value is based on extreme scores in a distribution, the resulting description of variation may be understated or overstated. Better measures of variation include the interquartile range and the semi-interquartile range.
The interquartile and semi-interquartile ranges
A distribution of test scores (or any other data, for that matter) can be divided into four parts such that 25% of the test scores occur in each quarter. As illustrated in Figure 3–5 , the dividing points between the four quarters in the distribution are the quartiles . There are three of them, respectively labeled Q1, Q2, and Q3. Note that quartile refers to a specific point whereas quarter refers to an interval. An individual score may, for example, fall at the third quartile or in the third quarter (but not “in” the third quartile or “at” the third quarter). It should come as no surprise to you that Q2 and the median are exactly the same. And just as the median is the midpoint in a distribution of scores, so are quartiles Q1 and Q3 the quarter-points in a distribution of scores. Formulas may be employed to determine the exact value of these points.
Figure 3–5 A Quartered DistributionPage 91
The interquartile range is a measure of variability equal to the difference between Q3 and Q1. Like the median, it is an ordinal statistic. A related measure of variability is the semi-interquartile range , which is equal to the interquartile range divided by 2. Knowledge of the relative distances of Q1 and Q3 from Q2 (the median) provides the seasoned test interpreter with immediate information as to the shape of the distribution of scores. In a perfectly symmetrical distribution, Q1 and Q3 will be exactly the same distance from the median. If these distances are unequal then there is a lack of symmetry. This lack of symmetry is referred to as skewness, and we will have more to say about that shortly.
The average deviation
Another tool that could be used to describe the amount of variability in a distribution is the average deviation , or AD for short. Its formula is
The lowercase italic x in the formula signifies a score’s deviation from the mean. The value of x is obtained by subtracting the mean from the score (X − mean = x). The bars on each side of x indicate that it is the absolute value of the deviation score (ignoring the positive or negative sign and treating all deviation scores as positive). All the deviation scores are then summed and divided by the total number of scores (n) to arrive at the average deviation. As an exercise, calculate the average deviation for the following distribution of test scores:
85 100 90 95 80
Begin by calculating the arithmetic mean. Next, obtain the absolute value of each of the five deviation scores and sum them. As you sum them, note what would happen if you did not ignore the plus or minus signs: All the deviation scores would then sum to 0. Divide the sum of the deviation scores by the number of measurements (5). Did you obtain an AD of 6? The AD tells us that the five scores in this distribution varied, on average, 6 points from the mean.
JUST THINK . . .
After reading about the standard deviation, explain in your own words how an understanding of the average deviation can provide a “stepping-stone” to better understanding the concept of a standard deviation.
The average deviation is rarely used. Perhaps this is so because the deletion of algebraic signs renders it a useless measure for purposes of any further operations. Why, then, discuss it here? The reason is that a clear understanding of what an average deviation measures provides a solid foundation for understanding the conceptual basis of another, more widely used measure: the standard deviation. Keeping in mind what an average deviation is, what it tells us, and how it is derived, let’s consider its more frequently used “cousin,” the standard deviation.
The standard deviation
Recall that, when we calculated the average deviation, the problem of the sum of all deviation scores around the mean equaling zero was solved by employing only the absolute value of the deviation scores. In calculating the standard deviation, the same problem must be dealt with, but we do so in a different way. Instead of using the absolute value of each deviation score, we use the square of each score. With each score squared, the sign of any negative deviation becomes positive. Because all the deviation scores are squared, we know that our calculations won’t be complete until we go back and obtain the square root of whatever value we reach.
We may define the standard deviation as a measure of variability equal to the square root of the average squared deviations about the mean. More succinctly, it is equal to the square root of the variance. The variance is equal to the arithmetic mean of the squares of the differences between the scores in a distribution and their mean. The formula used to calculate the variance (s2) using deviation scores is
Simply stated, the variance is calculated by squaring and summing all the deviation scores and then dividing by the total number of scores. The variance can also be calculated in other ways. For example: From raw scores, first calculate the summation of the raw scores squared, divide by the number of scores, and then subtract the mean squared. The result is
The variance is a widely used measure in psychological research. To make meaningful interpretations, the test-score distribution should be approximately normal. We’ll have more to say about “normal” distributions later in the chapter. At this point, think of a normal distribution as a distribution with the greatest frequency of scores occurring near the arithmetic mean. Correspondingly fewer and fewer scores relative to the mean occur on both sides of it.
For some hands-on experience with—and to develop a sense of mastery of—the concepts of variance and standard deviation, why not allot the next 10 or 15 minutes to calculating the standard deviation for the test scores shown in Table 3–1 ? Use both formulas to verify that they produce the same results. Using deviation scores, your calculations should look similar to these:
Using the raw-scores formula, your calculations should look similar to these:
In both cases, the standard deviation is the square root of the variance (s2). According to our calculations, the standard deviation of the test scores is 14.10. If s = 14.10, then 1 standard deviation unit is approximately equal to 14 units of measurement or (with reference to our example and rounded to a whole number) to 14 test-score points. The test data did not provide a good normal curve approximation. Test professionals would describe these data as “positively skewed.” Skewness, as well as related terms such as negatively skewed and positively skewed, are covered in the next section. Once you are “positively familiar” with terms like positively skewed, you’ll appreciate all the more the section later in this chapter entitled “The Area Under the Normal Curve.” There you will find a wealth of information about test-score interpretation Page 93in the case when the scores are not skewed—that is, when the test scores are approximately normal in distribution.
The symbol for standard deviation has variously been represented as s, S, SD, and the lowercase Greek letter sigma (σ). One custom (the one we adhere to) has it that s refers to the sample standard deviation and σ refers to the population standard deviation. The number of observations in the sample is n, and the denominator n − 1 is sometimes used to calculate what is referred to as an “unbiased estimate” of the population value (though it’s actually only lessbiased; see Hopkins & Glass, 1978). Unless n is 10 or less, the use of n or n − 1 tends not to make a meaningful difference.
Whether the denominator is more properly n or n − 1 has been a matter of debate. Lindgren (1983) has argued for the use of n − 1, in part because this denominator tends to make correlation formulas simpler. By contrast, most texts recommend the use of n − 1 only when the data constitute a sample; when the data constitute a population, n is preferable. For Lindgren (1983), it doesn’t matter whether the data are from a sample or a population. Perhaps the most reasonable convention is to use n either when the entire population has been assessed or when no inferences to the population are intended. So, when considering the examination scores of one class of students—including all the people about whom we’re going to make inferences—it seems appropriate to use n.
Having stated our position on the n versus n − 1 controversy, our formula for the population standard deviation follows. In this formula, represents a sample mean and M a population mean:
The standard deviation is a very useful measure of variation because each individual score’s distance from the mean of the distribution is factored into its computation. You will come across this measure of variation frequently in the study and practice of measurement in psychology.
Distributions can be characterized by their skewness , or the nature and extent to which symmetry is absent. Skewness is an indication of how the measurements in a distribution are distributed. A distribution has a positive skew when relatively few of the scores fall at the high end of the distribution. Positively skewed examination results may indicate that the test was too difficult. More items that were easier would have been desirable in order to better discriminate at the lower end of the distribution of test scores. A distribution has a negative skew when relatively few of the scores fall at the low end of the distribution. Negatively skewed examination results may indicate that the test was too easy. In this case, more items of a higher level of difficulty would make it possible to better discriminate between scores at the upper end of the distribution. (Refer to Figure 3–3 for graphic examples of skewed distributions.)
The term skewed carries with it negative implications for many students. We suspect that skewed is associated with abnormal, perhaps because the skewed distribution deviates from the symmetrical or so-called normal distribution. However, the presence or absence of symmetry in a distribution (skewness) is simply one characteristic by which a distribution can be described. Consider in this context a hypothetical Marine Corps Ability and Endurance Screening Test administered to all civilians seeking to enlist in the U.S. Marines. Now look again at the graphs in Figure 3–3 . Which graph do you think would best describe the resulting distribution of test scores? (No peeking at the next paragraph before you respond.)
No one can say with certainty, but if we had to guess, then we would say that the Marine Corps Ability and Endurance Screening Test data would look like graph C, the positively skewed distribution in Figure 3–3 . We say this assuming that a level of difficulty would have been built into the test to ensure that relatively few assessees would score at the high end of Page 94the distribution. Most of the applicants would probably score at the low end of the distribution. All of this is quite consistent with the advertised objective of the Marines, who are only looking for a few good men. You know: the few, the proud. Now, a question regarding this positively skewed distribution: Is the skewness a good thing? A bad thing? An abnormal thing? In truth, it is probably none of these things—it just is. By the way, although they may not advertise it as much, the Marines are also looking for (an unknown quantity of) good women. But here we are straying a bit too far from skewness.
Various formulas exist for measuring skewness. One way of gauging the skewness of a distribution is through examination of the relative distances of quartiles from the median. In a positively skewed distribution, Q3 − Q2 will be greater than the distance of Q2 − Q1. In a negatively skewed distribution, Q3− Q2 will be less than the distance of Q2 − Q1. In a distribution that is symmetrical, the distances from Q1 and Q3 to the median are the same.
The term testing professionals use to refer to the steepness of a distribution in its center is kurtosis . To the root kurtic is added to one of the prefixes platy-, lepto-, or meso- to describe the peakedness/flatness of three general types of curves ( Figure 3–6 ). Distributions are generally described as platykurtic (relatively flat), leptokurtic (relatively peaked), or—somewhere in the middle— mesokurtic . Distributions that have high kurtosis are characterized by a high peak and “fatter” tails compared to a normal distribution. In contrast, lower kurtosis values indicate a distribution with a rounded peak and thinner tails. Many methods exist for measuring kurtosis. According to the original definition, the normal bell-shaped curve (see graph A from Figure 3–3 ) would have a kurtosis value of 3. In other methods of computing kurtosis, a normal distribution would have kurtosis of 0, with positive values indicating higher kurtosis and negative values indicating lower kurtosis. It is important to keep the different methods of calculating kurtosis in mind when examining the values reported by researchers or computer programs. So, given that this can quickly become an advanced-level topic and that this book is of a more introductory nature, let’s move on. It’s time to focus on a type of distribution that happens to be the standard against which all other distributions (including all of the kurtic ones) are compared: the normal distribution.
Figure 3–6 The Kurtosis of Curves
JUST THINK . . .
Like skewness, reference to the kurtosis of a distribution can provide a kind of “shorthand” description of a distribution of test scores. Imagine and describe the kind of test that might yield a distribution of scores that form a platykurtic curve.
The Normal Curve
Before delving into the statistical, a little bit of the historical is in order. Development of the concept of a normal curve began in the middle of the eighteenth century with the work of Abraham DeMoivre and, later, the Marquis de Laplace. At the beginning of the nineteenth century, Karl Friedrich Gauss made some substantial contributions. Through the early nineteenth century, scientists referred to it as the “Laplace-Gaussian curve.” Karl Pearson is credited with being the first to refer to the curve as the normal curve, perhaps in an effort to be diplomatic to all of the people who helped develop it. Somehow the term normal curve stuck—but don’t be surprised if you’re sitting at some scientific meeting one day and you hear this distribution or curve referred to as Gaussian.
Theoretically, the normal curve is a bell-shaped, smooth, mathematically defined curve that is highest at its center. From the center it tapers on both sides approaching the X-axis asymptotically (meaning that it approaches, but never touches, the axis). In theory, the distribution of the normal curve ranges from negative infinity to positive infinity. The curve is perfectly symmetrical, with no skewness. If you folded it in half at the mean, one side would lie exactly on top of the other. Because it is symmetrical, the mean, the median, and the mode all have the same exact value.
Why is the normal curve important in understanding the characteristics of psychological tests? Our Close-Up provides some answers.
The Area Under the Normal Curve
The normal curve can be conveniently divided into areas defined in units of standard deviation. A hypothetical distribution of National Spelling Test scores with a mean of 50 and a standard deviation of 15 is illustrated in Figure 3–7 . In this example, a score equal to 1 standard deviation above the mean would be equal to 65 (+ 1s = 50 + 15 = 65).
Figure 3–7 The Area Under the Normal CurvePage 96
The Normal Curve and Psychological Tests
Scores on many psychological tests are often approximately normally distributed, particularly when the tests are administered to large numbers of subjects. Few, if any, psychological tests yield precisely normal distributions of test scores (Micceri, 1989). As a general rule (with ample exceptions), the larger the sample size and the wider the range of abilities measured by a particular test, the more the graph of the test scores will approximate the normal curve. A classic illustration of this was provided by E. L. Thorndike and his colleagues (1927). They compiled intelligence test scores from several large samples of students. As you can see in Figure 1 , the distribution of scores closely approximated the normal curve.
Figure 1 Graphic Representation of Thorndike et al. Data The solid line outlines the distribution of intelligence test scores of sixth-grade students (N = 15,138). The dotted line is the theoretical normal curve (Thorndike et al., 1927).
Following is a sample of more varied examples of the wide range of characteristics that psychologists have found to be approximately normal in distribution.
· The strength of handedness in right-handed individuals, as measured by the Waterloo Handedness Questionnaire (Tan, 1993).
· Scores on the Women’s Health Questionnaire, a scale measuring a variety of health problems in women across a wide age range (Hunter, 1992).
· Responses of both college students and working adults to a measure of intrinsic and extrinsic work motivation (Amabile et al., 1994).
· The intelligence-scale scores of girls and women with eating disorders, as measured by the Wechsler Adult Intelligence Scale–Revised and the Wechsler Intelligence Scale for Children–Revised (Ranseen & Humphries, 1992).
· The intellectual functioning of children and adolescents with cystic fibrosis (Thompson et al., 1992).
· Decline in cognitive abilities over a one-year period in people with Alzheimer’s disease (Burns et al., 1991).
· The rate of motor-skill development in developmentally delayed preschoolers, as measured by the Vineland Adaptive Behavior Scale (Davies & Gavin, 1994).Page 97
· Scores on the Swedish translation of the Positive and Negative Syndrome Scale, which assesses the presence of positive and negative symptoms in people with schizophrenia (von Knorring & Lindstrom, 1992).
· Scores of psychiatrists on the Scale for Treatment Integration of the Dually Diagnosed (people with both a drug problem and another mental disorder); the scale examines opinions about drug treatment for this group of patients (Adelman et al., 1991).
· Responses to the Tridimensional Personality Questionnaire, a measure of three distinct personality features (Cloninger et al., 1991).
· Scores on a self-esteem measure among undergraduates (Addeo et al., 1994).
In each case, the researchers made a special point of stating that the scale under investigation yielded something close to a normal distribution of scores. Why? One benefit of a normal distribution of scores is that it simplifies the interpretation of individual scores on the test. In a normal distribution, the mean, the median, and the mode take on the same value. For example, if we know that the average score for intellectual ability of children with cystic fibrosis is a particular value and that the scores are normally distributed, then we know quite a bit more. We know that the average is the most common score and the score below and above which half of all the scores fall. Knowing the mean and the standard deviation of a scale and that it is approximately normally distributed tells us that (1) approximately two-thirds of all testtakers’ scores are within a standard deviation of the mean and (2) approximately 95% of the scores fall within 2 standard deviations of the mean.
The characteristics of the normal curve provide a ready model for score interpretation that can be applied to a wide range of test results.
Before reading on, take a minute or two to calculate what a score exactly at 3 standard deviations below the mean would be equal to. How about a score exactly at 3 standard deviations above the mean? Were your answers 5 and 95, respectively? The graph tells us that 99.74% of all scores in these normally distributed spelling-test data lie between ±3 standard deviations. Stated another way, 99.74% of all spelling test scores lie between 5 and 95. This graph also illustrates the following characteristics of all normal distributions.
· 50% of the scores occur above the mean and 50% of the scores occur below the mean.
· Approximately 34% of all scores occur between the mean and 1 standard deviation above the mean.
· Approximately 34% of all scores occur between the mean and 1 standard deviation below the mean.
· Approximately 68% of all scores occur between the mean and ±1 standard deviation.
· Approximately 95% of all scores occur between the mean and ±2 standard deviations.
A normal curve has two tails. The area on the normal curve between 2 and 3 standard deviations above the mean is referred to as a tail . The area between −2 and −3 standard deviations below the mean is also referred to as a tail. Let’s digress here momentarily for a “real-life” tale of the tails to consider along with our rather abstract discussion of statistical concepts.
As observed in a thought-provoking article entitled “Two Tails of the Normal Curve,” an intelligence test score that falls within the limits of either tail can have momentous consequences in terms of the tale of one’s life:
Individuals who are mentally retarded or gifted share the burden of deviance from the norm, in both a developmental and a statistical sense. In terms of mental ability as operationalized by tests of intelligence, performance that is approximately two standard deviations from the mean (or, IQ of 70–75 or lower or IQ of 125–130 or higher) is one key element in identification. Success at life’s tasks, or its absence, also plays a defining role, but the primary classifying feature of both gifted and retarded groups is intellectual deviance. These individuals are out of sync with more average people, simply by their difference from what is expected for their age Page 98and circumstance. This asynchrony results in highly significant consequences for them and for those who share their lives. None of the familiar norms apply, and substantial adjustments are needed in parental expectations, educational settings, and social and leisure activities. (Robinson et al., 2000, p. 1413)
Robinson et al. (2000) convincingly demonstrated that knowledge of the areas under the normal curve can be quite useful to the interpreter of test data. This knowledge can tell us not only something about where the score falls among a distribution of scores but also something about a person and perhaps even something about the people who share that person’s life. This knowledge might also convey something about how impressive, average, or lackluster the individual is with respect to a particular discipline or ability. For example, consider a high-school student whose score on a national, well-respected spelling test is close to 3 standard deviations above the mean. It’s a good bet that this student would know how to spell words like asymptotic and leptokurtic.
Just as knowledge of the areas under the normal curve can instantly convey useful information about a test score in relation to other test scores, so can knowledge of standard scores.
Simply stated, a standard score is a raw score that has been converted from one scale to another scale, where the latter scale has some arbitrarily set mean and standard deviation. Why convert raw scores to standard scores?
Raw scores may be converted to standard scores because standard scores are more easily interpretable than raw scores. With a standard score, the position of a testtaker’s performance relative to other testtakers is readily apparent.
Different systems for standard scores exist, each unique in terms of its respective mean and standard deviations. We will briefly describe z scores, Tscores, stanines, and some other standard scores. First for consideration is the type of standard score scale that may be thought of as the zero plus or minus one scale. This is so because it has a mean set at 0 and a standard deviation set at 1. Raw scores converted into standard scores on this scale are more popularly referred to as z scores.
A z score results from the conversion of a raw score into a number indicating how many standard deviation units the raw score is below or above the mean of the distribution. Let’s use an example from the normally distributed “National Spelling Test” data in Figure 3–7 to demonstrate how a raw score is converted to a z score. We’ll convert a raw score of 65 to a z score by using the formula
In essence, a z score is equal to the difference between a particular raw score and the mean divided by the standard deviation. In the preceding example, a raw score of 65 was found to be equal to a z score of +1. Knowing that someone obtained a z score of 1 on a spelling test provides context and meaning for the score. Drawing on our knowledge of areas under the normal curve, for example, we would know that only about 16% of the other testtakers obtained higher scores. By contrast, knowing simply that someone obtained a raw score of 65 on a spelling test conveys virtually no usable information because information about the context of this score is lacking.Page 99
In addition to providing a convenient context for comparing scores on the same test, standard scores provide a convenient context for comparing scores on different tests. As an example, consider that Crystal’s raw score on the hypothetical Main Street Reading Test was 24 and that her raw score on the (equally hypothetical) Main Street Arithmetic Test was 42. Without knowing anything other than these raw scores, one might conclude that Crystal did better on the arithmetic test than on the reading test. Yet more informative than the two raw scores would be the two z scores.
Converting Crystal’s raw scores to z scores based on the performance of other students in her class, suppose we find that her z score on the reading test was 1.32 and that her z score on the arithmetic test was −0.75. Thus, although her raw score in arithmetic was higher than in reading, the z scores paint a different picture. The z scores tell us that, relative to the other students in her class (and assuming that the distribution of scores is relatively normal), Crystal performed above average on the reading test and below average on the arithmetic test. An interpretation of exactly how much better she performed could be obtained by reference to tables detailing distances under the normal curve as well as the resulting percentage of cases that could be expected to fall above or below a particular standard deviation point (or z score).
If the scale used in the computation of z scores is called a zero plus or minus one scale, then the scale used in the computation of T scores can be called a fifty plus or minus ten scale; that is, a scale with a mean set at 50 and a standard deviation set at 10. Devised by W. A. McCall (1922, 1939) and named a Tscore in honor of his professor E. L. Thorndike, this standard score system is composed of a scale that ranges from 5 standard deviations below the mean to 5 standard deviations above the mean. Thus, for example, a raw score that fell exactly at 5 standard deviations below the mean would be equal to a T score of 0, a raw score that fell at the mean would be equal to a T of 50, and a raw score 5 standard deviations above the mean would be equal to a T of 100. One advantage in using T scores is that none of the scores is negative. By contrast, in a z score distribution, scores can be positive and negative; this can make further computation cumbersome in some instances.
Other Standard Scores
Numerous other standard scoring systems exist. Researchers during World War II developed a standard score with a mean of 5 and a standard deviation of approximately 2. Divided into nine units, the scale was christened a stanine , a term that was a contraction of the words standard and nine.
Stanine scoring may be familiar to many students from achievement tests administered in elementary and secondary school, where test scores are often represented as stanines. Stanines are different from other standard scores in that they take on whole values from 1 to 9, which represent a range of performance that is half of a standard deviation in width ( Figure 3–8 ). The 5th stanine indicates performance in the average range, from 1/4 standard deviation below the mean to 1/4 standard deviation above the mean, and captures the middle 20% of the scores in a normal distribution. The 4th and 6th stanines are also 1/2 standard deviation wide and capture the 17% of cases below and above (respectively) the 5th stanine.
Figure 3–8 Stanines and the Normal Curve
Another type of standard score is employed on tests such as the Scholastic Aptitude Test (SAT) and the Graduate Record Examination (GRE). Raw scores on those tests are converted to standard scores such that the resulting distribution has a mean of 500 and a standard deviation of 100. If the letter A is used to represent a standard score from a college or graduate school admissions test whose distribution has a mean of 500 and a standard deviation of 100, then the following is true:
Have you ever heard the term IQ used as a synonym for one’s score on an intelligence test? Of course you have. What you may not know is that what is referred to variously as IQ, deviation IQ, or deviation intelligence quotient is yet another kind of standard score. For most IQ tests, the distribution of raw scores is converted to IQ scores, whose distribution typically has a mean set at 100 and a standard deviation set at 15. Let’s emphasize typically because there is some variation in standard scoring systems, depending on the test used. The typical mean and standard deviation for IQ tests results in approximately 95% of deviation IQs ranging from 70 to 130, which is 2 standard deviations below and above the mean. In the context of a normal distribution, the relationship of deviation IQ scores to the other standard scores we have discussed so far (z, T, and A scores) is illustrated in Figure 3–9 .
Figure 3–9 Some Standard Score Equivalents Note that the values presented here for the IQ scores assume that the intelligence test scores have a mean of 100 and a standard deviation of 15. This is true for many, but not all, intelligence tests. If a particular test of intelligence yielded scores with a mean other than 100 and/or a standard deviation other than 15, then the values shown for IQ scores would have to be adjusted accordingly.
Standard scores converted from raw scores may involve either linear or nonlinear transformations. A standard score obtained by a linear transformation is one that retains a direct numerical relationship to the original raw score. The magnitude of differences between such standard scores exactly parallels the differences between corresponding raw scores. Sometimes scores may undergo more than one transformation. For example, the creators of the SAT did a second linear transformation on their data to convert z scores into a new scale that has a mean of 500 and a standard deviation of 100.
A nonlinear transformation may be required when the data under consideration are not normally distributed yet comparisons with normal distributions need to be made. In a nonlinear transformation, the resulting standard score does not necessarily have a direct numerical relationship to the original, raw score. As the result of a nonlinear transformation, the original distribution is said to have been normalized.
Normalized standard scores
Many test developers hope that the test they are working on will yield a normal distribution of scores. Yet even after very large samples have been tested with the instrument under development, skewed distributions result. What should be done?
One alternative available to the test developer is to normalize the distribution. Conceptually, normalizing a distribution involves “stretching” the skewed curve into the shape of a normal curve and creating a corresponding scale of standard scores, a scale that is technically referred to as a normalized standard score scale .
Normalization of a skewed distribution of scores may also be desirable for purposes of comparability. One of the primary advantages of a standard score on one test is that it can readily be compared with a standard score on another test. However, such comparisons are appropriate only when the distributions from which they derived are the same. In most instances, Page 101they are the same because the two distributions are approximately normal. But if, for example, distribution A were normal and distribution B were highly skewed, then z scores in these respective distributions would represent different amounts of area subsumed under the curve. A z score of −1 with respect to normally distributed data tells us, among other things, that about 84% of the scores in this distribution were higher than this score. A z score of −1 with respect to data that were very positively skewed might mean, for example, that only 62% of the scores were higher.
JUST THINK . . .
Apply what you have learned about frequency distributions, graphing frequency distributions, measures of central tendency, measures of variability, and the normal curve and standard scores to the question of the data listed in Table 3–1 . How would you communicate the data from Table 3–1 to the class? Which type of frequency distribution might you use? Which type of graph? Which measure of central tendency? Which measure of variability? Might reference to a normal curve or to standard scores be helpful? Why or why not?
For test developers intent on creating tests that yield normally distributed measurements, it is generally preferable to fine-tune the test according to difficulty or other relevant variables so that the resulting distribution will approximate the normal curve. That usually is a better bet than attempting to normalize skewed distributions. This is so because there are technical cautions to be observed before attempting normalization. For example, transformations should be made only when there is good reason to believe that the test sample was large enough and representative enough and that the failure to obtain normally distributed scores was due to the measuring instrument.Page 102
Correlation and Inference
Central to psychological testing and assessment are inferences (deduced conclusions) about how some things (such as traits, abilities, or interests) are related to other things (such as behavior). A coefficient of correlation (or correlation coefficient ) is a number that provides us with an index of the strength of the relationship between two things. An understanding of the concept of correlation and an ability to compute a coefficient of correlation is therefore central to the study of tests and measurement.
The Concept of Correlation
Simply stated, correlation is an expression of the degree and direction of correspondence between two things. A coefficient of correlation (r) expresses a linear relationship between two (and only two) variables, usually continuous in nature. It reflects the degree of concomitant variation between variable X and variable Y. The coefficient of correlation is the numerical index that expresses this relationship: It tells us the extent to which X and Y are “co-related.”
The meaning of a correlation coefficient is interpreted by its sign and magnitude. If a correlation coefficient were a person asked “What’s your sign?,” it wouldn’t answer anything like “Leo” or “Pisces.” It would answer “plus” (for a positive correlation), “minus” (for a negative correlation), or “none” (in the rare instance that the correlation coefficient was exactly equal to zero). If asked to supply information about its magnitude, it would respond with a number anywhere at all between −1 and +1. And here is a rather intriguing fact about the magnitude of a correlation coefficient: It is judged by its absolute value. This means that to the extent that we are impressed by correlation coefficients, a correlation of −.99 is every bit as impressive as a correlation of +.99. To understand why, you need to know a bit more about correlation.
“Ahh . . . a perfect correlation! Let me count the ways.” Well, actually there are only two ways. The two ways to describe a perfect correlation between two variables are as either +1 or −1. If a correlation coefficient has a value of +1 or −1, then the relationship between the two variables being correlated is perfect—without error in the statistical sense. And just as perfection in almost anything is difficult to find, so too are perfect correlations. It’s challenging to try to think of any two variables in psychological work that are perfectly correlated. Perhaps that is why, if you look in the margin, you are asked to “just think” about it.
JUST THINK . . .
Can you name two variables that are perfectly correlated? How about two psychological variables that are perfectly correlated?
If two variables simultaneously increase or simultaneously decrease, then those two variables are said to be positively (or directly) correlated. The height and weight of normal, healthy children ranging in age from birth to 10 years tend to be positively or directly correlated. As children get older, their height and their weight generally increase simultaneously. A positive correlation also exists when two variables simultaneously decrease. For example, the less a student prepares for an examination, the lower that student’s score on the examination. A negative (or inverse) correlation occurs when one variable increases while the other variable decreases. For example, there tends to be an inverse relationship between the number of miles on your car’s odometer (mileage indicator) and the number of dollars a car dealer is willing to give you on a trade-in allowance; all other things being equal, as the mileage increases, the number of dollars offered on trade-in decreases. And by the way, we all know students who use cell phones during class to text, tweet, check e-mail, or otherwise be engaged with their phone at a questionably appropriate time and place. What would you estimate the correlation to be between such daily, in-class cell phone use and test grades? See Figure 3–10 for one such estimate (and kindly refrain from sharing the findings on Facebook during class).Page 103
Figure 3–10 Cell Phone Use in Class and Class Grade This may be the “wired” generation, but some college students are clearly more wired than others. They seem to be on their cell phones constantly, even during class. Their gaze may be fixed on Mech Commander when it should more appropriately be on Class Instructor. Over the course of two semesters, Chris Bjornsen and Kellie Archer (2015) studied 218 college students, each of whom completed a questionnaire on their cell phone usage right after class. Correlating the questionnaire data with grades, the researchers reported that cell phone usage during class was significantly, negatively correlated with grades.© Caia Image/Glow Images RF
If a correlation is zero, then absolutely no relationship exists between the two variables. And some might consider “perfectly no correlation” to be a third variety of perfect correlation; that is, a perfect noncorrelation. After all, just as it is nearly impossible in psychological work to identify two variables that have a perfect correlation, so it is nearly impossible to identify two variables that have a zero correlation. Most of the time, two variables will be fractionally correlated. The fractional correlation may be extremely small but seldom “perfectly” zero.
JUST THINK . . .
Bjornsen & Archer (2015) discussed the implications of their cell phone study in terms of the effect of cell phone usage on student learning, student achievement, and post-college success. What would you anticipate those implications to be?
JUST THINK . . .
Could a correlation of zero between two variables also be considered a “perfect” correlation? Can you name two variables that have a correlation that is exactly zero?
As we stated in our introduction to this topic, correlation is often confused with causation. It must be emphasized that a correlation coefficient is merely an index of the relationship between two variables, not an index of the causal relationship between two variables. If you were told, for example, that from birth to age 9 there is a high positive correlation between hat size and spelling ability, would it be appropriate to conclude that hat size causes spelling ability? Of course not. The period Page 104from birth to age 9 is a time of maturation in all areas, including physical size and cognitive abilities such as spelling. Intellectual development parallels physical development during these years, and a relationship clearly exists between physical and mental growth. Still, this doesn’t mean that the relationship between hat size and spelling ability is causal.
Although correlation does not imply causation, there is an implication of prediction. Stated another way, if we know that there is a high correlation between X and Y, then we should be able to predict—with various degrees of accuracy, depending on other factors—the value of one of these variables if we know the value of the other.
The Pearson r
Many techniques have been devised to measure correlation. The most widely used of all is the Pearson r , also known as the Pearson correlation coefficientand the Pearson product-moment coefficient of correlation. Devised by Karl Pearson ( Figure 3–11 ), r can be the statistical tool of choice when the relationship between the variables is linear and when the two variables being correlated are continuous (or, they can theoretically take any value). Other correlational techniques can be employed with data that are discontinuous and where the relationship is nonlinear. The formula for the Pearson r takes into account the relative position of each test score or measurement with respect to the mean of the distribution.
Figure 3–11 Karl Pearson (1857–1936) Karl Pearson’s name has become synonymous with correlation. History records, however, that it was actually Sir Francis Galton who should be credited with developing the concept of correlation (Magnello & Spies, 1984). Galton experimented with many formulas to measure correlation, including one he labeled r. Pearson, a contemporary of Galton’s, modified Galton’sr, and the rest, as they say, is history. The Pearson r eventually became the most widely used measure of correlation.© TopFoto/Fotomas/The Image Works
A number of formulas can be used to calculate a Pearson r. One formula requires that we convert each raw score to a standard score and then multiply each pair of standard scores. A mean for the sum of the products is calculated, and that mean is the value of the Pearson r. Even from this simple verbal conceptualization of the Pearson r, it can be seen that the sign of the resulting r would be a function of the sign and the magnitude of the standard scores used. If, for example, negative Page 105standard score values for measurements of X always corresponded with negative standard score values for Y scores, the resulting r would be positive (because the product of two negative values is positive). Similarly, if positive standard score values on X always corresponded with positive standard score values on Y, the resulting correlation would also be positive. However, if positive standard score values for Xcorresponded with negative standard score values for Y and vice versa, then an inverse relationship would exist and so a negative correlation would result. A zero or near-zero correlation could result when some products are positive and some are negative.
The formula used to calculate a Pearson r from raw scores is
This formula has been simplified for shortcut purposes. One such shortcut is a deviation formula employing “little x,” or x in place of and “little y,” or y in place of
Another formula for calculating a Pearson r is
Although this formula looks more complicated than the previous deviation formula, it is easier to use. Here N represents the number of paired scores; Σ XY is the sum of the product of the paired X and Y scores; Σ X is the sum of the X scores; Σ Y is the sum of the Y scores; Σ X2 is the sum of the squared Xscores; and Σ Y2 is the sum of the squared Y scores. Similar results are obtained with the use of each formula.
The next logical question concerns what to do with the number obtained for the value of r. The answer is that you ask even more questions, such as “Is this number statistically significant, given the size and nature of the sample?” or “Could this result have occurred by chance?” At this point, you will need to consult tables of significance for Pearson r—tables that are probably in the back of your old statistics textbook. In those tables you will find, for example, that a Pearson r of .899 with an N = 10 is significant at the .01 level (using a two-tailed test). You will recall from your statistics course that significance at the .01 level tells you, with reference to these data, that a correlation such as this could have been expected to occur merely by chance only one time or less in a hundred if X and Y are not correlated in the population. You will also recall that significance at either the .01 level or the (somewhat less rigorous) .05 level provides a basis for concluding that a correlation does indeed exist. Significance at the .05 level means that the result could have been expected to occur by chance alone five times or less in a hundred.
The value obtained for the coefficient of correlation can be further interpreted by deriving from it what is called a coefficient of determination , or r2. The coefficient of determination is an indication of how much variance is shared by the X– and the Y-variables. The calculation of r2 is quite straightforward. Simply square the correlation coefficient and multiply by 100; the result is equal to the percentage of the variance accounted for. If, for example, you calculated r to be .9, then r2 would be equal to .81. The number .81 tells us that 81% of the variance is accounted for by the X– and Y-variables. The remaining variance, equal to 100(1 − r2), or 19%, could presumably be accounted for by chance, error, or otherwise unmeasured or unexplainable factors. 7 Page 106
Before moving on to consider another index of correlation, let’s address a logical question sometimes raised by students when they hear the Pearson r referred to as the product-moment coefficient of correlation. Why is it called that? The answer is a little complicated, but here goes.
In the language of psychometrics, a moment describes a deviation about a mean of a distribution. Individual deviations about the mean of a distribution are referred to as deviates. Deviates are referred to as the first moments of the distribution. The second moments of the distribution are the moments squared. The third moments of the distribution are the moments cubed, and so forth. The computation of the Pearson r in one of its many formulas entails multiplying corresponding standard scores on two measures. One way of conceptualizing standard scores is as the first moments of a distribution. This is because standard scores are deviates about a mean of zero. A formula that entails the multiplication of two corresponding standard scores can therefore be conceptualized as one that entails the computation of the product of corresponding moments. And there you have the reason r is called product-moment correlation. It’s probably all more a matter of psychometric trivia than anything else, but we think it’s cool to know. Further, you can now understand the rather “high-end” humor contained in the cartoon (below).
The Spearman Rho
The Pearson r enjoys such widespread use and acceptance as an index of correlation that if for some reason it is not used to compute a correlation coefficient, mention is made of the statistic that was used. There are many alternative ways to derive a coefficient of correlation. One commonly used alternative statistic is variously called a rank-order correlation coefficient , a rank-difference correlation coefficient , or simply Spearman’s rho .Developed by Charles Spearman, a British psychologist ( Figure 3–12 ), this coefficient of correlation is frequently used when the sample size is small (fewer than 30 pairs of measurements) and especially when both sets of measurements are in ordinal (or rank-order) form. Special tables are used to determine whether an obtained rho coefficient is or is not significant.
Figure 3–12 Charles Spearman (1863–1945) Charles Spearman is best known as the developer of the Spearman rho statistic and the Spearman-Brown prophecy formula, which is used to “prophesize” the accuracy of tests of different sizes. Spearman is also credited with being the father of a statistical method called factor analysis, discussed later in this text.© Atlas Archive/The Image Works Copyright 2016 by Ronald Jay Cohen. All rights reserved.Page 107
Graphic Representations of Correlation
One type of graphic representation of correlation is referred to by many names, including a bivariate distribution , a scatter diagram , a scattergram , or—our favorite—a scatterplot . A scatterplot is a simple graphing of the coordinate points for values of the X-variable (placed along the graph’s horizontal axis) and the Y-variable (placed along the graph’s vertical axis). Scatterplots are useful because they provide a quick indication of the direction and magnitude of the relationship, if any, between the two variables. Figures 3–13 and 3–14 offer a quick course in eyeballing the nature and degree of correlation by means of scatterplots. To distinguish positive from negative correlations, note the direction of the curve. And to estimate the strength of magnitude of the correlation, note the degree to which the points form a straight line.
Figure 3–13 Scatterplots and Correlations for Positive Values of r Figure 3–14 Scatterplots and Correlations for Negative Values of r
Scatterplots are useful in revealing the presence of curvilinearity in a relationship. As you may have guessed, curvilinearity in this context refers to an “eyeball gauge” of how curved a graph is. Remember that a Pearson r should be used only if the relationship between the variables is linear. If the graph does not appear to take the form of a straight line, the chances are good that the relationship is not linear ( Figure 3–15 ). When the relationship is nonlinear, other statistical tools and techniques may be employed. 8
Page 110Figure 3–15 Scatterplot Showing a Nonlinear Correlation
A graph also makes the spotting of outliers relatively easy. An outlier is an extremely atypical point located at a relatively long distance—an outlying distance—from the rest of the coordinate points in a scatterplot ( Figure 3–16 ). Outliers stimulate interpreters of test data to speculate about the reason for the atypical score. For example, consider an outlier on a scatterplot that reflects a correlation between hours each member of a fifth-grade class spent studying and their grades on a 20-item spelling test. And let’s say that one student studied for 10 hours and received a failing grade. This outlier on the scatterplot might raise a red flag and compel the test user to raise some important questions, such as “How effective are this student’s study skills and habits?” or “What was this student’s state of mind during the test?”
Figure 3–16 Scatterplot Showing an Outlier
In some cases, outliers are simply the result of administering a test to a very small sample of testtakers. In the example just cited, if the test were given statewide to fifth-graders and the sample size were much larger, perhaps many more low scorers who put in large amounts of study time would be identified.
As is the case with very low raw scores or raw scores of zero, outliers can sometimes help identify a testtaker who did not understand the instructions, was not able to follow the instructions, or was simply oppositional and did not follow the instructions. In other cases, an outlier can provide a hint of some deficiency in the testing or scoring procedures.
People who have occasion to use or make interpretations from graphed data need to know if the range of scores has been restricted in any way. To understand why this is so necessary to know, consider Figure 3–17 . Let’s say that graph A describes the relationship between Public University entrance test scores for 600 applicants (all of whom were later admitted) and their grade point averages at the end of the first semester. The scatterplot indicates that the relationship between entrance test scores and grade point average is both linear and positive. But what if the admissions officer had accepted only the applications of the students who scored within the top half or so on the entrance exam? To a trained eye, this scatterplot (graph B) appears to indicate a weaker correlation than that indicated in graph A—an effect attributable exclusively to the restriction of range. Graph B is less a straight line than graph A, and its direction is not as obvious.
Figure 3–17 Two Scatterplots Illustrating Unrestricted and Restricted Ranges
Generally, the best estimate of the correlation between two variables is most likely to come not from a single study alone but from analysis of the data from several studies. One option to Page 111facilitate understanding of the research across a number of studies is to present the range of statistical values calculated from a number of different studies of the same phenomenon. Viewing all of the data from a number of studies that attempted to determine the correlation between variable X and variable Y, for example, might lead the researcher to conclude that “The correlation between variable Xand variable Y ranges from .73 to .91.” Another option might be to combine statistically the information across the various studies; that is what is done using a statistical technique called meta-analysis. Using this technique, researchers raise (and strive to answer) the question: “Combined, what do all of these studies tell us about the matter under study?” For example, Imtiaz et al. (2016) used meta-analysis to draw some conclusions regarding the relationship between cannabis use and physical health. Colin (2015) used meta-analysis to study the correlations of use-of-force decisions among American police officers.
Meta-analysis may be defined as a family of techniques used to statistically combine information across studies to produce single estimates of the data under study. The estimates derived, referred to as effect size , may take several different forms. In most meta-analytic studies, effect size is typically expressed as a correlation coefficient. 9 Meta-analysis facilitates the drawing of conclusions and the making of statements like, “the typical therapy client is better off than 75% of untreated individuals” (Smith & Glass, 1977, p. 752), there is “about 10% increased risk for antisocial behavior among children with incarcerated parents, compared to peers” (Murray et al., 2012), and “GRE and UGPA [undergraduate grade point average] are generalizably valid predictors of graduate grade point average, 1st-year graduate grade point average, comprehensive examination scores, publication citation counts, and faculty ratings” (Kuncel et al., 2001, p. 162).Page 112
MEET AN ASSESSMENT PROFESSIONAL
Meet Dr. Joni L. Mihura
Hi, my name is Joni Mihura, and my research expertise is in psychological assessment, with a special focus on the Rorschach. To tell you a little about me, I was the only woman* to serve on the Research Council for John E. Exner’s Rorschach Comprehensive System (CS) until he passed away in 2006. Due to the controversy around the Rorschach’s validity, I began reviewing the research literature to ensure I was teaching my doctoral students valid measures to assess their clients. That is, the controversy about the Rorschach has not been that it is a completely invalid test—the critics have endorsed several Rorschach scales as valid for their intended purpose—the main problem that they have highlighted is that only a small proportion of its scales had been subjected to “meta-analysis,” a systematic technique for summarizing the research literature. To make a long story short, I eventually published my review of the Rorschach literature in the top scientific review journal in psychology (Psychological Bulletin) in the form of systematic reviews and meta-analyses of the 65 main Rorschach CS variables (Mihura et al., 2013), therefore making the Rorschach the psychological test with the most construct validity meta-analyses for its scales!
My meta-analyses also resulted in two other pivotal events. They formed the backbone for a new scientifically based Rorschach system of which I am a codeveloper—the Rorschach Performance Assessment System (R-PAS; Meyer et al., 2011), and they resulted in the Rorschach critics removing the “moratorium” they had recommended for the Rorschach (or, Garb, 1999) for the scales they deemed had solid support in our meta-analyses (Wood et al., 2015; also see our reply, Mihura et al., 2015).
I’m very excited to talk with you about meta-analysis. First, to set the stage, let’s take a step back and look at what you might have experienced so far when reading about psychology. When students take their first psychology course, they are often surprised how much of the field is based on research findings rather than just “common sense.” Even so, because undergraduate textbooks have numerous topics about which they cannot cite all of the research, it can appear that the textbook is relying on just one or two studies as the “proof.” Therefore, you might be surprised just how many psychological research studies actually exist! Conducting a quick search in the PsycINFO database shows that over a million psychology journal articles are classified as empirical studies—and that excludes chapters, theses, dissertations, and many other studies not listed in PsycINFO.
Joni L. Mihura, Ph.D. is Associate Professor of Psychology at the University of Toledo in Toledo, Ohio © Joni L. Mihura, Ph.D.
But, good news or bad news, a significant challenge with many research studies is how to summarize results. The classic example of such a dilemma and the eventual solution is a fascinating one that comes from the psychotherapy literature. In 1952, Hans Eysenck published a classic article entitled Page 113“The Effects of Psychotherapy: An Evaluation,” in which he summarized the results of a few studies and concluded that psychotherapy doesn’t work! Wow! This finding had the potential to shake the foundation of psychotherapy and even ban its existence. After all, Eysenck had cited research that suggested that the longer a person was in therapy, the worse-off they became. Notwithstanding the psychotherapists and the psychotherapy enterprise, Eysenck’s publication had sobering implications for people who had sought help through psychotherapy. Had they done so in vain? Was there really no hope for the future? Were psychotherapists truly ill-equipped to do things like reduce emotional suffering and improve peoples’ lives through psychotherapy?
In the wake of this potentially damning article, several psychologists—and in particular Hans H. Strupp—responded by pointing out problems with Eysenck’s methodology. Other psychologists conducted their own reviews of the psychotherapy literature. Somewhat surprisingly, after reviewing the same body of research literature on psychotherapy, various psychologists drew widely different conclusions. Some researchers found strong support for the efficacy of psychotherapy. Other researchers found only modest support for the efficacy of psychotherapy. Yet other researchers found no support for it at all.
How can such different conclusions be drawn when the researchers are reviewing the same body of literature? A comprehensive answer to this important question could fill the pages of this book. Certainly, one key element of the answer to this question had to do with a lack of systematic rules for making decisions about including studies, as well as lack of a widely acceptable protocol for statistically summarizing the findings of the various studies. With such rules and protocols absent, it would be all too easy for researchers to let their preexisting biases run amok. The result was that many researchers “found” in their analyses of the literature what they believed to be true in the first place.
A fortuitous bi-product of such turmoil in the research community was the emergence of a research technique called “meta-analysis.” Literally, “an analysis of analyses,” meta-analysis is a tool used to systematically review and statistically summarize the research findings for a particular topic. In 1977, Mary Lee Smith and Gene V. Glass published the first meta-analysis of psychotherapy outcomes. They found strong support for the efficacy of psychotherapy. Subsequently, others tried to challenge Smith and Glass’ findings. However, the systematic rigor of their meta-analytic technique produced findings that were consistently replicated by others. Today there are thousands of psychotherapy studies, and many meta-analysts ready to research specific, therapy-related questions (like “What type of psychotherapy is best for what type of problem?”).
What does all of this mean for psychological testing and assessment? Meta-analytic methodology can be used to glean insights about specific tools of assessment, and testing and assessment procedures. However, meta-analyses of information related to psychological tests brings new challenges owing, for example, to the sheer number of articles to be analyzed, the many variables on which tests differ, and the specific methodology of the meta-analysis. Consider, for example, that multiscale personality tests may contain over 50, and sometimes over 100, scales that each need to be evaluated separately. Furthermore, some popular multiscale personality tests, like the MMPI-2 and Rorschach, have had over a thousand research studies published on them. The studies typically report findings that focus on varied aspects of the test (such as the utility of specific test scales, or other indices of test reliability or validity). In order to make the meta-analytic task manageable, meta-analyses for multiscale tests will typically focus on one or another of these characteristics or indices.
In sum, a thoughtful meta-analysis of research on a specific topic can yield important insights of both theoretical and applied value. A meta-analytic review of the literature on a particular psychological test can even be instrumental in the formulation of revised ways to score the test and interpret the findings (just ask Meyer et al., 2011). So, the next time a question about psychological research arises, students are advised to respond to that question with their own question, namely “Is there a meta-analysis on that?”
Used with permission of Joni L. Mihura.
*I have also edited the Handbook of Gender and Sexuality in Psychological Assessment (Brabender & Mihura, 2016).
A key advantage of meta-analysis over simply reporting a range of findings is that, in meta-analysis, more weight can be given to studies that have larger numbers of subjects. This weighting process results in more accurate estimates (Hunter & Schmidt, 1990). Some advantages to meta-analyses are: (1) meta-analyses can be replicated; (2) the conclusions of meta-analyses tend to be more reliable and precise than the conclusions from single studies; (3) there is more focus on effect size rather than statistical significance alone; and (4) meta-analysis promotes evidence-based practice , which may be defined as professional practice that is based on clinical and research findings (Sánchez-Meca & Marin-Martinez, 2010). Despite these and other advantages, meta-analysis is, at least to some degree, art as well as science (Hall & Rosenthal, 1995). The value of any meta-analytic investigation is very much a matter of the skill and ability of the meta-analyst (Kavale, 1995), and use of an inappropriate meta-analytic method can lead to misleading conclusions (Kisamore & Brannick, 2008).
It may be helpful at this time to review this statistics refresher to make certain that you indeed feel “refreshed” and ready to continue. We will build on your knowledge of basic statistical principles in the chapters to come, and it is important to build on a rock-solid foundation.
Test your understanding of elements of this chapter by seeing if you can explain each of the following terms, expressions, and abbreviations: