Research Methods in Psychology

Please upload your responses here to the review questions (#3, 6, 7, 8, 9, 10) at the end of Chapter 2, found on page 72.

Save Time On Research and Writing
Hire a Pro to Write You a 100% Plagiarism-Free Paper.
Get My Paper

Beth Morling – Research Methods in Psychology_ Evaluating a World of Information.pdf


Research Methods in Psychology EVALUATING A WORLD OF INFORMATION


Research Methods in Psychology EVALUATING A WORLD OF INFORMATION

Save Time On Research and Writing
Hire a Pro to Write You a 100% Plagiarism-Free Paper.
Get My Paper



W. W. Norton & Company has been independent since its founding in 1923,

when William Warder Norton and Mary D. Herter Norton first published

lectures delivered at the People’s Institute, the adult education division of

New York City’s Cooper Union. The firm soon expanded its program beyond

the Institute, publishing books by celebrated academics from America and

abroad. By midcentury, the two major pillars of Norton’s publishing program—

trade books and college texts—were firmly established. In the 1950s, the Norton

family transferred control of the company to its employees, and today—with

a staff of four hundred and a comparable number of trade, college, and

professional titles published each year—W. W. Norton & Company stands as

the largest and oldest publishing house owned wholly by its employees.

Copyright © 2018, 2015, 2012 by W. W. Norton & Company, Inc.

All rights reserved Printed in Canada

Editor: Sheri L. Snavely Project Editor: David Bradley Editorial Assistant: Eve Sanoussi Manuscript/Development Editor: Betsy Dilernia Managing Editor, College: Marian Johnson Managing Editor, College Digital Media: Kim Yi Production Manager: Jane Searle Media Editor: Scott Sugarman Associate Media Editor: Victoria Reuter Media Assistant: Alex Trivilino Marketing Manager, Psychology: Ashley Sherwood Design Director and Text Design: Rubina Yeh Photo Editor: Travis Carr Photo Researcher: Dena Digilio Betz Permissions Manager: Megan Schindel Composition: CodeMantra Illustrations: Electragraphics Manufacturing: Transcontinental Printing

Permission to use copyrighted material is included in the Credits section beginning on page 603.

Library of Congress Cataloging-in-Publication Data

Names: Morling, Beth, author. Title: Research methods in psychology : evaluating a world of information / Beth Morling, University of Delaware. Description: Third Edition. | New York : W. W. Norton & Company, [2017] | Revised edition of the author’s Research methods in psychology, [2015] | Includes bibliographical references and index. Identifiers: LCCN 2017030401 | ISBN 9780393617542 (pbk.) Subjects: LCSH: Psychology—Research—Methodology—Textbooks. | Psychology, Experimental—Textbooks. Classification: LCC BF76.5 .M667 2017 | DDC 150.72—dc23 LC record available at

Text-Only ISBN 978-0-393-63017-6

W.  W. Norton & Company, Inc., 500 Fifth Avenue, New York, NY 10110 W.  W. Norton & Company Ltd., 15 Carlisle Street, London W1D 3BS

1 2 3 4 5 6 7 8 9 0

For my parents


Brief Contents

PART I Introduction to Scientific Reasoning CHAPTER 1 Psychology Is a Way of Thinking 5

CHAPTER 2 Sources of Information: Why Research Is Best and How to Find It 25

CHAPTER 3 Three Claims, Four Validities: Interrogation Tools for Consumers of Research 57

PART II Research Foundations for Any Claim CHAPTER 4 Ethical Guidelines for Psychology Research 89

CHAPTER 5 Identifying Good Measurement 117

PART III Tools for Evaluating Frequency Claims CHAPTER 6 Surveys and Observations: Describing What People Do 153

CHAPTER 7 Sampling: Estimating the Frequency of Behaviors and Beliefs 179

PART IV Tools for Evaluating Association Claims CHAPTER 8 Bivariate Correlational Research 203

CHAPTER 9 Multivariate Correlational Research 237

PART V Tools for Evaluating Causal Claims CHAPTER 10 Introduction to Simple Experiments 273

CHAPTER 11 More on Experiments: Confounding and Obscuring Variables 311

CHAPTER 12 Experiments with More Than One Independent Variable 351

PART VI Balancing Research Priorities CHAPTER 13 Quasi-Experiments and Small-N Designs 389

CHAPTER 14 Replication, Generalization, and the Real World 425

Statistics Review Descriptive Statistics 457

Statistics Review Inferential Statistics 479

Presenting Results APA-Style Reports and Conference Posters 505

Appendix A Random Numbers and How to Use Them 545

Appendix B Statistical Tables 551


BETH MORLING is Professor of Psychology at the University of  Delaware. She attended Carleton College in Northfield, Minnesota, and received her Ph.D. from the University of Massachusetts at Amherst. Before coming to Delaware, she held positions at Union College (New York) and Muhlenberg College (Pennsylvania). In  addition to teaching research methods at Delaware almost every semester, she also teaches undergraduate cultural psychology, a seminar on the self- concept, and a graduate course in the teaching of psychology. Her research in the area of cultural psychology explores how cultural practices shape people’s motivations. Dr. Morling has been a Fulbright scholar in Kyoto, Japan, and was the Delaware State Professor of the Year (2014), an award from the Council for Advancement and Support of Education (CASE) and the Carnegie Foundation for the Advancement of Teaching.

About the Author



Students in the psychology major plan to pursue a tremendous variety of careers— not just becoming psychology researchers. So they sometimes ask: Why do we need to study research methods when we want to be therapists, social workers, teachers, lawyers, or physicians? Indeed, many students anticipate that research methods will be “dry,” “boring,” and irrelevant to their future goals. This book was written with these very students in mind—students who are taking their first course in research methods (usually sophomores) and who plan to pursue a wide variety of careers. Most of the students who take the course will never become researchers themselves, but they can learn to systematically navigate the research information they will encounter in empirical journal articles as well as in online magazines, print sources, blogs, and tweets.

I used to tell students that by conducting their own research, they would be able to read and apply research later, in their chosen careers. But the literature on learning transfer leads me to believe that the skills involved in designing one’s own studies will not easily transfer to understanding and critically assessing studies done by others. If we want students to assess how well a study supports its claims, we have to teach them to assess research. That is the approach this book takes.

Students Can Develop Research Consumer Skills To be a systematic consumer of research, students need to know what to priori- tize when assessing a study. Sometimes random samples matter, and sometimes they do not. Sometimes we ask about random assignment and confounds, and sometimes we do not. Students benefit from having a set of systematic steps to help them prioritize their questioning when they interrogate quantitative infor- mation. To provide that, this book presents a framework of three claims and four validities, introduced in Chapter 3. One axis of the framework is the three kinds of claims researchers (as well as journalists, bloggers, and commentators) might make: frequency claims (some percentage of people do X), association claims (X is associated with Y), and causal claims (X changes Y). The second axis of


the  framework is the four validities that are generally agreed upon by methodol- ogists: internal, external, construct, and statistical.

The three claims, four validities framework provides a scaffold that is rein- forced throughout. The book shows how almost every term, technique, and piece of information fits into the basic framework.

The framework also helps students set priorities when evaluating a study. Good quantitative reasoners prioritize different validity questions depending on the claim. For example, for a frequency claim, we should ask about measurement (construct validity) and sampling techniques (external validity), but not about ran- dom assignment or confounds, because the claim is not a causal one. For a causal claim, we prioritize internal validity and construct validity, but external validity is generally less important.

Through engagement with a consumer-focused research methods course, students become systematic interrogators. They start to ask more appropriate and refined questions about a study. By the end of the course, students can clearly explain why a causal claim needs an experiment to support it. They know how to evaluate whether a variable has been measured well. They know when it’s appro- priate to call for more participants in a study. And they can explain when a study must have a representative sample and when such a sample is not needed.

What About Future Researchers? This book can also be used to teach the flip side of the question: How can produc- ers of research design better studies? The producer angle is presented so that stu- dents will be prepared to design studies, collect data, and write papers in courses that prioritize these skills. Producer skills are crucial for students headed for Ph.D. study, and they are sometimes required by advanced coursework in the undergraduate major.

Such future researchers will find sophisticated content, presented in an accessible, consistent manner. They will learn the difference between media- tion (Chapter 9) and moderation (Chapters 8 and 9), an important skill in theory building and theory testing. They will learn how to design and interpret factorial designs, even up to three-way interactions (Chapter 12). And in the common event that a student-run study fails to work, one chapter helps them explore the possi- ble reasons for a null effect (Chapter 11). This book provides the basic statistical background, ethics coverage, and APA-style notes for guiding students through study design and execution.

Organization The fourteen chapters are arranged in six parts. Part I (Chapters 1–3) includes introductory chapters on the scientific method and the three claims, four validities framework. Part II (Chapters 4–5) covers issues that matter for any study: research

xiSupport for Students and Instructors

ethics and good measurement. Parts III–V (Chapters 6–12) correspond to each of the three claims (frequency, association, and causal). Part VI (Chapters 13–14) focuses on balancing research priorities.

Most of the chapters will be familiar to veteran instructors, including chapters on measurement, experimentation, and factorial designs. However, unlike some methods books, this one devotes two full chapters to correlational research (one on bivariate and one on multivariate studies), which help students learn how to interpret, apply, and interrogate different types of association claims, one of the common types of claims they will encounter.

There are three supplementary chapters, on Descriptive Statistics, Inferential Statistics, and APA-Style Reports and Conference Posters. These chapters provide a review for students who have already had statistics and provide the tools they need to create research reports and conference posters.

Two appendices—Random Numbers and How to Use Them, and Statistical Tables—provide reference tools for students who are conducting their own research.

Support for Students and Instructors The book’s pedagogical features emphasize active learning and repetition of the most important points. Each chapter begins with high-level learning objectives— major skills students should expect to remember even “a year from now.” Impor- tant terms in a chapter are introduced in boldface. The Check Your Understanding questions at the end of each major section provide basic questions that let students revisit key concepts as they read. Each chapter ends with multiple-choice Review Questions for retrieval practice, and a set of Learning Actively exercises that encourage students to apply what they learned. (Answers are provided at the end of the book.) A master table of the three claims and four validities appears inside the book’s front cover to remind students of the scaffold for the course.

I believe the book works pedagogically because it spirals through the three claims, four validities framework, building in repetition and depth. Although each chapter addresses the usual core content of research methods, students are always reminded of how a particular topic helps them interrogate the key validities. The interleaving of content should help students remember and apply this questioning strategy in the future.

I have worked with W. W. Norton to design a support package for fel- low instructors and students. The online Interactive Instructor’s Guide offers in-class activities, models of course design, homework and final assignments, and chapter-by-chapter teaching notes, all based on my experience with the course. The book is accompanied by other ancillaries to assist both new and experienced research methods instructors, including a new InQuizitive online assessment tool, a robust test bank with over 750 questions, updated lecture and active learning slides, and more; for a complete list, see p. xix.


Teachable Examples on the Everyday Research Methods Blog Students and instructors can find additional examples of psychological science in the news on my regularly updated blog, Everyday Research Methods (www; no password or registration required). Instruc- tors can use the blog for fresh examples to use in class, homework, or exams. Students can use the entries as extra practice in reading about research studies in psychology in the popular media. Follow me on Twitter to get the latest blog updates (@bmorling).

Changes in the Third Edition Users of the first and second editions will be happy to learn that the basic organi- zation, material, and descriptions in the text remain the same. The third edition provides several new studies and recent headlines. Inclusion of these new exam- ples means that instructors who assign the third edition can also use their favorite illustrations from past editions as extra examples while teaching.

In my own experience teaching the course, I found that students could often master concepts in isolation, but they struggled to bring them all together when reading a real study. Therefore, the third edition adds new Working It Through sections in several chapters (Chapters 3, 4, 5, 8, and 11). Each one works though a single study in depth, so students can observe how the chapter’s central concepts are integrated and applied. For instance, in Chapter 4, they can see how ethics concepts can be applied to a recent study that manipulated Facebook newsfeeds. The Working It Through material models the process students will probably use on longer class assignments.

Also new in the third edition, every figure has been redrawn to make it more visually appealing and readable. In addition, selected figures are annotated to help students learn how to interpret graphs and tables.

Finally, W. W. Norton’s InQuizitive online assessment tool is available with the third edition. InQuizitive helps students apply concepts from the textbook to practice examples, providing specific feedback on incorrect responses. Some questions require students to interpret tables and figures; others require them to apply what they’re learning to popular media articles.

Here is a detailed list of the changes made to each chapter.

xiiiChanges in the Third Edition


1. Psychology Is a Way of Thinking

The heading structure is the same as in the second edition, with some updated examples. I replaced the facilitated communication example (still an excellent teaching example) with one on the Scared Straight program meant to keep adolescents out of the criminal justice system, based on a reviewer’s recommendation.

2. Sources of Information: Why Research Is Best and How to Find it

I simplified the coverage of biases of intuition. Whereas the second edition separated cognitive biases from motivated reasoning, the biases are now presented more simply. In addition, this edition aims to be clearer on the difference between the availability heuristic and the present/present bias. I also developed the coverage of Google Scholar.

3. Three Claims, Four Validities: Interrogation Tools for Consumers of Research

The three claims, four validities framework is the same, keeping the best teachable examples from the second edition and adding new examples from recent media. In response to my own students’ confusion, I attempted to clarify the difference between the type of study conducted (correlational or experimental) and the claims made about it. To this end, I introduced the metaphor of a gift, in which a journalist might “wrap” a correlational study in a fancy, but inappropriate, causal claim.

When introducing the three criteria for causation, I now emphasize that covariance is about the study’s results, while temporal precedence and internal validity are determined from the study’s method.

Chapter 3 includes the first new Working It Through section.

4. Ethical Guidelines for Psychology Research

I updated the section on animal research and removed the full text of APA Standard 8. There’s a new figure on the difference between plagiarism and paraphrasing, and a new example of research fabrication (the notorious, retracted Lancet article on vaccines and autism). A new Working It Through section helps students assess the ethics of a recent Facebook study that manipulated people’s newsfeeds.

5. Identifying Good Measurement

This chapter retains many of the teaching examples as the second edition. For clarity, I changed the discriminant validity example so the correlation is only weak (not both weak and negative). A new Working It Through section helps students apply the measurement concepts to a self-report measure of gratitude in relationships.

6. Surveys and Observations: Describing What People Do

Core examples are the same, with a new study illustrating the effect of leading questions (a poll on attitudes toward voter ID laws). Look for the new “babycam” example in the Learning Actively exercises.

7. Sampling: Estimating the Frequency of Behaviors and Beliefs

Look for new content on MTurk and other Internet-based survey panels. I updated the statistics on cell-phone-only populations, which change yearly. Finally, I added clarity on the difference between cluster and stratified samples and explained sample weighting.

I added the new keyword nonprobability sample to work in parallel with the term probability sample. A new table (Table 7.3) helps students group related terms.



8. Bivariate Correlational Research

This chapter keeps most of the second edition examples. It was revised to better show that association claims are separate from correlational methods. Look for improved moderator examples in this chapter. These new examples, I hope, will communicate to students that moderators change the relationship between variables; they do not necessarily reflect the level of one of the variables.

9. Multivariate Correlational Research

I replaced both of the main examples in this chapter. The new example of cross- lag panel design, on parental overpraise and child narcissism, has four time periods (rather than two), better representing contemporary longitudinal studies. In the multiple regression section, the recess example is replaced with one on adolescents in which watching sexual TV content predicts teen pregnancy. The present regression example is student-friendly and also has stronger effect sizes.

Look for an important change in Figure 9.13 aimed to convey that a moderator can be thought of as vulnerability. My own students tend to think something is a moderator when the subgroup is simply higher on one of the variables. For example, boys might watch more violent TV content and be higher on aggression, but that’s not the same as a moderator. Therefore, I have updated the moderator column with the moderator “parental discussion.” I hope this will help students come up with their own moderators more easily.

10. Introduction to Simple Experiments

The red/green ink example was replaced with a popular study on notetaking, comparing the effects of taking notes in longhand or on laptops. There is also a new example of pretest/posttest designs (a study on mindfulness training). Students sometimes are surprised when a real-world study has multiple dependent variables, so I’ve highlighted that more in the third edition. Both of the chapter’s opening examples have multiple dependent variables.

I kept the example on pasta bowl serving size. However, after Chapter 10 was typeset, some researchers noticed multiple statistical inconsistencies in several publications from Wansink’s lab (for one summary of the issues, see the Chronicle of Higher Education article, “Spoiled Science”). At the time of writing, the pasta study featured in Chapter 10 has not been identified as problematic. Nevertheless, instructors might wish to engage students in a discussion of these issues.

11. More on Experiments: Confounding and Obscuring Variables

The content is virtually the same, with the addition of two Working It Through sections. The first one is to show students how to work through Table 11.1 using the mindfulness study from Chapter 10. This is important because after seeing Table 11.1, students sometimes think their job is to find the flaw in any study. In fact, most published studies do not have major internal validity flaws. The second Working It Through shows students how to analyze a null result.

12. Experiments with More Than One Independent Variable

Recent work has suggested that context-specific memory effects are not robust, so I replaced the Godden and Baddeley factorial example on context-specific learning with one comparing the memory of child chess experts to adults.



13. Quasi-Experiments and Small-N Designs

I replaced the Head Start study for two reasons. First, I realized it’s not a good example of a nonequivalent control group posttest-only design, because it actually included a pretest! Second, the regression to the mean effect it meant to illustrate is rare and difficult to understand. In exchange, there is a new study on the effects of walking by a church.

In the small-N design section, I provided fresh examples of multiple baseline design and alternating treatment designs. I also replaced the former case study example (split-brain studies) with the story of H.M. Not only is H.M.’s story compelling (especially as told through the eyes of his friend and researcher Suzanne Corkin), the brain anatomy required to understand this example is also simpler than that of split- brain studies, making it more teachable.

14. Replication, Generalization, and the Real World

A significant new section and table present the so-called “replication crisis” in psychology. In my experience, students are extremely engaged in learning about these issues. There’s a new example of a field experiment, a study on the effect of radio programs on reconciliation in Rwanda.

Supplementary Chapters In the supplementary chapter on inferential statistics, I replaced the section on randomization tests with a new section on confidence intervals. The next edition of the book may transition away from null hypothesis significance testing to emphasize the “New Statistics” of estimation and confidence intervals. I welcome feedback from instructors on this potential change.

Changes in the Third Edition



Working on this textbook has been rewarding and enriching, thanks to the many people who have smoothed the way. To start, I feel fortunate to have collaborated with an author-focused company and an all-around great editor, Sheri Snavely. Through all three editions, she has been both optimistic and realistic, as well as savvy and smart. She also made sure I got the most thoughtful reviews possible and that I was supported by an excellent staff at Norton: David Bradley, Jane Searle, Rubina Yeh, Eve Sanoussi, Victoria Reuter, Alex Trivilino, Travis Carr, and Dena Diglio Betz. My developmental editor, Betsy Dilernia, found even more to refine in the third edition, making the language, as well as each term, figure, and refer- ence, clear and accurate.

I am also thankful for the support and continued enthusiasm I have received from the Norton sales management team: Michael Wright, Allen Clawson, Ashley Sherwood, Annie Stewart, Dennis Fernandes, Dennis Adams, Katie Incorvia, Jordan Mendez, Amber Watkins, Shane Brisson, and Dan Horton. I also wish to thank the science and media special- ists for their creativity and drive to ensure my book reaches a wide audience, and that all the media work for instructors and students.

I deeply appreciate the support of many col- leagues. My former student Patrick Ewell, now at Kenyon College, served as a sounding board for new examples and authored the content for InQuizitive. Eddie Brummelman and Stefanie Nelemans provided additional correlations for the cross-lag panel design in Chapter 9. My friend Carrie Smith authored the Test Bank for the past two editions and has made it

an authentic measure of quantitative reasoning (as well as sending me things to blog about). Catherine Burrows carefully checked and revised the Test Bank for the third edition. Many thanks to Sarah Ainsworth, Reid Griggs, Aubrey McCarthy, Emma McGorray, and Michele M. Miller for carefully and patiently fact-checking every word in this edition. My student Xiaxin Zhong added DOIs to all the refer- ences and provided page numbers for the Check Your Understanding answers. Thanks, as well, to Emily Stanley and Jeong Min Lee, for writing and revising the questions that appear in the Coursepack created for the course management systems. I’m grateful to Amy Corbett and Kacy Pula for reviewing the ques- tions in InQuizitive. Thanks to my students Matt Davila-Johnson and Jeong Min Lee for posing for photographs in Chapters 5 and 10.

The book’s content was reviewed by a cadre of talented research method professors, and I am grateful to each of them. Some were asked to review; others cared enough to send me comments or examples by e-mail. Their students are lucky to have them in the classroom, and my readers will benefit from the time they spent in improving this book:

Eileen Josiah Achorn, University of Texas, San Antonio Sarah Ainsworth, University of North Florida Kristen Weede Alexander, California State University,

Sacramento Leola Alfonso-Reese, San Diego State University Cheryl Armstrong, Fitchburg State University Jennifer Asmuth, Susquehanna University Kristin August, Rutgers University, Camden


Jessica L. Barnack-Tavlaris, The College of New Jersey Gordon Bear, Ramapo College Margaret Elizabeth Beier, Rice University Jeffrey Berman, University of Memphis Brett Beston, McMaster University Alisa Beyer, Northern Arizona University Julie Boland, University of Michigan Marina A. Bornovalova, University of South Florida Caitlin Brez, Indiana State University Shira Brill, California State University, Northridge J. Corey Butler, Southwest Minnesota State University Ricardo R. Castillo, Santa Ana College Alexandra F. Corning, University of Notre Dame Kelly A. Cotter, California State University, Stanislaus Lisa Cravens-Brown, The Ohio State University Victoria Cross, University of California, Davis Matthew Deegan, University of Delaware Kenneth DeMarree, University at Buffalo Jessica Dennis, California State University, Los Angeles Nicole DeRosa, SUNY Upstate Golisano Children’s Hospital Rachel Dinero, Cazenovia College Dana S. Dunn, Moravian College C. Emily Durbin, Michigan State University Russell K. Espinoza, California State University, Fullerton Patrick Ewell, Kenyon College Iris Firstenberg, University of California, Los Angeles Christina Frederick, Sierra Nevada College Alyson Froehlich, University of Utah Christopher J. Gade, University of California, Berkeley Timothy E. Goldsmith, University of New Mexico Jennifer Gosselin, Sacred Heart University AnaMarie Connolly Guichard, California State University,

Stanislaus Andreana Haley, University of Texas, Austin Edward Hansen, Florida State University Cheryl Harasymchuk, Carleton University Richard A. Hullinger, Indiana State University Deborah L. Hume, University of Missouri Kurt R. Illig, University of St. Thomas Jonathan W. Ivy, Pennsylvania State University, Harrisburg W. Jake Jacobs, University of Arizona Matthew D. Johnson, Binghamton University Christian Jordan, Wilfrid Laurier University Linda Juang, San Francisco State University

Victoria A. Kazmerski, Penn State Erie, The Behrend College Heejung Kim, University of California, Santa Barbara Greg M. Kim-Ju, California State University, Sacramento Ari Kirshenbaum, Ph.D., St. Michael’s College Kerry S. Kleyman, Metropolitan State University Penny L. Koontz, Marshall University Christina M. Leclerc, Ph.D., State University of New York

at Oswego Ellen W. Leen-Feldner, University of Arkansas Carl Lejuez, University of Maryland Marianne Lloyd, Seton Hall University Stella G. Lopez, University of Texas, San Antonio Greg Edward Loviscky, Pennsylvania State University Sara J. Margolin, Ph.D., The College at Brockport, State

University of New York Azucena Mayberry, Texas State University Christopher Mazurek, Columbia College Peter Mende-Siedlecki, University of Delaware Molly A. Metz, Miami University Dr. Michele M. Miller, University of Illinois Springfield Daniel C. Molden, Northwestern University J. Toby Mordkoff, University of Iowa Elizabeth Morgan, Springfield College Katie Mosack, University of Wisconsin, Milwaukee Erin Quinlivan Murdoch, George Mason University Stephanie C. Payne, Texas A&M University Anita Pedersen, California State University, Stanislaus Elizabeth D. Peloso, University of Pennsylvania M. Christine Porter, College of William and Mary Joshua Rabinowitz, University of Michigan Elizabeth Riina, Queens College, City University of New York James R. Roney, University of California, Santa Barbara Richard S. Rosenberg, Ph.D., California State University,

Long Beach Carin Rubenstein, Pima Community College Silvia J. Santos, California State University, Dominguez Hills Pamela Schuetze, Ph.D., The College at Buffalo, State

University of New York John N. Schwoebel, Ph.D., Utica College Mark J. Sciutto, Muhlenberg College Elizabeth A. Sheehan, Georgia State University Victoria A. Shivy, Virginia Commonwealth University Leo Standing, Bishop’s University


Harold W. K. Stanislaw, California State University, Stanislaus Kenneth M. Steele, Appalachian State University Mark A. Stellmack, University of Minnesota, Twin Cities Eva Szeli, Arizona State University Lauren A. Taglialatela, Kennesaw State University Alison Thomas-Cottingham, Rider University Chantal Poister Tusher, Georgia State University Allison A. Vaughn, San Diego State University Simine Vazire, University of California, Davis Jan Visser, University of Groningen John L. Wallace, Ph.D., Ball State University Shawn L. Ward, Le Moyne College Christopher Warren, California State University, Long Beach Shannon N. Whitten, University of Central Florida Jelte M. Wicherts, Tilburg University Antoinette R. Wilson, University of California, Santa Cruz James Worthley, University of Massachusetts, Lowell Charles E. (Ted) Wright, University of California, Irvine Guangying Wu, The George Washington University

David Zehr, Plymouth State University Peggy Mycek Zoccola, Ohio University

I have tried to make the best possible improvements from all of these capable reviewers.

My life as a teaching professor has been enriched during the last few years because of the friendship and support of my students and colleagues at the Uni- versity of Delaware, colleagues I see each year at the SPSP conference, and all the faculty I see regularly at the National Institute for the Teaching of Psychology, affectionately known as NITOP.

Three teenage boys will keep a person both enter- tained and humbled; thanks to Max, Alek, and Hugo for providing their services. I remain grateful to my mother-in-law, Janet Pochan, for cheerfully helping on the home front. Finally, I want to thank my husband Darrin for encouraging me and for always having the right wine to celebrate (even if it’s only Tuesday).

Beth Morling

Media Resources for Instructors and Students




INTERACTIVE INsTRUCTOR’s GUIDE Beth Morling, University of Delaware The Interactive Instructor’s Guide contains hundreds of downloadable resources and teaching ideas, such as a discussion of how to design a course that best utilizes the textbook, sample syllabus and assignments, and chapter-by-chapter teaching notes and suggested activities.

POwERPOINTs The third edition features three types of PowerPoints. The Lecture PowerPoints provide an overview of the major headings and definitions for each chapter. The Art Slides contain a complete set of images. And the Active Learning Slides provide the author’s favorite in-class activities, as well as reading quiz- zes and clicker questions. Instructors can browse the Active Learning Slides to select activities that supplement their classes.

TEsT BANk C. Veronica Smith, University of Mississippi, and Catherine Burrows, University of Miami The Test Bank provides over 750 questions using an evidence-centered approach designed in collabora- tion with Valerie Shute of Florida State University and Diego Zapata-Rivera of the Educational Testing Service. The Test Bank contains multiple-choice and short-answer questions classified by section, Bloom’s taxonomy, and difficulty, making it easy for instructors to construct tests and quizzes that are meaningful and diagnostic. The Test Bank is available in Word RTF, PDF, and ExamView® Assessment Suite formats.

INQUIZITIVE Patrick Ewell, Kenyon College InQuizitive allows students to practice applying terminology in the textbook to numerous examples. It can guide the students with specific feedback for incorrect answers to help clarify common mistakes. This online assessment tool gives students the repetition they need to fully understand the material without cutting into valuable class time. InQuizitive provides practice in reading tables and figures, as well as identifying the research methods used in studies from popular media articles, for an integrated learning experience.

EVERYDAY REsEARCH METHODs BLOG: The Research Methods in Psychology blog offers more than 150 teachable moments from the web, curated by Beth Morling and occasional guest contributors. Twice a month, the author highlights examples of psychological science in the news. Students can connect these recent stories with textbook concepts. Instructors can use blog posts as examples in lecture or assign them as homework. All entries are searchable by chapter.

COURsEPACk Emily Stanley, University of Mary Washington, and Jeong Min Lee, University of Delaware The Coursepack presents students with review opportunities that employ the text’s analytical frame- work. Each chapter includes quizzes based on the Norton Assessment Guidelines, Chapter Outlines created by the textbook author and based on the Learning Objectives in the text, and review flash- cards. The APA-style guidelines from the textbook are also available in the Coursepack for easy access.







Preface ix Media Resources for Instructors and Students xix

PART I Introduction to Scientific Reasoning


Psychology Is a Way of Thinking 5

Research Producers, Research Consumers 6 Why the Producer Role Is Important 6

Why the Consumer Role Is Important 7

The Benefits of Being a Good Consumer 8

How Scientists Approach Their Work 10 Scientists Are Empiricists 10

Scientists Test Theories: The Theory-Data Cycle 11

Scientists Tackle Applied and Basic Problems 16

Scientists Dig Deeper 16

Scientists Make It Public: The Publication Process 17

Scientists Talk to the World: From Journal to

Journalism 17

Chapter Review 22




Sources of Information: Why Research Is Best and How to Find It 25

The Research vs. Your Experience 26 Experience Has No Comparison Group 26

Experience Is Confounded 29

Research Is Better Than Experience 29

Research Is Probabilistic 31

The Research vs. Your Intuition 32 Ways That Intuition Is Biased 32

The Intuitive Thinker vs. the Scientific Reasoner 38

Trusting Authorities on the Subject 39 Finding and Reading the Research 42 Consulting Scientific Sources 42

Finding Scientific Sources 44

Reading the Research 46

Finding Research in Less Scholarly Places 48

Chapter Review 53


Three Claims, Four Validities: Interrogation Tools for Consumers of Research 57

Variables 58 Measured and Manipulated Variables 58

From Conceptual Variable to Operational Definition 59

Three Claims 61 Frequency Claims 62

Association Claims 63

Causal Claims 66

Not All Claims Are Based on Research 68

Interrogating the Three Claims Using the Four Big Validities 68 Interrogating Frequency Claims 69

Interrogating Association Claims 71

Interrogating Causal Claims 74

Prioritizing Validities 79

Review: Four Validities, Four Aspects of Quality 80 wORkING IT THROUGH Does Hearing About Scientists’ Struggles Inspire

Young Students? 81

Chapter Review 83


PART II Research Foundations for Any Claim


Ethical Guidelines for Psychology Research 89

Historical Examples 89 The Tuskegee Syphilis Study Illustrates Three Major Ethics Violations 89

The Milgram Obedience Studies Illustrate a Difficult Ethical Balance 92

Core Ethical Principles 94 The Belmont Report: Principles and Applications 94

Guidelines for Psychologists: The APA Ethical Principles 98 Belmont Plus Two: APA’s Five General Principles 98

Ethical Standards for Research 99

Ethical Decision Making: A Thoughtful Balance 110 wORkING IT THROUGH Did a Study Conducted on Facebook Violate Ethical

Principles? 111

Chapter Review 113


Identifying Good Measurement 117

Ways to Measure Variables 118 More About Conceptual and Operational Variables 118

Three Common Types of Measures 120

Scales of Measurement 122

Reliability of Measurement: Are the Scores Consistent? 124 Introducing Three Types of Reliability 125

Using a Scatterplot to Quantify Reliability 126

Using the Correlation Coefficient r to Quantify Reliability 128

Reading About Reliability in Journal Articles 131

Validity of Measurement: Does It Measure What It’s Supposed to Measure? 132

Measurement Validity of Abstract Constructs 133

Face Validity and Content Validity: Does It Look Like a

Good Measure? 134

Criterion Validity: Does It Correlate with Key Behaviors? 135

Convergent Validity and Discriminant Validity: Does the

Pattern Make Sense? 139

The Relationship Between Reliability and Validity 142


Review: Interpreting Construct Validity Evidence 143

wORkING IT THROUGH How Well Can We Measure the Amount of Gratitude Couples Express to Each Other? 145

Chapter Review 147

PART III Tools for Evaluating Frequency Claims


Surveys and Observations: Describing What People Do 153

Construct Validity of Surveys and Polls 153 Choosing Question Formats 154

Writing Well-Worded Questions 155

Encouraging Accurate Responses 159

Construct Validity of Behavioral Observations 165 Some Claims Based on Observational Data 165

Making Reliable and Valid Observations 169

Chapter Review 175


Sampling: Estimating the Frequency of Behaviors and Beliefs 179

Generalizability: Does the Sample Represent the Population? 179 Populations and Samples 180

When Is a Sample Biased? 182

Obtaining a Representative Sample: Probability Sampling Techniques 186

Settling for an Unrepresentative Sample: Nonprobability Sampling Techniques 191

Interrogating External Validity: What Matters Most? 193 In a Frequency Claim, External Validity Is a

Priority 193

When External Validity Is a Lower Priority 194

Larger Samples Are Not More Representative 196

Chapter Review 198


PART IV Tools for Evaluating Association Claims


Bivariate Correlational Research 203

Introducing Bivariate Correlations 204 Review: Describing Associations Between Two Quantitative

Variables 205

Describing Associations with Categorical Data 207

A Study with All Measured Variables Is Correlational 209

Interrogating Association Claims 210 Construct Validity: How Well Was Each Variable Measured? 210

Statistical Validity: How Well Do the Data Support

the Conclusion? 211

Internal Validity: Can We Make a Causal Inference from

an Association? 221

External Validity: To Whom Can the Association Be Generalized? 226

wORkING IT THROUGH Are Parents Happier Than People with No Children? 231

Chapter Review 233


Multivariate Correlational Research 237

Reviewing the Three Causal Criteria 238 Establishing Temporal Precedence with Longitudinal

Designs 239 Interpreting Results from Longitudinal Designs 239

Longitudinal Studies and the Three Criteria for Causation 242

Why Not Just Do an Experiment? 242

Ruling Out Third Variables with Multiple-Regression Analyses 244 Measuring More Than Two Variables 244

Regression Results Indicate If a Third Variable Affects

the Relationship 247

Adding More Predictors to a Regression 251

Regression in Popular Media Articles 252

Regression Does Not Establish Causation 254

Getting at Causality with Pattern and Parsimony 256 The Power of Pattern and Parsimony 256

Pattern, Parsimony, and the Popular Media 258


Mediation 259 Mediators vs. Third Variables 261

Mediators vs. Moderators 262

Multivariate Designs and the Four Validities 264 Chapter Review 266

PART V Tools for Evaluating Causal Claims


Introduction to Simple Experiments 273

Two Examples of Simple Experiments 273 Example 1: Taking Notes 274

Example 2: Eating Pasta 275

Experimental Variables 276 Independent and Dependent Variables 277

Control Variables 278

Why Experiments Support Causal Claims 278 Experiments Establish Covariance 279

Experiments Establish Temporal Precedence 280

Well-Designed Experiments Establish Internal Validity 281

Independent-Groups Designs 287 Independent-Groups vs. Within-Groups Designs 287

Posttest-Only Design 287

Pretest/Posttest Design 288

Which Design Is Better? 289

Within-Groups Designs 290 Repeated-Measures Design 290

Concurrent-Measures Design 291

Advantages of Within-Groups Designs 292

Covariance, Temporal Precedence, and Internal Validity in Within-Groups Designs 294

Disadvantages of Within-Groups Designs 296

Is Pretest/Posttest a Repeated-Measures Design? 297

Interrogating Causal Claims with the Four Validities 298 Construct Validity: How Well Were the Variables Measured and Manipulated? 298

External Validity: To Whom or What Can the Causal Claim Generalize? 301

Statistical Validity: How Well Do the Data Support the Causal Claim? 304

Internal Validity: Are There Alternative Explanations for the Results? 306

Chapter Review 307



More on Experiments: Confounding and Obscuring Variables 311

Threats to Internal Validity: Did the Independent Variable Really Cause the Difference? 312

The Really Bad Experiment (A Cautionary Tale) 312

Six Potential Internal Validity Threats in One-Group,

Pretest/Posttest Designs 314

Three Potential Internal Validity Threats in Any Study 322

With So Many Threats, Are Experiments Still Useful? 325

wORkING IT THROUGH Did Mindfulness Training Really Cause GRE Scores to Improve? 328

Interrogating Null Effects: What If the Independent Variable Does Not Make a Difference? 330

Perhaps There Is Not Enough Between-Groups Difference 332

Perhaps Within-Groups Variability Obscured the Group Differences 335

Sometimes There Really Is No Effect to Find 342

wORkING IT THROUGH Will People Get More Involved in Local Government If They Know They’ll Be Publicly Honored? 344

Null Effects May Be Published Less Often 345

Chapter Review 346


Experiments with More Than One Independent Variable 351

Review: Experiments with One Independent Variable 351 Experiments with Two Independent Variables Can

Show Interactions 353

Intuitive Interactions 353

Factorial Designs Study Two Independent Variables 355

Factorial Designs Can Test Limits 356

Factorial Designs Can Test Theories 358

Interpreting Factorial Results: Main Effects and Interactions 360

Factorial Variations 370 Independent-Groups Factorial Designs 370

Within-Groups Factorial Designs 370

Mixed Factorial Designs 371

Increasing the Number of Levels of an Independent Variable 371

Increasing the Number of Independent Variables 373

Identifying Factorial Designs in Your Reading 378 Identifying Factorial Designs in Empirical Journal Articles 379

Identifying Factorial Designs in Popular Media Articles 379

Chapter Review 383


PART VI Balancing Research Priorities


Quasi-Experiments and Small-N Designs 389

Quasi-Experiments 389 Two Examples of Independent-Groups

Quasi-Experiments 390

Two Examples of Repeated-Measures

Quasi-Experiments 392

Internal Validity in Quasi-Experiments 396

Balancing Priorities in Quasi-Experiments 404

Are Quasi-Experiments the Same as Correlational Studies? 405

Small-N Designs: Studying Only a Few Individuals 406 Research on Human Memory 407

Disadvantages of Small-N Studies 410

Behavior-Change Studies in Applied Settings:

Three Small-N Designs 411

Other Examples of Small-N Studies 417

Evaluating the Four Validities in Small-N Designs 418

Chapter Review 420


Replication, Generalization, and the Real World 425

To Be Important, a Study Must Be Replicated 425 Replication Studies 426

The Replication Debate in Psychology 430

Meta-Analysis: What Does the Literature Say? 433

Replicability, Importance, and Popular Media 436

To Be Important, Must a Study Have External Validity? 438 Generalizing to Other Participants 438

Generalizing to Other Settings 439

Does a Study Have to Be Generalizable to Many People? 440

Does a Study Have to Take Place in a Real-World Setting? 447

Chapter Review 453

xxviii CONTENTs

Statistics Review Descriptive Statistics 457 Statistics Review Inferential Statistics 479 Presenting Results APA-Style Reports and Conference Posters 505 Appendix A Random Numbers and How to Use Them 545 Appendix B Statistical Tables 551 Areas Under the Normal Curve (Distribution of z) 551

Critical Values of t 557

Critical Values of F 559

r to z’ Conversion 564

Critical Values of r 565 Glossary 567 Answers to End-of-Chapter Questions 577 Review Question 577

Guidelines for Selected Learning Actively Exercises 578 References 589 Credits 603 Name Index 607 Subject Index 611


Research Methods in Psychology EVALUATING A WORLD OF INFORMATION


Introduction to Scientific Reasoning

Your Dog Hates Hugs, 2016

Mindfulness May Improve Test Scores Scientific American, 2013


Psychology Is a Way of Thinking THINKING BACK TO YOUR introductory psychology course, what do you remember learning? You might remember that dogs can be trained to salivate at the sound of a bell or that people in a group fail to call for help when the room fills up with smoke. Or perhaps you recall studies in which people administered increasingly stron- ger electric shocks to an innocent man although he seemed to be in distress. You may have learned what your brain does while you sleep or that you can’t always trust your memories. But how come you didn’t learn that “we use only 10% of our brain” or that “hitting a punching bag can make your anger go away”?

The reason you learned some principles, and not others, is because psychological science is based on studies—on research—by psychologists. Like other scientists, psychologists are empiricists. Being an empiricist means basing one’s conclusions on systematic observations. Psychologists do not simply think intuitively about behavior, cognition, and emotion; they know what they know because they have conducted studies on people and animals acting in their natural environments or in specially designed situations. Research is what tells us that most people will administer electric shock to an innocent man in certain situations, and it also tells us that people’s brains are usually fully engaged—not just 10%. If you are to think like a psychologist, then you must think like a researcher, and taking a course in research methods is crucial to your understanding of psychology.

This book explains the types of studies psychologists conduct, as well as the potential strengths and limitations of each type of study. You will learn not only how to plan your own studies but


A year from now, you should still be able to:

1. Explain what it means to reason empirically.

2. Appreciate how psychological research methods help you become a better producer of information as well as a better consumer of information.

3. Describe five practices that psychological scientists engage in.

6 CHAPTER 1 Psychology Is a Way of Thinking

also how to find research, read about it, and ask questions about it. While gaining a greater appreciation for the rigorous standards psychologists maintain in their research, you’ll find out how to be a systematic and critical consumer of psychological science.

RESEARCH PRODUCERS, RESEARCH CONSUMERS Some psychology students are fascinated by the research process and intend to become producers of research. Perhaps they hope to get a job studying brain anatomy, documenting the behavior of dolphins or monkeys, administering per- sonality questionnaires, observing children in a school setting, or analyzing data. They may want to write up their results and present them at research meetings. These students may dream about working as research scientists or professors.

Other psychology students may not want to work in a lab, but they do enjoy reading about the structure of the brain, the behavior of dolphins or monkeys, the personalities of their fellow students, or the behavior of children in a school setting. They are interested in being consumers of research information—reading about research so they can later apply it to their work, hobbies, relationships, or personal growth. These students might pursue careers as family therapists, teachers, entrepreneurs, guidance counselors, or police officers, and they expect psychology courses to help them in these roles.

In practice, many psychologists engage in both roles. When they are planning their research and creating new knowledge, they study the work of others who have gone before them. Furthermore, psychologists in both roles require a curi- osity about behavior, emotion, and cognition. Research producers and consumers also share a commitment to the practice of empiricism—to answer psychological questions with direct, formal observations, and to communicate with others about what they have learned.

Why the Producer Role Is Important For your future coursework in psychology, it is important to know how to be a producer of research. Of course, students who decide to go to graduate school for psychology will need to know all about research methods. But even if you do not plan to do graduate work in psychology, you will probably have to write a paper following the style guidelines of the American Psychological Association (APA) before you graduate, and you may be required to do research as part of a course lab section. To succeed, you will need to know how to randomly assign people to groups, how to measure attitudes accurately, or how to interpret results from a graph. The skills you acquire by conducting research can teach you how psycho- logical scientists ask questions and how they think about their discipline.

7Research Producers, Research Consumers

As part of your psychology studies, you might even work in a research lab as an undergraduate (Figure 1.1). Many psy- chology professors are active researchers, and if you are offered the opportunity to get involved in their laboratories, take it! Your faculty supervisor may ask you to code behaviors, assign participants to different groups, graph an outcome, or write a report. Doing so will give you your first taste of being a research producer. Although you will be supervised closely, you will be expected to know the basics of conducting research. This book will help you understand why you have to protect the anonymity of your participants, use a cod- ing book, or flip a coin to decide who goes in which group. By participating as a research producer, you can expect to deepen your understanding of psychological inquiry.

Why the Consumer Role Is Important Although it is important to understand the psychologist’s role as a producer of research, most psychology majors do not eventually become researchers. Regard- less of the career you choose, however, becoming a savvy consumer of informa- tion is essential. In your psychology courses, you will read studies published by psychologists in scientific journals. You will need to develop the ability to read about research with curiosity—to understand it, learn from it, and ask appropriate questions about it.

Think about how often you encounter news stories or look up information on the Internet. Much of the time, the stories you read and the websites you visit will present information based on research. For example, during an election year, Americans may come across polling information in the media almost every day. Many online newspapers have science sections that include stories on the lat- est research. Entire websites are dedicated to psychology-related topics, such as treatments for autism, subliminal learning tapes, or advice for married couples. Magazines such as Scientific American, Men’s Health, and Parents summarize research for their readers. While some of the research—whether online or printed— is accurate and useful, some of it is dubious, and some is just plain wrong. How can you tell the good research information from the bad? Understanding research methods enables you to ask the appropriate questions so you can evaluate informa- tion correctly. Research methods skills apply not only to research studies but also to much of the other types of information you are likely to encounter in daily life.

FIGURE 1.1 Producers of research. As undergraduates, some psychology majors work alongside faculty members as producers of information.

8 CHAPTER 1 Psychology Is a Way of Thinking

Finally, being a smart consumer of research could be crucial to your future career. Even if you do not plan to be a researcher—if your goal is to be a social worker, a teacher, a sales representative, a human resources professional, an entrepreneur, or a parent—you will need to know how to interpret published research with a critical eye. Clinical psychologists, social workers, and family therapists must read research to know which therapies are the most effective. In fact, licensure in these helping professions requires knowing the research behind evidence-based treatments—that is, therapies that are supported by research. Teachers also use research to find out which teaching methods work best. And the business world runs on quantitative information: Research is used to predict what sales will be like in the future, what consumers will buy, and whether investors will take risks or lie low. Once you learn how to be a consumer of information—psychological or otherwise—you will use these skills constantly, no matter what job you have.

In this book, you will often see the phrase “interrogating information.” A con- sumer of research needs to know how to ask the right questions, determine the answers, and evaluate a study on the basis of those answers. This book will teach you systematic rules for interrogating research information.

The Benefits of Being a Good Consumer What do you gain by being a critical consumer of information? Imagine, for exam- ple, that you are a correctional officer at a juvenile detention center, and you watch a TV documentary about a crime-prevention program called Scared Straight. The program arranges for teenagers involved in the criminal justice system to visit prisons, where selected prisoners describe the stark, violent realities of prison life (Figure 1.2). The idea is that when teens hear about how tough it is in prison, they will be scared into the “straight,” law-abiding life. The program makes a lot

FIGURE 1.2 Scared straight. Although it makes intuitive sense that young people would be scared into good behavior by hearing from current prisoners, such intervention programs have actually been shown to cause an increase in criminal offenses.

9Research Producers, Research Consumers

of sense to you. You are considering starting a partnership between the residents of your detention center and the state prison system.

However, before starting the partnership, you decide to investigate the efficacy of the program by reviewing some research that has been conducted about it. You learn that despite the intuitive appeal of the Scared Straight approach, the program doesn’t work—in fact, it might even cause criminal activity to get worse! Several published articles have reported the results of randomized, controlled studies in which young adults were assigned to either a Scared Straight program or a control program. The researchers then collected criminal records for 6–12 months. None of the studies showed that Scared Straight attendees committed fewer crimes, and most studies found an increase in crime among participants in the Scared Straight programs, compared to the controls (Petrosino, Turpin-Petrosino, & Finckenauer, 2000). In one case, Scared Straight attendees had committed 20% more crimes than the control group.

At first, people considering such a program might think: If this program helps even one person, it’s worth it. However, we always need empirical evidence to test the efficacy of our interventions. A well-intentioned program that seems to make sense might actually be doing harm. In fact, if you investigate further, you’ll find that the U.S. Department of Justice officially warns that such programs are inef- fective and can harm youth, and the Juvenile Justice and Delinquency Prevention Act of 1974 was amended to prohibit youth in the criminal justice system from interactions with adult inmates in jails and prisons.

Being a skilled consumer of information can inform you about other pro- grams that might work. For example, in your quest to become a better student, suppose you see this headline: “Mindfulness may improve test scores.” The prac- tice of mindfulness involves attending to the present moment, on purpose, with a nonjudgmental frame of mind (Kabat-Zinn, 2013). In a mindful state, people simply observe and let go of thoughts rather than elaborating on them. Could the practice of mindfulness really improve test scores? A study conducted by Michael Mrazek and his colleagues assigned people to take either a 2-week mindfulness training course or a 2-week nutrition course (Mrazek, Franklin, Philips, Baird, & Schooner, 2013). At the end of the training, only the people who had practiced mindfulness showed improved GRE scores (compared to their scores beforehand). Mrazek’s group hypothesized that mindfulness training helps people attend to an academic task without being distracted. They were bet- ter, it seemed, at controlling their minds from wandering. The research evidence you read about here appears to support the use of mindfulness for improving test scores.

By understanding the research methods and results of this study, you might be convinced to take a mindfulness-training course similar to the one used by Mrazek and his colleagues. And if you were a teacher or tutor, you might consider advising your students to practice some of the focusing techniques. (Chapter 10 returns to this example and explains why the Mrazek study stands up to interro- gation.) Your skills in research methods will help you become a better consumer of

10 CHAPTER 1 Psychology Is a Way of Thinking

studies like this one, so you can decide when the research supports some programs (such as mindfulness for study skills) but not others (such as Scared Straight for criminal behavior).


1. Explain what the consumer of research and producer of research roles have in common, and describe how they differ.

2. What kinds of jobs would use consumer-of-research skills? What kinds of jobs would use producer-of-research skills?

HOW SCIENTISTS APPROACH THEIR WORK Psychological scientists are identified not by advanced degrees or white lab coats; they are defined by what they do and how they think. The rest of this chapter will explain the fundamental ways psychologists approach their work. First, they act as empiricists in their investigations, meaning that they systematically observe the world. Second, they test theories through research and, in turn, revise their theories based on the resulting data. Third, they take an empirical approach to both applied research, which directly targets real-world problems, and basic research, which is intended to contribute to the general body of knowledge. Fourth, they go further: Once they have discovered an effect, scientists plan further research to test why, when, or for whom an effect works. Fifth, psychologists make their work public: They submit their results to journals for review and respond to the opinions of other scientists. Another aspect of making work public involves sharing findings of psy- chological research with the popular media, who may or may not get the story right.

Scientists Are Empiricists Empiricists do not base conclusions on intuition, on casual observations of their own experience, or on what other people say. Empiricism, also referred to as the empirical method or empirical research, involves using evidence from the senses (sight, hearing, touch) or from instruments that assist the senses (such as thermometers, timers, photographs, weight scales, and questionnaires) as the basis for conclusions. Empiricists aim to be systematic, rigorous, and to make their work independently verifiable by other observers or scientists. In Chapter 2,

1. See pp. 6–7. 2. See pp. 7–8.

❯❯ For more on the contrast between empiricism and

intuition, experience, and authority, see Chapter 2,

pp. 26–31.

11How Scientists Approach Their Work

you will learn more about why empiricism is considered the most reliable basis for conclusions when compared with other forms of reasoning, such as expe- rience or intuition. For now, we’ll focus on some of the practices in which empiricists engage.

Scientists Test Theories: The Theory-Data Cycle In the theory-data cycle, scientists collect data to test, change, or update their theories. Even if you have never been in a formal research situation, you have probably tested ideas and hunches of your own by asking specific questions that are grounded in theory, making predictions, and reflecting on data.

For example, let’s say you need to take your bike to work later, so you check the weather forecast on your tablet (Figure 1.3). The application opens, but you see a blank screen. What could be wrong? Maybe your entire device is on the blink: Do the other apps work? When you test them, you find your calculator is working, but not your e-mail. In fact, it looks as if only the apps that need wireless are not working. Your wireless indicator looks low, so you ask your roommate, sitting nearby, “Are you having wifi problems?” If she says no, you might try resetting your device’s wireless connection.

Notice the series of steps in this process. First, you asked a particular set of questions, all of which were guided by your theory about how such devices work. The questions (Is it the tablet as a whole? Is it only the wifi?) reflected your theory that the weather app requires a working electronic device as well as a wireless connection. Because you were operating under this theory, you chose not to ask other kinds of questions (Has a warlock cursed my tablet? Does my device have a bacterial infection?). Your theory set you up to ask certain questions and not others. Next, your questions led you to specific predictions, which you tested by collecting data. You tested your first idea about the problem (My device can’t run any apps) by making a specific prediction (If I test any application, it won’t work). Then you set up a situation to test your prediction (Does the calculator work?). The data (The calculator does work) told you your initial prediction was wrong. You used that out- come to change your idea about the problem (It’s only the wireless-based apps that aren’t working). And so on. When you take systematic steps to solve a problem, you are participating in something similar to what scientists do in the theory-data cycle.


A classic example from the psychological study of attachment can illustrate the way researchers similarly use data to test their theories. You’ve probably observed that animals form strong attachments to their caregivers. If you have a dog, you know he’s extremely happy to see you when you come home, wagging his tail and jumping all over you. Human babies, once they are able to crawl, may follow their parents or caregivers around, keeping close to them. Baby monkeys exhibit similar behavior, spending hours clinging tightly to the mother’s fur. Why do animals form such strong attachments to their caregivers?

FIGURE 1.3 Troubleshooting a tablet. Troubleshooting an electronic device is a form of engaging in the theory-data cycle.

12 CHAPTER 1 Psychology Is a Way of Thinking

One theory, referred to as the cupboard theory of mother-infant attachment, is that a mother is valu- able to a baby mammal because she is a source of food. The baby animal gets hungry, gets food from the mother by nursing, and experiences a pleas- ant feeling (reduced hunger). Over time, the sight of the mother is associated with pleasure. In other words, the mother acquires positive value for the baby because she is the “cupboard” from which food comes. If you’ve ever assumed your dog loves you only because you feed it, your beliefs are consistent with the cupboard theory.

An alternative theory, proposed by psycholo- gist Harry Harlow (1958), is that hunger has little to do with why a baby monkey likes to cling to the warm, fuzzy fur of its mother. Instead, babies are attached to their mothers because of the comfort of cozy touch. This is the contact comfort theory. (In addition, it provides a less cynical view of why your dog is so happy to see you!)

In the natural world, a mother provides both food and contact comfort at once, so when the baby

clings to her, it is impossible to tell why. To test the alternative theories, Harlow had to separate the two influences—food and contact comfort. The only way he could do so was to create “mothers” of his own. He built two monkey foster “mothers”—the only mothers his lab-reared baby monkeys ever had. One of the mothers was made of bare wire mesh with a bottle of milk built in. This wire mother offered food, but not comfort. The other mother was covered with fuzzy terrycloth and was warmed by a lightbulb suspended inside, but she had no milk. This cloth mother offered comfort, but not food.

Note that this experiment sets up three possible outcomes. The contact com- fort theory would be supported if the babies spent most of their time clinging to the cloth mother. The cupboard theory would be supported if the babies spent most of their time clinging to the wire mother. Neither theory would be supported if monkeys divided their time equally between the two mothers.

When Harlow put the baby monkeys in the cages with the two mothers, the evidence in favor of the contact comfort theory was overwhelming. Harlow’s data showed that the little monkeys would cling to the cloth mother for 12–18 hours a day (Figure 1.4). When they were hungry, they would climb down, nurse from the wire mother, and then at once go back to the warm, cozy cloth mother. In short, Harlow used the two theories to make two specific predictions about how the monkeys would interact with each mother. Then he used the data he recorded (how much time the monkeys spent on each mother) to support only one of the theories. The theory-data cycle in action!

FIGURE 1.4 The contact comfort theory. As the theory hypothesized, Harlow’s baby monkeys spent most of their time on the warm, cozy cloth mother, even though she did not provide any food.

13How Scientists Approach Their Work


A theory is a set of statements that describes general principles about how variables relate to one another. For example, Harlow’s theory, which he developed in light of extensive observations of primate babies and mothers, was about the overwhelming importance of bodily contact (as opposed to simple nourishment) in forming attachments. Contact comfort, not food, was the primary basis for a baby’s attachment to its mother. This theory led Harlow to investigate particular kinds of questions—he chose to pit contact comfort against food in his research. The theory meant that Harlow also chose not to study unrelated questions, such as the babies’ food preferences or sleeping habits.

The theory not only led to the questions; it also led to specific hypothe- ses about the answers. A hypothesis, or prediction, is the specific outcome the researcher expects to observe in a study if the theory is accurate. Har- low’s hypothesis related to the way the baby monkeys would interact with two kinds of mothers he created for the study. He predicted that the babies would spend more time on the cozy mother than the wire mother. Notably, a sin- gle theory can lead to a large number of hypotheses because a single study is not sufficient to test the entire theory—it is intended to test only part of it. Most researchers test their theories with a series of empirical studies, each designed to test an individual hypothesis.

Data are a set of observations. (Harlow’s data were the amount of time the baby monkeys stayed on each mother.) Depending on whether the data are consistent with hypotheses based on a theory, the data may either support or challenge the theory. Data that match the theory’s hypothe- ses strengthen the resea rcher ’s con- fidence in the the- ory. When the data do not match the theory’s hypotheses, however, those results indicate that the theory needs to be revised or the research design needs to be improved. Figure 1.5 shows how these steps work as a  cycle.

FIGURE 1.5 The theory-data cycle.

Theory leads researchers to

pose particular

research questions, which lead to an appropriate

research design. In the context of the design,

researchers formulate

hypotheses. Researchers then

collect and analyze

Su pp

or t Revision

Nonsupporting data lead to revised theories or improved research


Supporting data strengthen

the theory.

data, which feed back

into the cycle.

14 CHAPTER 1 Psychology Is a Way of Thinking


In scientific practice, some theories are better than others. The best theories are supported by data from studies, are falsifiable, and are parsimonious.

Good Theories Are Supported by Data. The most important feature of a scientific theory is that it is supported by data from research studies. In this respect, the contact comfort theory of infant attachment turned out to be better than the cup- board theory because it was supported by the data. Clearly, primate babies need food, but food is not the source of their emotional attachments to their mothers. In this way, good theories, like Harlow’s, are consistent with our observations of the world. More importantly, scientists need to conduct mul- tiple studies, using a variety of methods, to address different aspects of their theories. A theory that is supported by a large quantity and variety of evi- dence is a good theory.

Good Theories Are Falsifiable. A second impor- tant feature of a good scientific theory is falsifiability. A theory must lead to hypotheses that, when tested, could actually fail to support the theory. Harlow’s theory was falsifiable: If the monkeys had spent more time on the wire mother than the cloth mother,

the contact-comfort theory would have been shown to be incorrect. Similarly, Mrazek’s mindfulness study could have falsified the researchers’ theory: If stu- dents in the mindfulness training group had shown lower GRE scores than those in the nutrition group, their theory of mindfulness and attention would not have been supported.

In contrast, some dubious therapeutic techniques have been based on theories that are not falsifiable. Here’s an example. Some therapists practice facilitated communication (FC), believing they can help people with developmental disorders communicate by gently guiding their clients’ hands over a special keyboard. In simple but rigorous empirical tests, the facilitated messages have been shown to come from the therapist, not the client (Twachtman-Cullen, 1997). Such studies demonstrated FC to be ineffective. However, FC’s supporters don’t accept these results. The empirical method introduces skepticism, which, the supporters say, breaks down trust between the therapist and client and shows a lack of faith in people with disabilities. Therefore, these supporters hold a belief about FC that is not falsifiable. To be truly scientific, researchers must take risks, including being prepared to accept data indicating their theory is not supported. Even practi- tioners must be open to such risk, so they can use techniques that actually work. For another example of an unfalsifiable claim, see Figure 1.6.

FIGURE 1.6 An example of a theory that is not falsifiable. Certain people might wear a tinfoil hat, operating under the idea that the hat wards off government mental surveillance. But like most conspiracy theories, this notion of remote government mindreading is not falsifiable. If the government has been shown to read people’s minds, the theory is supported. But if there is no physical evidence, that also supports the theory because if the government does engage in such surveillance, it wouldn’t leave a detectable trace of its secret operations.

15How Scientists Approach Their Work

Good Theories Have Parsimony. A third important feature of a good scientific theory is that it exhibits parsimony. Theories are supposed to be simple. If two theories explain the data equally well, most scientists will opt for the simpler, more parsimonious theory.

Parsimony sets a standard for the theory-data cycle. As long as a simple theory predicts the data well, there should be no need to make the theory more com- plex. Harlow’s theory was parsimonious because it posed a simple explanation for infant attachment: Contact comfort drives attachment more than food does. As long as the data continue to support the simple theory, the simple theory stands. However, when the data contradict the theory, the theory has to change in order to accommodate the data. For example, over the years, psychologists have collected data showing that baby monkeys do not always form an attachment to a soft, cozy mother. If monkeys are reared in complete social isolation during their first, crit- ical months, they seem to have problems forming attachments to anyone or any- thing. Thus, the contact comfort theory had to change a bit to emphasize the importance of contact comfort for attachment especially in the early months of life. The theory is slightly less parsimonious now, but it does a better job of accommo- dating the data.


The word prove is not used in science. Researchers never say they have proved their theories. At most, they will say that some data support or are consistent with a theory, or they might say that some data are inconsistent with or compli- cate a theory. But no single confirming finding can prove a theory (Figure 1.7). New information might require researchers, tomorrow or the next day, to change and improve current ideas. Similarly, a single, disconfirming finding does not lead researchers to scrap a theory entirely. The disconfirming study may itself have been designed poorly. Or perhaps the theory needs to be mod- ified, not discarded. Rather than thinking of a theory as proved or disproved by a single study, scientists evaluate their theories based on the weight of the evidence, for and against. Harlow’s theory of attachment could not be “proved” by the single study involving wire and cloth mothers. His laboratory conducted dozens of individual studies to rule out alternative explanations and test the theory’s limits.

❮❮ For more on weight of the evidence, see Chapter 14, p. 436.

FIGURE 1.7 Scientists don’t say “prove.” When you see the word prove in a headline, be skeptical. No single study can prove a theory once and for all. A more scientifically accurate headline would be: “Study Supports the Hypothesis that Hiking Improves Mental Health.” (Source: Netburn,, 2015.)

16 CHAPTER 1 Psychology Is a Way of Thinking

Scientists Tackle Applied and Basic Problems The empirical method can be used for both applied and basic research questions. Applied research is done with a practical problem in mind; the researchers con- duct their work in a particular real-world context. An applied research study might ask, for example, if a school district’s new method of teaching language arts is work- ing better than the former one. It might test the efficacy of a treatment for depres- sion in a sample of trauma survivors. Applied researchers might be looking for better ways to identify those who are likely to do well at a particular job, and so on.

Basic research, in contrast, is not intended to address a specific, practical problem; the goal is to enhance the general body of knowledge. Basic researchers might want to understand the structure of the visual system, the capacity of human memory, the motivations of a depressed person, or the limitations of the infant attachment system. Basic researchers do not just gather facts at random; in fact, the knowledge they generate may be applied to real-world issues later on.

Translational research is the use of lessons from basic research to develop and test applications to health care, psychotherapy, or other forms of treatment and inter- vention. Translational research represents a dynamic bridge from basic to applied research. For example, basic research on the biochemistry of cell membranes might be translated into a new drug for schizophrenia. Or basic research on how mindful- ness changes people’s patterns of attention might be translated into a study skills intervention. Figure 1.8 shows the interrelationship of the three types of research.

Scientists Dig Deeper Psychological scientists rarely conduct a single investigation and then stop. Instead, each study leads them to ask a new question. Scientists might start with a simple effect, such as the effect of comfort on attachment, and then ask, “Why

Translational Research

Basic Research

Applied Research

In a laboratory study, can meditation lessons improve college students’

GRE scores?

What parts of the brain are active when

experienced meditators are


Has our school’s new meditation

program helped students focus longer on their math lessons?

FIGURE 1.8 Basic, applied, and translational research. Basic researchers may not have an applied context in mind, and applied researchers may be less familiar with basic theories and principles. Translational researchers attempt to translate the findings of basic research into applied areas.

17How Scientists Approach Their Work

does this occur?” “When does this happen the most?” “For whom does this apply?” “What are the limits?”

Mrazek and his team did not stop after only one study of mindfulness training and GRE performance. They dug deeper. They also asked whether mindfulness training was especially helpful for people whose minds wander the most. In other studies, they investigated if mindfulness training influenced skills such as people’s insight about their own memory (Baird, Mrazek, Phillips, & Schooler, 2014). And they have contrasted mindfulness with mind-wandering, attempting to find both the benefits and the costs of mind-wandering (Baird et al., 2012). This research team has conducted many related studies of how people can and cannot control their own attention.

Scientists Make It Public: The Publication Process When scientists want to tell the scientific world about the results of their research, they write a paper and submit it to a scientific journal. Like magazines, journals usually come out every month and contain articles written by various qualified contributors. But unlike popular newsstand magazines, the articles in a scientific journal are peer-reviewed. The journal editor sends the submission to three or four experts on the subject. The experts tell the editor about the work’s virtues and flaws, and the editor, considering these reviews, decides whether the paper deserves to be published in the journal.

The peer-review process in the field of psychology is rigorous. Peer reviewers are kept anonymous, so even if they know the author of the article professionally or personally, they can feel free to give an honest assessment of the research. They comment on how interesting the work is, how novel it is, how well the research was done, and how clear the results are. Ultimately, peer reviewers are supposed to ensure that the articles published in scientific journals contain innovative, well- done studies. When the peer-review process works, research with major flaws does not get published. However, the process continues even after a study is pub- lished. Other scientists can cite an article and do further work on the same subject. Moreover, scientists who find flaws in the research (perhaps overlooked by the peer reviewers) can publish letters, commentaries, or competing studies. Through publishing their work, scientists make the process of their research transparent, and the scientific community evaluates it.

Scientists Talk to the World: From Journal to Journalism One goal of this textbook is to teach you how to interrogate information about psycho- logical science that you find not only in scientific journals, but also in more mainstream sources that you encounter in daily life. Psychology’s scientific journals are read

18 CHAPTER 1 Psychology Is a Way of Thinking

primarily by other scientists and by psychology students; the general public almost never reads them. Journalism, in contrast, includes the kinds of news and com- mentary that most of us read or hear on television, in magazines and newspa- pers, and on Internet sites—articles in Psychology Today and Men’s Health, topical blogs, relationship advice columns, and so on. These sources are usually written by journalists or laypeople, not scientists, and they are meant to reach the general public; they are easy to access, and understanding their content does not require specialized education.

How does the news media find out about the latest scientific findings? A jour- nalist might become interested in a particular study by reading the current issue of a scientific journal or by hearing scientists talk about their work at a conference. The journalist turns the research into a news story by summarizing it for a popular audience, giving it an interesting headline, and writing about it using nontechnical terms. For example, the journal article by Mrazek and his colleagues on the effect of mindfulness on GRE scores was summarized by a journalist in the magazine Scientific American (Nicholson, 2013).


Psychologists can benefit when journalists publicize their research. By read- ing about psychological research in the newspaper, the general public can learn what psychologists really do. Those who read or hear the story might also pick up important tips for living: They might understand their children or themselves better; they might set different goals or change their habits. These important ben- efits of science writing depend on two things, however. First, journalists need to report on the most important scientific stories, and second, they must describe the research accurately.

Is the Story Important? When journalists report on a study, have they chosen research that has been conducted rigorously, that tests an important question, and that has been peer-reviewed? Or have they chosen a study simply because it is cute or eye-catching? Sometimes journalists do follow important stories, especially when covering research that has already been published in a selective, peer- reviewed jour- nal. But sometimes journalists choose the sensational story over the important one.

For example, one spring, headlines such as “Your dog hates hugs” and “You need to stop hugging your dog, study finds” began popping up in newsfeeds. Of course, this topic is clickbait, and dozens of news outlets shocked readers and listeners with these claims. However, the original claim had been made by a psychology professor who had merely reported some data in a blog post. The study he conducted had not been peer-reviewed or published in an empirical journal. The author had simply coded some Internet photographs of people hugging their dogs; according to the author, 82% of the dogs in the sample were showing signs of stress (Coren, 2016). Journalists should not have run with this story before it had been peer-reviewed. Scientific peer reviewers might have criticized the study because it didn’t include a comparison group of photos of dogs that weren’t being hugged.

19How Scientists Approach Their Work

The author also left out important details, such as how the photographs were selected and whether the dogs’ behavior actually meant they were stressed. In this case, journalists were quick to publish a headline that was sensational, but not necessarily important.

Is the Story Accurate? Even when journalists report on reliable, important research, they don’t always get the story right. Some science writers do an excel- lent, accurate job of summarizing the research, but not all of them do (Figure 1.9). Perhaps the journalist does not have the scientific training, the motivation, or the time before deadline to understand the original science very well. Maybe the journalist dumbs down the details of a study to make it more accessible to a general audience. And sometimes a journalist wraps up the details of a study with a more dramatic headline than the research can support.

FIGURE 1.9 Getting it right. Cartoonist Jorge Cham parodies what can happen when journalists report on scientific research. Here, an original study reported a relationship between two variables. Although the University Public Relations Office relates the story accurately, the strength of the relationship and its implications become distorted with subsequent retellings, much like a game of “telephone.”

20 CHAPTER 1 Psychology Is a Way of Thinking

Media coverage of a phenomenon called the “Mozart effect” provides an example of how jour- nalists might misrepresent science when they write for a popular audience (Spiegel, 2010). In 1993, researcher Frances Rauscher found that when stu- dents heard Mozart music played for 10 minutes, they performed better on a subsequent spatial intel- ligence test when compared with students who had listened to silence or to a monotone speaking voice (Rauscher, Shaw, & Ky, 1993). Rauscher said in a radio interview, “What we found was that the stu- dents who had listened to the Mozart sonata scored significantly higher on the spatial temporal task.” However, Rauscher added, “It’s very important to note that we did not find effects for general intelli- gence . . . just for this one aspect of intelligence. It’s a small gain and it doesn’t last very long” (Spiegel, 2010). But despite the careful way the scientists described their results, the media that reported on the story exaggerated its importance:

The headlines in the papers were less subtle than her findings: “Mozart makes

you smart” was the general idea. . . . But worse, says Rauscher, was that her

very modest finding started to be wildly distorted. “Generalizing these results

to children is one of the first things that went wrong. Somehow or another the

myth started exploding that children that listen to classical music from a young

age will do better on the SAT, they’ll score better on intelligence tests in general,

and so forth.” (Spiegel, 2010)

Perhaps because the media distorted the effects of that first study, a small industry sprang up, recording child-friendly sonatas for parents and teachers (Figure 1.10). However, according to research conducted since the first study was published, the effect of listening to Mozart on people’s intelligence test scores is not very strong, and it applies to most music, not just Mozart (Pietschnig, Voracek, & Formann, 2010).

The journalist Ben Goldacre (2011) catalogs examples of how journalists and the general public misinterpret scientific data when they write about it for a pop- ular audience. Some journalists create dramatic stories about employment statis- tics that show, for example, a 0.9% increase in unemployment claims. Journalists may conclude that these small increases show an upward trend—when in fact, they may simply reflect sampling error. Another example comes from a happiness survey of 5,000 people in the United Kingdom. Local journalists picked up on tiny city-to-city differences, creating headlines about, for instance, how the city of Edinburgh is the “most miserable place in the country.” But the differences

FIGURE 1.10 The Mozart effect. Journalists sometimes misrepresent research findings. Exaggerated reports of the Mozart effect even inspired a line of consumer products for children.

21How Scientists Approach Their Work

the survey found between the various places were not statistically significant (Goldacre, 2008). Even though there were slight differences in happiness from Edinburgh to London, the differences were small enough to be caused by random variation. The researcher who conducted the study said, “I tried to explain issues of [statistical] significance to the journalists who interviewed me. Most did not want to know” (Goldacre, 2008).

How can you prevent being misled by a journalist’s coverage of science? One idea is to find the original source, which you’ll learn to do in Chapter 2. Reading the original scientific journal article is the best way to get the full story. Another approach is to maintain a skeptical mindset when it comes to popular sources. Chapter 3 explains how to ask the right questions before you allow yourself to accept the journalist’s claim.

1. See the discussion of Harlow’s monkey experiment on p. 13. 2. See p. 16. 3. See p. 15. 4. See p. 17. 5. See pp. 18–21.


1. What happens to a theory when the data do not support the theory’s hypotheses? What happens to a theory when the data do support the

theory’s hypotheses?

2. Explain the difference between basic research and applied research, and describe how the two interact.

3. Why can’t theories be proved in science?

4. When scientists publish their data, what are the benefits?

5. Describe two ways journalists might distort the science they attempt to publicize.

❮❮ To learn about sampling error, see Chapter 7, pp. 196–197.

22 CHAPTER 1 Psychology Is a Way of Thinking

Research Producers, Research Consumers • Some students need skills as producers of research;

they develop the ability to work in research laborato- ries and make new discoveries.

• Some students need skills as consumers of research; they need to be able to find, read, and evaluate the research behind important policies, therapies, and workplace decisions.

• Having good consumer-of-research skills means being able to evaluate the evidence behind the claims of a salesperson, journalist, or researcher, and making better, more informed decisions by asking the right questions.

How Scientists Approach Their Work • As scientists, psychologists are empiricists; they

base their conclusions on systematic, unbiased observations of the world.

• Using the theory-data cycle, researchers propose theories, make hypotheses (predictions), and collect data. A good scientific theory is supported by data, is falsifiable, and is parsimonious. A researcher might

say that a theory is well supported or well established, rather than proved, meaning that most of the data have confirmed the theory and very little data have disconfirmed it.

• Applied researchers address real-world problems, and basic researchers work for general understanding. Translational researchers attempt to translate the findings of basic research into applied areas.

• Scientists usually follow up an initial study with more questions about why, when, and for whom a phenomenon occurs.

• The publication process is part of worldwide scientific communication. Scientists publish their research in journals, following a peer-review process that leads to sharper thinking and improved communication. Even after publication, published work can be approved or criticized by the scientific community.

• Journalists are writers for the popular media who are skilled at transforming scientific studies for the general public, but they don’t always get it right. Think critically about what you read online, and when in doubt, go directly to the original source—peer-reviewed research.

Summary Thinking like a psychologist means thinking like a scientist, and thinking like a scientist involves thinking about the empirical basis for what we believe.


Key Terms

evidence-based treatment, p. 8 empiricism, p. 10 theory, p. 13 hypothesis, p. 13 data, p. 13

falsifiability, p. 14 parsimony, p. 15 weight of the evidence, p. 15 applied research, p. 16 basic research, p. 16

translational research, p. 16 journal, p. 17 journalism, p. 18

23Learning Actively

To see samples of chapter concepts in the popular media, visit and click the box for Chapter 1.r

Review Questions

1. Which of the following jobs most likely involves producer-of-research skills rather than consumer-of-research skills?

a. Police officer

b. University professor

c. Physician

d. Journalist

2. To be an empiricist, one should:

a. Base one’s conclusions on direct observations.

b. Strive for parsimony.

c. Be sure that one’s research can be applied in a real-world setting.

d. Discuss one’s ideas in a public setting, such as on social media.

3. A statement, or set of statements, that describes general principles about how variables relate to one another is a(n) .

a. prediction

b. hypothesis

c. empirical observation

d. theory

4. Why is publication an important part of the empirical method?

a. Because publication enables practitioners to read the research and use it in applied settings.

b. Because publication contributes to making empirical observations independently verifiable.

c. Because journalists can make the knowledge available to the general public.

d. Because publication is the first step of the theory-data cycle.

5. Which of the following research questions best illustrates an example of basic research?

a. Has our company’s new marketing campaign led to an increase in sales?

b. How satisfied are our patients with the sensitivity of the nursing staff?

c. Does wearing kinesio-tape reduce joint pain?

d. Can 2-month-old human infants tell the difference between four objects and six objects?

Learning Actively

1. To learn more about the theory-data cycle, look in the textbooks from your other psychology courses for examples of theories. In your introductory psychology book, you might look up the James Lange theory or the Cannon-Bard theory of emotion. You could look up Piaget’s theory of cognitive development, the Young-Helmholz theory of color vision, or the stage theory of memory. How do the data presented in your textbook show support for the theory? Does the textbook present any data that do not support the theory?

2. Go to an online news website and find a headline that is reporting the results of a recently published study. Read the story, and ask: Has the research in the story been published yet? Does the journalist mention the name of a journal in which the results

appeared? Or has the study only been presented at a research conference? Then, use the Internet to find examples of how other journalists have covered the same story. What variation do you notice in their stories?

3. See what you can find online that has been written about the Mozart effect, about whether people should hug their dogs, or whether people should begin a mindfulness practice in their lives. Does the source you found discuss research evidence? Does the source provide the names of scientists and the journals in which data have been published? On the downside, does the coverage suggest that you purchase a product or that science has “proved” the effectiveness of a certain behavior or technique?

Houston’s “Rage Room” a Smash as Economy Struggles The Guardian, 2016

Six Great Ways to Vent Your Frustrations, n.d.


Sources of Information: Why Research Is Best and How to Find It HAVE YOU EVER LOOKED online for a stress-relief technique? You might have found aggressive games such as Kick the Buddy or downloaded an app such as Vent. Maybe you’ve considered a for-profit “rage room” that lets you destroy plates, computers, or teddy bears. Perhaps a friend has suggested posting your complaints publicly and anonymously on Yik Yak. But does venting anger really make people feel better? Does expressing aggression make aggression go away?

Many sources of information promote the idea that venting your frustrations works. You might try one of the venting apps yourself and feel good while you’re using it. Or you may hear from guidance counselors, friends, or online sources that venting negative feelings is a healthy way to manage anger. But is it accurate to base your conclusions on what authorities—even well-meaning ones—say? Should you believe what everyone else believes? Does it make sense to base your convictions on your own personal experience?

This chapter discusses three sources of evidence for people’s beliefs— experience, intuition, and authority—and compares them to a superior source of evidence: empirical research. We will focus on evaluating a particular type of response to the question about handling anger: the idea of cathartically releasing bottled-up tension by hitting a punching


A year from now, you should still be able to:

1. Explain why all scientists, including psychologists, value research-based conclusions over beliefs based on experience, intuition, or authority.

2. Locate research-based information, and read it with a purpose.

26 CHAPTER 2 Sources of Information: Why Research Is Best and How to Find It

bag, screaming, or expressing your emotions (Figure 2.1). Is catharsis a healthy way to deal with feelings of anger and frustration? How could you find credible research on this subject if you wanted to read about it? And why should you trust the conclusions of researchers instead of those based on your own experience or intuition?

THE RESEARCH VS. YOUR EXPERIENCE When we need to decide what to believe, our own experiences are powerful sources of information. “I’ve used tanning beds for 10 years. No skin cancer yet!” “My knee doesn’t give out as much when I use kinesio-tape.” “When I’m mad, I feel so much better after I vent my feelings online.” Often, too, we base our opinions on the experi- ences of friends and family. For instance, suppose you’re considering buying a new car. You want the most reliable one, so after consulting Consumer

Reports, you decide on a Honda Fit, a top-rated car based on its objective road test- ing and a survey of 1,000 Fit owners. But then you hear about your cousin’s Honda Fit, which is always in the shop. Why shouldn’t you trust your own experience—or that of someone you know and trust—as a source of information?

Experience Has No Comparison Group There are many reasons not to base beliefs solely on personal experience, but per- haps the most important is that when we do so, we usually don’t take a comparison group into account. Research, by contrast, asks the critical question: Compared to what? A comparison group enables us to compare what would happen both with and without the thing we are interested in—both with and without tanning beds, online games, or kinesio-tape (Figure 2.2).

Here’s a troubling example of why a comparison group is so important. Centuries ago, Dr. Benjamin Rush drained blood from people’s wrists or ankles as part of a “bleeding,” or bloodletting, cure for illness (Eisenberg, 1977). The practice emerged from the belief that too much blood was the cause of illness. To restore an “appropriate” balance, a doctor might remove up to 100 ounces of blood from a patient over the course of a week. Of course, we now know that draining blood is one of the last things a doctor would want to do to a sick patient. Why did Dr. Rush, one of the most respected physicians of his time, keep on using such a practice? Why did he believe bloodletting was a cure?

FIGURE 2.2 Your own experience. You may think you feel better when you wear kinesio-tape. But does placing stretchy tape on your body really reduce pain, prevent injury, or improve performance?

FIGURE 2.1 Anger management. Some people believe that venting physically or emotionally the best way to work through anger. But what does the research suggest?

27The Research vs. Your Experience

In those days, a doctor who used the bleeding cure would have noticed that some of his patients recovered and some died; it was the doctor’s personal experience. Every patient’s recovery from yellow fever after bloodletting seemed to support Rush’s theory that the treatment worked. But Dr. Rush never set up a systematic comparison because doc- tors in the 1700s were not collecting data on their treatments. To test the bleeding cure, doctors would have had to systematically count death rates among patients who were bled versus those who received some compar- ison treatment (or no treatment). How many people were bled and how many were not? Of each group, how many died and how many recovered? Putting all the records together, the doctors could have come to an empirically derived conclusion about the effectiveness of bloodletting.

Suppose, for example, Dr. Rush had kept records and found that 20 patients who were bled recovered, and 10 patients who refused the bleeding treatment recovered. At first, it might look like the bleeding cure worked; after all, twice as many bled patients as untreated patients improved. But you need to know all the numbers—the number of bled patients who died and the number of untreated patients who died, in addition to the number of patients in each group who recov- ered. Tables 2.1, 2.2, and 2.3 illustrate how we need all the data to draw the correct conclu- sion. In the first example (Table 2.1), there is no relationship at all between treatment and improvement. Although twice as many bled patients as untreated patients recovered, twice as many bled patients as untreated patients died, too. If you calculate the percentages, the recovery rate among people who were bled was 20%, and the recovery rate among people who were not treated was also 20%: The pro- portions are identical. (Remember, these data were invented for purposes of illustration.)

TABLE 2.2 

One Value Decreased If we change the value in one cell (in red), survival rates change, and the bleeding cure is very ineffective.


Number of patients who recovered 20 10

Number of patients who died 80 1

(Number recovered divided by total number of patients)

20/100 10/11

Percentage recovered 20% 91%

TABLE 2.1 

Baseline Comparisons At first, it looks like more patients who were bled survived (20 vs. 10), but when we divide by the total numbers of patients, survival rates were the same.


Number of patients who recovered 20 10

Number of patients who died 80 40

(Number recovered divided by total number of patients)

20/100 10/50

Percentage recovered 20% 20%

TABLE 2.3 

One Value Increased If we change the value in the same cell (in red), now the bleeding cure looks effective.


Number of patients who recovered 20 10

Number of patients who died 80 490

(Number recovered divided by total number of patients)

20/100 10/500

Percentage recovered 20% 2%

28 CHAPTER 2 Sources of Information: Why Research Is Best and How to Find It

To reach the correct conclusion, we need to know all the values, including the number of untreated patients who died. Table 2.2 shows an example of what might happen if the value in only that cell changes. In this case, the number of untreated patients who died is much lower, so the treatment is shown to have a negative effect. Only 20% of the treated patients recovered, compared with 91% of the untreated patients. In contrast, if the number in the fourth cell were increased drastically, as in Table 2.3, the treatment would be shown to have a positive effect. The recovery rate among bled patients is still 20%, but the recovery rate among untreated patients is a mere 2%.

Notice that in all three tables, changing only one value leads to dramatically different results. Drawing conclusions about a treatment—bloodletting, ways of venting anger, or using stretchy tape—requires comparing data systematically from all four cells: the treated/improved cell, the treated/unimproved cell, the untreated/improved cell, and the untreated/unimproved cell. These compa- rison cells show the relative rate of improvement when using the treatment, compared with no treatment.

Because Dr. Rush bled every patient, he never had the chance to see how many would recover without the bleeding treatment (Figure 2.3). Similarly, when you rely on personal experience to decide what is true, you usually don’t have a systematic comparison group because you’re observing only one “patient”: yourself. The tape you’ve been using may seem to be working, but what would have happened to your knee pain with- out it? Maybe it would have felt fine anyway. Or perhaps you try an online brain-training course and get higher grades later that semester. But what kind of grades would you have gotten if you hadn’t taken the course? Or you might think using the Kick the Buddy game makes you feel better when you’re angry, but would you have felt better anyway, even if you had played a nonviolent game? What if you had done nothing and just let a little time pass?

Basing conclusions on personal experience is prob- lematic because daily life usually doesn’t include com- parison experiences. In contrast, basing conclusions on systematic data collection has the simple but tremen- dous advantage of providing a comparison group. Only a systematic comparison can show you whether your knee improves when you use a special tape (compared with when you do not), or whether your anger goes away when you play a violent online game (compared with doing nothing).

FIGURE 2.3 Bloodletting in the eighteenth century. Describe how Dr. Rush’s faulty attention to information led him to believe the bleeding treatment was effective.

29The Research vs. Your Experience

Experience Is Confounded Another problem with basing conclusions on personal experience is that in every- day life, too much is going on at once. Even if a change has occurred, we often can’t be sure what caused it. When a patient treated by Dr. Rush got better, that patient might also have been stronger to begin with, or may have been eating special foods or drinking more fluids. Which one caused the improvement? When you notice a difference in your knee pain after using kinesio-tape, maybe you also took it easy that day or used a pain reliever. Which one caused your knee pain to improve? If you play Kick the Buddy, it provides violent content, but you might also be distract- ing yourself or increasing your heart rate. Is it these factors, or the game’s violence that causes you to feel better after playing it?

In real-world situations, there are several possible explanations for an outcome. In research, these alternative explanations are called confounds. Confounded can also mean confused. Essentially, a confound occurs when you think one thing caused an outcome but in fact other things changed, too, so you are confused about what the cause really was. You might think online brain-training exercises are making your grades better than last year, but because you were also taking differ- ent classes and have gained experience as a student, you can’t determine which of these factors (or combination of factors) caused the improvement.

What can we do about confounds like these? For a personal experience, it is hard to isolate variables. Think about the last time you had an upset stomach. Which of the many things you ate that day made you sick? Or your allergies— which of the blossoming spring plants are you allergic to? In a research setting, though, scientists can use careful controls to be sure they are changing only one factor at a time.

Research Is Better Than Experience What happens when scientists set up a systematic comparison that controls for potential confounds? For example, by using controlled, systematic comparisons, several groups of researchers have tested the hypothesis that venting anger is ben- eficial (e.g., Berkowitz, 1973; Bushman, Baumeister, & Phillips, 2001; Feshbach, 1956; Lohr, Olatunji, Baumeister, & Bushman, 2007). One such study was con- ducted by researcher Brad Bushman (2002). To examine the effect of venting, or catharsis, Bushman systematically compared the responses of angry people who were allowed to vent their anger with the responses of those who did not vent their anger.

First, Bushman needed to make people angry. He invited 600 undergraduates to arrive, one by one, to a laboratory setting, where each student wrote a political essay. Next, each essay was shown to another person, called Steve, who was actually a confederate, an actor playing a specific role for the experimenter. Steve insulted the writer by criticizing the essay, calling it “the worst essay I’ve ever read,” among

❮❮ For more on confounds and how to avoid them in research designs, see Chapter 10, pp. 281–286.

30 CHAPTER 2 Sources of Information: Why Research Is Best and How to Find It

other unflattering comments. (Bushman knew this technique made students angry because he had used it in previous studies, in which students whose essays were criticized reported feeling angrier than those whose essays were not criticized.)

Bushman then randomly divided the angry students into three groups, to sys- tematically compare the effects of venting and not venting anger. Group 1 was instructed to sit quietly in the room for 2 minutes. Group 2 was instructed to punch a punching bag for 2 minutes, having been told it was a form of exercise. Group 3 was instructed to punch a punching bag for 2 minutes while imagin- ing Steve’s face on it. (This was the important catharsis group.) Finally, all three groups of students were given a chance to get back at Steve. In the course of play- ing a quiz game with him, students had the chance to blast Steve’s ears with a loud noise. (Because Steve was a confederate, he didn’t actually hear the noises, but the students thought he did.)

Which group gave Steve the loudest, longest blasts of noise? The catharsis hypothesis predicts that Group 3 should have calmed down the most, and as a result, this group should not have blasted Steve with very much noise. This group, however, gave Steve the loudest noise blasts of all! Compared with the other two groups, those who vented their anger at Steve through the punching

bag continued to punish him when they had the chance. In contrast, Group 2, those who hit the punching bag for exercise, subjected him to less noise (not as loud or as long). Those who sat quietly for 2 minutes punished Steve the least of all. So much for the catharsis hypoth- esis. When the researchers set up the compar- ison groups, they found the opposite result: People’s anger subsided more quickly when they sat in a room quietly than if they tried to vent it. Figure 2.4 shows the study results in graph form.

Notice the power of systematic comparison here. In a controlled study, re searchers can set up the conditions to include at least one comparison group. Contrast the researcher’s larger view with the more subjective view, in which each person consults only his or her own experience. For example, if you had asked some of the students in the catharsis group whether using the punching bag helped their anger subside, they could only consider their own, idiosyncratic experiences. When Bushman looked at the pattern overall— taking into account all three groups—the results indicated that the catharsis group still felt

0.25More than average

Less than average

Subsequent aggression to partner (z score)









–0.25 Sit quietly

Group 1

Punching bag (exercise) Group 2

Punching bag (Steve’s face)

Group 3


FIGURE 2.4 Results from controlled research on the catharsis hypothesis. In this study, after Steve (the confederate) insulted all the students in three groups by criticizing their essays, those in Group 1 sat quietly for 2 minutes, Group 2 hit a punching bag while thinking about exercise, and Group 3 hit a punching bag while imagining Steve’s face on it. Later, students in all three groups had the chance to blast Steve with loud noise. (Source: Adapted from Bushman, 2002, Table 1.)

31The Research vs. Your Experience

the angriest. The researcher thus has a privileged view—the view from the outside, including all possible comparison groups. In contrast, when you are the one acting in the situation, yours is a view from the inside, and you only see one possible condition.

Researchers can also control for potential confounds. In Bushman’s study, all three groups felt equally angry at first. Bushman even separated the effects of aggres- sion only (using the punching bag for exercise) from the effects of aggression toward the person who made the participant mad (using the punching bag as a stand-in for Steve). In real life, these two effects—exercise and the venting of anger—would usu- ally occur at the same time.

Bushman’s study is, of course, only one study on catharsis, and scientists always dig deeper. In other studies, researchers have made people angry, presented them with an opportunity to vent their anger (or not), and then watched their behavior. Research results have repeatedly indicated that people who physically express their anger at a target actually become more angry than when they started. Thus, practicing aggression only seems to teach people how to be aggressive (Berkowitz, 1973; Bushman et al., 2001; Feshbach, 1956; Geen & Quanty, 1977; Lohr et al., 2007; Tavris, 1989).

The important point is that the results of a single study, such as Bushman’s, are certainly better evidence than experience. In addition, consistent results from several similar studies mean that scientists can be confident in the find- ings. As more and more studies amass evidence on the subject, theories about how people can effectively regulate their anger gain increasing support. Finally, psychologist Todd Kashdan applied this research when he was interviewed for a story about the “rage room” concept, in which people pay to smash objects. He advised the journalist that “it just increases your arousal and thus makes you even more angry. What you really need is to reduce or learn to better manage that arousal” (Dart, 2016).

Research Is Probabilistic Although research is usually more accurate than individual experience, some- times our personal stories contradict the research results. Personal experience is powerful, and we often let a single experience distract us from the lessons of more rigorous research. Should you disagree with the results of a study when your own experience is different? Should you continue to play online games when you’re angry because you believe they work for you? Should you disregard Consumer Reports because your cousin had a terrible experience with her Honda Fit?

At times, your experience (or your cousin’s) may be an exception to what the research finds. In such cases, you may be tempted to conclude: The research must be wrong. However, behavioral research is probabilistic, which means that its findings are not expected to explain all cases all of the time. Instead, the conclusions of research are meant to explain a certain proportion (prefer- ably a high proportion) of the possible cases. In practice, this means scientific

❮❮ For more on the value of conducting multiple studies, see Chapter 14, pp. 425–433.

32 CHAPTER 2 Sources of Information: Why Research Is Best and How to Find It

conclusions are based on patterns that emerge only when researchers set up comparison groups and test many people. Your own experience is only one point in that overall pattern. Thus, for instance, even though bloodletting does not cure illness, some sick patients did recover after being bled. Those excep- tional patients who recovered do not change the conclusion derived from all of the data. And even though your cousin’s Honda needed a lot of repairs, her case is only one out of 1,001 Fit owners, so it doesn’t invalidate the general trend. Similarly, just because there is a strong general trend (that Honda Fits are reliable), it doesn’t mean your Honda will be reliable too. The research may suggest there is a strong probability your Honda will be reliable, but the prediction is not perfect.


1. What are two general problems with basing beliefs on experience? How does empirical research work to correct these problems?

2. What does it mean to say that research is probabilistic?

1. See pp. 26–31. 2. See pp. 31–32. THE RESEARCH VS. YOUR INTUITION Personal experience is one way we might reach a conclusion. Another is intuition— using our hunches about what seems “natural,” or attempting to think about things “logically.” While we may believe our intuition is a good source of information, it can lead us to make less effective decisions.

Ways That Intuition Is Biased Humans are not scientific thinkers. We might be aware of our potential to be biased, but we often are too busy, or not motivated enough, to correct and control for these biases. What’s worse, most of us think we aren’t biased at all! Fortunately, the formal processes of scientific research help prevent these biases from affecting our decisions. Here are five examples of biased reasoning.


One example of a bias in our thinking is accepting a conclusion just because it makes sense or feels natural. We tend to believe good stories—even ones that are false. For example, to many people, bottling up negative emotions seems

33The Research vs. Your Intuition

unhealthy, and expressing anger is sensible. As with a pimple or a boiling kettle of water, it might seem better to release the pressure. One of the early propo- nents of catharsis was the neurologist Sigmund Freud, whose models of mental distress focused on the harmful effects of suppressing one’s feelings and the benefits of expressing them. Some biographers have speculated that Freud’s ideas were influenced by the industrial technology of his day (Gay, 1989). Back then, engines used the power of steam to create vast amounts of energy. If the steam was too compressed, it could have devastating effects on a machine. Freud seems to have reasoned that the human psyche functions the same way. Cathar- sis makes a good story, because it draws on a metaphor (pressure) that is familiar to most people.

The Scared Straight program is another commonsense story that turned out to be wrong. As you read in Chapter 1, such programs propose that when teen- agers susceptible to criminal activity hear about the difficulties of prison from actual inmates, they will be scared away from committing crimes in the future. It certainly makes sense that impressionable young people would be frightened and deterred by such stories. However, research has consistently found that Scared Straight programs are ineffective; in fact, they sometimes even cause more crime. The intuitive appeal of such programs is strong (which accounts for why many communities still invest in them), but the research warns against them. One psychologist estimated that the widespread use of the program in New  Jersey might have “caused 6,500 kids to commit crimes they otherwise would not have committed” (Wilson, 2011, p. 138). Faulty intuition can even be harmful.

Sometimes a good story will turn out to be accurate, of course, but it’s important to be aware of the limitations of intuition. When empirical evidence contradicts what your common sense tells you, be ready to adjust your beliefs on the basis of the research. Automatically believing a story that may seem to make sense can lead you astray.


Another bias in thinking is the availability heuristic, which states that things that pop up easily in our mind tend to guide our thinking (Tversky & Kahneman, 1974). When events or memories are vivid, recent, or memorable, they come to mind more easily, leading us to overestimate how often things happen.

Here’s a scary headline: “Woman dies in Australian shark attack.” Dramatic news like this might prompt us to change our vacation plans. If we rely on our intuition, we might think shark attacks are truly common. However, a closer look at the frequency of reported shark attacks reveals that they are incredibly rare. Being killed by a shark (1 in 3.7 million) is less likely than dying from the flu (1 in 63) or in a bathtub (1 in 800,000; Ropeik, 2010).

Why do people make this mistake? Death by shark attack is certainly more memorable and vivid than getting the flu or taking a bath, so people talk about it

34 CHAPTER 2 Sources of Information: Why Research Is Best and How to Find It

more. It comes to mind easily, and we inflate the associated risk. In contrast, more common methods of dying don’t get much press. Nevertheless, we are too busy (or too lazy) to think beyond the easy answer. We decide the answer that comes to mind easily must be the correct one. We avoid swimming in the ocean, but neglect to get the flu vaccine.

The availability heuristic might lead us to wrongly estimate the number of something or how often some- thing happens. For example, if you visited my campus, you might see some women wearing a headcovering (hijab), and conclude there are lots of Muslim women here. The availability heuristic could lead you to overestimate, sim- ply because Muslim women stand out visually. People who practice many other religions do not stand out, so you may underestimate their frequency.

Our attention can be inordinately drawn to certain instances, leading to overestimation. A professor may com- plain that “everybody” uses a cell phone during his class, when in fact only one or two students do so; it’s just that their annoying behavior stands out. You might overesti- mate how often your kid sister leaves her bike out in the rain, only because it’s harder to notice the times she put it away. When driving, you may complain that you always

hit the red lights, only because you spend more time at them; you don’t notice the green lights you breeze through. What comes to mind easily can bias our conclusions about how often things happen (Figure 2.5).


The availability heuristic leads us to overestimate events, such as how frequently people encounter red lights or die in shark attacks. A related problem prevents us from seeing the relationship between an event and its outcome. When deciding if there’s a pattern, for example, between bleeding a patient and the patient’s recov- ery, or between using kinesio-tape and feeling better, people forget to seek out the information that isn’t there.

In the story “Silver Blaze,” the fictional detective Sherlock Holmes investigates the theft of a prize racehorse. The horse was stolen at night while two stable hands and their dog slept, undisturbed, nearby. Holmes reflects on the dog’s “curious” behavior that night. When the other inspectors protest that “the dog did nothing in the night-time,” Holmes replies, “That was the curious incident.” Because the dog did not bark, Holmes deduces that the horse was stolen by someone familiar to the dog at the stable (Doyle, 1892/2002, p. 149; see Gilbert, 2005). Holmes solves the crime because he notices the absence of something.

When testing relationships, we often fail to look for absences; in con- trast, it is easy to notice what is present. This tendency, referred to as the

FIGURE 2.5 The availability heuristic. Look quickly: Which color candy is most common in this bowl? You might have guessed yellow, red, or orange, because these colors are easier to see—an example of the availability heuristic. Blue is the most prevalent, but it doesn’t stand out in this context.

35The Research vs. Your Intuition

present/ present bias, is a name for our failure to consider appropriate com- parison groups (discussed earlier). Dr. Rush may have fallen prey to the present/ present bias when he was observing the effects of bloodletting on his patients. He focused on patients who did receive the treatment and did recover (the first cell in Table 2.1 where bleeding treatment was “present” and the recovery was also “present”). He did not fully account for the untreated patients or those who did not recover (the other three cells back in Table 2.1 in which treatment was “absent” or recovery was “absent”).

Did you ever find yourself thinking about a friend and then get a text mes- sage or phone call from him? “I must be psychic!” you think. No; it’s just the present/present bias in action. You noticed the times when your thoughts coin- cided with a text message and concluded there was a psychic relationship. But you forgot to consider all the times you thought of people who didn’t subse- quently text you or the times when people texted you when you weren’t think- ing about them.

In the context of managing anger, the present/present bias means we will easily notice the times we did express frustration at the gym, at the dog, or in an e-mail, and subsequently felt better. In other words, we notice the times when both the treatment (venting) and the desired outcome (feeling better) are present but are less likely to notice the times when we didn’t express our anger and just felt better anyway; in other words, the treatment was absent but the outcome was still present (Table 2.4). When thinking intuitively, we tend to focus only on experiences that fall in the present/present cell, the instances in which catharsis seemed to work. But if we think harder and look at the whole picture, we would conclude catharsis doesn’t work well at all.

The availability heuristic plays a role in the present/present bias because instances in the “present/present” cell of a comparison stand out. But the pres- ent/present bias adds the tendency to ignore “absent” cells, which are essential for testing relationships. To avoid the present/present bias, scientists train themselves always to ask: Compared to what?

TABLE 2.4 

The Present/Present Bias



Felt better (outcome present)

5 Present/present

10 Absent/present

Felt worse (outcome absent)

10 Present/absent

5 Absent/absent

Note: The number in each cell represents the number of times the two events coincided. We are more likely to focus on the times when two factors were both present or two events occurred at the same time (the red present/present cell), rather than on the full pattern of our experiences.

36 CHAPTER 2 Sources of Information: Why Research Is Best and How to Find It


During an election season, you might check opinion polls for your favorite candidate. What if your candidate lags behind in the first opinion poll you see? If you’re like most people, you will keep looking until you find a poll in which your candidate has the edge (Wolfers, 2014).

The tendency to look only at information that agrees with what we already believe is called the confirmation bias. We “cherry-pick” the information we take in—seeking and accepting only the evidence that supports what we already think. A lyric by the songwriter Paul Simon captures this well: “A man sees what he wants to see and disregards the rest.”

One study specifically showed how people select only their preferred evidence. The participants took an IQ test and then were told their IQ was either high or low. Shortly afterward, they all had a chance to look at some magazine articles about IQ tests. Those who were told their IQ was low spent more time looking at articles that criticized the validity of IQ tests, whereas those who were told their IQ was high spent more time looking at articles that supported IQ tests as valid measures of intelligence (Frey & Stahlberg, 1986). They all wanted to think they were smart, so they analyzed the available information in biased ways that supported this belief. People keep their beliefs intact (in this case, the belief that they are smart) by selecting only the kinds of evidence they want to see.

One way we enact the confirmation bias is by asking questions that are likely to give the desired or expected answers. Take, for example, a study in which the researchers asked students to interview fellow undergraduates (Snyder & Swann, 1978). Half the students were given the goal of deciding whether their target per- son was extroverted, and the other half were given the goal of deciding whether their target person was introverted.

Before the interview, the students selected their interview questions from a prepared list. As it turned out, when the students were trying to find out whether their target was extroverted, they chose questions such as “What would you do if you wanted to liven things up at a party?” and “What kind of situations do you seek out if you want to meet new people?” You can see the problem: Even introverts will look like extroverts when they answer questions like these. The students were asking questions that would tend to confirm that their targets were extroverted. The same thing happened with the students who were trying to find out if their target was introverted. They chose questions such as “In what situations do you wish you could be more outgoing?” and “What factors make it hard for you to really open up to people?” Again, in responding to these questions, wouldn’t just about any- body seem introverted? Later, when the students asked these questions of real people, the targets gave answers that supported the expectations. The researchers asked some judges to listen in on what the targets said during the interviews. Regardless of their personality, the targets who were being

37The Research vs. Your Intuition

tested for extroversion acted extroverted, and the targets who were being tested for introver- sion acted introverted.

Unlike the hypothesis-testing process in the theory-data cycle (see Chapter 1), confirmation bias operates in a way that is decidedly not sci- entific. If interviewers were testing the hypoth- esis that their target was an extrovert, they asked the questions that would confirm that hypothe- sis and did not ask questions that might discon- firm that hypothesis. Indeed, even though the students could have chosen neutral questions (such as “What do you think the good and bad points of acting friendly and open are?”), they hardly ever did. In follow-up studies, Snyder and Swann found that student interviewers chose hypothesis-confirming questions even if they were offered a big cash prize for being the most objective interviewer, suggesting that even when people are trying to be accurate, they cannot always be.

Without scientific training, we are not very rigorous in gathering evidence to test our ideas. Psychological research has repeatedly found that when people are asked to test a hypothesis, they tend to seek the evidence that supports their expectations (Copeland & Snyder, 1995; Klayman & Ha, 1987; Snyder & Campbell, 1980; Snyder & White, 1981). As a result, people tend to gather only a certain kind of information, and then they conclude that their beliefs are supported. This bias is one reason clini- cal psychologists and other therapists are required to get a research methods education (Figure 2.6).


Even though we read about the biased ways people think (such as in a research methods textbook like this one), we nevertheless conclude that those biases do not apply to us. We have what’s called a bias blind spot, the belief that we are unlikely to fall prey to the other biases previously described (Pronin, Gilovich, & Ross, 2004; Pronin, Lin, & Ross, 2002). Most of us think we are less biased than others, so when we notice our own view of a situation is different from that of somebody else, we conclude that “I’m the objective one here” and “you are the biased one.”

In one study, researchers interviewed U.S. airport travelers, most of whom said the average American is much more biased than themselves (Pronin et al., 2002). For example, the travelers said that while most others would take

FIGURE 2.6 Confirmation bias. This therapist suspects her client has an anxiety disorder. What kinds of questions should she be asking that would both potentially confirm and potentially disconfirm her hypothesis?

38 CHAPTER 2 Sources of Information: Why Research Is Best and How to Find It

personal credit for successes, the travelers themselves would not. Respondents believed other Americans would say a person is smart and competent, just because he is nice; however, they themselves do not have this bias. People believed other Americans would tend to “blame the victim” of random violence for being in the wrong place at the wrong time, even though they would do no such thing themselves (Figure 2.7).

The bias blind spot might be the sneakiest of all of the biases in human thinking. It makes us trust our faulty rea- soning even more. In addition, it can make it difficult for us to initiate the scientific theory-data cycle. We might say, “I don’t need to test this conclusion; I already know it is correct.” Part of learning to be a scientist is learning not to use feelings of confidence as evidence for the truth of our beliefs. Rather than thinking what they want to, scientists use data.

The Intuitive Thinker vs. the Scientific Reasoner When we think intuitively rather than scientifically, we make mistakes. Because of our biases, we tend to notice and actively seek information that confirms our ideas. To counteract your own biases, try to adopt the empi- rical mindset of a researcher. Recall from Chapter 1 that empiricism involves basing beliefs on systematic information from the senses. Now we have an additional

nuance for what it means to reason empirically: To be an empiricist, you must also strive to interpret the data you collect in an objective way; you must guard against common biases.

Researchers—scientific reasoners—create comparison groups and look at all the data. Rather than base their theories on hunches, researchers dig deeper and generate data through rigorous studies. Knowing they should not simply go along with the story everyone believes, they train themselves to test their intuition with systematic, empirical observations. They strive to ask questions objectively and collect potentially disconfirming evidence, not just evidence that confirms their hypotheses. Keenly aware that they have biases, scientific reasoners allow the data to speak more loudly than their own confidently held—but possibly biased—ideas. In short, while research- ers are not perfect reasoners themselves, they have trained themselves to guard against the many pitfalls of intuition—and they draw more accurate conclusions as a result.

FIGURE 2.7 The bias blind spot. A physician who receives a free gift from a pharmaceutical salesperson might believe she won’t be biased by it, but she may also believe other physicians will be persuaded by such gifts to prescribe the drug company’s medicines.

39Trusting Authorities on the Subject


1. This section described several ways in which intuition is biased. Can you name all five?

2. Why might the bias blind spot be the most sneaky of all the intuitive reasoning biases?

3. Do you think you can improve your own reasoning by simply learning about these biases? How?

1. See pp. 32–38. 2. See pp. 37–38. 3. Answers will vary.

TRUSTING AUTHORITIES ON THE SUBJECT You might have heard statements like these: “We only use 10% of our brains” and  “People are either right-brained or left-brained.” People—even those we trust—make such claims as if they are facts. However, you should be cautious about basing your beliefs on what everybody says—even when the claim is made by someone who is (or claims to be) an authority. In that spirit, how reliable is the advice of guidance counselors, TV talk show hosts, or psychology professors? All these people have some authority—as cultural messengers, as professionals with advanced degrees, as people with significant life experience. But should you trust them?

Let’s consider this example of anger management advice from a person with a master’s degree in psychology, several published books on anger management, a thriving workshop business, and his own website. He’s certainly an authority on the subject, right? Here is his advice:

Punch a pillow or a punching bag. Yell and curse and moan and holler. . . . If you are

angry at a particular person, imagine his or her face on the pillow or punching bag,

and vent your rage. . . . You are not hitting a person, you are hitting the ghost of

that person . . . a ghost alive in you that must be exorcised in a concrete, physical

way. (Lee, 1993, p. 96)

Knowing what you know now, you probably do not trust John Lee’s advice. In fact, this is a clear example of how a self-proclaimed “expert” might be wrong.

Before taking the advice of authorities, ask yourself about the source of their ideas. Did the authority systematically and objectively compare different con- ditions, as a researcher would do? Or maybe they have read the research and

40 CHAPTER 2 Sources of Information: Why Research Is Best and How to Find It

are interpreting it for you; they might be practitioners who are basing their conclusions on empirical evidence. In this respect, an authority with a scientific degree may be better able to accurately understand and interpret scientific evi- dence (Figure 2.8). If you know this is the case—in other words, if an authority refers to research evidence—their advice might be worthy of attention. However, authorities can also base their advice on their own experience or intuition, just like the rest of us. And they, too, might present only the studies that support their own side.

Keep in mind, too, that not all research is equally reliable. The research an expert uses to support his or her argument might have been conducted poorly. In the rest of this book, you will learn how to interrogate others’ research and form conclusions about its quality. Also, the research someone cites to support an argu- ment may not accurately and appropriately support that particular argument. In Chapter 3, you’ll learn more about what kinds of research support different kinds of claims. Figure 2.9 shows a concept map illustrating the sources of information reviewed in this chapter. Conclusions based on research, outlined in black on the concept map, are the most likely to be correct.

FIGURE 2.8 Which authority to believe? Jenny McCarthy (left), an actress and celebrity, claims that giving childhood vaccines later in life would prevent autism disorders. Dr. Paul Offit (right), a physician-scientist who has both reviewed and conducted scientific research on childhood vaccines, says that early vaccines save lives and that there is no link between vaccination and autism diagnosis.

41Trusting Authorities on the Subject

Based on authority

Based on experience

No comparison


Has confounds

Based on research

Wikis (think


Magazines and newspaper

articles (look for research)

Trade books (look for


Chapters in edited


Full-length books

Journal articles

Review articles

Empirical articles

Based on intuition

Scientific sources (by psychologists, for


Other sources (by psychologists,

journalists, or laypeople for a popular audience)

• Good story • Availability • Present/present bias • Confirmation bias • Bias blind spot

Could be the authority’s


Could be the authority’s

personal experience

Could be based on the authority’s research

“This, I believe…”

FIGURE 2.9 A concept map showing sources of information. People’s beliefs can come from several sources. You should base your beliefs about psychological phenomena on research, rather than experience, intuition, or authority. Research can be found in a variety of sources, some more dependable than others. Ways of knowing that are mentioned in outlined boxes are more trustworthy.

42 CHAPTER 2 Sources of Information: Why Research Is Best and How to Find It


1. When would it be sensible to accept the conclusions of authority figures? When might it not?

1. See p. 40. When authorities base their conclusions on well-conducted research (rather than experience or intuition), it may be reasonable to accept them.

FINDING AND READING THE RESEARCH In order to base your beliefs on empirical evidence rather than on experience, intuition, or authority, you will, of course, need to read about that research. But where do you find it? What if you wanted to read studies on venting anger? How would you locate them?

Consulting Scientific Sources Psychological scientists usually publish their research in three kinds of sources. Most often, research results are published as articles in scholarly journals. In addi- tion, psychologists may describe their research in single chapters within edited books. Some researchers also write full-length scholarly books.


Scientific journals come out monthly or quarterly, as magazines do. Unlike popular magazines, however, scientific journals usually do not have glossy, colorful cov- ers or advertisements. You are most likely to find scientific journals in college or university libraries or in online academic databases, which are generally available through academic libraries. For example, the study by Bushman (2002) described earlier was published in the journal Personality and Social Psychology Bulletin.

Journal articles are written for an audience of other psychological scientists and psychology students. They can be either empirical articles or review articles. Empirical journal articles report, for the first time, the results of an (empirical) research study. Empirical articles contain details about the study’s method, the statistical tests used, and the results of the study. Figure 2.10 is an example of an empirical journal article.

Review journal articles provide a summary of all the published studies that have been done in one research area. A review article by Anderson and his colleagues (2010), for example, summarizes 130 studies on the effects of playing violent video games on the aggressive behavior of children. Sometimes a review article uses a quantitative technique called meta-analysis, which combines the

❯❯ For a full discussion

of meta-analysis, see Chapter 14,

pp. 433–437.

43Finding and Reading the Research

Does Venting Anger Feed or Extinguish the Flame? Catharsis, Rumination, Distraction, Anger, and Aggressive Responding

Brad J. Bushman Iowa State University

Does distraction or rumination work better to diffuse anger? Catharsis theory predicts that rumination works best, but empir- ical evidence is lacking. In this study, angered participants hit a punching bag and thought about the person who had angered them (rumination group) or thought about becoming physically fit (distraction group). After hitting the punching bag, they reported how angry they felt. Next, they were given the chance to administer loud blasts of noise to the person who had angered them. There also was a no punching bag control group. People in the rumination group felt angrier than did people in the distrac- tion or control groups. People in the rumination group were also most aggressive, followed respectively by people in the distraction and control groups. Rumination increased rather than decreased anger and aggression. Doing nothing at all was more effective than venting anger. These results directly contradict catharsis theory.

The belief in the value of venting anger has become widespread in our culture. In movies, magazine articles, and even on billboards, people are encouraged to vent their anger and “blow off steam.” For example, in the movie Analyze This, a psychiatrist (played by Billy Crystal) tells his New York gangster client (played by Robert De Niro), “You know what I do when I’m angry? I hit a pil- low. Try that.” The client promptly pulls out his gun, points it at the couch, and fires several bullets into the pillow. “Feel better?” asks the psychiatrist. “Yeah, I do,” says the gunman. In a Vogue magazine article, female model Shalom concludes that boxing helps her release pent-up anger. She said,

I found myself looking forward to the chance to pound out the frustrations of the week against Carlos’s (her trainer) mitts. Let’s face it: A personal boxing trainer has advantages over a husband or lover. He won’t look at you accusingly and say, “I don’t know where this irritation is

coming from.” . . . Your boxing trainer knows it’s in there. And he wants you to give it to him. (“Fighting Fit,” 1993, p. 179)

In a New York Times Magazine article about hate crimes, Andrew Sullivan writes, “Some expression of prejudice serves a useful purpose. It lets off steam; it allows natural tensions to express themselves incrementally; it can siphon off conflict through words, rather than actions” (Sullivan, 1999, p. 113). A large billboard in Missouri states, “Hit a Pillow, Hit a Wall, But Don’t Hit Your Kids!”

Catharsis Theory

The theory of catharsis is one popular and authorita- tive statement that venting one’s anger will produce a positive improvement in one’s psychological state. The word catharsis comes from the Greek word katharsis, which literally translated means a cleansing or purging. According to catharsis theory, acting aggressively or even viewing aggression is an effective way to purge angry and aggressive feelings.

Sigmund Freud believed that repressed negative emo- tions could build up inside an individual and cause psy- chological symptoms, such as hysteria (nervous out- bursts). Breuer and Freud (1893-1895/1955) proposed that the treatment of hysteria required the discharge of the emotional state previously associated with trauma. They claimed that for interpersonal traumas, such as

Author’s Note: I would like to thank Remy Reinier for her help scan- ning photo IDs of students and photographs from health magazines. I also would like to thank Angelica Bonacci for her helpful comments on an early draft of this article. Correspondence concerning this article should be addressed to Brad J. Bushman, Department of Psychology, Iowa State University, Ames, IA 50011-3180; e-mail: bushman@

PSPB, Vol. 28 No. 6, June 2002 724-731 © 2002 by the Society for Personality and Social Psychology, Inc.


FIGURE 2.10 Bushman’s empirical article on catharsis. The first page is shown here, as it appeared in Personality and Social Psychology Bulletin. The inset shows how the article appears in an online search in that journal. Clicking “Full Text pdf” takes you to the article shown. (Source: Bushman, 2002.)


44 CHAPTER 2 Sources of Information: Why Research Is Best and How to Find It

results of many studies and gives a number that summarizes the magnitude, or the effect size, of a relationship. In the Anderson review (2010), the authors computed the average effect size across all the studies. This technique is valued by psychol- ogists because it weighs each study proportion- ately and does not allow cherry-picking particular studies.

Before being published in a journal, both empirical articles and review articles must be peer-reviewed (see Chapter 1). Both types are con- sidered the most prestigious forms of publication because they have been rigorously peer-reviewed.


An edited book is a collection of chapters on a common topic; each chapter is written by a differ- ent contributor. For example, Michaela Riedeger and Kathrin Klipker published a chapter entitled

“Emotional Regulation in Adolescence” in an edited book, The Handbook of Emotion Regulation (2014). There are over 30 chapters, all written by different researchers. The editor, James Gross, invited all the other authors to contribute. Generally, a book chapter is not the first place a study is reported; instead, the scientist is summarizing a collection of research and explaining the theory behind it. Edited book chapters can therefore be a good place to find a summary of a set of research a particular psychologist has done. (In this sense, book chapters are sim- ilar to review articles in journals.) Chapters are not peer-reviewed as rigorously as empirical journal articles or review articles. However, the editor of the book is careful to invite only experts—researchers who are intimately familiar with the empirical evidence on a topic—to write the chapters. The audience for these chapters is usually other psychologists and psychology students (Figure 2.11).


In some other disciplines (such as anthropology, art history, or English), full- length books are a common way for scholars to publish their work. However, psychologists do not write many full-length scientific books for an audience of other psychologists. Those books that have been published are most likely to be found in academic libraries. (Psychologists may also write full-length books for a general audience, as discussed below.)

Finding Scientific Sources You can find trustworthy, scientific sources on psychological topics by start- ing with the tools in your college or university’s library. The library’s reference

FIGURE 2.11 The variety of scientific sources. You can read about research in empirical journal articles, review journal articles, edited books, and full-length books.

45Finding and Reading the Research

staff can be extremely helpful in teaching you how to find appropriate articles or chapters. Working on your own, you can use databases such as PsycINFO and Google Scholar to conduct searches.


One comprehensive tool for sorting through the vast number of psychological research articles is a search engine and database called PsycINFO; it is main- tained and updated weekly. Doing a search in PsycINFO is like using Google, but instead of searching the Internet, it searches only sources in psychology, plus a few sources from related disciplines, including communication, marketing, and education. PsycINFO’s database includes more than 2.5 million records, mostly peer-reviewed articles.

PsycINFO has many advantages. It can show you all the articles written by a single author (e.g., “Brad Bushman”) or under a single keyword (e.g., “autism”). It tells you whether each source was peer-reviewed. One of the best features of PsycINFO is that it shows other articles that have cited each target article (listed under “Cited by”) and other articles each target article has cited (listed under “References”). If you’ve found a great article for your project in PsychINFO, the “cited by” and “references” lists can be helpful for finding more papers just like it.

The best way to learn to use PsycINFO is to simply try it yourself. Or, a reference librarian can show you the basic steps in a few minutes.

One disadvantage is that you cannot use PsycINFO unless your college or uni- versity library subscribes to it. Another challenge—true for any search—is translat- ing your curiosity into the right keywords. Sometimes the search you run will give you too many results to sort through easily. Other times your search words won’t yield the kinds of articles you were expecting to see. Table 2.5 presents some strategies for turning your questions into successful searches.

TABLE 2.5 

Tips for Turning Your Question into a Successful Database Search

1. Find out how psychologists talk about your question. Use the Thesaurus tool in the PsycINFO search window to help you find the proper search term:

Example question: Do eating disorders happen more frequently in families that eat dinner together?

Instead of “eating disorders,” you may need to be more specific. The Thesaurus tool suggests “binge-eating disorder” or “binge eating.”

Instead of “eating dinner together,” you may need to be more broad. Thesaurus terms include “family environment” and “home environment.”

Example question: What motivates people to study?

Search terms to try: “achievement motivation,” “academic achievement motivation,” “academic self concept,” “study habits,” “homework,” “learning strategies.”

Example question: Is the Mozart effect real?

Search terms to try: “Mozart-effect,” “music,” “performance,” “cognitive processes,” “reasoning.”

2. An asterisk can help you get all related terms: Example: “adolescen*” searches for “adolescence” and “adolescents” and “adolescent.”

3. If you get too few hits, combine terms using “or” (or gives you more):

Example: “anorexia” or “bulimia” or “eating disorder.”

Example: “false memory” or “early memory.”

4. If you get too many hits, restrict using “and” or by using “not”:

Example: “anorexia” and “adolescen*.”

Example: “repressed memory” and “physical abuse.”

Example: “repressed memory” not “physical abuse.”

5. Did you find a suitable article? Great! Find similar others by looking through that article’s References and by clicking on Cited by to find other researchers who have used it.

46 CHAPTER 2 Sources of Information: Why Research Is Best and How to Find It


If you want to find empirical research but don’t have access to PsycINFO, you can try the free tool Google Scholar. It works like the regular Google search engine, except the search results are only in the form of empirical journal articles and scholarly books. In addition, by visiting the User Profile for a particular scientist, you can see all of that person’s publications. The User Profile list is updated auto- matically, so you can easily view each scientist’s most recent work, as well as his or her most cited publications.

One disadvantage of Google Scholar is that it doesn’t let you limit your search to specific fields (such as the abstract). In addition, it doesn’t categorize the articles it finds, for example, as peer-reviewed or not, whereas PsycINFO does. And while PsycINFO indexes only psychology articles, Google Scholar contains articles from all scholarly disciplines. It may take more time for you to sort through the articles it returns because the output of a Google Scholar search is less well organized.

When you find a good source in Google Scholar, you might be able to immediately access a PDF file of the article for free. If not, then look up whether your university library offers it. You can also request a copy of the article through your college’s interlibrary loan office, or possibly by visiting the author’s university home page.

Reading the Research Once you have found an empirical journal article or chapter, then what? You might wonder how to go about reading the material. At first glance, some journal articles contain an array of statistical symbols and unfamiliar terminology. Even the titles of journal articles and chapters can be intimidating. Take this one, for example: “Object Substitution Masking Interferes with Semantic Processing: Evidence from Event-Related Potentials” (Reiss & Hoffman, 2006). How is a student supposed to read this sort of thing? It helps to know what you will find in an article and to read with a purpose.


Most empirical journal articles (those that report the results of a study for the first time) are written in a standard format, as recommended by the Publication Manual of the American Psychological Association (APA, 2010). Most empirical journal articles include certain sections in the same order: abstract, introduction, method, results, discussion, and references. Each section contains a specific kind of information. (For more on empirical journal articles, see Presenting Results: APA-Style Reports at the end of this book.)

Abstract. The abstract is a concise summary of the article, about 120 words long. It briefly describes the study’s hypotheses, method, and major results. When you are collecting articles for a project, the abstracts can help you quickly decide whether each article describes the kind of research you are looking for, or whether you should move on to the next article.

47Finding and Reading the Research

Introduction. The introduction is the first section of regular text, and the first paragraphs typically explain the topic of the study. The middle paragraphs lay out the background for the research. What theory is being tested? What have past studies found? Why is the present study important? Pay attention to the final paragraph, which states the specific research questions, goals, or hypotheses for the current study.

Method. The Method section explains in detail how the researchers conducted their study. It usually contains subsections such as Participants, Materials, Pro- cedure, and Apparatus. An ideal Method section gives enough detail that if you wanted to repeat the study, you could do so without having to ask the authors any questions.

Results. The Results section describes the quantitative and, as relevant, quali- tative results of the study, including the statistical tests the authors used to ana- lyze the data. It usually provides tables and figures that summarize key results. Although you may not understand all the statistics used in the article (especially early in your psychology education), you might still be able to understand the basic findings by looking at the tables and figures.

Discussion. The opening paragraph of the Discussion section generally sum- marizes the study’s research question and methods and indicates how well the results of the study supported the hypotheses. Next, the authors usually discuss the study’s importance: Perhaps their hypothesis was new, or the method they used was a creative and unusual way to test a familiar hypothesis, or the partic- ipants were unlike others who had been studied before. In addition, the authors may discuss alternative explanations for their data and pose interesting questions raised by the research.

References. The References section contains a full bibliographic listing of all the sources the authors cited in writing their article, enabling interested readers to locate these studies. When you are conducting a literature search, reference lists are excellent places to look for additional articles on a given topic. Once you find one relevant article, the reference list for that article will contain a treasure trove of related work.


Here’s some surprising advice: Don’t read every word of every article, from begin- ning to end. Instead, read with a purpose. In most cases, this means asking two questions as you read: (1) What is the argument? (2) What is the evidence to support the argument? The obvious first step toward answering these questions is to read the abstract, which provides an overview of the study. What should you read next?

Empirical articles are stories from the trenches of the theory-data cycle (see Figure 1.5 in Chapter 1). Therefore, an empirical article reports on data that are generated to test a hypothesis, and the hypothesis is framed as a test of a particular theory. After reading the abstract, you can skip to the end of the introduction to

48 CHAPTER 2 Sources of Information: Why Research Is Best and How to Find It

find the primary goals and hypotheses of the study. After reading the goals and hypotheses, you can read the rest of the introduction to learn more about the the- ory that the hypotheses are testing. Another place to find information about the argument of the paper is the first paragraph of the Discussion section, where most authors summarize the key results of their study and state how well the results supported their hypotheses.

Once you have a sense of what the argument is, you can look for the evidence. In an empirical article, the evidence is contained in the Method and Results sec- tions. What did the researchers do, and what results did they find? How well do these results support their argument (i.e., their hypotheses)?


While empirical journal articles use predetermined headings such as Method, Results, and Discussion, authors of chapters and review articles usually create headings that make sense for their particular topic. Therefore, a way to get an overview of a chapter or review article is by reading each heading.

As you read these sources, again ask: What is the argument? What is the evi- dence? The argument will be the purpose of the chapter or review article—the author’s stance on the issue. In a review article or chapter, the argument often presents an entire theory (whereas an empirical journal article usually tests only one part of a theory). Here are some examples of arguments you might find in chapters or review articles:

• Playing violent video games causes children to be more aggressive (Anderson et al., 2010).

• While speed reading is possible, it comes at the cost of comprehension of the text (Rayner, Schotter, Masson, Potter, & Treiman, 2016).

• “Prolonged exposure therapy” is effective for treating most people who suffer from posttraumatic stress disorder, though many therapists do not yet use this therapy with their clients (Foa, Gillihan, & Bryant, 2013).

In a chapter or review article, the evidence is the research that the author reviews. How much previous research has been done? What have the results been? How strong are the results? What do we still need to know? With practice, you will get better at reading efficiently. You’ll learn to categorize what you read as argument or evidence, and you will be able to evaluate how well the evidence supports the argument.

Finding Research in Less Scholarly Places Reading about research in its original form is the best way to get a thorough, accurate, and peer-reviewed report of scientific evidence. There are other sources

49Finding and Reading the Research

for reading about psychological research, too, such as nonacademic books written for the general pub- lic, websites, and popular newspapers and mag- azines. These can be good places to read about psychological research, as long as you choose and read your sources carefully.


If you browse through the psychology section in a bookstore, you will mostly find what are known as trade books about psychology, written for a general audience (Figure 2.12). Unlike the scientific sources we’ve covered, these books are written for people who do not have a psychology degree. They are written to help people, to inform, to entertain, and to make money for their authors.

The language in trade books is much more read- able than the language in most journal articles. Trade books can also show how psychology applies to your everyday life, and in this way they can be useful. But how well do trade books reflect current research in psychology? Are they peer-reviewed? Do they contain the best research, or do they simply present an uncritical summary of common sense, intuition, or the author’s own experience?

One place to start is flipping to the end of the book, where you should find foot- notes or references documenting the research studies on which the arguments are based. For example, The Secret Life of Pronouns, by psychologist James Pennebaker (2011), contains 54 pages of notes—mostly citations to research discussed in the rest of the book. Gabriele Oettingen’s Rethinking Positive Thinking (2014) contains 17 pages of citations and notes. A book related to this chapter’s theme, Anger: The Misunderstood Emotion (Tavris, 1989), contains 25 pages of references. These are examples of trade books based on research that are written by psychologists for a general audience.

In contrast, if you flip to the end of some other trade books, you may not find any references or notes. For example, The Everything Guide to Narcissistic Person- ality Disorder, by Cynthia Lechan and Barbara Leff (2011), suggests a handful of books, but includes no reference section. Healing ADD: The Breakthrough Program that Allows You to See and Heal the 6 Types of ADD, by Daniel Amen (2002), cites no research. The book Why Mars and Venus Collide: Improving Relationships by Understanding How Men and Women Cope Differently with Stress, by John Gray (2008), has four pages of references. Four pages is better than nothing but seems a little light, given that literally thousands of journal articles have been devoted to the scientific study of gender differences.

FIGURE 2.12 Finding research in a popular bookstore. You can find some good descriptions of psychology at your local bookstore. Be sure to choose books that contain a long list of scientific sources in their reference section.

50 CHAPTER 2 Sources of Information: Why Research Is Best and How to Find It

Vast, well-conducted bodies of literature exist on such topics as self-esteem, ADHD, gender differences, mental illnesses, and coping with stress, but some authors ignore this scientific literature and instead rely on hand-selected anec- dotes from their own clinical practice. So if you find a book that claims to be about psychology but does not have any references, consider it to be light entertainment (at best) or irresponsible (at worst). By now, you know that you can do better.


Wikis can provide quick, easy-to-read facts about almost any topic. What kind of animal is a narwhal? What years were the Hunger Games movies released? How many Grammy awards has Shakira received? Wikis are democratic encyclope- dias. Anybody can create a new entry, anybody can contribute to the content of a page, and anybody can log in and add details to an entry. Theoretically, wikis are self-correcting: If one user posts an incorrect fact, another user would come along and correct it.

If you’re like most students, you’ve used wikis for research even though you’ve been warned not to. If you use Wikipedia for psychology research, for example, sometimes you will find a full review of a psychological phenomenon; sometimes you won’t. Searching for the term catharsis provides an illustration. If you look up that term on Wikipedia, the first article that comes up is not related to psychology at all; it’s about the role of catharsis in classical drama.

You probably know about other disadvantages. First, wikis are not comprehen- sive in their coverage: You cannot read about a topic if no one has created a page for it. Second, although wiki pages might include references, these references are not a comprehensive list; they are idiosyncratic, representing the preferences of wiki contributors. Third, the details on the pages might be incorrect, and they will stay incorrect until somebody else fixes them. Finally, vandalism is a potential problem (sometimes people intentionally insert errors into pages); however, Wikipedia has developed digital robots to detect and delete the most obvious errors—often within seconds (Nasaw, 2012).

Now that more scientists make a point of contributing to them, wikis may become more comprehensive and accurate (Association for Psychological Sci- ence, n.d.). But be careful. Wikis may be written only by a small, enthusiastic, and not necessarily expert group of contributors. Although Wikipedia may be your first hit from a Google search, you should always double-check the information found there. And be aware that many psychology professors do not accept wikis as sources in academic assignments.


Overall, popular media coverage is good for psychology. Journalists play an impor- tant role in telling the public about exciting findings in psychological science. Psychological research is covered in online magazines (such as Slate and Vox), in news outlets, and in podcasts and blogs. Some outlets, such as Psychology Today

51Finding and Reading the Research

and the Hidden Brain podcast, are devoted exclusively to covering social science research for a popular audience (Figure 2.13).

Chapter 1 explained that journalists who specialize in science writing are trained to faithfully represent journal articles for a popular audience, but jour- nalists who are not trained in science writing might not correctly summarize a journal article. They may oversimplify things and even make claims that the study did not support. When you read popular media stories, plan to use your skills as

FIGURE 2.13 Examples of sources for reading about psychological science, directed at a popular audience. A variety of sources cover psychological science in a reader-friendly format.

52 CHAPTER 2 Sources of Information: Why Research Is Best and How to Find It

a consumer of information to read the content critically. Delve into the topic the journalist is covering by using PsycINFO or Google Scholar to locate the original article and read the research at its source.


1. How are empirical journal articles different from review journal articles? How is each type of article different from a chapter in an edited book?

2. What two guiding questions can help you read any academic research source?

3. Describe the advantages and disadvantages of using PsycINFO and Google Scholar.

4. If you encounter a psychological trade book, what might indicate that the information it contains is research-based?

1. See pp. 42–44. 2. See p. 47. 3. See pp. 45–46. 4. See pp. 49–50.


The Research vs. Your Experience • Beliefs based on personal experience may not be

accurate. One reason is that personal experience usu- ally does not involve a comparison group. In contrast, research explicitly asks: Compared to what?

• In addition, personal experience is often confounded. In daily life, many things are going on at once, and it is impossible to know which factor is responsible for a particular outcome. In contrast, researchers can closely control for confounding factors.

• Research has an advantage over experience because researchers design studies that include appropriate comparison groups.

• Conclusions based on research are probabilistic. Research findings are not able to predict or explain all cases all the time; instead, they aim to predict or explain a high proportion of cases. Individual exceptions to research findings will not nullify the results.

The Research vs. Your Intuition • Intuition is a flawed source of information because it

is affected by biases in thinking. People are likely to accept the explanation of a story that makes sense intuitively, even if it is not true.

• People can overestimate how often something hap- pens if they consider only readily available thoughts, those that come to mind most easily.

• People find it easier to notice what is present than what is absent. When people forget to look at the

information that would falsify their belief, they may see relationships that aren’t there.

• Intuition is also subject to confirmation bias. We tend to focus on the data that support our ideas and criticize or discount data that disagree. We ask leading questions whose answers are bound to confirm our initial ideas.

• We all seem to have a bias blind spot and believe we are less biased than everyone else.

• Scientific researchers are aware of their potential for biased reasoning, so they create special situations in which they can systematically observe behavior. They create comparison groups, consider all the data, and allow the data to change their beliefs.

Trusting Authorities on the Subject • Authorities may attempt to convince us to accept their

claims. If their claims are based on their own experience or intuition, we should probably not accept them. If they use well-conducted studies to support their claims, we can be more confident about taking their advice.

Finding and Reading the Research • Tools for finding research in psychology include the

online database PsycINFO, available through aca- demic libraries. You can also use Google Scholar or the websites of researchers.

• Journal articles, chapters in edited books, and full-length books should be read with a purpose by asking: What is the theoretical argument? What is the evidence—what do the data say?

Summary People’s beliefs can be based on their own experience, their intuition, on authorities, or on controlled research. Of these, research information is the most accurate source of knowledge.


54 CHAPTER 2 Sources of Information: Why Research Is Best and How to Find It

• Trade books, wikis, and popular media articles can be good sources of information about psychology research, but they can also be misleading. Such

sources should be evaluated by asking whether they are based on research and whether the coverage is comprehensive, accurate, and responsible.

Key Terms

comparison group, p. 26 confound, p. 29 confederate, p. 29 probabilistic, p. 31

availability heuristic, p. 33 present/present bias, p. 35 confirmation bias, p. 36 bias blind spot, p. 37

empirical journal article, p. 42 review journal article, p. 42 meta-analysis, p. 42 effect size, p. 44

Review Questions

1. Destiny concluded that her new white noise machine helped her fall asleep last night. She based this conclusion on personal experience, which might have confounds. In this context, a confound means:

a. Another thing might have also occurred last night to help Destiny fall asleep.

b. Destiny’s experience has left her puzzled or confused.

c. Destiny has not compared last night to times she didn’t use the white noise machine.

d. Destiny will have trouble thinking of counterexamples.

2. What does it mean to say that research is probabilistic?

a. Researchers refer to the probability that their theories are correct.

b. Research predicts all possible results.

c. Research conclusions are meant to explain a certain proportion of possible cases, but may not explain all.

d. If there are exceptions to a research result, it means the theory is probably incorrect.

3. After two students from his school commit suicide, Marcelino concludes that the most likely cause of death in teenagers is suicide. In fact, suicide is not the most likely cause of death in teens. What happened?

a. Marcelino was probably a victim of the bias blind spot.

b. Marcelino was probably influenced by the avail- ability heuristic; he was too influenced by cases that came easily to mind.

c. Marcelino thought about too many examples of teens who died from other causes besides suicide.

d. Marcelino did not consider possible confounds.

4. When is it a good idea to base conclusions on the advice of authorities?

a. When authorities have an advanced degree, such as a Ph.D. or a master’s degree.

b. When authorities based their advice on research that systematically and objectively compares different conditions.

c. It is never a good idea to base conclusions on the advice of authorities.

d. When authorities have several years of experience in their specialty area.

5. Which of the following is not a place where psychological scientists publish their research?

a. Scientific journals

b. Online podcasts

c. Chapters in edited books

d. Full-length books

6. In reading an empirical journal article, what are the two questions you should be asking as you read?

a. What is the argument? What is the evidence to support the argument?

b. Why was this research done? Were there any significant findings?

c. How reputable is (are) the author(s)? Did the findings include support for the hypotheses?

d. How does this research relate to other research? What are ways to extend this research further?

To see examples of chapter concepts in the popular media, see and click the box for Chapter 2.r


Learning Actively

1. Each of the examples below is a statement, based on experience, that does not take a comparison group into account:

a. A bath at bedtime helps my baby sleep better.

b. My meditation practice has made me feel more peaceful.

c. The GRE course I took really improved my scores!

For each statement: (a) Ask: Compared to what? Write a comparison group question that would help you evaluate the conclusion. (b) Get all the information. Draw a 2×2 matrix for systematically comparing outcomes. (c) Address confounds. Assuming there is a relationship, write down possible confounds for the proposed relationship.

Example: “Since I cut sugar from their diets, I’ve noticed the campers in my cabin are much more cooperative!”

(a) Compared to what? Would the campers have improved anyway, without the change in diet?

(b) A systematic comparison should be set up as follows:



Kids are cooperative (outcome present)

Kids are not cooperative (outcome absent)

(c) Possible confounds: What other factor might have changed at the same time as the low-sugar diet and also caused more cooperativeness? Possible confounds include that the campers may simply have gotten used to camp and settled down. Maybe the new swimming program started at the same time and tired the campers out.

2. Using what you have learned in this chapter, write a sentence or two explaining why the reasoning reflected in each of the following statements is sound or unsound. (a) What are you being asked to believe in each case? (b) What further information might you need to determine the accuracy of the speaker’s conclusions? (c) On what is the speaker basing her claim—experience, intuition, or authority?

a. “I’m positive my cousin has an eating disorder! She’s always eating diet bars.”

b. A friend tells you, “I read something cool in the paper this morning: They said violent video games don’t cause aggression when they are played cooperatively as team games. They were talking about some research somebody did.”

c. “It’s so clear that our candidate won that debate! Did you hear all the zingers he delivered?”

d. “I read online that doing these special puzzles every day helps grow your brain. It’s been proven by neuropsychology.”

e. “Binge drinking is totally normal on my campus. Everybody does it almost every weekend.”

f. “I’m afraid of flying—planes are so dangerous!”

g. Decluttering your closets makes a huge difference in your happiness. I did it last week, and I feel so much happier when I am in my room.

h. “Wow—look at how many happy couples got married after meeting on! I think I’m going to try it, too.”

3. Finding sources on PsycINFO means figuring out the right search terms. Use the PsycINFO Thesaurus tool to find keywords that will help you do searches on these research questions. Table 2.5 has some suggestions for turning research questions into help- ful searches.

a. Are adults with autism more violent than typical adults?

b. Does having PTSD put you at risk for alcoholism?

c. Can eating more protein make you smarter?

d. How do we measure narcissism?

e. What kinds of managers do employees like the best?

4. Choose one of the search terms you worked on in Question 3. Try doing the same search using three platforms: a general search engine, then Google Scholar, then PsycINFO.

a. What kind of information did you get from the general search engine you chose? Were the results based on research, or do you see more commer- cial websites or blogs? How might you refine your search to get more research-based hits?

b. Which of the three search platforms is the easiest to use when you want a general overview of a topic? Which platforms will give you the most up-to-date research? Which of the three makes it easiest to know if information has been peer-reviewed?

Learning Actively

People with Higher Incomes Spend Less Time Socializing Huffington Post, 2016

Stories Told of Brilliant Scientists Affect Kids’ Interest in the Field National Public Radio, 2016

72% of the World Smiled Yesterday, 2016


Three Claims, Four Validities: Interrogation Tools for Consumers of Research ARTICLES ABOUT PSYCHOLOGY RESEARCH written for a general audience regularly appear in the popular media. Headlines about psychology attract readers because people are interested in such topics as happiness, social interaction, and school achievement— among many others. As a psychology student, you’re probably interested in these subjects as well. But to what extent should you believe the information you read online? Journalists who write about psychological science should simply report what the researchers did and why the study was important, but sometimes they misrepresent or overstate the research findings. They may do so either uninten­ tionally, because they lack the appropriate training to properly critique the findings, or intentionally, to draw readers’ attention. Even an empirical journal article could overstate a study’s findings. When writers make unsupported claims about what a particular study means, it’s kind of like wrapping an unexciting gift in fancy paper (Figure 3.1).

Your research methods course will help you understand both popular and research­based articles at a more sophisticated level. You can learn how to raise the appropriate questions for interrogat­ ing the study that is being used to support a writer’s claims about


A year from now, you should still be able to:

1. Differentiate the three types of claims: frequency, association, and causal.

2. Ask appropriate questions to help you interrogate each of the four big validities: construct validity, statistical validity, external validity, and internal validity.

3. Explain which validities are most relevant for each of the three types of claims.

4. State which kind of research study is required to support a causal claim.

58 CHAPTER 3 Three Claims, Four Validities: Interrogation Tools for Consumers of Research

human behavior. By extension, the skills you use to evaluate information behind the research will also help you plan your own studies if you intend to become a producer of information.

Think of this chapter as a scaffold. All the research information in later chapters will have a place in the framework of three claims and four validities presented here. The three types of claims—frequency claims, association claims, and causal claims—make statements about variables or about relationships between variables. Therefore, learning some basics about variables comes first.

VARIABLES Variables are the core unit of psychological research. A variable, as the word implies, is something that varies, so it must have at least two levels, or values.

Take this headline: “72% of the world smiled yesterday.” Here, “whether a per­ son smiled yesterday” is the variable, and its levels are a person smiling yesterday and a person not smiling yesterday. Similarly, the study that inspired the statement “People with higher incomes spend less time socializing” contains two variables: income (whose levels might be low, medium, and high) and the amount of time people spend socializing (with levels ranging from 0 to 7 evenings per week). In contrast if a study concluded that “15% of Americans smoke,” nationality is not a variable because everyone in the study is American. In this example, nationality would be a constant, not a variable. A constant is something that could potentially vary but that has only one level in the study in question. (In this example, “smok­ ing” would be a variable, and its levels would be smoker and nonsmoker.)

Measured and Manipulated Variables The researchers in any study either measure or manipulate each variable. The distinction is important because some claims are tested with measured variables, while other claims must be tested with both measured and manipulated variables. A measured variable is one whose levels are simply observed and recorded. Some variables, such as height and IQ, are measured using familiar tools (a ruler, a test). Other variables, such as gender and hair color, are also said to be “measured.” To measure abstract variables, such as depression and stress, researchers might devise a special set of questions to represent the various levels. In each case, mea­ suring a variable is a matter of recording an observation, a statement, or a value as it occurs naturally.

In contrast, a manipulated variable is a variable a researcher controls, usually by assigning study participants to the different levels of that variable. For example, a researcher might give some participants 10 milligrams of a medication, others 20 mg, and still others 30 mg. Or a researcher might assign some people to take a test in a room with many other people and assign others to take the test alone.

FIGURE 3.1 Studies may not match the claims made about them. Journalists and researchers sometimes make claims about the meaning of research results. When a study has been “wrapped up” in a dramatic headline, the research inside might not live up to the expectations. Does the study support the headline in which a writer has wrapped it?

Music Lessons Enhance IQ


Group 1










Increase in score

Group 2 Group 3 Group 4


In both examples, the participants could end up at any of the levels because the researchers do the manipulating, assigning participants to be at one level of the variable or another.

Some variables cannot be manipulated—they can only be measured. Age can’t be manipulated because researchers can’t assign people to be older or younger; they can only measure what age they already are. IQ is another variable that can’t be manipulated; researchers cannot assign some people to have a high IQ and oth­ ers to have a low IQ; they can only measure each person’s IQ. Even if the research­ ers choose the 10% of people with the highest IQ and the 10% with the lowest IQ, it is still a measured variable because people cannot be assigned to the highest or lowest 10%.

Other variables cannot be manipulated because it would be unethical to do so. For example, in a study on the long­term effects of elementary education, you could not ethically assign children to “high­quality school” and “low­quality school” conditions. Nor could you ethically assign people to conditions that put their physical or emotional well­being at risk.

Some variables, however, can be either manipulated or measured, depending on the goals of a study. If childhood extracurricular activities were the variable of interest, you could measure whether children already take music or drama lessons, or you could manipulate this variable if you assigned some children to take music lessons and others to take drama lessons. If you wanted to study hair color, you could measure this trait by recording whether people have, for instance, blond, brown, or black hair. You could also manipulate this variable if you assigned some willing people to dye their hair one color or the other.

From Conceptual Variable to Operational Definition Each variable in a study can be described in two ways (Table 3.1). When research­ ers are discussing their theories and when journalists write about research, they use concept­level language. Conceptual variables are abstract concepts, such as “spending time socializing” and “school achievement.” A conceptual variable is sometimes called a construct. Conceptual variables must be carefully defined at the theoretical level, and these definitions are called conceptual definitions. To test hypotheses, researchers have to do something specific in order to gather data. When testing their hypotheses with empirical research, they create operational definitions of variables, also known as operational variables, or operationaliza- tions. To operationalize means to turn a concept of interest into a measured or manipulated variable.

For example, a researcher’s interest in the conceptual variable “spending time socializing” might be operationalized as a structured question, in which people tell an interviewer how often they spend an evening alone, socialize with friends, or see relatives in a typical week. Alternatively, the same concept might

❮❮ For a complete discussion of ethical guidelines in research, see Chapter 4.

60 CHAPTER 3 Three Claims, Four Validities: Interrogation Tools for Consumers of Research

be operationalized by having people keep a diary for one month, recording which nights they spent with relatives or friends and which nights they were alone.

Sometimes this operationalization step is simple and straightforward. For example, a researcher interested in a conceptual variable such as “weight gain” in laboratory rats would probably just weigh them. Or a researcher who was inter­ ested in the conceptual variable “income” might operationalize this variable by asking each person about their total income last year. In these two cases, the researcher can operationalize the conceptual variable of interest quite easily.

Often, however, concepts researchers wish to study are difficult to see, touch, or feel, so they are also harder to operationalize. Examples are personality traits, states such as “argumentativeness,” and behavior judgments such as “attempted suicide.” The more abstract nature of these conceptual variables does not stop psy­ chologists from operationalizing them; it just makes studying them a little harder. In such cases, researchers spend extra time clarifying and defining the conceptual variables they plan to study. They might develop creative or elegant operational definitions to capture the variable of interest.

Most often, variables are stated at the conceptual level. To discover how the variable “school achievement” was operationalized, you need to ask: How did the researchers measure “school achievement” in this study? To determine how a


Describing Variables





Car ownership Researchers asked people to circle “I own a car” or “I do not” on their questionnaire.

2 levels: own a car or not Measured

Expressing gratitude to romantic partner

Researchers asked people in relationships the extent to which they agree with items such as “I tell my partner often that s/he is the best.”

7 levels, from 1 (strongly disagree) to 7 (strongly agree)


Type of story told about a scientist

Researchers assigned participants to read stories about Einstein and Curie, which related either their work struggles or their achievements.

2 levels: a story about a scientist’s struggles and a story about a scientist’s achievements


What time children eat dinner

Using a daily food diary, researchers had children write down what time they ate dinner each evening.

Researchers divided children into two groups: those who ate dinner between 2 p.m. and 8 p.m., and those who ate after 8 p.m.


61Three Claims

variable such as “frequency of worrying” was operationalized, ask: How did researchers measure “worrying” in this research? Figure 3.2 shows how the first variable might be operationalized.

FIGURE 3.2 Operationalizing “school achievement.” A single conceptual variable can be operationalized in a number of ways.


1. What is the difference between a variable and its levels? What might be the levels of the variable “favorite color”?

2. Explain why some variables can only be measured, not manipulated. Can “gender” be a manipulated variable? Can “frequency of worrying” be a

manipulated variable?

3. What is the difference between a conceptual variable and the operational definition of a variable? How might the conceptual variables “frequency of

worrying,” “intelligence,” and “stress” be operationalized by a researcher?

1. See p. 58. 2. See pp. 58–59. “Gender” is probably not a manipulated variable, but “frequency of worrying” might be manipulated if researchers assigned some people to purposely worry about something. 3. See pp. 59–61.

THREE CLAIMS A claim is an argument someone is trying to make. Internet bloggers might make claims based on personal experience or observation (“The media coverage of congressional candidates has been sexist”). Politicians might make claims based

Teachers’ observations

Operational variables

Conceptual variable

Checking recordsSelf-report questionnaire

What grades do you get?

All As

Mostly As and Bs

Mostly Bs

Mostly Bs and Cs

School achievement

62 CHAPTER 3 Three Claims, Four Validities: Interrogation Tools for Consumers of Research

on rhetoric (“I am the candidate of change!”). Literature scholars make claims based on tex­ tual evidence (“Based on my reading of the text, I argue that the novel Frankenstein reflects a fear of technology”). In this textbook, we focus on claims made by journalists or researchers— claims that are based on empirical research. Recall from Chapters 1 and 2 that psychologists use systematic observations, or data, to test and refine theories and claims. A psychologist might claim, based on data he or she has collected, that a certain percentage of teens attempted suicide last year, or that higher­income people spend less time socializing, or that music lessons can improve a child’s IQ.

Notice the different wording in the boldface headlines in Table 3.2. In particular, the first claim merely gives a percentage of teens who tex­ ted while driving; this is a frequency claim. The claim in the middle, about single people eating fewer vegetables, is an association claim: It sug­ gests that the two variables go together, but does not claim that being single causes people to eat fewer vegetables or that eating fewer vegetables

causes people to be single. The last boldface claim, however, is a causal claim: The verb enhance indicates that the music lessons actually cause improved IQ. The kind of claim a psychological scientist makes must be backed up by the right kind of study. How can you identify the types of claims researchers make, and how can you evaluate whether their studies are able to support each type of claim? If you conduct research yourself, how will you know what kind of study will support the type of claim you wish to make?

Frequency Claims Two Out of Five Americans Say They Worry Every Day Just 15% of Americans Smoke 72% of the World Smiled Yesterday 4 in 10 Teens Admit to Texting While Driving

Frequency claims describe a particular rate or degree of a single variable. In the first example above, “two out of five” is the frequency of worrying among people in the United States. In the second example, “15%” is the rate (the proportion) of American adults who smoke. These headlines claim how frequent or common


Examples of Each Type of Claim


Frequency claims

4 in 10 teens admit to texting while driving

42% of Europeans never exercise

Middle school kids see 2–4 alcohol ads a day

Association claims

Single people eat fewer vegetables

Angry Twitter communities linked to heart deaths

Girls more likely to be compulsive texters

Suffering a concussion could triple the risk of suicide

Causal claims Music lessons enhance IQ

Babysitting may prime brain for parenting

Family meals curb eating disorders

Why sleep deprivation makes you crabby

63Three Claims

something is. Claims that mention the percentage of a variable, the number of people who engage in some activity, or a certain group’s level on a variable can all be called frequency claims.

The best way to identify frequency claims is that they focus on only one variable—such as frequency of worrying, rate of smiling, or amount of texting. In addition, in studies that support frequency claims, the variables are always measured, not manipulated. In the examples above, the researchers have mea­ sured the frequency of worrying by using a questionnaire or an interview and have reported the results.

Some reports give a list of single­variable results, all of which count as frequency claims. Take, for example, the recent report from Gallup stating that 72% of the world smiled yesterday (, 2016). The same report also found that 51% of people said they learned something interesting yester­ day. These are two separate frequency claims—they each measured a single variable one at a time. The researchers were not trying to show an association between these single variables; the report did not claim the people who learned something interesting were more likely to smile. It simply stated that a certain percentage of the world’s people smiled and a certain percentage learned some­ thing interesting.

Association Claims People with Higher Incomes Spend Less Time Socializing Romantic Partners Who Express Gratitude Are Three Times More Likely to Stay

Together People Who Multitask the Most Are the Worst at It A Late Dinner Is Not Linked to Childhood Obesity, Study Shows

These headlines are all examples of association claims. An association claim argues that one level of a variable is likely to be associated with a particular level of another variable. Variables that are associated are sometimes said to correlate, or covary, meaning that when one variable changes, the other variable tends to change, too. More simply, they may be said to be related.

Notice that there are two variables in each example above. In the first, the vari­ ables are income and spending time socializing: Having a higher income is asso­ ciated with spending less time socializing (and therefore having a lower income goes with spending more time socializing). In the second example, the variables are the frequency of expressing gratitude and the likelihood of staying together: More frequent gratitude goes with a longer relationship.

An association claim states a relationship between at least two variables. In order to support an association claim, the researcher usually measures the two variables and determines whether they’re associated. This type of study, in which the variables are measured and the relationship between them is tested, is called

❮❮ For more on correlation patterns, see Chapter 8.

64 CHAPTER 3 Three Claims, Four Validities: Interrogation Tools for Consumers of Research

a correlational study. Therefore, when you unwrap an association claim, you should find a correlational study supporting it (Figure 3.3).

There are three basic types of associations among variables: positive associa­ tions, negative associations, and zero associations.


The headline “Romantic partners who express gratitude are three times more likely to stay together” is an association in which high goes with high and low goes with low, and it’s called a positive association, or positive correlation. Stated another way, high scores on gratitude go with staying together longer, and low scores on gratitude go with a shorter time together.

One way to represent an association is to use a scatterplot, a graph in which one variable is plotted on the y­axis and the other variable is plotted on the x­axis; each dot represents one participant in the study, measured on the two variables. Figure 3.4 shows what scatterplots of the associations in three of the example headlines would look like. (Data are fabricated for illustration purposes, and num­ bers are arbitrary units.) Notice that the dots in Figure 3.4A form a cloud of points, as opposed to a straight line. If you drew a straight line through the center of the cloud of points, however, the line would incline upward; in other words, the math­ ematical slope of the line would be positive.


The study behind the claim “People who multitask the most are the worst at it” obtained a negative association. In a negative association (or negative correla- tion), high goes with low and low goes with high. In other words, high rates of multitasking go with a low ability to multitask, and low rates of multitasking go with a high ability to multitask.

A scatterplot representing this association would look something like the one in Figure 3.4B. Each dot represents a person who has been measured on two vari­ ables. However, in this example, a line drawn through the cloud of points would slope downward; it would have a negative slope.

Keep in mind that the word negative refers only to the slope; it does not mean the association is somehow bad. In this example, the reverse of the association— that people who multitask the least are the best at it—is another way to phrase this negative association. To avoid this kind of confusion, some people prefer the term inverse association.


The study behind the headline “A late dinner is not linked to childhood obesity, study shows” is an example of a zero association, or no association between the variables (zero correlation). In a scatterplot, both early and late levels of dinner time are associated with all levels of obesity (Figure 3.4C). This cloud of points has

FIGURE 3.3 Correlational studies support association claims. When a journalist makes an association claim, it’s usually based on a correlational study, in which two or more variables were measured.

People with higher incomes spend less

time socializing.

C or

rela tional study

0 0









Relationship longevity

5 10

Amount of expressed gratitude 15

65Three Claims

More obese

Less obese

Childhood weight


Time of dinner Early Late

0 0 1 2 3

Frequency of multitasking


Skill at multitasking

4 5









0 0









Relationship longevity

5 10

Amount of expressed gratitude 15

FIGURE 3.4 Scatterplots showing three types of associations. (A) Positive association: “Romantic partners who express gratitude are more likely to stay together.” (B) Negative association: “People who multitask the most are the worst at it.” (C) Zero association: “A late dinner is not linked to childhood obesity, study shows.” Data are fabricated for illustration purposes.

66 CHAPTER 3 Three Claims, Four Validities: Interrogation Tools for Consumers of Research

no slope—or more specifically, a line drawn through it would be nearly horizontal, and a horizontal line has a slope of zero.


Some association claims are useful because they help us make predictions. Which couples are going to stay together the longest? Who’s likely to have poor multi­ tasking skill? With a positive or negative association, if we know the level of one variable, we can more accurately guess, or predict, the level of the other variable. Note that the word predict, as used here, does not necessarily mean predicting into the future. It means predicting in a mathematical sense—using the association to make our estimates more accurate.

To return to the headlines, according to the positive association described in the first example, if we know how much gratitude a couple is showing, we can pre­ dict how long they will stay together, and if a couple expresses a lot of gratitude, we might predict they’ll be together a long time. According to the negative association in the second example, if we know someone spends a lot of time multitasking, we can predict she’ll be less skilled at it. Are these predictions going to be perfect? No—they will usually be off by a certain margin. The stronger the relationship between the two variables, the more accurate our prediction will be; the weaker the relationship between the two variables, the less accurate our prediction will be. But if two variables are even somewhat correlated, it helps us make better predictions than if we didn’t know about this association.

Both positive and negative associations can help us make predictions, but zero associations cannot. If we wanted to predict whether or not a child will be obese, we could not do so just by knowing what time he or she eats dinner because these two variables are not correlated. With a zero correlation, we cannot predict the level of one variable from the level of the other.

Causal Claims Music Lessons Enhance IQ Stories Told of Brilliant Scientists Affect Kids’ Interest in the Field Pressure to Be Available 24/7 on Social Media Causes Teen Anxiety Family Meals Curb Teen Eating Disorders

Whereas an association claim merely notes a relationship between two variables, a causal claim goes even further, arguing that one of the variables is responsible for changing the other. Note that each of the causal claims above has two variables, just like association claims: music lessons and IQ; type of story told about brilliant sci­ entists and interest in the field; social media and anxiety; family meals and eating disorders. In addition, like association claims, the causal claims above suggest that the two variables in question covary: Those who take music lessons have higher IQs that those who don’t; children who hear certain types of stories about scientists are more interested in the field than those who hear other types of stories.

❯❯ For more on predictions

from correlations, see Chapter 8, pp. 212–213.

67Three Claims

Causal claims start with a positive or negative association. Music lessons are pos­ itively associated with IQ, and social media pressure is associated with anxiety. Occa­ sionally you might also see a causal claim based on a zero association; it would report lack of cause. For example, you might read that vaccines do not cause autism or that daycare does not cause behavior problems.

Causal claims, however, go beyond a simple association between the two vari­ ables. They use language suggesting that one variable causes the other—verbs such as cause, enhance, affect, and change. In con­ trast, association claims use verbs such as link, associate, correlate, predict, tie to, and being at risk for. In Table 3.3, notice the dif­ ference between these types of verbs and verb phrases. Causal verbs tend to be more exciting; they are active and forceful, suggesting that one variable comes first in time and acts on the other variable. It’s not surprising, then, that journalists may be tempted to describe family dinners as curbing eating disorders, for exam­ ple, because it makes a better story than family meals just being associated with eating disorders.

Here’s another important point: A causal claim that contains tentative language—could, may, seem, suggest, sometimes, potentially—is still considered a causal claim. If the first headline read “Music lessons may enhance IQ,” it would be more tentative, and you should assume a causal claim. The verb enhance makes it a causal claim, regardless of any softening or qualifying language.

Advice is also a causal claim; it implies that if you do X, then Y will happen. For example: “Best way to deal with jerks? Give them the cold shoulder.” “Want to boost brain power? Do yoga.”

Causal claims are a step above association claims. Because they make a stronger statement, we hold them to higher standards. To move from the simple language of association to the language of causality, a study has to satisfy three criteria. First, it must establish that the two variables (the causal variable and the outcome variable) are correlated; the relationship cannot be zero. Second, it must show that the causal variable came first and the outcome variable came later. Third, it must establish that no other explanations exist for the relationship. Therefore, when we unwrap a causal claim, we must be sure the study inside can support it. Later in this chapter, you will learn that only one type of study, an experiment, can enable researchers to support a causal claim because it meets all three criteria.


Verb Phrases That Distinguish Association and Causal Claims


is linked to causes promotes

is at higher risk for affects reduces

is associated with may curb prevents

is correlated with exacerbates distracts

prefers changes fights

are more/less likely to may lead to worsens

may predict makes increases

is tied to sometimes makes trims

goes with hurts adds

68 CHAPTER 3 Three Claims, Four Validities: Interrogation Tools for Consumers of Research

Not All Claims Are Based on Research Besides the types of claims mentioned above, you may also encounter stories in the popular media that are not based on research, even if they are related to psychology. For instance:

12-Year-Old’s Insight on Autism and Vaccines Goes Viral Living in the Shadow of Huntington’s Disease Baby Born Without Skull in the Back of His Head Defies Odds

Such headlines do not report the results of research. They may describe a person’s solution to a problem, an inspiring story, or an expert’s advice, but they don’t say anything about the frequency of a problem or what research has been shown to work. The Huntington’s piece is about a single person’s experience with an inherited con­ dition that involves the breakdown of nerve cells in the brain, but it is not suggesting a treatment. The account of the baby’s survival is uplifting, but it doesn’t report sys­ tematic research about the outcomes of babies with different birth conditions. Stories like these show what can happen, but not how often it happens, or when, or why.

These kinds of headlines may be interesting, and they might be related to psy­ chology, but they are not frequency, association, or causal claims, in which a writer summarizes the results of a poll, survey, or other research study. Such anecdotes are about isolated experiences, not empirical studies. And as you read in Chapter 2, experience is not as good a source of information as empirical research.


1. How many variables are there in a frequency claim? An association claim? A causal claim?

2. Which part of speech in a claim can help you differentiate between association and causal claims?

3. How are causal claims special, compared with the other two claim types?

4. What are the three criteria causal claims must satisfy?

1. See pp. 62–64 and 66–67. 2. The verbs matter; see pp. 66–67 and Table 3.3. 3. See pp. 66–67. 4. See p. 67.

INTERROGATING THE THREE CLAIMS USING THE FOUR BIG VALIDITIES You now have the tools to differentiate the three major claims you’ll encounter in research journals and the popular media—but your job is just beginning. Once you identify the kind of claim a writer is making, you need to ask targeted questions as

69Interrogating the Three Claims Using the Four Big Validities

a critically minded consumer of information. The rest of this chapter will sharpen your ability to evaluate the claims you come across, using what we’ll call the four big validities: construct validity, external validity, statistical validity, and internal validity. Validity refers to the appropriateness of a conclusion or decision, and in general, a valid claim is reasonable, accurate, and justifiable. In psychological research, however, we do not say a claim is simply “valid.” Instead, psychologists specify which of the validities they are applying. As a psychology student, you will learn to pause before you declare a study to be “valid” or “not valid,” and to specify which of the four big validities the study has achieved.

Although the focus for now is on how to evaluate other people’s claims based on the four big validities, you’ll also be using this same framework if you plan to conduct your own research. Depending on whether you decide to test a frequency claim, an association claim, or a causal claim, it is essential to plan your research carefully, emphasizing the validities that are most important for your goals.

Interrogating Frequency Claims To evaluate how well a study supports a frequency claim, you will focus on two of the big validities: construct validity and external validity. You may decide to ask about statistical validity, too.


Construct validity refers to how well a conceptual variable is operationalized. When evaluating the construct validity of a frequency claim, the question is how well the researchers measured their variables. Consider this claim: “4 in 10 teens admit to texting while driving.” There are several ways to measure this variable. You could ask teenagers to tell you on an online survey how often they engage in text messaging while they’re behind the wheel. You could stand near an intersec­ tion and record the behaviors of teenage drivers. You could even use cell phone records to see if a text was sent at the same time a person was known to be driving. In other words, there are a number of ways to operationalize such a variable, and some are better than others.

When you ask how well a study measured or manipulated a variable, you are interrogating the construct validity: how accurately a researcher has operation­ alized each variable—be it smiling, exercising, texting, gender identity, body mass index, or self­esteem. For example, you would expect a study on obesity rates to use an accurate scale to weigh participants. Similarly, you should expect a study about texting among teenagers to use an accurate measure, and observing behav­ ior is probably a better way than casually asking, “Have you ever texted while driving?” To ensure construct validity, researchers must establish that each vari­ able has been measured reliably (meaning the measure yields similar scores on repeated testings) and that different levels of a variable accurately correspond to true differences in, say, depression or happiness. (For more detail on construct validity, see Chapter 5.)

70 CHAPTER 3 Three Claims, Four Validities: Interrogation Tools for Consumers of Research


The next important questions to ask about frequency claims concern generalizability: How did the researchers choose the study’s participants, and how well do those participants represent the intended population? Consider the example “72% of the world smiled yesterday.” Did Gallup researchers survey every one of the world’s 7 billion people to come up with this number? Of course not. They surveyed only a small sample of people. Next you ask: Which people did they survey, and how did they choose their participants? Did they include only people in major urban areas? Did they ask only college students from each country? Or did they attempt to randomly select people from every region of the world?

Such questions address the study’s external validity how well the results of a study generalize to, or represent, people or contexts besides those in the original study. If Gallup researchers had simply asked people who visited the Gallup web­ site whether they smiled yesterday, and 72% of them said they did, the researcher cannot claim that 72% of the entire world smiled. The researcher cannot even argue that 72% of Gallup website visitors smiled because the people who choose to answer such questions may not be an accurate representation. Indeed, to claim that 72% of the world smiled yesterday, the researchers would have needed to ensure that the participants in the sample adequately represented all people in the world—a daunting task! Gallup’s Global Emotions Report states that their sample included adults in each of 140 countries who were interviewed by phone or in person (, 2016). The researchers attempted to obtain representative samples in each country (excluding very remote or politically unstable areas of certain countries).


Researchers use statistics to analyze their data. Statistical validity, also called statistical conclusion validity, is the extent to which a study’s statistical conclusions are accurate and reasonable. How well do the numbers support the claim?

Statistical validity questions will vary depending on the claim. Asking about the statistical validity of a frequency claim involves reminding yourself that the number associated with the claim is an estimate, and it has a specific amount of error associated with it. The percentage reported in a frequency claim is usually accompanied by a margin of error of the estimate. This is a statistical figure, based on sample size for the study, that attempts to include the true value in the population. For example, in the report about how many teenagers text while driv­ ing, the Centers for Disease Control’s 41% value was accompanied by this note: “The margin of error is +/–2.6 percentage points” (CDC, n.d.). The margin of error helps us describe how well our sample estimates the true percentage. Specifically, the range, 38.4–43.6%, is highly likely to contain the true percentage of teens who text while driving.

❯❯ For more on the procedures

that researchers use to ensure external validity, see

Chapter 7, pp. 186–191.

71Interrogating the Three Claims Using the Four Big Validities

Interrogating Association Claims As mentioned earlier, studies that are able to support association claims are called correlational studies: They measure two variables instead of just one. Such studies describe how these variables are related to each other. To interrogate an associ­ ation claim, you ask how well the correlational study behind the claim supports construct, external, and statistical validities.


To support an association claim, a researcher measures two variables, so you have to assess the construct validity of each variable. For the headline “People who multitask the most are the worst at it,” you should ask how well the researchers measured the frequency of multitasking and how well they measured the ability to multitask. The first variable, frequency of multitasking, could be measured accu­ rately by asking people to document their day or by observing people during the day and recording times when they are multitasking. The second variable, ability to multitask, could be measured accurately using a computer­scored exercise that involves doing two things at once; a less accurate measure would be obtained by asking people how good they are at multitasking.

In any study, measuring variables is a fundamental strength or weakness—and construct validity questions assess how well such measurements were conducted. If you conclude one of the variables was measured poorly, you would not be able to trust the conclusions related to that variable. However, if you conclude the construct validity in the study was excellent, you can have more confidence in the association claim being reported.


You might also interrogate the external validity of an association claim by asking whether it can generalize to other populations, as well as to other contexts, times, or places. For the association between expressing gratitude and relationship length, you would ask whether the results from this study’s participants, 194 California college students currently in a romantic relationship, would generalize to other people and settings. Would the same results be obtained if all of the participants were midwest­ ern couples 45 or older? You can evaluate generalizability to other contexts by asking, for example, whether the link between gratitude and relationship length also exists in friendships. (Table 3.4 summarizes the four big validities used in this text.)


When applied to an association claim, statistical validity is the extent to which the statistical conclusions are accurate and reasonable. One aspect of statistical validity is strength: How strong is the association? Some associations—such as the association between height and shoe size—are quite strong. People who are tall almost always have larger feet than people who are short, so if you predict shoe size from height, you will predict fairly accurately. Other associations—such as the

72 CHAPTER 3 Three Claims, Four Validities: Interrogation Tools for Consumers of Research

association between height and income—might be very weak. In fact, because of a stereotype that favors tall people (tall people are more admired in North America), taller people do earn more money than short people, but the relationship is not very strong. Though you can predict income from height, your prediction will be less accurate than predicting shoe size from height.

Another question worth interrogating is the statistical significance of a par­ ticular association. Some associations obtained in a study might simply be due to chance connections in that particular sample caused by a few individuals. However, if an association is statistically significant, it is probably not due to chance characteristics in that one sample. For example, because the association between gratitude and relationship length is statistically significant, it means the association is probably not a chance result from that sample alone.


As you might imagine, evaluating statistical validity can be complicated. Full training in how to interrogate statistical validity requires a separate, semester­long statistics class. This book introduces you to the basics, and we will focus mainly on asking about statistical significance and the strength of an effect.

It’s also worth mentioning that statistical validity involves two kinds of mis­ takes. First, a study might mistakenly conclude, based on the results from a sample of people, that there is an association between two variables (e.g., gratitude and relationship length), when there really is no association in the full population. Careful researchers try to minimize the chances of making this kind of mistake, known as a “false positive,” or Type I error. They want to increase the chances of finding associations only when they are really there.

Second, a study might mistakenly conclude from a sample that there is no asso­ ciation between two variables (e.g., dinner time and obesity), when there really is an


The Four Big Validities


Construct validity How well the variables in a study are measured or manipulated.

The extent to which the operational variables in a study are a good approximation of the conceptual variables.

External validity The extent to which the results of a study generalize to some larger population (e.g., whether the results from this sample of children apply to all U.S. schoolchildren), as well as to other times or situations (e.g., whether the results based on this type of music apply to other types of music).

Statistical validity The extent to which the data support the conclusions. Among many other questions, it is important to ask about the strength of an association and its statistical significance (the probability that the results could have been obtained by chance if there really is no relationship).

Internal validity In a relationship between one variable (A) and another (B), the extent to which A, rather than some other variable (C), is responsible for changes in B.

❯❯ For more about association

strength and statistical significance, see Chapter 8,

pp. 205–207 and pp. 214–217.

73Interrogating the Three Claims Using the Four Big Validities

association in the full population. Careful researchers try to minimize the chances of making this kind of mistake, too; it’s known as a “miss,” or Type II error. Obvi­ ously, they want to reduce the chances of missing associations that are really there.

In sum, when you come across an association claim, you should ask about three validities: construct, external, and statistical. You can ask how well the two vari­ ables were measured (construct validity). You can ask whether you can generalize the result to a population (external validity). And you can evaluate the strength and significance of the association (statistical validity).

Table 3.5 gives an overview of the three claims, four validities framework. Before reading about how to interrogate causal claims, use the table to review what we’ve covered so far.

❮❮ For more on Type I and Type II errors, see Statistics Review: Inferential Statistics. pp. 484–490.


Interrogating the Three Types of Claims Using the Four Big Validities





Usually based on a survey or poll, but can come from other types of studies

Usually supported by a correlational study

Must be supported by an experimental study

Construct validity

How well has the researcher measured the variable in question?

How well has the researcher measured each of the two variables in the association?

How well has the researcher measured or manipulated the variables in the study?

Statistical validity

What is the margin of error of the estimate?

What is the effect size? How strong is the association? Is the association statistically significant? If the study finds a relationship, what is the probability the researcher’s conclusion is a false positive? If the study finds no relationship, what is the probability the researcher is missing a true relationship?

What is the effect size? Is there a difference between groups, and how large is it?

Is the difference statistically significant?

Internal validity

Frequency claims are usually not asserting causality, so internal validity is not relevant.

People who make association claims are not asserting causality, so internal validity is not relevant to interrogate. A researcher should avoid making a causal claim from a simple association, however (see Chapter 8).

Was the study an experiment? Does the study achieve temporal precedence? Does the study control for alternative explanations by randomly assigning participants to groups? Does the study avoid several internal validity threats (see Chapters 10 and 11)?

External validity

To what populations, settings, and times can we generalize this estimate? How representative is the sample—was it a random sample?

To what populations, settings, and times can we generalize this association claim? How representative is the sample? To what other problems might the association be generalized?

To what populations, settings, and times can we generalize this causal claim? How representative is the sample? How representative are the manipulations and measures?

74 CHAPTER 3 Three Claims, Four Validities: Interrogation Tools for Consumers of Research

Interrogating Causal Claims An association claim says that two variables are related, but a causal claim goes beyond, saying that one variable causes the other. Instead of using such verb phrases as is associated with, is related to, and is linked to, causal claims use direc­ tional verbs such as affects, leads to, and reduces. When you interrogate such a claim, your first step will be to make sure it is backed up by research that fulfills the three criteria for causation: covariance, temporal precedence, and internal validity.


Of course, one variable usually cannot be said to cause another variable unless the two are related. Covariance, the extent to which two variables are observed to go together, is determined by the results of a study. It is the first criterion a study must satisfy in order to establish a causal claim. But to justify using a causal verb, the study must not only have results showing that two variables are associated. The research method must also satisfy two additional criteria: temporal precedence and internal validity (Table 3.6).

To say that one variable has temporal precedence means it comes first in time, before the other variable. To make the claim “Music lessons enhance IQ,” a study must show that the music lessons came first and the higher IQ came later. Although this statement might seem obvious, it is not always so. In a simple associ­ ation, it might be the case that music lessons made the children smart, but it is also possible that children who start out smart are more likely to want to take music lessons. It’s not always clear which one came first. Similarly, to make the claim “Pressure to be available 24/7 on social media causes teen anxiety,” the study needs to show that social media pressure came first and the anxiety came later.

Another criterion, called internal validity, or the third-variable criterion, is an indication of a study’s ability to eliminate alternative explanations for the


Three Criteria for Establishing Causation Between Variable A and Variable B


Covariance The study’s results show that as A changes, B changes; e.g., high levels of A go with high levels of B, and low levels of A go with low levels of B.

Temporal precedence The study’s method ensures that A comes first in time, before B.

Internal validity The study’s method ensures that there are no plausible alternative explanations for the change in B; A is the only thing that changed.

75Interrogating the Three Claims Using the Four Big Validities

association. For example, to say “Music lessons enhance IQ” is to claim that music lessons cause increased IQ. But an alternative explanation could be that certain kinds of parents both encourage academic achievement (leading to higher IQ scores) and encourage their kids to take music lessons. In other words, there could be an internal validity problem: It is a certain type of parent, not the music lessons, that causes these children to score higher on IQ tests. In Chapter 2 you read that basing conclusions on personal experience is subject to confounds. Such confounds are also called internal validity problems.


What kind of study can satisfy all three criteria for causal claims? Usually, to support a causal claim, researchers must conduct a well­designed experiment, in which one variable is manipulated and the other is measured.

Experiments are considered the gold standard of psychological research because of their potential to support causal claims. In daily life, people tend to use the word experiment casually, referring to any trial of something to see what happens (“Let’s experiment and try making the popcorn with olive oil instead”). In science, including psychology, an experiment is more than just “a study.” When psychologists conduct an experiment, they manipulate the variable they think is the cause and measure the variable they think is the effect (or outcome). In the context of an experiment, the manipulated variable is called the independent variable and the measured variable is called the dependent variable. To support the claim that music les­ sons enhance IQ, the researchers in that study would have had to manipulate the music lessons variable and measure the IQ variable.

Remember: To manipulate a variable means to assign participants to be at one level or the other. In the music example, the researchers might assign some children to take music les­ sons, others to take another kind of lesson, and a third group to take no lessons. In an actual study that tested this claim in Toronto, Canada, researcher Glen Schellenberg (2004) manipu­ lated the music lesson variable by having some children take music lessons (either keyboard or voice lessons), others take drama lessons, and still others take no lessons. After several months of lessons, he measured the IQs of all the children. At the conclusion of his study, Schellenberg found that the children who took keyboard and voice lessons gained an average of 3.7 IQ points more than the children who took drama lessons or no lessons (Figure 3.5). This

❮❮ For examples of independent and dependent variables, see Chapter 10, p. 277.

FIGURE 3.5 Interrogating a causal claim. What key features of Schellenberg’s study of music lessons and IQ made it possible for him to claim that music lessons increase children’s IQ? (Source: Schellenberg, 2004.)

Fig. 1. Mean increase in full-scale IQ (Wechsler Intelligence Scale for Children–Third Edition) for each group of 6-year-olds who completed the study. Error bars show standard errors.


76 CHAPTER 3 Three Claims, Four Validities: Interrogation Tools for Consumers of Research

was a statistically significant gain, and the result therefore established the first part of a causal claim: covariance.

A Study’s Method Can Establish Temporal Precedence and Internal Validity. Why does the method of manipulating one variable and measuring the other help scientists make causal claims? For one thing, manipulating the inde­ pendent variable—the causal variable—ensures that it comes first. By manipulating music lessons and measuring IQ, Schellenberg ensured temporal precedence in his study.

In addition, when researchers manipulate a variable, they have the potential to control for alternative explanations; that is, they can ensure internal validity. When Schellenberg was investigating whether music lessons could enhance IQ, he did not want the children in the music lessons groups to have more involved parents than those in the drama lessons group or the no­lessons group because then parental involvement would have been a plausible alternative explanation for why the music lessons enhanced IQ. He didn’t want the children in the music lessons groups to come from a different school district than those in the drama lessons group or the no­lessons group because then the school curriculum or teacher quality might have been an alternative explanation.

Therefore, Schellenberg used a technique called random assignment to ensure that the children in all the groups were as similar as possible. He used a method, such as rolling a die, to decide whether each child in his study would take keyboard lessons, voice lessons, drama lessons, or no lessons. Only by randomly assigning children to one of the groups could Schellenberg ensure those who took music lessons were as similar as possible, in every other way, to those who took drama lessons or no lessons. Random assignment increased internal validity by allowing Schellenberg to control for potential alternative explanations. He also designed the experiment so the children were engaged in their respective lessons for the same amount of time, over the same number of weeks. These methodology choices secured the study’s internal validity.

Schellenberg’s experiment met all three criteria of causation. The results showed covariance, and the method established temporal precedence and internal validity. Therefore, he was justified in making a causal claim from his data. His study can be used to support the claim that music lessons really do enhance—cause an increase in—IQ.


Let’s use two other examples to illustrate how to interrogate causal claims made by writers and journalists.

Do Family Meals Really Curb Eating Disorders? To interrogate the causal claim “Family meals curb teen eating disorders,” we start by asking about covari­ ance in the study behind this claim. Is there an association between family meals and eating disorders? Yes: The news report says 26% of girls who ate with their families fewer than five times a week had eating­disordered behavior (e.g., the use

❯❯ For more on how random assignment helps ensure

that experimental groups are similar, see Chapter 10,

pp. 284–286.

77Interrogating the Three Claims Using the Four Big Validities

FIGURE 3.6 Only experiments should be wrapped in causal language. When a journalist or researcher makes a causal claim, you need to be sure the right kind of study—an experiment—was conducted.

of laxatives or diuretics, or self­induced vomiting), and only 17% of girls who ate with their families five or more times a week engaged in eating­disordered behaviors (Warner, 2008). The two variables are associated.

What about temporal precedence? Did the researchers make sure family meals had increased before the eating disorders decreased? The best way to ensure tem­ poral precedence is to assign some families to have more meals together than others. Sure, families who eat more meals together may have fewer daughters with eating­disordered behavior, but the temporal precedence is not clear from this association. In fact, one of the symptoms of an eating disorder is embarrassment about eating in front of others, so perhaps the eating disorder came first and the decreased family meals came second. Daughters with eating disorders may simply find excuses to avoid eating with their families.

Internal validity is a problem here, too. Without an experiment, we cannot rule out many alternative, third­variable explanations. Perhaps girls from single­ parent families are less likely to eat with their families and are vulnerable to eating disorders, whereas girls who live with both parents are not. Maybe high­ achieving girls are too busy to eat with their families and are also more susceptible to eating­disordered behavior. These are just two possible alternative explana­ tions. Only a well­run experiment could have controlled for these internal validity problems (the alternative explanations), using random assignment to ensure that the girls who had frequent family dinners and those who had less­frequent family dinners were comparable in all other ways: high versus low scholastic achieve­ ment, single­parent versus two­parent households, and so on. However, it would be impractical and probably unethical to conduct such an experiment.

Although the study’s authors reported the findings appropriately, the journalist wrapped the study’s results in an eye­catching causal conclusion by saying that family dinners curb eating disorders (Figure 3.6). The journalist should probably have wrapped the study in the association claim, “Family dinners are linked to eating disorders.”

Does Social Media Pressure Cause Teen Anxiety? Another example of a dubious causal claim is this headline: “Pressure to be available 24/7 on social media causes teen anxiety” In the story, the journalist reported on a study that measured two variables in a set of teenagers; one variable was social media use (especially the degree of pressure to respond to texts and posts) and the other was level of anxiety (Science News, 2015). The researchers found that those who felt pressure to respond immediately also had higher levels of anxiety.

Let’s see if this study’s design is adequate to support the journalist’s conclusion—that social media pressure causes anxiety in teenagers. The study certainly does have covariance; the results showed that teens who felt more pres­ sure to respond immediately to social media were also more anxious. However, this was a correlational study, in which both variables were measured at the same time, so there was no temporal precedence. We cannot know if the pressure to respond to social media increased first, thereby leading to increased anxiety, or

Family dinners curb eating disorders.

C or

rela tional study

0 0 1 2 3

Frequency of multitasking

Skill at multitasking

4 5








78 CHAPTER 3 Three Claims, Four Validities: Interrogation Tools for Consumers of Research

if teens who were already anxious expressed their anxiety through social media use, by putting pres­ sure on themselves to respond immediately.

In addition, this study did not rule out pos­ sible alternative explanations (internal validity) because it was not an experiment. Several out­ side variables could potentially correlate with both anxiety and responding immediately to social media. One might be that teens who are involved in athletics are more relaxed (because exercise can reduce anxiety) and less engaged in social media (because busy schedules limit their time). Another might be that certain teenagers are vulnerable to emotional disorders in general; they are already more anxious and feel more pres­ sure about the image they’re presenting on social media (Figure 3.7).

An experiment could potentially rule out such alternative explanations. In this example, though, conducting an experiment would be hard. A researcher cannot randomly assign teens to be concerned about social media or to be anxious. Because the research was not enough to support a causal claim, the journalist should have packaged the description of the study under an association claim headline: “Social media pressure and teen anxiety are linked.”


A study can support a causal claim only if the results demonstrate covariance, and only if it used the experimental method, thereby establishing temporal precedence and internal validity. Therefore, internal validity is one of the most important validities to evaluate for causal claims. Besides internal validity, the other three validities discussed in this chapter—construct validity, statistical validity, and external validity—should be interrogated, too.

Construct Validity of Causal Claims. Take the headline, “Music lessons enhance IQ.” First, we could ask about the construct validity of the measured variable in this study. How well was IQ measured? Was an established IQ test administered by trained testers? Then we would need to interrogate the construct validity of the manipulated variable as well. In operationalizing manipulated variables, researchers must create a specific task or situation that will represent each level of the variable. In the current example, how well did the researchers manipulate music lessons? Did students take private lessons for several weeks or have a single group lesson?

External Validity of Causal Claims. We could ask, in addition, about exter­ nal validity. If the study used children in Toronto, Canada, as participants, do the results generalize to Japanese children? Do the results generalize to rural Canadian children? If Japanese students or rural students take music lessons,

❯❯ For more on how

researchers use data to check the construct validity

of their manipulations, see Chapter 10, pp. 298–301.

FIGURE 3.7 Support for a causal claim? Without conducting an experiment, researchers cannot support the claim that social media pressure causes teen anxiety.

79Interrogating the Three Claims Using the Four Big Validities

will their IQs go up, too? What about generalization to other settings—could the results generalize to other music lessons? Would flute lessons and violin lessons also work? (In Chapters 10 and 14, you’ll learn more about how to evaluate the external validity of experiments and other studies.)

Statistical Validity of Causal Claims. We can also interrogate statistical valid­ ity. To start, we would ask: How strong is the relationship between music lessons and IQ? In our example study, participants who took music lessons gained 7 points in IQ, whereas students who did not gained an average of 4.3 points in IQ—a net gain of about 3.7 IQ points (Schellenberg, 2004). Is this a large gain? (In this case, the difference between these two groups is about 0.35 of a standard deviation, which, as you’ll learn, is considered a moderate difference between the groups.) Next, asking whether the differences among the lessons groups were statisti­ cally significant helps ensure that the covariance criterion was met; it helps us be more sure that the difference is not just due to a chance difference in this sample alone. (In Chapter 10, you’ll learn more about interrogating the statistical validity of causal claims.)

Prioritizing Validities Which of the four validities is the most important? It depends. When researchers plan studies to test hypotheses and support claims, they usually find it impossible to conduct a study that satisfies all four validities at once. Depending on their goals, sometimes researchers don’t even try to satisfy some of them. They decide what their priorities are—and so will you, when you participate in producing your own research.

External validity, for instance, is not always possible to achieve—and some­ times it may not be the researcher’s priority. As you’ll learn in Chapter 7, to be able to generalize results from a sample to a wide population requires a representative sample from that population. Consider the Schellenberg study on music lessons and IQ. Because he was planning to test a causal claim, he wanted to emphasize internal validity, so he focused on making his different groups—music lessons, drama lessons, or no lessons—absolutely equivalent. He was not prioritizing exter­ nal validity and did not try to sample children from all over Canada. However, his study is still important and interesting because it used an internally valid exper­ imental method—even though it did not achieve external validity. Furthermore, even though he used a sample of children from Toronto, there may be no theoret­ ical reason to assume music lessons would not improve the IQs of rural children, too. Future research could confirm that music works for other groups of children just as well, but there is no obvious reason to expect otherwise.

In contrast, if some researchers were conducting a telephone survey and did want to generalize its results to the entire Canadian population—to maximize external validity—they would have to randomly select Canadians from all ten provinces. One approach would be using a random­digit telephone dialing system

❮❮ For more on determining the strength of a relationship between variables, see Statistics Review: Descriptive Statistics, pp. 468–472.

80 CHAPTER 3 Three Claims, Four Validities: Interrogation Tools for Consumers of Research

to call people in their homes, but this technology is expensive. When researchers do use formal, randomly sampled polls, they often have to pay the polling company a fee to administer each question. Therefore, a researcher who wants to evaluate, say, depression levels in a large population may be economically forced to use a short questionnaire or survey. A 2­item measure might not be as good as a 15­item measure, but the longer one would cost more. In this example, the researcher might sacrifice some construct validity in order to achieve external validity.

You’ll learn more about these priorities in Chapter 14. The point, for now, is simply that in the course of planning and conducting a study, researchers weigh the pros and cons of methodology choices and decide which validities are most important. When you read about a study, you should not necessarily conclude it is faulty just because it did not meet one of the validities.


1. What question(s) would you use to interrogate a study’s construct validity?

2. In your own words, describe at least two things that statistical validity addresses.

3. Define external validity, using the term generalize in your definition.

4. Why is a correlational study not able to support a causal claim?

5. Why don’t researchers usually aim to achieve all four of the big validities at once?

1. See pp. 69, 71, and 78. 2. See pp. 71–72. 3. See pp. 70–71 and 78–79. 4. See pp. 74–75 and Table 3.4. 5. See pp. 79–80.

REVIEW: FOUR VALIDITIES, FOUR ASPECTS OF QUALITY As a review, let’s apply the four validities discussed in this chapter to another headline from a popular news source: “Stories told of brilliant scientists affect kids’ interest in the field.” The journalist’s story was reported on the radio ( Vedantam, 2016), and the original research was published in the Journal of Educational Psychology (Lin­Siegler, Ahn, Chen, Fang, & Luna­Lucero, 2016). Should we consider this a well­designed study? How well does it hold up on each of the four validities? At this stage in the course, your focus should be on asking the right questions for each validity. In later chapters, you will also learn how to evaluate the answers to those questions. You can see how we might interrogate this study by reading the Working It Through section.

81Review: Four Validities, Four Aspects of Quality

Does Hearing About Scientists’ Struggles Inspire Young Students? Educational psychologists conducted a study in which children were told different kinds of stories about scientists, and the researchers tracked the children’s interest in being scientists themselves (Lin-Siegler et al., 2016). We will work through this example to demonstrate how to apply the concepts from Chapter 3.


What kind of claim is in the headline?

What are the variables in the headline?

Which validities should we interrogate for this claim?

“Stories told of brilliant scientists affect kids’ interest in the field” (Vedantam, 2016).

This is a causal claim because affect is a causal verb.

One variable is the type of story told. The other variable is kids’ level of interest in the field.

We should interrogate a causal claim on all four validities, especially internal validity.

Construct validity

How well did the researchers manipulate “Stories told of brilliant scientists”?

The journalist reports that “some kids were told very conventional genius stories [such as] Albert Einstein, brilliant physicist, won the Nobel Prize. He’s such a genius. Others were told how hard Einstein had to struggle. At one point, apparently Einstein was having such trouble working out the math in his theories that he turned to his friend Max Planck and said, this is driving me crazy. Can you please help me with the math?” (Vedantam, 2016).

These two stories do seem to differ in focus in the way the researchers intended, with one emphasizing genius and the other Einstein’s troubles. The manipulation seems well done.



82 CHAPTER 3 Three Claims, Four Validities: Interrogation Tools for Consumers of Research


Construct Validity (continued)

How well did they measure “kids’ interest in the field”?

The journalist lists several ways interest in the field was operationalized, reporting, “Lin-Siegler and her colleagues then measured how well students who read the struggle stories did in science tests. She also measured how engaged the students felt about science and how much persistence they demonstrated when they faced obstacles” (Vedantam, 2016).

Classroom science tests seem a clear way to operationalize performance. We might want to find out more about how engagement and persistence were measured, such as the kinds of items used. Note that science performance and motivation may not be the same as what the journalist’s headline called it, “interest in the field.”

Statistical validity

How large was the difference in science performance and motivation between the different story groups?

Was the difference in science performance and motivation between the different story groups statistically significant?

The journalist interviewed one of the researchers, who was quoted as saying, “people who read struggle stories improved their science grades significantly more than people who read the achievement story” (Vedantam, 2016).

This quote indicates the difference was statistically significant, but does not mention how far apart the groups’ science grades were. Although the headline indicated “interest in the field” was a main outcome of the study, only science grades (not motivation beliefs) showed a statistically significant difference.

Internal validity

Was this study an experiment? Are there alternative explanations, other than the struggle aspect of the science stories, that could have caused the improvement in students’ grades?

The fact that students heard one of two stories suggests the study is probably an experiment, in which some students heard about struggles and others heard about genius. The journalist reports that both stories were about the same scientist, but they differed in focus. The journalist did not indicate if students were randomly assigned to the different stories or not.

If researchers randomly assigned students to stories, we can assume the students who heard struggle stories and those who heard achievement stories were equivalent in background, ability level, original interest in science, gender, age, and so on. If the stories were the same length, reading level, and tone, we can assume it was the struggle aspect of the story, rather than anything else, that caused grades to improve.

External validity

Can we generalize from the 9th and 10th graders in this sample to other students?

Would students also benefit from hearing about the struggles of people in other fields, such as law or art?

The journalist does not indicate whether the people in this study were a representative sample.

The subjects in this study heard only about scientists, not other fields.

The study’s ability to generalize to students in other cities or other grades is unknown. However, when researchers conduct experiments to support causal claims, their priority is usually internal validity, not external validity.

We don’t yet know if the pattern generalizes to other disciplines. A future study could test this idea.


Summary The three claims, four validities framework enables you to systematically evaluate any study you read, in a journal article or a popular media story. It can also guide you in making choices about research you might conduct yourself.

Interrogating the Three Claims Using the Four Big Validities • To interrogate a frequency claim, ask questions about

the study’s construct validity (quality of the measure- ments), external validity (generalizability to a larger population), and statistical validity (degree of error in the percentage estimate).

• To interrogate an association claim, ask about its construct, external, and statistical validity. Statistical validity addresses the strength of a relationship, and whether or not a finding is statistically significant.

• To interrogate a causal claim, ask whether the study conducted was an experiment, which is the only way to establish internal validity and temporal precedence. If it was an experiment, further assess internal validity by asking whether the study was designed with any confounds, and whether the researchers used random assignment for making participant groups. You can also ask about the study’s construct, external, and statistical validity.

• Researchers cannot usually achieve all four validities at once in an experiment, so they prioritize them. Their interest in making a causal statement means they may sacrifice external validity to ensure internal validity.

Variables • Variables, concepts of interest that vary, form the

core of psychological research. A variable has at least two levels.

• Variables can be measured or manipulated.

• Variables in a study can be described in two ways: as conceptual variables (elements of a theory) and as operational definitions (specific measures or manipu- lations in order to study them).

Three Claims • As a consumer of information, you will identify three

types of claims that researchers, journalists, and other writers make about research: frequency, association, and causal claims.

• Frequency claims make arguments about the level of a single, measured variable in a group of people.

• Association claims argue that two variables are related to each other. An association can be positive, negative, or zero. Association claims are usually supported by correlational studies, in which all variables are mea- sured. When you know how two variables are associ- ated, you can use one to predict the other.

• Causal claims state that one variable is responsible for changes in the other variable. To support a causal claim, a study must meet three criteria—covariance, temporal precedence, and internal validity—which is accomplished only by an experimental study.


84 CHAPTER 3 Three Claims, Four Validities: Interrogation Tools for Consumers of Research

Key Terms

variable, p. 58 level, p. 58 constant, p. 58 measured variable, p. 58 manipulated variable, p. 58 conceptual variable, p. 59 construct, p. 59 conceptual definition, p. 59 operational definition, p. 59 operational variable, p. 59 operationalize, p. 59 claim, p. 61

frequency claim, p. 62 association claim, p. 63 correlate, p. 63 correlational study, p. 64 positive association, p. 64 scatterplot, p. 64 negative association, p. 64 zero association, p. 64 causal claim, p. 66 validity, p. 69 construct validity, p. 69 generalizability, p. 70

external validity, p. 70 statistical validity, p. 70 margin of error of the estimate, p. 70 Type I error, p. 72 Type II error, p. 73 covariance, p. 74 temporal precedence, p. 74 internal validity, p. 74 experiment, p. 75 independent variable, p. 75 dependent variable, p. 75 random assignment, p. 76

To see samples of chapter concepts in the popular media, visit and click the box for Chapter 3.r

Review Questions

1. Which of the following variable or variables is manipu- lated, rather than measured? (Could be more than one.)

a. Number of pairs of shoes owned, in pairs.

b. A person’s height, in cm.

c. Amount of aspirin a researcher gives a person to take, either 325 mg or 500 mg.

d. Degree of happiness, rated on a scale from 1 to 10.

e. Type of praise a researcher uses in groups of dogs: verbal praise or a clicking sound paired with treats.

2. Which of the following headlines is an association claim?

a. Chewing gum can improve your mood and focus.

b. Handling money decreases helpful behavior in young children.

c. Workaholism is tied to psychiatric disorders.

d. Eating kiwis may help you fall asleep.

3. Which of the following headlines is a frequency claim?

a. Obese kids are less sensitive to tastes.

b. 80% of women feel dissatisfied with how their bodies look.

c. Feeling fat? Maybe Facebook is to blame.

d. Daycare and behavior problems are not linked.

4. Which of the following headlines is a causal claim?

a. Taking a deep breath helps minimize high blood pressure, anxiety, and depression.

b. Younger people can’t read emotions on wrinkled faces.

c. Strange but true: Babies born in the autumn are more likely to live to 100.

d. Check the baby! Many new moms show signs of OCD.

5. Which validity would you be interrogating by asking: How well did the researchers measure sensitivity to tastes in this study?

a. Construct validity

b. Statistical validity

c. External validity

d. Internal validity

6. Which validity would you be interrogating by asking: How did the researchers get their sample of people for this survey?

a. Construct validity

b. Statistical validity

c. External validity

d. Internal validity

85Learning Actively

7. In most experiments, trade-offs are made between validities because it is not possible to achieve all four at once. What is the most common trade-off?

a. Internal and external validity.

b. Construct and statistical validity.

c. Statistical and internal validity.

d. External and statistical validity.

Learning Actively

1. For each boldfaced variable below, indicate the variable’s levels, whether the variable is measured or manipulated, and how you might describe the variable conceptually and operationally.





A questionnaire study asks for various demographic information, including participants’ level of education.

Level of education

Asking participants to circle their highest level of education from this list:

High school diploma Some college College degree Graduate degree

High school diploma Some college College degree Graduate degree


A questionnaire study asks about anxiety, measured on a 20-item Spielberger Trait Anxiety Inventory.

A study of readability has people read a passage of text printed in one of two fonts: sans-serif or serif.

A study of school achievement asks each participant to report his or her SAT score, as a measure of college readiness.

A researcher studying self-control and blood sugar levels gives participants one of two glasses of sweet-tasting lemonade: one has sugar, the other is sugar-free.

2. Imagine you encounter each of the following head- lines. What questions would you ask if you wanted to understand more about the quality of the study behind the headline? For each question, indicate which of the four validities it is addressing. Follow the model in the Working It Through section.

a. Chewing gum can improve your mood and focus.

b. Workaholism is tied to psychiatric disorders.

c. 80% of women feel dissatisfied with how their bodies look.

3. Suppose you want to test the causal claim about chewing gum improving your mood and focus. How could you design an experiment to test this claim? What would the variables be? Would each be manip- ulated or measured? What results would you expect? Sketch a graph of the outcomes you would predict. Would your experiment satisfy the three criteria for supporting a causal statement?


Research Foundations for Any Claim

Respect for Persons




Ethical Guidelines for Psychology Research NO MATTER WHAT TYPE of claim researchers are investigating, they are obligated—by law, by morality, and by today’s social norms—to treat the participants in their research with kindness, respect, and fairness. In the 21st century, researchers are expected to follow basic ethical principles in the treatment of humans and other animals. Researchers are also expected to produce research that is meaningful and helpful to society. How can we know when a study is conducted ethically? This chapter introduces the criteria for evaluating whether a set of research was conducted appropriately.

HISTORICAL EXAMPLES In the past, researchers may have held different ideas about the ethical treatment of study participants. Two examples of research, one from medicine and one from psychology, follow. The first one clearly illustrates several ethics violations. The second demonstrates the difficult balance of priorities researchers might face when evaluating a study’s ethics.

The Tuskegee Syphilis Study Illustrates Three Major Ethics Violations In the late 1920s and early 1930s, about 35% of poor Black men living in the southern United States were infected with syphilis. Because the disease was largely untreatable at the time, it interfered with their


A year from now, you should still be able to:

1. Define the three ethical principles of the Belmont Report and describe how each one is applied. Recognize the similarities between the Belmont Report’s principles and the five APA Ethical Principles.

2. Describe the procedures that are in place to protect human participants and animal subjects in research.

3. Articulate some of the ways that ethical decision making requires balancing priorities, such as research risks versus benefits, rights of individual participants versus societal gains, free participation versus coercion.

90 CHAPTER 4 Ethical Guidelines for Psychology Research

ability to work, contribute to society, and climb their way out of poverty. The available treatment involved infusions of toxic metals; when it worked at all, this method had serious—even fatal—side effects (CDC, 2016). In 1932, the U.S. Public Health Service (PHS), cooperating with the Tuskegee (Alabama) Institute, began a study of 600 Black men. About 400 were already infected with syphilis, and about 200 were not. The researchers wanted to study the effects of untreated syphilis on the men’s health over the long term. At the time, no treatment was a reasonable choice because the risky methods available in 1932 were not likely to work (Jones, 1993). The men were recruited in their community churches and schools, and many of them were enthusiastic about participating in a project that would give them access to medical care for the first time in their lives (Reverby, 2009). However, there is little evidence that the men were told the study was actually about syphilis.

Early in the project, the researchers decided to follow the men infected with syphilis until each one had died, to obtain valuable data on how the disease progresses when untreated. The study lasted 40 years, during which the researchers made a long series of unethical choices (Figure 4.1). Infected men were told they had “bad blood” instead of syphilis. The researchers told them they were being treated, and all of them were required to come to the Tuskegee clinic for evaluation and testing. But they were never given any ben- eficial treatment. At one point, in fact, the researchers had to conduct a painful, potentially dangerous spinal tap procedure on every participant, in order to follow the progression of the disease. To ensure that they would come in for the

procedure, the researchers lied, telling the men it was a “special free treatment” for their illness (Jones, 1993).

As the project continued, 250 of the men reg- istered to join the U.S. Armed Forces, which were then engaged in World War II. As part of the draft process, the men were diagnosed (again) with syphilis and told to reenlist after they had been treated. Instead of following these instructions, however, the researchers interfered by preventing the men from being treated. As a result, they could not serve in the armed forces or receive subsequent G. I. benefits (Final Report of the Tuskegee Study Ad Hoc Advisory Panel, 1973).

In 1943, the PHS approved the use of penicillin for treating syphilis, yet the Tuskegee Institute did not provide information about this new cure to the participants in their study. In 1968, PHS employee Peter Buxtun raised concerns with officials at the CDC. However, the researchers decided to proceed

FIGURE 4.1 The Tuskegee Syphilis Study. A doctor takes a blood sample from a participant. What unethical decisions were made by the researchers who conducted this study?

91Historical Examples

as before. The study continued until 1972, when Buxtun told the story to the Asso- ciated Press (Gray, 1998; Heller, 1972), and the study was widely condemned. Over the years, many men got sicker, and dozens died. Several men inadvertently infected their partners, thereby causing, in some cases, congenital syphilis in their children (Jones, 1993; Reverby, 2009).

In 1974, the families of the participants reached a settlement in a lawsuit against the U.S. government. In 1997, President Bill Clinton formally apologized to the survi- vors on behalf of the nation (Figure 4.2). Nonetheless, the Tuskegee Syphilis Study has contributed to an unfortunate legacy. As a result of this study, some African Americans are suspicious of government health services and research participa- tion (McCallum, Arekere, Green, Katz, & Rivers, 2006).


The researchers conducting this infamous study made a number of choices that are unethical from today’s perspective. Later writers have identified these choices as falling into three distinct categories (Childress, Meslin, & Shapiro, 2005; Gray, 1998). First, the men were not treated respectfully. The researchers lied to them about the nature of their participation and withheld information (such as penicillin as a cure for the disease); in so doing, they did not give the men a chance to make a fully informed decision about participating in the study. If they had known in advance the true nature of the study, some might still have agreed to participate but others might not. After the men died, the doctors offered a generous burial fee to the families, but mainly so they could be sure of doing autopsy studies. These low-income families may have felt coerced into agreeing to an autopsy only because of the large payment.

Second, the men in the study were harmed. They and their families were not told about a treatment for a disease that, in the later years of the study, could be easily cured. (Many of the men were illiterate and thus unable to learn about the penicillin cure on their own.) They were also subjected to painful and dan- gerous tests. Third, the researchers targeted a disadvantaged social group in this study. Syphilis affects people from all ethnicities and social backgrounds, yet all the men in this study were poor and African American (Gray, 1998; Jones, 1993).

FIGURE 4.2 An official apology in 1997. The U.S. government issued an apology to survivors of the Tuskegee Syphilis Study.

92 CHAPTER 4 Ethical Guidelines for Psychology Research

The Milgram Obedience Studies Illustrate a Difficult Ethical Balance The Tuskegee Syphilis Study provides several clear examples of ethics vio- lations, but decisions about ethical matters are usually more nuanced. Social psychologist Stanley Milgram’s series of studies on obedience to authority, conducted in the early 1960s, illustrates some of the difficulties of ethical decision making.

Imagine yourself as a participant in one of Milgram’s studies. You are told there will be two participants: you, the “teacher,” and another participant, the “learner.” As teacher, your job is to punish the learner when he makes mistakes in a learning task. The learner slips into a cubicle where you can’t see him, and the session begins (Milgram, 1963, 1974).

As the study goes on, you are told to punish the learner for errors by admin- istering electric shocks at increasingly higher intensities, as indicated on an imposing piece of equipment in front of you: the “shock generator.” At first, while receiving the low-voltage shocks, the learner does not complain. But he keeps making mistakes on a word association test he is supposed to be learn- ing, and you are required by the rules of the study to deliver shocks that are 15 volts higher after each mistake (Figure 4.3). As the voltage is increased, the learner begins to grunt with pain. At about 120 volts, the learner shouts that the shocks are very painful and says he wants to quit the experiment. At 300

volts, the learner screams that he will no longer respond to the learning task; he stops responding. The experimenter, sitting behind you in a white lab coat, tells you to keep deliv- ering shocks—15 volts more each time, until the machine indicates you’re delivering 450-volt shocks. Whereas before the learner screamed in pain with each new shock, after 300 volts you now hear nothing from him. You can’t tell whether he is even conscious in his cubicle.

If you protest (and you probably would), the experimenter behind you says calmly, “Continue.” If you protest again, the experimenter says, again calmly, “The experiment requires that you continue,” or even, “You have no choice, you must go on.” What would you do now?

You may believe you would have refused to obey the demands of this inhumane experimenter. However, in the original study, fully 65% of the participants obeyed, following the experimenter’s instructions and delivering the 450-volt shock to the learner. Only two or three participants (out of hundreds) refused to give even the first, 15-volt shock. Virtu- ally all participants subjected another person to one or more electric shocks—or at least they thought they did. Fortunately, the learner was actually a confederate of the experimenter; he

FIGURE 4.3 The Milgram obedience studies. In one version, the experimenter (right) and the true participant, or “teacher” (left) help connect the “learner” to the electrodes that would supposedly shock him. Was it ethical for the researchers to invent this elaborate situation, which ultimately caused participants so much stress?

93Historical Examples

was a paid actor playing a role, and he did not actually receive any shocks. The participants did not know this; they thought the learner was an innocent, friendly man.

Milgram conducted 18 or more variations of this study. Each time, 40 new participants were asked to deliver pain- ful shocks to the learner. In one variation, the learner men- tioned he had a heart condition; this made no difference, and the level of obedience remained at about 65%. In another variation, the learner sat right in the room with the teacher- participant; this time the obedience level dropped to 40% (Figure 4.4). Another time, the experimenter supervised the situation from down the hall, giving his instructions (“Continue,” “The experiment requires that you continue”) over the phone. The obedience level also dropped, and only 20% of participants delivered all the shocks.


Was Milgram acting ethically in conducting this research? One psychologist at the time criticized the study because it was extremely stressful to the teacher-participants (Baumrind, 1964). Milgram relayed this observation from one of his research assistants:

I observed a mature and initially poised businessman enter the laboratory smiling

and confident. Within 20 minutes he was reduced to a twitching, nervous wreck,

who was rapidly approaching a point of nervous collapse. He constantly pulled on

his earlobe, and twisted his hands. At one point he pushed his fist into his forehead

and muttered, “Oh, God, let’s stop it.” And yet he continued to respond to every

word of the experimenter, and obeyed to the very end. (Milgram, 1963, p. 377)

Was it ethical or unethical to put unsuspecting volunteers through such a stressful experience?

Some writers have questioned how Milgram’s participants coped with their involvement over time. In an interview after the study, the participants were debriefed; they were carefully informed about the study’s hypotheses. They shook hands with the learner, who reassured them he was unharmed. However, in order to avoid influencing potential future participants, the debriefing never mentioned that the learner did not receive shocks (Perry, 2013). In interviews years later, some participants reported worrying for weeks about the learner’s welfare (Perry, 2013).

Milgram claimed that his results—65% obedience—surprised him (Milgram, 1974; but see Perry, 2013). Experts at the time predicted that only 1–2% of people would obey the experimenter up to 450 volts. After the first variation of the study, however, Milgram knew what kind of behavior to expect, and he had already seen

FIGURE 4.4 Balancing ethical concerns. In one variation of the Milgram obedience studies, participants were required to force the learner’s arm onto a (fake) electric plate. How do you balance potential harm to participants with the benefit to society in this research?

94 CHAPTER 4 Ethical Guidelines for Psychology Research

firsthand the stress the participants were under. Once he knew that many of the people in the study would experience anxiety and stress, Milgram might have taken steps to stop, or modify, the procedure, and yet he did not.

An ethical debate about the Milgram studies must also weigh the lessons learned, and Milgram himself emphasized their social impact. Some argue they contributed crucial lessons about obedience to authority and the “power of the situation”—lessons we would not have learned without his research (Blass, 2002). The research may have benefitted individual participants: Milgram had an associ- ate call some of the participants at home, months later, to ask about their current state of well-being. Some of them felt they had learned something important. For example, one participant reported: “What appalled me was that I could possess this capacity for obedience and compliance. . . . I hope I can deal more effectively with future conflicts of values I encounter” (Milgram, 1974, p. 54). Thus, there is a fundamental conundrum in deciding whether this research is ethical—trying to balance the potential risks to participants and the value of the knowledge gained. In cases like the Milgram studies, it is not an easy decision.

CORE ETHICAL PRINCIPLES Organizations around the world have developed formal statements of ethics. Following World War II, the Nuremberg Trials revealed the horror of medical experiments conducted on concentration camp victims in Nazi-occupied Europe and resulted in the Nuremberg Code. Although it is not a formal law in any nation, the ten-point Nuremberg Code influences the ethical research laws of many coun- tries (Shuster, 1997). In addition, many national leaders have signed the Declara- tion of Helsinki, which guides ethics in medical research and practice. Within the United States, ethical systems are also based on the Belmont Report, which defines the ethical guidelines researchers should follow. All of these ethical statements are grounded in the same core principles.

The Belmont Report: Principles and Applications In 1976, a commission of physicians, ethicists, philosophers, scientists, and other citizens gathered at the Belmont Conference Center in Eldridge, Maryland, at


1. What three categories of ethics violations are illustrated by the Tuskegee Syphilis Study?

2. What concerns have been raised against the Milgram obedience studies?

1. See pp. 89–91. 2. See pp. 92–94.

95Core Ethical Principles

the request of the U.S. Congress. They got together for an intensive discussion of basic ethical principles researchers should follow when conducting research with human participants. The commission was created partly in response to the serious ethics violations of the Tuskegee Syphilis Study (Jonsen, 2005). The contributors produced a short document called the Belmont Report, which outlines three main principles for guiding ethical decision making: respect for persons, beneficence, and justice. Each principle has standard applications. The guidelines are intended for use in many disciplines, including medicine, sociology, anthropology, and basic biological research, as well as psychology.


In the Belmont Report, the principle of respect for persons includes two provisions. First, individuals potentially involved in research should be treated as autonomous agents: They should be free to make up their own minds about whether they wish to participate in a research study. Applying this principle means that every partici- pant is entitled to the precaution of informed consent; each person learns about the research project, considers its risks and benefits, and decides whether to participate.

In obtaining informed consent, researchers are not allowed to mislead people about the study’s risks and benefits. Nor may they coerce or unduly influence a per- son into participating; doing so would violate the principle of respect for persons. Coercion is an implicit or explicit suggestion that those who do not participate will suffer a negative consequence; for example, a professor implying that students’ grades will be lower if they don’t participate in a particular study. Undue influence is offering an incentive too attractive to refuse, such as an irresistible amount of money in exchange for participating. The report notes that financially poor individuals may be more easily swayed into participating if a research study provides a large payment.

The second application of respect for persons states that some people have less autonomy, so they are entitled to special protection when it comes to informed con- sent. For example, children, people with intellectual or developmental disabilities, and prisoners should be protected, according to the Belmont Report. Children and certain other individuals might be unable to give informed consent because of not understand- ing the procedures involved well enough to make a responsible decision (Figure 4.5). Prisoners are especially susceptible to coercion, according to the Belmont Report, because they may perceive requests to participate in research as demands, rather than as invitations. All these populations should be treated with special consideration.


To comply with the principle of beneficence, researchers must take precau- tions to protect participants from harm and to ensure their well-being. To apply this principle, researchers need to carefully assess the risks and benefits of the study they plan to conduct. In addition, they must consider how the community might benefit or be harmed. Will a community gain something of value from the knowledge this research is producing? Will there be costs to a community if this research is not conducted?

FIGURE 4.5 Vulnerable populations in research. Why might children be considered a vulnerable population that requires special ethical consideration?

96 CHAPTER 4 Ethical Guidelines for Psychology Research

The Tuskegee Syphilis Study failed to treat the participants in accordance with the principle of beneficence. The researchers harmed participants through risky and invasive medical tests, and they harmed the participants’ families by exposing them to untreated syphilis. The researchers also withheld benefits from the men in the study. Today, researchers may not withhold treatments that are known to be helpful to study participants. For example, if preliminary results indicate, halfway through a study, that a treatment is advantageous for an experimental group, the researcher must give the participants in the control group the opportunity to receive that treatment, too.

A potential risk is having people’s personal information (their behavior, mental health information, or private reactions) revealed to others. To prevent harm, researchers usually make participant information either anonymous or confiden- tial. In an anonymous study, researchers do not collect any potentially identify- ing information, including names, birthdays, photos, and so on. Anonymous online surveys will even strip away the identifiers of the computer used. In a confidential study, researchers collect some identifying information (for contacting people at a later date if needed), but prevent it from being disclosed. They may save data in encrypted form or store people’s names separately from their other data.

Risks and benefits are generally easy to assess when it comes to physical health, the type measured in medical research. Is a person’s health getting worse or better? Is the community going to be healthier because of this research, or not? In contrast, some psychological studies can expose participants to emotional or psychological harm, such as anxiety, stress, depression, or mental strain, and these may be harder to evaluate.

Consider the participants in the Milgram studies, who were clearly experi- encing stress. How might you assess the harm done in this situation? Would you measure the way participants felt at that time? Would you ask how they felt about it a year later? Would you measure what they say about their own stress, or what an observer would say? Just as it’s hard to evaluate emotional or psychological harm, it is difficult to evaluate how damaging a study like Milgram’s might be. However, the principle of beneficence demands that researchers consider such risks (and benefits) before beginning each study. As a point of reference, some institutions ask researchers to estimate how stressful a study’s situation would be compared with the normal stresses of everyday life.

The other side of the balance—the benefits of psychological research to the community—may not be easy to assess either. One could argue that Milgram’s results are valuable, but their value is impossible to quantify in terms of lives or dollars saved. Nevertheless, to apply the principle of beneficence, researchers must attempt to predict the risks and benefits of their research—to both participants and the larger community.


The principle of justice calls for a fair balance between the kinds of people who participate in research and the kinds of people who benefit from it. For example, if a research study discovers that a procedure is risky or harmful, the participants,

97Core Ethical Principles

unfortunately, “bear the burden” of that risk, while other people—those not in the study—are able to benefit from the research results (Kimmel, 2007). The Tuskegee Syphilis Study illustrates a violation of this principle of justice: Anybody, regardless of race or income, can contract syphilis and benefit from research on it, but the par- ticipants in the study—who bore the burden of untreated syphilis—were all poor, African American men. Therefore, these participants bore an undue burden of risk.

When the principle of justice is applied, it means that researchers might first ensure that the participants involved in a study are representative of the kinds of people who would also benefit from its results. If researchers decide to study a sample from only one ethnic group or only a sample of institutionalized individuals, they must demonstrate that the problem they are studying is especially prevalent in that ethnic group or in that type of institution. For example, it might violate the justice principle if researchers studied a group of prisoners mainly because they were convenient. However, it might be perfectly acceptable to study only institu- tionalized people for a study on tuberculosis because tuberculosis is particularly prevalent in institutions, where people live together in a confined area.


Just as panels of judges interpret a country’s laws, panels of people interpret the guidelines in the Belmont Report (Jonsen, 2005). Most universities and research hospitals have committees who decide whether research and practice are com- plying with ethical guidelines. In the United States, federally funded agencies must follow the Common Rule, which describes detailed ways the Belmont Report should be applied in research (U.S. Department of Health and Human Services, 2009). For example, it explains informed consent procedures and ways to approve research before it is conducted.

At many colleges and universities, policies require anyone involved in research with human participants (professors, graduate students, undergraduates, or research staff) to be trained in ethically responsible research. Perhaps your institution requires you to complete online training, such as the course Responsible Conduct of Research, administrated by the CITI program. By learning the material in this chapter, you will be better prepared for CITI courses, if you’re required to take them.


1. Name and describe the three main principles of the Belmont Report.

2. Each principle in the Belmont Report has a particular application. The principle of respect for persons has its application in the informed consent

process. What are the applications of the other two principles?

1. See pp. 94–95 for principles and pp. 95–97 for definitions. 2. See pp. 95–97.

98 CHAPTER 4 Ethical Guidelines for Psychology Research

GUIDELINES FOR PSYCHOLOGISTS: THE APA ETHICAL PRINCIPLES In addition to the Belmont Report, local policies, and federal laws, American psychol- ogists can consult another layer of ethical principles and standards written by the American Psychological Association (2002), the Ethical Principles of Psychologists and Code of Conduct (Figure 4.6). This broad set of guidelines governs the three most common roles of psychologists: research scientists, educators, and practitioners (usually as therapists). Psychological associations in other countries have similar codes of ethics, and other professions have codes of ethics as well (Kimmel, 2007).

Belmont Plus Two: APA’s Five General Principles The APA outlines five general principles for guiding individual aspects of ethical behavior. These principles are intended to protect not only research participants, but also students in psychology classes and clients of professional therapists. As you can see in Table 4.1, three of the APA principles (A, D, and E in the table) are identical to the three main principles of the Belmont Report (beneficence, justice, and respect for persons). Another principle is fidelity and responsibility (e.g., a clinical psychologist teaching in a university may not serve as a therapist to one of his or her classroom students, and psychologists must avoid sexual relationships with their students or clients). The last APA principle is integrity (e.g., professors are obligated to teach accurately, and therapists are required to stay current on the empirical evidence for therapeutic techniques).

FIGURE 4.6 The APA website. The full text of the APA’s ethical principles can be found on the website.

99Guidelines for Psychologists: The APA Ethical Principles

TABLE 4.1 

The Belmont Report’s Basic Principles and the APA’s Five General Principles Compared



Beneficence A. Beneficence and nonmaleficence

Treat people in ways that benefit them. Do not cause suffering. Conduct research that will benefit society.

B. Fidelity and responsibility Establish relationships of trust; accept responsibility for professional behavior (in research, teaching, and clinical practice).

C. Integrity Strive to be accurate, truthful, and honest in one’s role as researcher, teacher, or practitioner.

Justice D. Justice Strive to treat all groups of people fairly. Sample research participants from the same populations that will benefit from the research. Be aware of biases.

Respect for persons

E. Respect for people’s rights and dignity

Recognize that people are autonomous agents. Protect people’s rights, including the right to privacy, the right to give consent for treatment or research, and the right to have participation treated confidentially. Understand that some populations may be less able to give autonomous consent, and take precautions against coercing such people.

Note: The principles in boldface are shared by both documents and specifically involve the treatment of human participants in research. The APA guidelines are broader; they apply not only to how psychologists conduct research, but also to how they teach and conduct clinical practice.

Ethical Standards for Research In addition to the five general principles, the APA lists ten specific ethical standards. These standards are similar to enforceable rules or laws. Psychologist members of the APA who violate any of these standards can lose their professional license or may be disciplined in some other way by the association.

Of its ten ethical standards, Ethical Standard 8 is the one most relevant in a research methods book; it is written specifically for psychologists in their role as researchers. (The other standards are more relevant to their roles as therapists, consultants, and teachers.) The next sections outline the details of the APA’s Ethical Standard 8, noting how it works together with other layers of guidance a researcher must follow.

The website of the APA Ethics Office provides the full text of the APA’s ethics documents. If you are considering becoming a therapist or counselor someday, you may find it interesting to read the other ethical standards that were written specifically for practitioners.


An institutional review board (IRB) is a committee responsible for interpret- ing ethical principles and ensuring that research using human participants is

100 CHAPTER 4 Ethical Guidelines for Psychology Research

conducted ethically. Most colleges and universities, as well as research hospi- tals, have an IRB. In the United States, IRBs are mandated by federal law. If an institution uses federal money (such as government grants) to carry out research projects, a designated IRB is required. However, in the United States, research conducted by private businesses does not have to use an IRB or follow any partic- ular ethical guidelines (though businesses may write their own ethics policies).

An IRB panel in the U.S. includes five or more people, some of whom must come from specified backgrounds. At least one member must be a scientist, one has to have academic interests outside the sciences, and one (or more) should be a community member who has no ties to the institution (such as a local pastor, a community leader, or an interested citizen). In addition, when the IRB discusses a proposal to use prison participants, one member must be recruited as a designated prisoner advocate. The IRB must consider particular questions for any research involving children. IRBs in most other countries follow similar mandates for their composition.

At regular meetings, the IRB reviews proposals from individual scientists. Before conducting a study, researchers must fill out a detailed application describ- ing their study, its risks and benefits (to both participants and society), its proce- dures for informed consent, and its provisions for protecting people’s privacy. The IRB then reviews each application.

Different IRBs have different procedures. In some universities, when a study is judged to be of little or no risk (such as a completely anonymous questionnaire), the IRB might not meet to discuss it in person. In most institutions, though, any study that poses risks to humans or that involves vulnerable populations must be reviewed by an in-person IRB meeting. In many cases, IRB oversight provides a neutral, multiperspective judgment on any study’s ethicality. An effective IRB should not permit research that violates people’s rights, research that poses unrea- sonable risk, or research that lacks a sound rationale. However, an effective IRB should not obstruct research, either. It should not prevent controversial—but still ethical—research questions from being investigated. Ideally, the IRB attempts to balance the welfare of research participants and the researchers’ goal of contrib- uting important knowledge to the field.


As mentioned earlier, informed consent is the researcher’s obligation to explain the study to potential participants in everyday language and give them a chance to decide whether to participate. In most studies, informed consent is obtained by providing a written document that outlines the procedures, risks, and benefits of the research, including a statement about any treatments that are experimental. Everyone who wishes to participate signs two copies of the document—one for the researcher to store, and one for the participant to take home.

In certain circumstances, the APA standards (and other federal laws that govern research) indicate that informed consent procedures are not necessary. Specifically, researchers may not be required to have participants sign informed

101Guidelines for Psychologists: The APA Ethical Principles

consent forms if the study is not likely to cause harm and if it takes place in an edu- cational setting. Written informed consent might not be needed when participants answer a completely anonymous questionnaire in which their answers are not linked to their names in any way. Written consent may not be required when the study involves naturalistic observation of participants in low-risk public settings, such as a museum, classroom, or mall—where people can reasonably expect to be observed by others anyway. The individual institution’s regulations determine whether written informed consent is necessary in such situations, and those stud- ies still must be approved by an IRB. However, the IRB will allow the researcher to proceed without obtaining formal, written consent forms from every participant. Nevertheless, researchers are always ethically obliged to respect participants’ rights.

According to Ethical Standard 8 (and most other ethical guidelines), obtaining informed consent also involves informing people whether the data they provide in a research study will be treated as private and confidential. Nonconfidential data might put participants at some risk. For example, in the course of research people might report on their health status, political attitudes, test scores, or study habits—information they might not want others to know. Therefore, informed consent procedures ordinarily outline which parts of the data are confidential and which, if any, are not. If data are to be treated as confidential, researchers agree to remove names and other identifiers. Such things as handwriting, birthdays, and photographs might reveal personal data, and researchers must be careful to protect that information if they have promised to do so. At many institutions, confidentiality procedures are not optional. Many institutions require researchers to store any identifiable data in a locked area or on secure computers.


You may have read about psychological research in which the researchers lied to participants. Consider some of the studies you’ve learned about in your psychology courses. In the Milgram obedience studies described earlier, the participants did not know the learner was not really being shocked. In another study, an experi- mental confederate posed as a thief, stealing money from a person’s bag while an unsuspecting bystander sat reading at a table (Figure 4.7). In some versions of this study, the “thief,” the “victim,” and a third person who sat calmly nearby, pretend- ing to read, were all experimental confederates. That makes three confederates and a fake crime—all in one study (Shaffer, Rogel, & Hendrick, 1975).

Even in the most straightforward study, participants are not told about all the comparison conditions. For example, in the study described in Chapter 3, some participants might have been aware they were reading a story about a scientist, but didn’t know others were reading a different story (Lin-Siegler, Ahn, Chen, Fang, & Luna-Lucero, 2016). All these studies contained an element of deception. Researchers withheld some details of the study from participants—deception through omission; in some cases, they actively lied to them—deception through commission.

FIGURE 4.7 Deception in research. A study on bystander action staged a theft at a library table.

102 CHAPTER 4 Ethical Guidelines for Psychology Research

Consider how these studies might have turned out if there had been no such deception. Suppose the researchers had said, “We’re going to see whether you’re willing to help prevent a theft. Wait here. In a few moments, we will stage a theft and see what you do.” Or “We want to know whether reading about Einstein’s struggles will make you more motivated in science class. Ready?” Obviously, the data would be useless. Deceiving research participants by lying to them or by withholding information is, in many cases, necessary in order to obtain meaning- ful results.

Is deception ethical? In a deception study, researchers must still uphold the principle of respect for persons by informing participants of the study’s activities, risks, and benefits. The principle of beneficence also applies: What are the ethical costs and benefits of doing the study with deception, compared with the ethical costs of not doing it this way? It’s important to find out what kinds of situational factors influence someone’s willingness to report a theft and to test hypotheses about what motivates students in school. Because most people consider these issues to be important, some researchers argue that the gain in knowledge seems worth the cost of lying (temporarily) to the participants (Kimmel, 1998). Even then, the APA principles and federal guidelines require researchers to avoid using deceptive research designs except as a last resort and to debrief participants after the study.

Despite such arguments, some psychologists believe that deception under- mines people’s trust in the research process and should never be used in a study design (Ortmann & Hertwig, 1997). Still others suggest deception is accept- able in certain circumstances (Bröder, 1998; Kimmel, 1998; Pittenger, 2002). Researchers have investigated how undergraduates respond to participating in a study that uses deception. Results indicate students usually tolerate minor deception and even some discomfort or stress, considering them necessary parts of research. When students do find deception to be stressful, these neg- ative effects are diminished when the researchers fully explain the deception in a debriefing session (Bröder, 1998; Sharpe, Adair, & Roese, 1992; Smith & Richardson, 1983).


When researchers have used deception, they must spend time after the study talking with each participant in a structured conversation. In a debriefing ses- sion, the researchers describe the nature of the deception and explain why it was necessary. Emphasizing the importance of their research, they attempt to restore an honest relationship with the participant. As part of the debriefing process, the researcher describes the design of the study, thereby giving the participant some insight about the nature of psychological science.

Nondeceptive studies often include a debriefing session, too. At many uni- versities, all student participants in research receive a written description of the study’s goals and hypotheses, along with references for further reading. The intention is to make participation in research a worthwhile educational

103Guidelines for Psychologists: The APA Ethical Principles

experience, so students can learn more about the research process in gen- eral, understand how their participation fits into the larger context of theory testing, and learn how their participation might benefit others. In debriefing sessions, researchers might also offer to share results with the participants. Even months after their participation, people can request a summary of the study’s results.


Most discussions of ethical research focus on protection and respect for partici- pants, and rightly so. However, the publication process also involves ethical deci- sion making. As an example, it is considered ethical to publish one’s results. After participants have spent their time in a study, it is only fair to make the results known publicly for the benefit of society. Psychologists must also treat their data and their sources accurately.

Data Fabrication (Standard 8.10) and Data Falsification. Two forms of research misconduct involve manipulating results. Data fabrication occurs when, instead of recording what really happened in a study (or sometimes instead of running a study at all), researchers invent data that fit their hypotheses. Data falsification occurs when researchers influence a study’s results, perhaps by selectively deleting observations from a data set or by influencing their research subjects to act in the hypothesized way.

A recent case exemplifies both of these breaches. In 2012, social psycholo- gist Diederik Stapel was fired from his job as a professor at Tilburg University in the Netherlands because he fabricated data in dozens of his studies (Stapel Investigation, 2012). Three graduate students became suspicious of his actions and bravely informed their department head. Soon thereafter, committees at the three universities where he had worked began documenting years of fraudulent data collection by Stapel. In written statements, he admitted that at first, he changed occasional data points (data falsification), but that later he found himself typing in entire datasets to fit his and his students’ hypotheses (data fabrication). The scientific journals that published his fraudulent data have retracted over 58 articles to date.

Creating fabricated or falsified data is clearly unethical and has far-reaching consequences. Scientists use data to test their theories, and they can do so only if they know that previously reported data are true and accurate. When people fabricate data, they mislead others about the actual state of support for a theory. Fabricated data might inspire other researchers to spend time (and, often, grant money) following a false lead or to be more confident in theories than they should be. In the case of Stapel, the fraud cast a shadow over the careers of the graduate students and coauthors he worked with. Even though investigators stated the collaborators did not know about or participate in the fabrication, Stapel’s collab- orators subsequently found many of their own published papers on the retraction

❮❮ For more on the theory-data cycle, see Chapter 1, pp. 11–13.

104 CHAPTER 4 Ethical Guidelines for Psychology Research

list. Psychologists are concerned that Stapel’s fraud could potentially harm psy- chology’s reputation, even though psychology as a field is not uniquely vulnerable to fraud (Stroebe, Postmes, & Spears, 2012).

The costs were especially high for a fraudulent study that suggested a link between the measles, mumps, rubella (MMR) vaccine and autism (Wakefield et al., 1998, cited in Sathyanarayana Rao & Andrade, 2011). The study, though conducted on only 12 children, was discussed worldwide among frightened parents. Some parents refuse to vaccinate their children, even though the paper has been retracted from the journal The Lancet because the authors admitted fraud (Figure 4.8). Even now, there are measles outbreaks in the U.K. and the U.S. attributable to inadequate vaccination rates.

Why might a researcher fabricate or falsify data? In many universities, the reputations, income, and promotions of professors are based on their publications and their influence in the field. In such high-pressure circumstances, the tempta- tion might be great to delete contradictory data or create supporting data (Stroebe, Postmes, & Spears, 2012). In addition, some researchers may simply be convinced of their own hypotheses and believe that any data that do not support their pre- dictions must be inaccurate. Writing about his first instance of falsification, Stapel said: “I changed an unexpected 2 into a 4 . . . I looked at the [office] door. It was closed. When I saw the new results, the world had returned to being logical” (quoted in Borsboom & Wagenmakers, 2013). Unethical scientists may manipulate their data to coincide with their intuition rather than with formal observations, as a true empiricist would.

❯❯ To review the quality

of different sources of information, see Chapter 2,

pp. 42–52 and Figure 2.9.


FIGURE 4.8 Fabricated and falsified data. This paper on vaccines was retracted from The Lancet after the authors admitted to fabricating results, selectively reporting data (falsification), and failing to report their financial interest. The cost of this fraud can be measured in loss of life from reduced vaccination rates and increased rates of diseases like measles. (Source: Wakefield et al., 1998, cited in Sathyanarayana Rao & Andrade, 2011. Originally published in The Lancet.)

105Guidelines for Psychologists: The APA Ethical Principles

Most recent cases of research fraud have been detected not by the peer review process, but by people who work with the perpetrator (Stroebe, Postmes, & Spears, 2012). If colleagues or students of a researcher in the U.S. suspect such misconduct, they may report it to the scientist’s institution. If the research project is federally funded, suspected misconduct can be reported to the Office of Research Integrity, a branch of the Department of Health and Human Services, which then has the obligation to investigate.

Plagiarism (Standard 8.11). Another form of research misconduct is plagiarism, usually defined as representing the ideas or words of others as one’s own. A formal definition, provided by the U.S. Office of Science and Technol- ogy Policy, states that plagiarism is “the appropriation of another person’s ideas, processes, results, or words without giving appropriate credit” (Federal Regis- ter, 2000). Academics and researchers consider plagiarism a violation of ethics because it is unfair for a researcher to take credit for another person’s intellectual property: It is a form of stealing.

To avoid plagiarism, a writer must cite the sources of all ideas that are not his or her own, to give appropriate credit to the original authors. Psychologists usually follow the style guidelines for citations in the Publication Manual of the American Psychological Association (APA, 2010). When a writer describes or paraphrases another person’s ideas, the writer must cite the original author’s last name and the year of publication. Writers must be careful not to paraphrase the original text too closely; failure to put the original source in one’s own words is a form of plagiarism. To avoid plagiarizing when using another person’s exact words, the writer puts quotation marks around the quoted text and indicates the page number where the quotation appeared in the original source. Complete source citations are included in the References section of the publication for all quoted or paraphrased works. Figure 4.9 presents examples of these guidelines.

Plagiarism is a serious offense—not only in published work by professional researchers, but also in papers students submit for college courses. Every university and college has plagiarism policies that prohibit students from copying the words or ideas of others without proper credit. Students who plagiarize in their academic work are subject to disciplinary action—including expulsion, in some cases.


In some branches of psychology, research is conducted almost entirely on animal subjects: rats, mice, cockroaches, sea snails, dogs, rabbits, cats, chimpanzees, and others. The ethical debates surrounding animal research can be just as complex as those for human participants. Most people have a profound respect for animals and compassion for their well-being. Psychologists and nonpsychologists alike want to protect animals from undue suffering.

Legal Protection for Laboratory Animals. In Standard 8.09, the APA lists ethical guidelines for the care of animals in research laboratories. Psychologists who use animals in research must care for them humanely, must use as few animals

❮❮ For further discussion of plagiarism in writing and the APA guidelines, see Presenting Results at the end of this book.

106 CHAPTER 4 Ethical Guidelines for Psychology Research

Correct presentation of a direct quotation from the article

Correct presentation of a paraphrase

Paraphrasing too close to the original.

The absence of quote marks around the highlighted areas presents the authors’ words as the writer’s own; passage is

plagiarized even though it is cited.

The authors explained that “students in the struggle story condition perceived scientists as individuals, like themselves, who needed to overcome obstacles to succeed. In contrast, students in the achievement story condition expressed views that scientists are innately talented individuals who are endowed with a special aptitude for science” (Lin-Siegler, Ahn, Chen, Fang, & Luna-Lucero, 2016, p. 317).

The authors explained that the two stories led students to think about scientists in di‡erent ways. When they’d read a struggle story, they tended to think scientists were regular people. But when they’d read an achieve- ment story, they tended to think scientists were more special and talented (Lin-Siegler, Ahn, Chen, Fang, & Luna-Lucero, 2016).

The authors explained that students in the struggle condi- tion considered scientists as individuals like themselves. On the other hand, students in the achievement story condition thought that scientists are innately talented individuals with a special aptitude for science (Lin-Siegler, Ahn, Chen, Fang, & Luna-Lucero, 2016).


Lin-Siegler, X. D., Ahn, J., Chen, J., Fang, A., & Luna-Lucero, M. (2016). Even Einstein struggled: E‡ects of learning about great scientists’ struggles on high school students’ motivation to learn science. Journal of Educational Psychology, 108, 314–328.

emphasizing the achievements of these scientists, or (c) the control condition, providing more content instruction in physics that the student were studying in school. Students in the struggle story condition perceived scientists as individuals, like themselves, who needed to overcome obstacles to succeed. In contrast, students in the achievement story condition expressed views that scientists are innately talented individuals who are endowed with a special aptitude for science. Learning about scientists’ struggles not only sparked interest among stu- dents who initially displayed little interest for science but also improved students’ retention of theoretical material and perfor- mance in solving more complex tasks based on the lesson material.

FIGURE 4.9 Avoiding plagiarism. Writers must cite direct quotations using quote marks, author name, year of publication, and page number. Paraphrasing must be cited with the author name and year. Paraphrasing that is too close to the original is plagiarism, even when it is cited.

107Guidelines for Psychologists: The APA Ethical Principles

as possible, and must be sure the research is valuable enough to justify using animal subjects.

In addition to these APA standards, psychologists must follow federal and local laws for animal care and protection. In the United States, the Animal Welfare Act (AWA) outlines standards and guidelines for the treatment of animals (Animal Welfare Act, 1966). The AWA applies to many species of animals in research labo- ratories and other contexts, including zoos and pet stores.

The AWA mandates relevant research institutions to have a local board called the Institutional Animal Care and Use Committee (IACUC, pronounced “EYE-a-kuk”). Similar to an IRB, the IACUC must approve any animal research project before it can begin (Animal Welfare Act, 1966). It must contain at least three members: a veterinarian, a practicing scientist who is familiar with the goals and procedures of animal research, and a member of the local community who is unconnected with the institution. The IACUC requires researchers to submit an extensive protocol specifying how animals will be treated and pro- tected. The IACUC application also includes the scientific justification for the research: Applicants must demonstrate that the proposed study has not already been done and explain why the research is important. The AWA does not cover mice, rats, and birds, but such species are included in the oversight of IACUC boards.

After approving a research project, the IACUC monitors the treatment of animals throughout the research process. It inspects the labs every 6 months. If a laboratory violates a procedure outlined in the proposal, the IACUC or a government agency can stop the experiment, shut the lab down, or discontinue government funding. In European countries and Canada, similar laws apply.

Animal Care Guidelines and the Three Rs. Animal researchers in the United States use the resources of the Guide for the Care and Use of Laboratory Animals, which focuses on what’s known as the Three Rs: replacement, refinement, and reduction (National Research Council, 2011).

• Replacement means researchers should find alternatives to animals in research when possible. For example, some studies can use computer simulations instead of animal subjects.

• Refinement means researchers must modify experimental procedures and other aspects of animal care to minimize or eliminate animal distress.

• Reduction means researchers should adopt experimental designs and proce- dures that require the fewest animal subjects possible.

In addition, the manual provides guidelines for housing facilities, diet, and other aspects of animal care in research. The guide indicates which species must be housed in social groups and specifies cage sizes, temperature and humidity ranges, air quality, lighting and noise conditions, sanitation procedures, and enrichments such as toys and bedding.

108 CHAPTER 4 Ethical Guidelines for Psychology Research

Attitudes of Scientists and Students Toward Animal Research. In surveys, the majority of psychology students and faculty support the use of animals in research (Plous, 1996a). Nationally, about 47% of Americans favor the use of animals in research; the more education people have, the more likely they are to back it (Pew Research Center, 2015). In fact, when people read about the requirements stated in the AWA, they become more supportive of animal research (Metzger, 2015). In other words, people seem to favor animal research more if they know it protects the welfare of animal subjects.

Attitudes of Animal Rights Groups. Since the mid-1970s in the United States, some groups have increased their visi- bility and have assumed a more extreme position—arguing for animal rights, rather than animal welfare (Figure 4.10). Groups such as People for the Ethical Treatment of Ani- mals (PETA), as well as other groups, both mainstream and marginal, violent and nonviolent, have tried to discover and expose cruelty to animals in research laboratories.

Animal rights groups generally base their activities on one of two arguments (Kimmel, 2007). First, they may believe animals are just as likely as humans to experience suffering. They feel humans should not be elevated above other animals:

Because all kinds of animals can suffer, all of them should be protected from painful research procedures. According to this view, a certain type of research with animals could be allowed, but only if it might also be permitted with human participants.

Second, some groups also believe animals have inherent rights, equal to those of humans. These activists argue that most researchers do not treat animals as crea- tures with rights; instead, animals are treated as resources to be used and discarded (Kimmel, 2007). In a way, this argument draws on the principle of justice, as outlined in the Belmont Report and the APA Ethical Principles: Animal rights activists do not believe animals should unduly bear the burden of research that benefits a different species (humans). Both arguments lead animal rights groups to conclude that many research practices using animals are morally wrong. Some activists accuse researchers who study animals of conducting cruel and unethical experiments (Kimmel, 2007).

The members of these groups may be politically active, vocal, and sincerely devoted to the protection of animals. In a survey, Herzog (1993) concluded they are “intelligent, articulate, and sincere . . . [and] eager to discuss their views about the treatment of animals” with a scientist (quoted in Kimmel, 2007, p. 118). Consistent with this view, Plous (1998) polled animal rights activists and found most to be open to compromise via a respectful dialogue with animal researchers.

Ethically Balancing Animal Welfare, Animal Rights, and Animal Research.  Given the laws governing animal welfare and given the broad aware- ness (if not universal endorsement) of animal rights arguments, you can be sure that

FIGURE 4.10 A poster opposing animal research.

109Guidelines for Psychologists: The APA Ethical Principles

today’s research with animals in psychological science is not conducted lightly or irresponsibly. On the contrary, though research with animals is widespread, animal researchers are generally thoughtful and respectful of animal welfare.

Animal researchers defend their use of ani- mal subjects with three primary arguments. The first and central argument is that animal research has resulted in numerous benefits to humans and animals alike (Figure 4.11). Animal research has contributed countless valuable lessons about psy- chology, biology, and neuroscience; discoveries about basic processes of vision, the organization of the brain, the course of infection, disease pre- vention, and therapeutic drugs. Animal research has made fundamental contributions to both basic and applied science, for both humans and animals. Therefore, as outlined in the Belmont Report and APA Ethical Principles, ethical thinking means that research scientists and the public must evaluate the costs and benefits of research projects—in terms of both the subjects used and the potential outcomes.

Second, supporters argue that animal researchers are sensitive to animal wel- fare. They think about the pain and suffering of animals in their studies and take steps to avoid or reduce it. The IACUC oversight process and the Guide for the Care and Use of Laboratory Animals help ensure that animals are treated with care. Third, researchers have successfully reduced the number of animals they need to use because of new procedures that do not require animal testing (Kimmel, 2007). Some animal researchers even believe animal rights groups have exaggerated (or fabricated, in some cases) the cruelty of animal research (Coile & Miller, 1984) and that some activists have largely ignored the valuable scientific and medical discoveries that have resulted from animal research.

FIGURE 4.11 How do researchers achieve an ethical balance between concern for animal welfare and the benefits to society from research using animals?


1. What are the five ethical principles outlined by the APA? Which two are not included in the three principles of the Belmont Report?

2. Name several ways the Animal Welfare Act, IACUC boards, and the Guide for the Care and Use of Laboratory Animals influence animal research.

1. See p. 98 and Table 4.1. 2. See pp. 105–109.

110 CHAPTER 4 Ethical Guidelines for Psychology Research

ETHICAL DECISION MAKING: A THOUGHTFUL BALANCE Ethical decision making, as you have learned, does not involve simple yes-or-no deci- sions; it requires a balance of priorities. When faced with a study that could possibly harm human participants or animals, researchers (and their IRBs) consider the poten- tial benefits of the research: Will it contribute something important to society? Many people believe that research with some degree of risk is justified, if the benefit from the knowledge gained from the results is great. In contrast, if the risk to participants becomes too high, the new knowledge may not be valuable enough to justify the harm.

Another example of this careful balance comes from the way researchers implement the informed consent process. On the one hand, researchers may want to demonstrate their gratitude and respect for participants by compensating them with money or some other form of reward or credit. Paying participants might help ensure that the samples represent a variety of populations, as the principle of justice requires, because some people might not participate in a study without a financial incentive. On the other hand, if the rewards researchers offer are too great, they could tip the balance. If monetary rewards become too influential, potential participants may no longer be able to give free consent.

Although in some cases it is easy to conduct important research that has a low degree of risk to participants, other ethical decisions are extremely difficult. Researchers try to balance respect for animal subjects and human participants, protections from harm, benefits to society, and awareness of justice. As this chap- ter has emphasized, they do not weigh the factors in this balance alone. Influenced by IRBs, IACUCs, peers, and sociocultural norms, they strive to conduct research that is valuable to society and to do it in an ethical manner.

Ethical research practice is not performed according to a set of permanent rules. It is an evolving and dynamic process that takes place in historical and cultural contexts. Researchers refine their ethical decision making in response to good and bad experiences, changing social norms (even public opinion), and scientific discoveries. By following ethical principles, researchers make it more likely that their work will benefit, and be appreciated by, the general public.

The Working it Through section shows how ethical principles can be applied to a controversial research example.


1. Give some examples from the preceding discussion of how the ethical practice of research balances priorities.

1. Answers will vary.

111Ethical Decision Making: A Thoughtful Balance

Did a Study Conducted on Facebook Violate Ethical Principles? A few years ago, researchers from Facebook and Cornell University collaborated to test the effect of emotional contagion through online social networks (Kramer, Guillory, & Hancock, 2014). Emotional contagion is the tendency for emotions to spread in face-to-face interactions. When people express happiness, people around them become happier, too. Researchers randomly selected over 600,000 Face- book users and withheld certain posts from their newsfeeds. From one group, they withheld posts with positive emotion words (such as happy and love). From a sec- ond group, they withheld posts at random (not altering their emotional content). A third group had negative posts withheld. The researchers measured how many positive and negative emotion words people used on their own Facebook time- lines. The results showed the group who’d seen fewer positive posts tended to use fewer positive and more negative emotion words on their own pages. The effect size was extremely small, but the researchers concluded that emotional contagion can happen even through online text. After the media publicized the study’s results, commentators raised an alarm. Facebook manipulated people’s newsfeeds? Is that ethical? We can organize the critiques of this study around the topics in Chapter 4.


Institutional review board

Was the study reviewed by an IRB?

The study’s lead scientist was employed by Facebook, and as a private company, Facebook is not required to follow federal ethical guidelines such as the Common Rule.

The other two scientists had the study reviewed by Cornell University’s IRB, as required. The Cornell IRB decided the study did not fall under its program because the data had been collected by Facebook.

This example highlights that private businesses sometimes conduct research on people who use their products and such research might not be reviewed for ethics.



112 CHAPTER 4 Ethical Guidelines for Psychology Research


Informed consent

Did Facebook users get to decide if they wanted to participate?

The study’s authors reported that when people create a Facebook account, they agree to a Data Use Policy, and this constituted informed consent.

Not all critics agreed. The journal in which the study was published attached an Editorial Statement of Concern stating that, though it had agreed to publish the paper, it was concerned that the study did not allow participants to opt out.

Deception and debriefing

Were participants told in full about the study after they participated?

Participants were not told their newsfeeds might have been manipulated for research purposes.

Participants were deceived through omission of information.

In addition, people were not debriefed afterwards; even now, people cannot find out whether they had participated in this study or not.

If an IRB had considered this study in advance, they would have evaluated it, first, in terms of respect for persons.

The application of respect for persons is informed consent. Participants did not consent to this particular study.

Although people did not provide informed consent for this particular study, informed consent might not be deemed necessary when a study takes place in a public place where people can reasonably expect to be observed. Do you think the Facebook study falls into the same category?

An IRB would also ask about beneficence: Did the research harm anyone, and did it benefit society?

The study itself demonstrated that people felt worse when positive emotion posts were removed.

The researchers argued that the study benefited society. Social media plays a role in most people’s daily lives, and emotions are linked to well-being.

People may have suffered a bit, but was their distress any greater than it might have been in daily life? (Perhaps not, because the effect size was so small.)

In addition, some argued that Facebook already manipulates newsfeeds. For example, your own newsfeed’s stories and posts have been selected according to a computerized algorithm.

The results did show that social media is a source of emotional contagion that could potentially improve public health. Do you find this study to be beneficial to society?

An IRB would consider whether the principle of justice was met. Were the people who participated in the study representative of the people who can benefit from its findings?

The study randomly selected hundreds of thousands of people who read Facebook in English.

Because the sample was selected at random, it appears that the people who “bore the burden” of research participation were the same types who could benefit from its findings. The principle of justice has probably been met.


Summary Whether psychologists are testing a frequency, association, or causal claim, they strive to conduct their research ethically. Psychologists are guided by standard ethical principles as they plan and conduct their research.

Historical Examples • The Tuskegee Syphilis Study, which took place in the

U.S. during the 1930s through the 1970s, illustrates the ethics violations of harming people, not asking for con- sent, and targeting a particular group in research.

• The Milgram obedience studies illustrate the gray areas in ethical research, including how researchers define harm to participants and how they balance the importance of a study with the harm it might do.

Core Ethical Principles • Achieving an ethical balance in research is guided by

standards and laws. Many countries’ ethical policies are governed by the Nuremberg Code and the Dec- laration of Helsinki. In the U.S., federal ethical policies are based on the Common Rule, which is grounded in the Belmont Report.

• The Belmont Report outlines three main principles for research: respect for persons, beneficence, and justice. Each principle has specific applications in the research setting.

• Respect for persons involves the process of informed consent and the protection of special groups in research, such as children and prisoners.

• Beneficence involves the evaluation of risks and benefits, to participants in the study and to society as a whole.

• Justice involves the way participants are selected for the research. One group of people should not bear an

undue burden for research participation, and partici- pants should be representative of the groups that will also benefit from the research.

Guidelines for Psychologists: The APA Ethical Principles • The APA guides psychologists by providing a set of

principles and standards for research, teaching, and other professional roles.

• The APA’s five general principles include the three Belmont Report principles, plus two more: the princi- ple of fidelity and responsibility, and the principle of integrity.

• The APA’s Ethical Standard 8 provides enforceable guidelines for researchers to follow. It includes specific information for informed consent, institutional review boards, deception, debriefing, research misconduct, and animal research.

Ethical Decision Making: A Thoughtful Balance • For any type of claim psychologists are investigating,

ethical decision making requires balancing a variety of priorities.

• Psychologists must balance benefits to society with risks to research participants, and balance compen- sation for participants with undue coercion for their involvement.


114 CHAPTER 4 Ethical Guidelines for Psychology Research

Key Terms

debriefed, p. 93 principle of respect for

persons, p. 95 informed consent, p. 95 principle of beneficence, p. 95

anonymous study, p. 96 confidential study, p. 96 principle of justice, p. 96 institutional review board

(IRB), p. 99

deception, p. 101 data fabrication, p. 103 data falsification, p. 103 plagiarism, p. 105

Review Questions

1. Which of the following is not one of the three princi- ples of the Belmont Report?

a. Respect for persons

b. Justice

c. Beneficence

d. Fidelity and responsibility

2. In a study of a new drug for asthma, a researcher finds that the group receiving the drug is doing much better than the control group, whose members are receiving a placebo. Which principle of the Belmont Report requires the researcher to also give the con- trol group the opportunity to receive the new drug?

a. Informed consent

b. Justice

c. Beneficence

d. Respect for persons

3. In order to study a sample of participants from only one ethnic group, researchers must first demonstrate that the problem being studied is especially prevalent in that ethnic group. This is an application of which principle from the Belmont Report?

a. Respect for persons

b. Beneficence

c. Special protection

d. Justice

4. Following a study using deception, how does the researcher attempt to restore an honest relationship with the participant?

a. By apologizing to the participant and offering mon- etary compensation for any discomfort or stress.

b. By debriefing each participant in a structured conversation.

c. By reassuring the participant that all names and identifiers will be removed from the data.

d. By giving each participant a written description of the study’s goals and hypotheses, along with references for further reading.

5. What type of research misconduct involves repre- senting the ideas or words of others as one’s own?

a. Plagiarism

b. Obfuscation

c. Suppression

d. Data falsification

6. Which of the following is not one of the Three R’s provided by the Guide for the Care and Use of Laboratory Animals?

a. Reduction

b. Replacement

c. Restoration

d. Refinement

To see samples of chapter concepts in the popular media, visit and click the box for Chapter 4.r

115Learning Actively

Learning Actively

1. A developmental psychologist applies to an insti- tutional review board (IRB), proposing to observe children ages 2–10 playing in the local McDonald’s play area. Because the area is public, the researcher does not plan to ask for informed consent from the children’s parents. What ethical concerns exist for this study? What questions might an IRB ask?

2. A social psychologist plans to hand out surveys in her 300-level undergraduate class. The survey asks about students’ study habits. The psychologist does not ask the students to put their names on the survey; instead, students will put completed surveys into a large box at the back of the room. Because of the low risk involved in participation and the anonymous nature of the survey, the researcher requests to be exempted from formal informed consent procedures. What ethical concerns exist for this study? What questions might an IRB ask?

3. Consider the use of deception in psychological research. Does participation in a study involving deception (such as the Milgram obedience studies) necessarily cause harm? Recall that when evaluating

the risks and benefits of a study, the researcher considers both the participants in the study and society as a whole—anyone who might be affected by the research. What might be some of the costs and benefits to participants who are deceived? What might be some of the costs and benefits to society of studies involving deception?

4. Use the Internet to look up your college’s definition of plagiarism. Does it match the one given in APA Ethical Standard 8.11? If not, what does it exclude or add? What are the consequences for plagiarism at your college?

5. Use the Internet to find out the procedures of the IRB at your college. According to your college’s policies, do undergraduates like you need special ethics train- ing before they can conduct research? Does research conducted in a research methods class need formal IRB approval? Does your college categorize studies that are “exempt” from IRB review versus “Expe- dited” versus “Full board” review? If so, what kinds of studies are considered exempt?

Gratitude Is for Lovers Greater Good, 2013

Who Are the Happiest People in the World? Gallup, 2016

Can Money Buy You Happiness? Wall Street Journal, 2014


Identifying Good Measurement WHETHER STUDYING THE NUMBER of polar bears left in the Arctic Circle, the strength of a bar of steel, the number of steps people take each day, or the level of human happiness, every scientist faces the challenge of measurement. When researchers test theories or pursue empirical questions, they have to systematically observe the phenomena by collecting data. Such systematic observations require measurements, and these measurements must be good ones—or else they are useless.

Measurement in psychological research can be particularly chal- lenging. Many of the phenomena psychologists are interested in— motivation, emotion, thinking, reasoning—are difficult to measure directly. Happiness, the topic of much research, is a good example of a construct that could be hard to assess. Is it really possible to quantify how happy people are? Are the measurements accurate? Before testing, for example, whether people who make more money are happier, we might ask whether we can really measure happiness. Maybe people misrepresent their level of well-being, or maybe peo- ple aren’t aware of how happy they are. How do we evaluate who is really happy and who isn’t? This chapter explains how to ask ques- tions about the quality of a study’s measures—the construct validity of quantifications of things like happiness, gratitude, or wealth. Construct validity, remember, refers to how well a study’s variables are measured or manipulated.

Construct validity is a crucial piece of any psychological research study—for frequency, association, or causal claims. This chapter focuses on the construct validity of measured variables. You will learn, first, about different ways researchers operationalize


A year from now, you should still be able to:

1. Interrogate the construct validity of a study’s variables.

2. Describe the kinds of evidence that support the construct validity of a measured variable.

118 CHAPTER 5 Identifying Good Measurement

measured variables. Then you’ll learn how you can evaluate the reliability and validity of those measurements. The construct validity of manipulated variables is covered in Chapter 10.

WAYS TO MEASURE VARIABLES The process of measuring variables involves some key decisions. As researchers decide how they should operationalize each variable in a study, they choose among three common types of measures: self-report, observational, and physiological. They also decide on the most appropriate scale of measurement for each variable they plan to investigate.

More About Conceptual and Operational Variables In Chapter 3, you learned about operationalization, the process of turning a con- struct of interest into a measured or manipulated variable. Much psychological research requires two definitions of each variable. The conceptual definition, or construct, is the researcher’s definition of the variable in question at a theoretical level. The operational definition represents a researcher’s specific decision about how to measure or manipulate the conceptual variable.


Let’s take the variable “happiness,” for example. One research team, led by Ed Diener, began the study of happiness by developing a precise conceptual definition. Specifically, Diener’s team reasoned that the word happiness might have a variety of meanings, so they explicitly limited their interest to “subjective well-being” (or well-being from a person’s own perspective).

After defining happiness at the conceptual level, Diener and his colleagues developed an operational definition. Because they were interested in people’s perspectives on their own well-being, they chose to operationalize subjective well-being, in part, by asking people to report on their own happiness in a ques- tionnaire format. The researchers decided people should use their own criteria to describe what constitutes a “good life” (Pavot & Diener, 1993). They worded their questions so people could think about the interpretation of life satisfaction that was appropriate for them. These researchers operationally defined, or measured, subjective well-being by asking people to respond to five items about their satis- faction with life using a 7-point scale; 1 corresponded to “strongly disagree” and 7 corresponded to “strongly agree”:

1. In most ways my life is close to my ideal.

2. The conditions of my life are excellent.

❯❯ For a review of measured

and manipulated variables, see Chapter 3, pp. 58–59.

119Ways to Measure Variables

3. I am satisfied with my life.

4. So far I have gotten the important things I want in life.

5. If I could live my life over, I would change almost nothing.

The unhappiest people would get a total score of 5 on this self-report scale because they would answer “strongly disagree,” or 1, to all five items (1 + 1 + 1 + 1 + 1 = 5). The happiest people would get a total score of 35 on this scale because they would answer “strongly agree,” or 7, to all five items (7 + 7 + 7 + 7 + 7 = 35). Those at the neutral point would score 20—right in between satisfied and dissatisfied (4 + 4 + 4 + 4 + 4 = 20). Diener and Diener (1996) reported some data based on this scale, concluding that most people are happy, meaning most people scored above 20. For example, 63% of high school and college students scored above 20 in one study, and 72% of disabled adults scored above 20 in another study.

In choosing this operational definition of subjective well-being, the research team started with only one possible measure, even though there are many other ways to study this concept. Another way to measure happiness is to use a single question called the Ladder of Life (Cantril, 1965). The question goes like this:

Imagine a ladder with steps numbered from 0 at the bottom to 10 at the top. The

top of the ladder represents the best possible life for you and the bottom of the

ladder represents the worst possible life for you. On which step of the ladder

would you say you personally stand at this time?

On this measure, participants respond by giving a value between 0 and 10. The Gallup polling organization uses the Ladder of Life scale in its daily Gallup- Healthways Well-Being Index.

You might be thinking one of these operational definitions seems like a better measure of happiness than the other. Which one do you think is best? We’ll see that they both do a good job of measuring the construct. Diener’s research team and Gallup have both learned their measures of happiness are accurate because they have collected data on them, as we’ll see later in this chapter.


To study conceptual variables other than happiness, researchers follow a simi- lar process: They start by stating a definition of their construct (the conceptual variable) and then create an operational definition. For example, to measure the association between wealth and happiness, researchers need to measure not only happiness, but also wealth. They might operationally define wealth by asking about salary in dollars, by asking for bank account balances, or even by observing the kind of car people drive.

Consider another variable that has been studied in research on relationships: gratitude toward one’s partner. Researchers who measure gratitude toward a rela- tionship partner might operationalize it by asking people how often they thank their partner for something they did. Or they might ask people how appreciative

120 CHAPTER 5 Identifying Good Measurement

they usually feel. Even a simple variable such as gender must be operationalized. As Table 5.1 shows, any conceptual variable can be operationalized in a number of ways. In fact, operationalizations are one place where creativity comes into the research process, as researchers work to develop new and better measures of their constructs.

Three Common Types of Measures The types of measures psychological scientists typically use to operationalize variables generally fall into three categories: self-report, observational, and physiological.


A self-report measure operationalizes a variable by recording people’s answers to questions about themselves in a questionnaire or interview. Diener’s five-item scale and the Ladder of Life question are both examples of self-report measures about life satisfaction. Similarly, asking people how much they appreciate their partner and asking about gender identity are both self-report measures. If stress was the variable being studied, researchers might ask people to self-report on the frequency of specific events they’ve experienced in the past year, such as marriage, divorce, or moving (e.g., Holmes & Rahe, 1967).

In research on children, self-reports may be replaced with parent reports or teacher reports. These measures ask parents or teachers to respond to a series of questions, such as describing the child’s recent life events, the words the child knows,


Variables and Operational Definitions



Gratitude toward one’s relationship partner

Asking people if they agree with the statement: “I appreciate my partner.”

Watching couples interact and counting how many times they thank each other.

Gender Asking people to report on a survey whether they identify as male or female.

In phone interviews, a researcher guesses gender through the sound of the person’s voice.

Wealth Asking people to report their income on various ranges (less than $20,000, between $20,000 and 50,000, and more than $50,000).

Coding the value of a car from 1 (older, lower-status vehicle) to 5 (new, high- status vehicle in good condition).

Intelligence An IQ test that includes problem-solving items, memory and vocabulary questions, and puzzles.

Recording brain activity while people solve difficult problems.

Well-being (happiness) 10-point Ladder of Life scale. Diener’s 5-item subjective well-being scale.

121Ways to Measure Variables

or the child’s typical classroom behaviors. (Chapter 6 discusses situations when self-report measures are likely to be accurate and when they might be biased.)


An observational measure, sometimes called a behavioral measure, operational- izes a variable by recording observable behaviors or physical traces of behaviors. For example, a researcher could operationalize happiness by observing how many times a person smiles. Intelligence tests can be considered observational measures, because the people who administer such tests in person are observing people’s intelligent behaviors (such as being able to correctly solve a puzzle or quickly detect a pattern). Coding how much a person’s car cost would be an observational measure of wealth (Piff, Stancato, Côté, Mendoza-Denton, & Keltner, 2012).

Observational measures may record physical traces of behavior. Stress behav- iors could be measured by counting the number of tooth marks left on a person’s pencil, or a researcher could measure stressful events by consulting public legal records to document whether people have recently married, divorced, or moved. (Chapter 6 addresses how an observer’s ratings of behavior might be accurate and how they might be biased.)


A physiological measure operationalizes a variable by recording biological data, such as brain activity, hormone levels, or heart rate. Physiological measures usually require the use of equipment to amplify, record, and analyze biological data. For example, moment-to-moment happiness has been measured using facial electromyography (EMG)—a way of electronically recording tiny movements in the muscles in the face. Facial EMG can be said to detect a happy facial expression because people who are smiling show particular patterns of muscle movement around the eyes and cheeks.

Other constructs might be measured using a brain scanning technique called functional magnetic resonance imaging, or fMRI. In a typical fMRI study, people engage in a carefully structured series of psychological tasks (such as looking at three types of photos or playing a series of rock-paper-scissors games) while lying in an MRI machine. The MRI equipment records and codes the relative changes in blood flow in particular regions of the brain, as shown in Figure 5.1. When more

FIGURE 5.1 Images from fMRI scans showing brain activity. In this study of how people respond to rewards and losses, the researchers tracked blood flow patterns in the brain when people had either won, lost, or tied a rock-paper-scissors game played with a computer. They found that many regions of the brain respond more to wins than to losses, as indicated by the highlighted regions. (Source: Vickery, Chun, & Lee, 2011.)

122 CHAPTER 5 Identifying Good Measurement

blood flows to a brain region while people perform a certain task, researchers con- clude that brain area is activated because of the patterns on the scanned images.

Some research indicates a way fMRI might be used to measure intelligence in the future. Specifically, the brains of people with higher intelligence are more efficient at solving complex problems; their fMRI scans show relatively less brain activity for complex problems (Deary, Penke, & Johnson, 2010). Therefore, future researchers may be able to use the efficiency of brain activity as a physiological measure of intel- ligence. A physiological measure from a century ago turned out to be flawed: People used head circumference to measure intelligence, under the mistaken impression that smarter brains would be stored inside larger skulls (Gould, 1996).

A physiological way to operationalize stress might be to measure the amount of the hormone cortisol released in saliva because people under stress show higher levels of cortisol (Carlson, 2009). Skin conductance, an electronic recording of the activity in the sweat glands of the hands or feet, is another way to measure stress physiologically. People under more stress have more activity in these glands. Another physiological measure used in psychology research is the detection of electrical patterns in the brain using electroencephalography (EEG).


A single construct can be operationalized in several ways, from self-report to behav- ioral observation to physiological measures. Many people erroneously believe phys- iological measures are the most accurate, but even their results have to be validated by using other measures. For instance, as mentioned above, researchers used fMRI to learn that the brain works more efficiently relative to level of intelligence. But how was participant intelligence established in the first place? Before doing the fMRI scans, the researchers gave the participants an IQ test—an observational measure (Deary et al., 2010). Similarly, researchers might trust an fMRI pattern to indicate when a person is genuinely happy. However, the only way a researcher could know that some pattern of brain activity was associated with happiness is by asking each person how happy he or she feels (a self-report measure) at the same time the brain scan was being done. As you’ll learn later in this chapter, it’s best when self-report, observational, and physiological measures show similar patterns of results.

Scales of Measurement All variables must have at least two levels (see Chapter 3). The levels of operational variables, however, can be coded using different scales of measurement.


Operational variables are primarily classified as categorical or quantitative. The levels of categorical variables, as the term suggests, are categories. (Categorical variables are also called nominal variables.) Examples are sex, whose levels are male and female; and species, whose levels in a study might be rhesus macaque, chimpanzee, and bonobo. A researcher might decide to assign numbers to the levels of a categorical variable (e.g., using “1” to represent rhesus macaques, “2” for

123Ways to Measure Variables

chimps, and “3” for bonobos) during the data-entry process. However, the numbers do not have numerical meaning—a bonobo is different from a chimpanzee, but being a bonobo (“3”) is not quantitatively “higher” than being a chimpanzee (“2”).

In contrast, the levels of quantitative variables are coded with meaning- ful numbers. Height and weight are quantitative because they are measured in numbers, such as 170 centimeters or 65 kilograms. Diener’s scale of subjective well-being is quantitative too, because a score of 35 represents more happiness than a score of 7. IQ score, level of brain activity, and amount of salivary cortisol are also quantitative variables.


For certain kinds of statistical purposes, researchers may need to further classify a quantitative variable in terms of ordinal, interval, or ratio scale.

An ordinal scale of measurement applies when the numerals of a quantitative variable represent a ranked order. For example, a bookstore’s website might display the top 10 best-selling books. We know that the #1 book sold more than the #2 book, and that #2 sold more than #3, but we don’t know whether the number of books that separates #1 and #2 is equal to the number of books that separates #2 and #3. In other words, the intervals may be unequal. Maybe the first two rankings are only 10 books apart, and the second two rankings are 150,000 books apart. Similarly, a professor might use the order in which exams were turned in to operationalize how fast students worked. This represents ordinal data because the fastest exams are on the bottom of the pile—ranked 1. However, this variable has not quantified how much faster each exam was turned in, compared with the others.

An interval scale of measurement applies to the numerals of a quantitative variable that meet two conditions: First, the numerals represent equal intervals (distances) between levels, and second, there is no “true zero” (a person can get a score of 0, but the 0 does not really mean “nothing”). An IQ test is an interval scale—the distance between IQ scores of 100 and 105 represents the same as the distance between IQ scores of 105 and 110. However, a score of 0 on an IQ test does not mean a person has “no intelligence.” Body temperature in degrees Celsius is another example of an interval scale—the intervals between levels are equal; however, a temperature of 0 degrees does not mean a person has “no tempera- ture.” Most researchers assume questionnaire scales like Diener’s (scored from 1 = strongly disagree to 7 = strongly agree) are interval scales. They do not have a true zero but we assume the distances between numerals, from 1 to 7, are equiva- lent. Because they do not have a true zero, interval scales cannot allow a researcher to say things like “twice as hot” or “three times happier.”

Finally, a ratio scale of measurement applies when the numerals of a quanti- tative variable have equal intervals and when the value of 0 truly means “none” or “nothing” of the variable being measured. On a knowledge test, a researcher might measure how many items people answer correctly. If people get a 0, it truly represents “nothing correct” (0 answers correct). A researcher might measure how frequently people blink their eyes in a stressful situation; number of eyeblinks is a

124 CHAPTER 5 Identifying Good Measurement

ratio scale because 0 would represent zero eyeblinks. Because ratio scales do have a true zero, one can meaningfully say something like “Miguel answered twice as many problems as Diogo.” Table 5.2 summarizes all the above variations.


Measurement Scales for Operational Variables


Categorical Levels are categories. Nationality. Type of music. Kind of phone people use.

Quantitative Levels are coded with meaningful numbers.

Ordinal A quantitative variable in which numerals represent a rank order. Distance between subsequent numerals may not be equal.

Order of finishers in a swimming race. Ranking of 10 movies from most to least favorite.

Interval A quantitative variable in which subsequent numerals represent equal distances, but there is no true zero.

IQ score. Shoe size. Degree of agreement on a 1–7 scale.

Ratio A quantitative variable in which numerals represent equal distances and zero represents “none” of the variable being measured.

Number of exam questions answered correctly. Number of seconds to respond to a computer task. Height in cm.


1. Explain why a variable will usually have only one conceptual definition but can have multiple operational definitions.

2. Name the three common ways in which researchers operationalize their variables.

3. In your own words, describe the difference between categorical and quantitative variables. Come up with new examples of variables that would

fit the definition of ordinal, interval, and ratio scales.

1. See pp. 118–120. 2. See pp. 120–122. 3. See pp. 122–124.

RELIABILITY OF MEASUREMENT: ARE THE SCORES CONSISTENT? Now that we’ve established different types of operationalizations, we can ask the important construct validity question: How do you know if a study’s operation- alizations are good ones? The construct validity of a measure has two aspects.

125Reliability of Measurement: Are the Scores Consistent?

Reliability refers to how consistent the results of a measure are, and validity concerns whether the operationalization is measuring what it is supposed to mea- sure. Both are important, and the first step is reliability.

Introducing Three Types of Reliability Before deciding on the measures to use in a study, researchers collect their own data or review data collected by others. They use data because establishing the reliability of a measure is an empirical question. A measure’s reliability is just what the word suggests: whether or not researchers can rely on a particular score. If an operationalization is reliable, it will yield a consistent pattern of scores every time.

Reliability can be assessed in three ways, depending on how a variable was operationalized, and all three involve consistency in measurement. With test- retest reliability, the researcher gets consistent scores every time he or she uses the measure. With interrater reliability, consistent scores are obtained no matter who measures the variable. With internal reliability (also called internal consis- tency), a study participant gives a consistent pattern of answers, no matter how the researcher has phrased the question.


To illustrate test-retest reliability, let’s suppose a sample of people took an IQ test today. When they take it again 1 month later, the pattern of scores should be consistent: People who scored the highest at Time 1 should also score the highest at Time 2. Even if all the scores from Time 2 have increased since Time 1 (due to practice or training), the pattern should be consistent: The highest-scoring Time 1 people should still be the highest scoring people at Time 2. Test-retest reliability can apply whether the operationalization is self-report, observational, or physio- logical, but it’s most relevant when researchers are measuring constructs (such as intelligence, personality, or gratitude) they expect to be relatively stable. Happy mood, for example, may reasonably fluctuate from month to month or year to year for a particular person, so less consistency would be expected in this variable.


With interrater reliability, two or more independent observers will come up with consistent (or very similar) findings. Interrater reliability is most relevant for observational measures. Suppose you are assigned to observe the number of times each child smiles in 1 hour at a daycare playground. Your lab partner is assigned to sit on the other side of the playground and make his own count of the same children’s smiles. If, for one child, you record 12 smiles during the first hour, and your lab partner also records 12 smiles in that hour for the same child, there is interrater reliability. Any two observers watching the same children at the same time should agree about which child has smiled the most and which child has smiled the least.

126 CHAPTER 5 Identifying Good Measurement


The third kind of reliability, internal reliability, applies only to self-report scales with multiple items. Suppose a sample of people take Diener’s five-item subjective well-being scale. The questions on his scale are worded differently, but each item is intended to be a measure of the same construct. Therefore, people who agree with the first item on the scale should also agree with the second item (as well as with Items 3, 4, and 5). Similarly, people who disagree with the first item should also disagree with Items 2, 3, 4, and 5. If the pattern is consistent across items in this way, the scale has internal reliability.

Using a Scatterplot to Quantify Reliability Before using a particular measure in a study they are planning, researchers collect data to see if it is reliable. Researchers may use two statistical devices for data anal- ysis: scatterplots (see Chapter 3) and the correlation coefficient r (discussed below). In fact, evidence for reliability is a special example of an association claim—the association between one version of the measure and another, between one coder and another, or between an earlier time and a later time.

Here’s an example of how correlations are used to document reliability. Years ago, when people thought smarter people had larger heads, they may have tried to use head circumference as an operationalization of intelligence. Would this measure be reliable? Probably. Suppose you record the head circumference, in centimeters, for everyone in a classroom, using an ordinary tape measure. To see if the measurements were reliable, you could measure all the heads twice (test-retest reliability) or you could measure them first, and then have someone else measure them (interrater reliability).

Figure 5.2 shows how the results of such a measurement might look, in the form of a data table and a scatterplot. In the scatterplot, the first measurements of head circumference for four students are plotted on the y-axis. The circum- ferences as measured the second time—whether by you again (test-retest) or by a second observer (interrater)—are plotted on the x-axis. In this scatterplot, each dot represents a person measured twice.

We would expect the two measurements of head circumference to be about the same for each person. They are, so the dots on the scatterplot all fall almost exactly on the sloping line that would indicate perfect agreement. The two measures won’t always be exactly the same because there is likely to be some measurement error that will lead to slightly different scores even for the same person (such as varia- tions in the tape measure placement for each trial).


In a different scenario, suppose ten young children are being observed at a play- ground. Two independent observers, Mark and Matt, rate how happy each child appears to be, on a scale of 1 to 10. They later compare notes to see how well their

127Reliability of Measurement: Are the Scores Consistent?

ratings agree. From these notes, they create a scatterplot, plotting Observer Mark’s ratings on the x-axis and Observer Matt’s ratings on the y-axis.

If the data looked like those in Figure 5.3A, the ratings would have high inter- rater reliability. Both Mark and Matt rate Jay’s happiness as 9—one of the happiest kids on the playground. Observer Mark rates Jackie a 2—one of the least happy kids; Observer Matt agreed because he rates her 3, and so on. The two observers do not show perfect agreement, but there are no great disagreements about the happiest and least happy kids. Again, the points are scattered around the plot a bit, but they hover close to the sloping line that would indicate perfect agreement.

In contrast, suppose the data looked like Figure 5.3B, which shows much less agreement. Here, the two observers are Mark and Peter, and they are watching the same children at the same time, but Mark gives Jay a rating of 9 and Peter thinks he rates only a 6. Mark considers Jackie’s behavior to be shy and withdrawn and rates her a 2, but Peter thinks she seems calm and content and rates her a 7. Here the interrater reliability would be considered unacceptably low. One reason could be that the observers did not have a clear enough operational definition of “hap- piness” to work with. Another reason could be that one or both of the coders has not been trained well enough yet.

A scatterplot can thus be a helpful tool for visualizing the agreement between two administrations of the same measurement (test-retest reliability) or between two coders (interrater reliability). Using a scatterplot, you can see whether the


Second measurement (cm)

Head circumference (cm)

First measurement (cm)

40 45 50 55 60 65 70 75 80














Kurt’s first measurement was 75

Kurt’s second measurement was 80













1 2

FIGURE 5.2 Two measurements of head circumference. (A) The data for four participants in table form. (B) The same data presented in a scatterplot.

128 CHAPTER 5 Identifying Good Measurement

two ratings agree (if the dots are close to a straight line drawn through them) or whether they disagree (if the dots scatter widely from a straight line drawn through them).

Using the Correlation Coefficient r to Quantify Reliability Scatterplots can provide a picture of a measure’s reliability. However, a more com- mon and efficient way to see if a measure is reliable is to use the correlation coef- ficient. Researchers can use a single number, called a correlation coefficient, or r, to indicate how close the dots, or points, on a scatterplot are to a line drawn through them.

Notice that the scatterplots in Figure 5.4 differ in two important ways. One dif- ference is that the scattered clouds of points slope in different directions. In Figure 5.4A and Figure 5.4B the points slope upward from left to right, in Figure 5.4C they slope downward, and in Figure 5.4D they do not slope up or down at all. This slope is referred to as the direction of the relationship, and the slope direction can be positive, negative, or zero—that is, sloping up, sloping down, or not sloping at all.

The other way the scatterplots differ is that in some, the dots are close to a straight, sloping line; in others, the dots are more spread out. This spread corre- sponds to the strength of the relationship. In general, the relationship is strong when dots are close to the line; it is weak when dots are spread out.

The numbers below the scatterplots are the correlation coefficients, or r. The r indicates the same two things as the scatterplot: the direction of the relationship

FIGURE 5.3 Interrater reliability. (A) Interrater reliability is high. (B) Interrater reliability is low.






10 A

Observer Mark’s ratings

Observer Matt’s ratings




2 4 6 8 10








Observer Peter’s ratings



Observer Mark’s ratings 0 2 4 6 8 10

If the data show this pattern, it means Matt and Mark have good interrater reliability. Mark rated Jackie as one of the least happy children in the sample, and so did Matt. Mark rated Jay as one of the happiest children in the sample, and so did Matt.

If the data show this pattern, it means Mark and Peter have poor interrater reliability. For example, they disagree about Jackie—Mark rated Jackie as one of the least happy children in the sample, but Peter rated her as one of the happiest.

❯❯ For more on the slope of a scatterplot, see Chapter 3,

pp. 63–66.

129Reliability of Measurement: Are the Scores Consistent?

and the strength of the relationship, both of which psychologists use in evaluating reliability evidence. Notice that when the slope is positive, r is positive; when the slope is negative, r is negative. The value of r can fall only between 1.0 and –1.0. When the relationship is strong, r is close to either 1 or –1; when the relationship is weak, r is closer to zero. An r of 1.0 represents the strongest possible positive relationship, and an r of –1.0 represents the strongest possible negative relation- ship. If there is no relationship between two variables, r will be .00 or close to .00 (i.e., .02 or –.04).

Those are the basics. How do psychologists use the strength and direction of r to evaluate reliability evidence?

❮❮ For more on how to compute r, see Statistics Review: Descriptive Statistics, pp. 470–472.

FIGURE 5.4 Correlation coefficients. Notice the differences in the correlation coefficients (r) in these scatterplots. The correlation coefficient describes both the direction and the strength of the association between the two variables, regardless of the scale on which the variables are measured.

r = .01 X

50 52 54 56 58 60 62 64 66

















X r = – .59


0 1 2 3 4 5 6










150 50 52 54 56 58 60 62 64 66

r = .93




0 5 10 15 20

r = .56












130 CHAPTER 5 Identifying Good Measurement


To assess the test-retest reliability of some measure, we would measure the same set of participants on that measure at least twice. First we’d give the set of partic- ipants the measure at Time 1. Then we’d wait a while (say, 2 months), and contact the same set of people again, at Time 2. After recording each person’s score at Time 1 and Time 2, we could compute r. If r turns out to be positive and strong (for test-retest, we might expect .5 or above), we would have very good test-retest reliability. If r is positive but weak, we would know that participants’ scores on the test changed from Time 1 to Time 2.

A low r would be a sign of poor reliability if we are measuring something that should stay the same over time. For example, a trait like intelligence is not usually expected to change over a few months, so if we assess the test-retest reliability of an IQ test and obtain a low r, we would be doubtful about the reliability of this test. In contrast, if we were measuring flu symptoms or seasonal stress, we would expect test-retest reliabilities to be low, simply because these constructs do not stay the same over time.


To test interrater reliability of some measure, we might ask two observers to rate the same participants at the same time, and then we would compute r. If r is pos- itive and strong (according to many researchers, r = .70 or higher), we would have very good interrater reliability. If r is positive but weak, we could not trust the observers’ ratings. We would retrain the coders or refine our operational definition so it can be more reliably coded. A negative r would indicate a big problem. In the daycare example, that would mean Observer Mark considered Jay very happy but Observer Peter considered Jay very unhappy, Observer Mark considered Jackie unhappy but Peter considered Jackie happy, and so on. When we’re assessing reliability, a negative correlation is rare and undesirable.

Although r can be used to evaluate interrater reliability when the observers are rating a quantitative variable, a more appropriate statistic, called kappa, is used when the observers are rating a categorical variable. Although the computations are beyond the scope of this book, kappa measures the extent to which two raters place participants into the same categories. As with r, a kappa close to 1.0 means that the two raters agreed.


Internal reliability is relevant for measures that use more than one item to get at the same construct. On self-report scales such as Diener’s five-item subjec- tive well-being scale, people answer the same question phrased in multiple ways. Researchers rephrase the items because any one way of wording the question might introduce measurement error. Researchers predict any such errors will cancel each other out when the items are summed up to form each person’s score.

Before combining the items on a self-report scale, researchers need to assess the scale’s internal reliability to evaluate whether people responded consistently

131Reliability of Measurement: Are the Scores Consistent?

to each item, despite the different wordings. Internal reliability means people gave consistent answers every time, no matter how the researchers asked the questions.

Let’s consider the following version of Diener’s well-being scale. Would a group of people give consistent responses to all five items? Would people who agree with Item 1 also agree with Items 2, 3, and 4?

1. In most ways my life is close to my ideal.

2. The conditions of my life are excellent.

3. I am fond of polka dots.

4. I am a good swimmer.

5. If I could live my life over, I would change almost nothing.

Obviously, these items do not seem to go together, so we could not average them together for a meaningful well-being score. Items 1 and 2 are probably correlated, since they are similar to each other, but Items 1 and 3 are probably not correlated, since people can like polka dots whether or not they are living their ideal lives. Item 4 doesn’t seem to go with any other item, either. But how could we quantify these intuitions about internal reliability?

Researchers typically will run a correlation-based statistic called Cronbach’s alpha (or coefficient alpha) to see if their measurement scales have internal reliabil- ity. First, they collect data on the scale from a large sample of participants, and then they compute all possible correlations among the items. The formula for Cronbach’s alpha returns one number, computed from the average of the inter-item correlations and the number of items in the scale. The closer the Cronbach’s alpha is to 1.0, the better the scale’s reliability. (For self-report measures, researchers are looking for Cronbach’s alpha of .70 or higher.) If Cronbach’s alpha is high, there is good internal reliability and researchers can sum all the items together. If Cronbach’s alpha is less than .70, then internal reliability is poor and the researchers are not justified in com- bining all the items into one scale. They have to go back and revise the items, or they might select only those items that were found to correlate strongly with one another.

Reading About Reliability in Journal Articles Authors of empirical journal articles usually present reliability information for the measures they are using. One example of such evidence is in Figure 5.5, which comes from an actual journal article. According to the table, the subjective well-being scale, called Satisfaction with Life (SWL), was used in six studies. The table shows the internal reliability (labeled as coefficient alpha) from each of these studies, as well as test-retest reliability for each one. The table did not present interrater reliability because the scale is a self-report measure, and inter- rater reliability is relevant only when two or more observers are doing the ratings. Based on the evidence in this table, we can conclude the subjective well-being scale has excellent internal reliability and excellent test-retest reliability. You’ll see another example of how reliability is discussed in a journal article in the Working It Through section at the end of this chapter.

132 CHAPTER 5 Identifying Good Measurement

Authors of study using SWL scale.

Coe�cient (Cronbach’s) alpha above .70 means SWL scale has good internal reliability.

High correlation of r = .83 for retesting 2 weeks apart means scale has good test-retest reliability.

FIGURE 5.5 Reliability of the well-being scale. The researchers created this table to show how six studies supported the internal reliability and test-retest reliability of their SWL scale. (Source: Pavot & Diener, 1993, Table 2.)



1. Reliability is about consistency. Define the three kinds of reliability, using the word consistent in each of your definitions.

2. For each of the three common types of operationalizations—self-report, observational, and physiological—indicate which type(s) of reliability would

be relevant.

3. Which of the following correlations is the strongest: r = .25, r = −.65, r = −.01, or r = .43?

1. See pp. 125–126. 2. Self-report: test-retest and internal may be relevant; observational: interrater would be relevant; physiological: interrater may be relevant. 3. r = −.65.

VALIDITY OF MEASUREMENT: DOES IT MEASURE WHAT IT’S SUPPOSED TO MEASURE? Before using particular operationalizations in a study, researchers not only check to be sure the measures are reliable; they also want to be sure they get at the conceptual variables they were intended for. That’s construct validity. You might

133Validity of Measurement: Does It Measure What It’s Supposed to Measure?

ask whether the five-item well-being scale Diener’s team uses really reflects how subjectively happy people are. You might ask if a self-report measure of gratitude really reflects how thankful people are. You might ask if recording the value of the car a person drives really reflects that person’s wealth.

Measurement reliability and measurement validity are separate steps in establishing construct validity. To demonstrate the difference between them, consider the example of head circumference as an operationalization of intelli- gence. Although head size measurements may be very reliable, almost all studies have shown that head circumference is not related to intelligence (Gould, 1996). Therefore, like a bathroom scale that always reads too light (Figure 5.6), the head circumference test may be reliable, but it is not valid as an intelligence test: It does not measure what it’s supposed to measure.

Measurement Validity of Abstract Constructs Does anyone you know use an activity monitor? Your friends may feel proud when they reach a daily steps goal or boast about how many miles they’ve covered that day (Figure 5.7). How can you know for sure these pedometers are accurate? Of course, it’s straightforward to evaluate the validity of a pedometer: You’d sim- ply walk around, counting your steps while wearing one, then compare your own count to that of your device. If you’re sure you walked 200 steps and your pedom- eter says you walked 200, then your device is valid. Similarly, if your pedometer counted the correct distance after you’ve walked around a track or some other path with a known mileage, it’s probably a valid monitor.

In the case of an activity monitor, we are lucky to have concrete, straight- forward standards for accurate measurement. But psychological scientists often want to measure abstract constructs such as happiness, intelligence, stress, or self-esteem, which we can’t simply count (Cronbach & Meehl, 1955; Smith, 2005a, 2005b). Construct validity is therefore important in psychological research, espe- cially when a construct is not directly observable. Take happiness: We have no means of directly measuring how happy a person is. We could estimate it in a number of ways, such as scores on a well-being inventory, daily smile rate, blood pressure, stress hormone levels, or even the activity levels of certain brain regions. Yet each of these measures of happiness is indirect. For some abstract constructs, there is no single, direct measure. And that is the challenge: How can we know if indirect operational measures of a construct are really measuring happiness and not something else?

We know by collecting a variety of data. Before using a measure in a study, researchers evaluate the measure’s validity, by either collecting their own data or reading about the data collected by others. Furthermore, the evidence for construct validity is always a matter of degree. Psychologists do not say a particular measure is or is not valid. Instead, they ask: What is the weight of evidence in favor of this measure’s validity? There are a number of kinds of

FIGURE 5.6 Reliability is not the same as validity. This person’s bathroom scale may report that he weighs 50 pounds (22.7 kg) every time he steps on it. The scale is certainly reliable, but it is not valid.

FIGURE 5.7 Are activity monitors valid? A friend wore a pedometer during a hike and recorded these values. What data could you collect to know whether or not it accurately counted his steps?

134 CHAPTER 5 Identifying Good Measurement

evidence that can convince a researcher, and we’ll discuss them below. First, take a look at Figure 5.8, an overview of the reliability and validity concepts covered in this chapter.

Face Validity and Content Validity: Does It Look Like a Good Measure? A measure has face validity if it is subjectively considered to be a plausible operationalization of the conceptual variable in question. If it looks like a good

Two subjective ways to assess validity

Face validity: It looks like

what you want to measure.

Content validity: The measure

contains all the parts that your theory says it

should contain.

Test-retest reliability: People get consistent

scores every time they take the test.

Interrater reliability:

Two coders’ ratings of a set of targets are consistent with

each other.

Internal reliability: People give consistent

responses on every item of

a questionnaire.

Convergent validity: Your self-report measure is more

strongly associated with self-report

measures of similar constructs.

Discriminant validity: Your self-report measure is less

strongly associated with self-report

measures of dissimiliar constructs.

Criterion validity: Your measure is

correlated with a relevant behavioral


Three empirical ways to assess validity

Reliability: Do you get consistent scores

every time?Reliability is necessary, but not sucient,

for validity

Internal Validity

Construct Validity

Statistical Validity

External Validity

FIGURE 5.8 A concept map of measurement reliability and validity.

135Validity of Measurement: Does It Measure What It’s Supposed to Measure?

measure, it has face validity. Head circumference has high face validity as a measurement of hat size, but it has low face validity as an operationalization of intelligence. In contrast, speed of problem solving, vocabulary size, and creativ- ity have higher face validity as operationalizations of intelligence. Researchers generally check face validity by consulting experts. For example, we might assess the face validity of Diener’s well-being scale by asking a panel of judges (such as personality psychologists) their opinion on how reasonable the scale is as a way of estimating happiness.

Content validity also involves subjective judgment. To ensure content validity, a measure must capture all parts of a defined construct. For example, consider this conceptual definition of intelligence, which contains distinct elements, including the ability to “reason, plan, solve problems, think abstractly, comprehend com- plex ideas, learn quickly, and learn from experience” (Gottfredson, 1997, p. 13). To have adequate content validity, any operationalization of intelligence should include questions or items to assess each of these seven components. Indeed, most IQ tests have multiple categories of items, such as memory span, vocabulary, and problem-solving sections.

Criterion Validity: Does It Correlate with Key Behaviors? To evaluate a measurement’s validity, face and content validity are a good place to start, but most psychologists rely on more than a subjective judgment: They prefer to see empirical evidence. There are several ways to collect data on a measure, but in all cases, the point is to make sure the measurement is associated with some- thing it theoretically should be associated with. In some cases, such relationships can be illustrated by using scatterplots and correlation coefficients. They can be illustrated with other kinds of evidence too, such as comparisons of groups with known properties.


Criterion validity evaluates whether the measure under consideration is asso- ciated with a concrete behavioral outcome that it should be associated with, according to the conceptual definition. Suppose you work for a company that wants to predict how well job applicants would perform as salespeople. Of the several commercially available tests of sales aptitude, which one should the com- pany use? You have two choices, which we’ll call Aptitude Test A and Aptitude Test B. Both have items that look good in terms of face validity—they ask about a candidate’s motivation, optimism, and interest in sales. But do the test scores correlate with a key behavior: success in selling? It’s an empirical question. Your company can collect data to tell them how well each of the two aptitude tests is correlated with success in selling.

136 CHAPTER 5 Identifying Good Measurement

To assess criterion validity, your company could give each sales test to all the current sales representatives and then find out each person’s sales figures—a measure of their selling performance. You would then compute two correlations: one between Aptitude Test A and sales figures, and the other between Aptitude Test B and sales figures. Figure 5.9A shows scatterplot results for Test A. The score on Aptitude Test A is plotted on the x-axis, and actual sales figures are plotted on the y-axis. (Alex scored 39 on the test and brought in $38,000 in sales, whereas Irina scored 98 and brought in $100,000.) Figure 5.9B, in contrast, shows the association of sales performance with Apti- tude Test B.

Looking at these two scatterplots, we can see that the correlation in the first one is much stronger than in the second one. In other words, future sales performance is correlated more strongly with scores on Aptitude Test A than with scores on Aptitude Test B. If the data looked like this, the company would conclude that Aptitude Test A has better criterion validity as a measure of selling ability, and this is the one they should use for selecting salespeople. In contrast, the other data show that scores on Aptitude Test B are a poorer indi- cator of future sales performance; it has poor criterion validity as a measure of sales aptitude.

Criterion validity is especially important for self-report measures because the correlation can indicate how well people’s self-reports predict their actual

FIGURE 5.9 Correlational evidence for criterion validity. (A) Aptitude Test A strongly predicts sales performance, so criterion validity is high. (B) Aptitude Test B does not predict sales as well, so criterion validity is lower. A company would probably want to use Test A for identifying potential selling ability when selecting salespeople.

20 20 40 60

Aptitude Test A score Aptitude Test B score


Irina Irina


100 100 105 110 120 130 140

Sales figures (thousands of dollars)

Sales figures (thousands of dollars)














Spread-out dots indicate lower correlation between Test B scores and sales figures, so this test has lower criterion validity as a measure of sales performance.

137Validity of Measurement: Does It Measure What It’s Supposed to Measure?

behavior. Criterion validity provides some of the strongest evidence for a mea- sure’s construct validity.

Here’s another example. Most colleges in the United States use standar- dized tests, such as the SAT and ACT, to measure the construct “aptitude for college-level work.” To demonstrate that these tests have criterion validity, an educational psychologist might want to show that scores on these mea- sures are correlated with college grades (a behavioral outcome that represents “college-level work”).

Gallup presents criterion validity evidence for the 10-point Ladder of Life scale they use to measure happiness. They report that Ladder of Life scores correlate with key behavioral outcomes, such as becoming ill and missing work (Gallup, n.d.).

If an IQ test has criterion validity, it should be correlated with behaviors that capture the construct of intelligence, such as how fast people can learn a complex set of symbols (an outcome that represents the conceptual definition of intelligence). Of course, the ability to learn quickly is only one component of that definition. Further criterion validity evidence could show that IQ scores are correlated with other behavioral outcomes that are theoretically related to intelligence, such as the ability to solve problems and indicators of life success (e.g., graduating from college, being employed in a high-level job, earning a high income).


Another way to gather evidence for criterion validity is to use a known-groups paradigm, in which researchers see whether scores on the measure can dis- criminate among two or more groups whose behavior is already confirmed. For example, to validate the use of salivary cortisol as a measure of stress, a researcher could compare the salivary cortisol levels in two groups of people: those who are about to give a speech in front of a classroom, and those who are in the audience. Public speaking is recognized as being a stressful situation for almost everyone. Therefore, if salivary cortisol is a valid measure of stress, people in the speech group should have higher levels of cortisol than those in the audience group.

Lie detectors are another good example. These instruments record a set of physiological measures (such as skin conductance and heart rate) whose levels are supposed to indicate which of a person’s statements are truthful and which are lies. If skin conductance and heart rate are valid measures of lying, we could conduct a known-groups test in which we know in advance which of a person’s statements are true and which are false. The physiological measures should be elevated only for the lies, not for the true statements. (For a review of the mixed evidence on lie detection, see Saxe, 1991.)

The known-groups method can also be used to validate self-report measures. Psychiatrist Aaron Beck and his colleagues developed the Beck Depression Inven- tory (BDI), a 21-item self-report scale with items that ask about major symptoms of

138 CHAPTER 5 Identifying Good Measurement

depression (Beck, Ward, Mendelson, Mock, & Erbaugh, 1961). Participants circle one of four choices, such as the following:

0 I do not feel sad.

1 I feel sad.

2 I am sad all the time and I can’t snap out of it.

3 I am so sad or unhappy that I can’t stand it.

0 I have not lost interest in other people.

1 I am less interested in other people than I used to be.

2 I have lost most of my interest in other people.

3 I have lost all of my interest in other people.

A clinical scientist adds up the scores on each of the 21 items for a total BDI score, which can range from a low of 0 (not at all depressed) to a high of 63.

To test the criterion validity of the BDI, Beck and his colleagues gave this self-report scale to two known groups of people. They knew one group was suf- fering from clinical depression and the other group was not because they had asked psychiatrists to conduct clinical interviews and diagnose each person. The researchers computed the mean BDI scores of the two groups and created a

bar graph, shown in Figure 5.10. The evidence supports the crite- rion validity of the BDI. The graph shows the expected result: the average BDI score of the known group of depressed people was higher than the average score of the known group who were not depressed. Because its criterion validity was established in this way, the BDI is still widely used today when researchers need a quick and valid way to identify new people who are vulnerable to depression.

Beck also used the known-groups paradigm to calibrate low, medium, and high scores on the BDI. When the psychiatrists interviewed the people in the sample, they evaluated not only whether they were depressed but also the level of depression in each person: none, mild, moderate, or severe. As expected, the BDI scores of the groups rose as their level of depression (assessed by psychiatrists) was more severe (Figure 5.11). This result was even clearer evidence that the BDI was a valid measure of depres- sion. With the BDI, clinicians and researchers can confidently use specific ranges of BDI scores to categorize how severe a person’s depression might be.

Diener’s subjective well-being (SWB) scale is another exam- ple of using the known-groups paradigm for criterion validity. In one review article, he and his colleague presented the SWB scale averages from several different studies. Each study had given the SWB scale to different groups of people who could be

If it’s a valid measure of depression, BDI should be higher here, as this group has already been diagnosed with depression.

BDI score

Not depressed


Psychiatrists’ judgment








FIGURE 5.10 BDI scores of two known groups. This pattern of results provides evidence for the criterion validity of the BDI using the known-groups method. Clients judged to be depressed by psychiatrists also scored higher. (Source: Adapted from Beck et al., 1961.)

139Validity of Measurement: Does It Measure What It’s Supposed to Measure?

expected to vary in happiness level (Pavot & Diener, 1993). For example, male prison inmates, a group that would be expected to have low subjective well-being, showed a lower mean score on the scale, compared with Canadian college students, who averaged much higher—indicated by the M column in Table 5.3. Such known-groups patterns provide strong evidence for the criterion validity of the SWB scale. Researchers can use this scale in their studies with confidence.

What about the Ladder of Life scale, the measure of hap- pi ness used in the Gallup-Healthways Well-Being Index? This measure also has some known-groups evidence to support its criterion validity. For one, Gallup reported that Americans’ well-being was especially low in 2008 and 2009, a period corresponding to a significant down- turn in the U.S. economy. Well-being is a little bit higher in American summer months, as well. These results fit what we would expect if the Ladder of Life is a valid measure of well-being.

Convergent Validity and Discriminant Validity: Does the Pattern Make Sense? Criterion validity examines whether a mea- sure correlates with key behavioral outcomes. Another form of validity evidence is whether there is a meaningful pattern of similarities and differences among self-report measures. A self-report measure should correlate more strongly with self-report measures of similar constructs than it does with those of dissimi- lar constructs. The pattern of correlations with measures of theoretically similar and dissimilar constructs is called convergent validity and discriminant validity (or divergent validity), respectively.


As an example of convergent validity, let’s con- sider Beck’s depression scale, the BDI, again. One team of researchers wanted to test the convergent and discriminant validity of the


Subjective Well-Being (SWB) Scores for Known Groups from Several Studies



American college students

244 23.7 6.4 Pavot & Diener (1993)

French Canadian college students (male)

355 23.8 6.1 Blais et al. (1989)

Korean university students

413 19.8 5.8 Suh (1993)

Printing trade workers

304 24.2 6.0 George (1991)

Veterans Affairs hospital inpatients

52 11.8 5.6 Frisch (1991)

Abused women 70 20.7 7.4 Fisher (1991)

Male prison inmates

75 12.3 7.0 Joy (1990)

Note: N = Number of people in group. M = Group mean on SWB. SD = Group standard deviation. Source: Adapted from Pavot & Diener, 1993, Table 1.

FIGURE 5.11 BDI scores of four known groups. This pattern of results means it is valid to use BDI cutoff scores to decide if a person has mild, moderate, or severe depression. (Source: Adapted from Beck et al., 1961.)

BDI score

None Mild

Psychiatrists’ rating

Moderate Severe









140 CHAPTER 5 Identifying Good Measurement

BDI (Segal, Coolidge, Cahill, & O’Riley, 2008). If the BDI really quanitfies depression, the researchers reasoned, it should be correlated with (should converge with) other self-report measures of depression. Their sample of 376 adults completed the BDI and a number of other questionnaires, including a self-report instrument called the Center for Epidemiologic Studies Depression scale (CES-D).

As expected, BDI scores were positively, strongly correlated with CES-D scores (r = .68). People who scored as depressed on the BDI also scored as depressed on the CES-D; likewise, those who scored as not depressed on the BDI also scored as not depressed on the CES-D. Figure 5.12 shows a scatterplot of the results. (Notice that most of the dots fall in the lower left portion of the scatterplot because most people in the sample are not depressed; they score low on both the BDI and the CES-D.) This correlation between similar self-report measures of the same construct (depression) provided good evidence for the convergent validity of the BDI.

Testing for convergent validity can feel circular. Even if researchers validate the BDI with the CES-D, for instance, there is no assurance that the CES-D mea- sure is a gold standard. Its validity would need to be established, too! The research- ers might next try to validate the CES-D with a third measure, but that measure’s validity would also need to be supported with evidence. Eventually, however, they might be satisfied that a measure is valid after evaluating the weight and pattern of the evidence. Many researchers are most convinced when measures are shown to predict actual behaviors (using criterion validity). However, no single definitive outcome will establish validity (Smith, 2005a).

❯❯ For more on the strength of correlations, see Chapter 8,

Table 8.4, and Statistics Review: Descriptive

Statistics, pp. 468–472.

FIGURE 5.12 Evidence supporting the convergent validity of the BDI. The BDI is strongly correlated with another measure of depression, the CES-D (r = .68), providing evidence for convergent validity. (Source: Segal et al., 2008.) BDI total score

CES-D total score





0 10 20 30

141Validity of Measurement: Does It Measure What It’s Supposed to Measure?

This example of convergent validity is somewhat obvious: A measure of depres- sion should correlate with a different measure of the same construct—depression. But convergent validity evidence also includes similar constructs, not just the same one. The researchers showed, for instance, that the BDI scores were strongly cor- related with a score quantifying psychological well-being (r = -.65). The observed strong, negative correlation made sense as a form of convergent validity because people who are depressed are expected to also have lower levels of well-being (Segal, Coolidge, Cahill, & O’Riley, 2008).


The BDI should not correlate strongly with measures of constructs that are very different from depression; it should show discriminant validity with them. For example, depression is not the same as a person’s perception of his or her over- all physical health. Although mental health problems, including depression, do overlap somewhat with physical health problems, we would not expect the BDI to be strongly correlated with a measure of perceived physical health problems. More importantly, we would expect the BDI to be more strongly correlated with the CES-D and well-being than it is with physical health problems. Sure enough, Segal and his colleagues found a correlation of only r = .16 between the BDI and a measure of perceived physical health problems. This weak correlation shows that the BDI is different from people’s perceptions of their physical health, so we can say that the BDI has discriminant validity with physical health problems. Figure 5.13 shows a scatterplot of the results.

FIGURE 5.13 Evidence supporting the discriminant validity of the BDI. As expected, the BDI is only weakly correlated with perceived health problems (r = .16), providing evidence for discriminant validity. (Source: Segal et al., 2008.)BDI total score

Physical health problems






0 10 20 30

142 CHAPTER 5 Identifying Good Measurement

Notice also that most of the dots fall in the lower left portion of the scatter- plot because most people in the sample reported few health problems and were not depressed: They score low on the BDI and on the perceived physical health problems scale.

As another example, consider that many developmental disorders have sim- ilar symptoms, but diagnoses can vary. It might be important to specify, for instance, whether a child has autism or whether she has a language delay only. Therefore, a screening instrument for identifying autism should have discrimi- nant validity; it should not accidentally diagnose that same child as having a lan- guage delay. Similarly, a scale that is supposed to diagnose learning disabilities should not be correlated with IQ because learning disabilities are not related to general intelligence.

It is usually not necessary to establish discriminant validity between a mea- sure and something that is completely unrelated. Because depression is not likely to be associated with how many movies you watch or the number of siblings you have, we would not need to examine its discriminant validity with these variables. Instead, researchers focus on discriminant validity when they want to be sure their measure is not accidentally capturing a similar but different construct. Does the BDI measure depression or perceived health problems? Does Diener’s SWB scale measure enduring happiness or just temporary mood? Does a screening technique identify autism or language delay?

Convergent validity and discriminant validity are usually evaluated together, as a pattern of correlations among self-report measures. A measurement should have higher correlations (higher r values) with similar traits (convergent validity) than it does with dissimilar traits (discriminant validity). There are no strict rules for what the correlations should be. Instead, the overall pattern of convergent and discriminant validity helps researchers decide whether their operationalization really measures the construct they want it to measure.

The Relationship Between Reliability and Validity One essential point is worth reiterating: The validity of a measure is not the same as its reliability. A journalist might boast that some operationalization of behavior is “a very reliable test,” but to say that a measure is “reliable” is only half the story. Determining head circumference might be extremely reliable, but it still may not be valid for assessing intelligence.

Although a measure may be less valid than it is reliable, it cannot be more valid than it is reliable. Intuitively, this statement makes sense. Reliability has to do with how well a measure correlates with itself. For example, an IQ test is reliable if it is correlated with itself over time. Validity, however, has to do with how well a measure is associated with something else, such as a behavior that indicates intelligence. An IQ test is valid if it is associated with another variable, such as

143Review: Interpreting Construct Validity Evidence

school grades or life success. If a measure does not even correlate with itself, then how can it be more strongly associated with some other variable?

As another example, suppose you used your pedometer to count how many steps there are in your daily walk from your parking spot to your building. If the pedometer reading is very different day to day, then the measure is unreliable—and of course, it also cannot be valid because the true distance of your walk has not changed. Therefore, reliability is necessary (but not sufficient) for validity.


1. What do face validity and content validity have in common?

2. Many researchers believe criterion validity is more important than convergent and discriminant validity. Can you see why?

3. Which requires stronger correlations for its evidence: convergent validity or discriminant validity? Which requires weaker correlations?

4. Can a measure be reliable but not valid? Can it be valid but unreliable?

1. See pp. 134–135. 2. Because only criterion validity establishes how well a measure correlates with a behavioral outcome, not simply with other self-report measures; see pp. 135–142. 3. Convergent validity. 4. It can be reliable but not valid, but a measure cannot be valid if it is unreliable; see pp. 142–143.

REVIEW: INTERPRETING CONSTRUCT VALIDITY EVIDENCE Before using a stopwatch in a track meet, a coach wants to be sure the stopwatch is working well. Before taking a patient’s blood pressure, a nurse wants to be sure the cuff she’s using is reliable and accurate. Similarly, before conducting a study, researchers want to be sure the measures they plan to use are reliable and valid ones. When you read a research study, you should be asking: Did the researchers collect evidence that the measures they are using have construct validity? If they didn’t do it themselves, did they review construct validity evidence provided by others?

In empirical journal articles, you’ll usually find reliability and validity infor- mation in the Method section, where the authors describe their measures. How do you recognize this evidence, and how can you interpret it? The Working It Through section shows how such information might be presented, using a study conducted by Gordon, Impett, Kogan, Oveis, & Keltner (2012) as an example.

144 CHAPTER 5 Identifying Good Measurement

FIGURE 5.14 Items in the Appreciation in Relationships (AIR) Scale. These items were used by the researchers to measure how much people appreciate their relationship partner. Do you think these items have face validity as a measure of appreciation? (Source: Gordon et al., 2012.)

1. I tell my partner often that s/he is the best.

2. I often tell my partner how much I appreciate her/him.

3. At times I take my partner for granted. (reverse scored item)

4. I appreciate my partner.

5. Sometimes I don’t really acknowledge or treat my partner like s/he is someone special. (reverse scored item)

6. I make sure my partner feels appreciated.

7. My partner sometimes says that I fail to notice the nice things that s/he does for me. (reverse scored item)

8. I acknowledge the things that my partner does for me, even the really small things.

9. I am sometimes struck with a sense of awe and wonder when I think about my partner being in my life.


The evidence reported by Gordon et al. (2012) supports the AIR scale as a reli- able and valid scale (Figure 5.14). It has internal and test-retest reliability, and there is evidence of its convergent, discriminant, and criterion validity. The research- ers were confident they could use AIR when they later tested their hypothesis that more appreciative couples would have healthier relationships. Many of their hypotheses about gratitude (operationalized by the AIR scale) were supported. One of the more dramatic results was from a study that followed couples over time. The authors reported: “We found that people who were more appreciative of their partners were significantly more likely to still be in their relationships at the 9-month follow-up” (Gordon et al., 2012, p. 268).

This empirical journal article illustrates how researchers use data to establish the construct validity of the measure they plan to use ahead of time, before going on to test their hypotheses. Their research helps support the headline, “Gratitude is for lovers.”

145Review: Interpreting Construct Validity Evidence

How Well Can We Measure the Amount of Gratitude Couples Express to Each Other? What do partners bring to a healthy romantic relationship? One research team pro- posed that gratitude toward one’s partner would be important (Gordon et al., 2012). They predicted that when people are appreciative of their partners, close relation- ships are happier and last longer. In an empirical journal article, the researchers reported how they tested this hypothesis. But before they could study how the concept of gratitude contributes to relationship health, they needed to be able to measure the variable “gratitude” in a reliable and valid way. They created and tested the AIR scale, for Appreciation in Relationships. We will work through ways this example illustrates concepts from Chapter 5.


Conceptual and Operational Definitions

How did they operationalize the conceptual variable “gratitude”?

“In the first step, we created an initial pool of items based on lay knowledge, theory, and previous measures of appreciation and gratitude. . . . These items were designed to capture a broad conceptualization of appreciation by including items that assess the extent to which people recognize and value their partner as a person as well as the extent to which they are grateful for a partner’s kind deeds. . . .” (p. 260).

This quoted passage describes how Gordon and her colleagues developed and selected the AIR items. Notice how they wrote items to capture their conceptual definition of gratitude.



146 CHAPTER 5 Identifying Good Measurement


Was the AIR scale reliable? Did the scale give consistent scores?

Internal Reliability

When a self-report scale has multiple items, it should have good internal reliability. Did the AIR scale have good internal reliability?

“In the initial online survey, participants completed a questionnaire with basic demographic information. Participants completed the AIR scale … α = .87” (p. 266).

In this passage, the authors report the internal reliability of the AIR scale. The value α = .87 indicates that people in the Gordon study answered all the AIR items consistently. A Cronbach’s alpha of .87 is considered good internal reliability because it is close to 1.0.

Test-Retest Reliability

We might expect the AIR scale to have test- retest reliability because gratitude should be stable over time. Were the AIR scores stable over time?

“The AIR scale had strong test-retest reliability from baseline to the 9-month follow-up (… r = .61, p = .001)” (p. 267).

This passage reports the test-retest correlation, which was r = .61. Those who were the most appreciative at Time 1 were also the most appreciative at Time 2; similarly, those who were least appreciative at Time 1 were also least appreciative at Time 2.

Interrater Reliability

Because the AIR scale is a self-report measure, the researchers do not need to report interrater reliability evidence.

The evidence indicates the AIR scale has adequate internal and test-retest reliability. What evidence is there for its validity? Is the scale really measuring the concept of gratitude?

Convergent and Discriminant Validity

Do AIR scores correlate more strongly with measures similar to gratitude and less strongly with measures dissimilar to gratitude?

In a section on convergent and discriminant validity, the authors write: “As expected, . . . [the AIR scale was] positively correlated with the extent to which people had a grateful disposition [r = .25], as well as with people’s gratitude in response to their partners’ kind acts [r = .60]. In contrast, [the AIR scale was not] associated with people’s feelings of indebtedness to their partners [r = .19]” (p. 262).

In this passage, the authors give convergent and discriminant validity evidence. The AIR scale has convergent validity with other measures of gratitude, and discriminant validity with a measure of indebtedness. In other words, there was a pattern of higher correlations with gratitude than with indebtedness.

Criterion Validity

Does the AIR scale predict relevant behavioral outcomes?

The “final study allowed us to provide additional evidence for the validity of the AIR scale by examining cross-partner associations. . . . [P]eople who reported feeling more appreciative of their partners had partners who felt more appreciated by them, β = .50, t(66) = 5.87, p < .001, . . . suggesting that the AIR scale is capturing the interpersonal transmission of appreciation from one partner to the other” (p. 269).

In this passage, the authors present criterion validity evidence. If the AIR scale is a valid measure, you’d expect partners with higher AIR scores to also have partners who notice this appreciation. Because the results showed that AIR scores were associated with this relevant outcome, there is evidence for the AIR scale’s criterion validity.

Summary The construct validity of a study’s measured variables is something you will interrogate for any type of claim.

Ways to Measure Variables • Psychological scientists measure variables in every

study they conduct. Three common types of mea- sures are self-report, in which people report on their own behaviors, beliefs, or attitudes; observational measures, in which raters record the visible behaviors of people or animals; and physiological measures, in which researchers measure biological data, such as heart rate, brain activity, and hormone levels.

• Depending on how they are operationalized, variables may be categorical or quantitative. The levels of cate- gorical variables are categories. The levels of quantita- tive variables are meaningful numbers, in which higher numbers represent more of some variable.

• Quantitative variables can be further classified in terms of ordinal, interval, or ratio scales.

Reliability of Measurement: Are the Scores Consistent? • Both measurement reliability and measurement

validity are important for establishing a measure’s construct validity.

• Researchers use scatterplots and correlation coeffi- cients (among other methods) to evaluate evidence for a measure’s reliability and validity.

• To establish a measure’s reliability, researchers collect data to see whether the measure works consistently. There are three types of measurement reliability.

• Test-retest reliability establishes whether a sample gives a consistent pattern of scores at more than one testing.

• Interrater reliability establishes whether two observers give consistent ratings to a sample of targets.

• Internal reliability is established when people answer similarly worded items in a consistent way.

• Measurement reliability is necessary but not sufficient for measurement validity.

Validity of Measurement: Does It Measure What It’s Supposed to Measure? • Measurement validity can be established with subjec-

tive judgments (face validity and content validity) or with empirical data.

• Criterion validity requires collecting data that show a measure is correlated with expected behavioral outcomes.

• Convergent and discriminant validity require collecting data that show a self-report measure is correlated more strongly with self-report measures of similar constructs than with measures of dissimilar constructs.

Review: Interpreting Construct Validity Evidence • Measurement reliability and validity evidence are

reported in the Method section of empirical journal articles. Details may be provided in the text, as a table of results, or through cross-reference to a longer article that presents full reliability and validity evidence.



148 CHAPTER 5 Identifying Good Measurement

Key Terms

To see samples of chapter concepts in the popular media, visit and click the box for Chapter 5.r

Review Questions

1. Classify each operational variable below as cate- gorical or quantitative. If the variable is quantitative, further classify it as ordinal, interval, or ratio.

a. Degree of pupil dilation in a person’s eyes in a study of romantic couples (measured in millimeters).

b. Number of books a person owns.

c. A book’s sales rank on

d. The language a person speaks at home.

e. Nationality of the participants in a cross- cultural study of Canadian, Ghanaian, and French students.

f. A student’s grade in school.

2. Which of the following correlation coefficients best describes the pictured scatterplot?

a. r = .78

b. r = −.95

c. r = .03

d. r = .45


0 1 2 3 4 5 6 7 8 9 10

10 20 30 40 50 60 70 80 90


3. Classify each of the following results as an example of internal reliability, interrater reliability, or test-retest reliability.

a. A researcher finds that people’s scores on a mea- sure of extroversion stay stable over 2 months.

b. An infancy researcher wants to measure how long a 3-month-old baby looks at a stimulus on the right and left sides of a screen. Two undergrad- uates watch a tape of the eye movements of ten infants and time how long each baby looks to the right and to the left. The two sets of timings are correlated r = .95.

c. A researcher asks a sample of 40 people a set of five items that are all capturing how extroverted they are. The Cronbach’s alpha for the five items is found to be .75.

4. Classify each result below as an example of face validity, content validity, convergent and discriminant validity, or criterion validity.

a. A professor gives a class of 40 people his five-item measure of conscientiousness (e.g., “I get chores done right away,” “I follow a schedule,” “I do not make a mess of things”). Average scores are correlated (r = −.20) with how many times each student has been late to class during the semester.

b. A professor gives a class of 40 people his five-item measure of conscientiousness (e.g., “I get chores done right away,” “I follow a schedule,” “I do not make a mess of things”). Average scores are more highly correlated with a self-report measure of

self-report measure, p. 120 observational measure, p. 121 physiological measure, p. 121 categorical variable, p. 122 quantitative variable, p. 123 ordinal scale, p. 123 interval scale, p. 123 ratio scale, p. 123

reliability, p. 125 validity, p. 125 test-retest reliability, p. 125 interrater reliability, p. 125 internal reliability, p. 125 correlation coefficient r, p. 128 slope direction, p. 128 strength, p. 128

Cronbach’s alpha, p. 131 face validity, p. 134 content validity, p. 135 criterion validity, p. 135 known-groups paradigm, p. 137 convergent validity, p. 139 discriminant validity, p. 139

tidiness (r = .50) than with a measure of general knowledge (r = .09).

c. The researcher e-mails his five-item measure of conscientiousness (e.g., “I get chores done right away,” “I follow a schedule,” “I do not make a mess of things”) to 20 experts in personality psychol- ogy, and asks them if they think his items are a good measure of conscientiousness.

d. The researcher e-mails his five-item measure of conscientiousness (e.g., “I get chores done right away.” “I follow a schedule,” “I do not make a mess of things”) to 20 experts in per- sonality psychology, and asks them if they think he has included all the important aspects of conscientiousness.

Learning Actively

1. For each measure below, indicate which kinds of reliability would need to be evaluated. Then, draw a scatterplot indicating that the measure has good reliability and another one indicating the measure has poor reliability. (Pay special attention to how you label the axes of your scatterplots.)

a. Researchers place unobtrusive video recording devices in the hallway of a local high school. Later, coders view tapes and code how many students are using cell phones in the 4-minute period between classes.

b. Clinical psychologists have developed a seven-item self-report measure to quickly identify people who are at risk for panic disorder.

c. Psychologists measure how long it takes a mouse to learn an eyeblink response. For 60 trials, they present a mouse with a distinctive blue light followed immediately by a puff of air. The 5th, 10th, and 15th trials are test trials, in which they present the blue light alone (without the air puff). The mouse is said to have learned the eyeblink response if observers record that it blinked its eyes in response to a test trial. The earlier in the 60 trials the mouse shows the eyeblink response, the faster it has learned the response.

d. Educational psychologists use teacher ratings of classroom shyness (on a nine-point scale, where 1 = “not at all shy in class” and 9 = “very shy in class”) to measure children’s temperament.

2. Consider how you might validate the nine-point classroom shyness rating example in question 1d. First, what behaviors might be relevant for establishing this rating’s criterion validity? Draw a scatterplot showing the results of a study in which the classroom shyness rating has good criterion validity (be careful how you label the axes). Second, come up with ways to eval- uate the convergent and discriminant validity of this rating system. What traits should correlate strongly with shyness? What traits should correlate only weakly or not at all? Explain why you chose those traits. Draw a scatterplot showing the results of a study in which the shyness rating has good convergent or discrimi- nant validity (be careful how you label the axes).

3. This chapter included the example of a sales ability test. Search online for “sales ability assessment” and see what commercially available tests you can find. Do the websites present reliability or validity evidence for the measures? If so, what form does the evidence take? If not, what kind of evidence would you like to see? You might frame your predictions in this form: “If this sales ability test has convergent validity, I would expect it to be correlated with . . . .”; “If this sales ability test has discriminant validity, I would expect it not to be correlated with . . . .”; “If this sales ability test has criterion validity, I would expect it to be correlated with . . . .”

149Learning Actively


Tools for Evaluating Frequency Claims

“Should we eat at this restaurant? It got almost 5 stars on Yelp.”

“Should I take this class? The professor has a green smiley face on”

7 Secrets of Low-Stress Families, 2010



A year from now, you should still be able to:

1. Explain how carefully prepared questions improve the construct validity of a poll or survey.

2. Describe how researchers can make observations with good construct validity.

6 Surveys and Observations: Describing What People Do YOU SHOULD BE ABLE to identify the three statements that open this chapter as single-variable frequency claims. Each claim is based on data from one variable: the rated quality of a restaurant, opinions about a professor, or the habits of families. Where do the data for such claims come from? This chapter focuses on the construct valid- ity of surveys and polls, in which researchers ask people questions, as well as observational studies, in which researchers watch the behavior of people or other animals, often without asking them ques- tions at all. Researchers use surveys, polls, and observations to mea- sure variables for any type of claim. However, in this chapter and the next, many of the examples focus on how surveys and observations are used to measure one variable at a time—for frequency claims.

CONSTRUCT VALIDITY OF SURVEYS AND POLLS Researchers use surveys and polls to ask people questions over the telephone, in door-to-door interviews, through the mail, or over the Internet. You may have been asked to take surveys in various situa- tions. Perhaps after you purchased an item from an Internet retailer,

154 CHAPTER 6 Surveys and Observations: Describing What People Do

you got an e-mail asking you to post a review. While you were reading an online newspaper, maybe a survey popped up. A polling organization, such as Gallup or Pew Research Center, may have called you at home.

The word survey is often used when people are asked about a consumer prod- uct, whereas the word poll is used when people are asked about their social or political opinions. However, these two terms can be interchangeable, and in this book, survey and poll both mean the same thing: a method of posing questions to people on the phone, in personal interviews, on written questionnaires, or online. Psychologists might conduct national polls as part of their research, or they may use the polling information they read (as consumers of information) to inspire further research.

How much can you learn about a phenomenon just by asking people questions? It depends on how well you ask. As you will learn, researchers who develop their questions carefully can support frequency, association, or causal claims that have excellent construct validity.

Choosing Question Formats Survey questions can follow four basic formats. Researchers may ask open-ended questions that allow respondents to answer any way they like. They might ask people to name the public figure they admire the most, or ask a sample of people to describe their views on immigration. Departing overnight guests might be asked to submit comments about their experience at a hotel. Their various responses to open-ended questions provide researchers with spontaneous, rich information. The drawback is that the responses must be coded and categorized, a process that is often difficult and time-consuming. In the interest of efficiency, therefore, researchers in psychology often restrict the answers people can give.

One specific way to ask survey questions uses forced-choice questions, in which people give their opinion by picking the best of two or more options. Forced- choice questions are often used in political polls, such as asking which of two or three candidates respondents plan to vote for.

An example of a psychology measure that uses forced-choice questions is the Narcissistic Personality Inventory (NPI; Raskin & Terry, 1988). This instrument asks people to choose one statement from each of 40 pairs of items, such as the following:

1.    I really like to be the center of attention.

   It makes me uncomfortable to be the center of attention.

2.    I am going to be a great person.

   I hope I am going to be successful.

To score a survey like this, the researcher adds up the number of times people choose the “narcissistic” response over the “non-narcissistic” one (in the example items above, the narcissistic response is the first option).

155Construct Validity of Surveys and Polls

In another question format, people are presented with a statement and are asked to use a rating scale to indicate their degree of agreement. When such a scale contains more than one item and each response value is labeled with the specific terms strongly agree, agree, neither agree nor disagree, disagree, and strongly disagree, it is often called a Likert scale (Likert, 1932). If it does not follow this format exactly (e.g., if it has only one item, or if its response labels are a bit different from the original Likert labels) it may be called a Likert-type scale. Here is one of the ten items from the Rosenberg self-esteem inventory, a commonly used measure of self-esteem (Rosenberg, 1965). It can be considered a Likert scale:

I am able to do things as well as most other people.

1 2 3 4 5





Instead of degree of agreement, respondents might be asked to rate a target object using a numeric scale that is anchored with adjectives; this is called a semantic differential format. For example, on the Internet site, students assign ratings to a professor using the following adjective phrases:

Overall Quality:

Profs get 1 2 3 4 5 A real gem F’s too

Level of Difficulty:

Show up 1 2 3 4 5 Hardest thing

and pass I’ve ever done

The five-star rating format that Internet rating sites (like Yelp) use is another example of this technique (Figure 6.1). Generally one star means “poor” or “I don’t like it,” and five stars means “outstanding” or even “Woohoo! As good as it gets!”

There are other question types, of course, and researchers might combine formats on a single survey. The point is that the format of a question (open-ended, forced-choice, or Likert scale) does not make or break its construct validity. The way the questions are worded, and the order in which they appear, are much more important.

Writing Well-Worded Questions As with other research findings, when you interrogate a survey result, your first question is about construct validity: How well was that variable measured? The way a question is worded and presented in a survey can make a tremendous

FIGURE 6.1 A five-star restaurant rating on the Internet. Ratings of the products and services people might consult online are examples of frequency claims. Is a five-star rating a valid indicator of a restaurant’s quality?

156 CHAPTER 6 Surveys and Observations: Describing What People Do

difference in how people answer. It is crucial that each question be clear and straightforward. Poll and survey creators work to ensure that the wording and order of the questions do not influence respondents’ answers.


An example of the way question wording can affect responses comes from survey research on a random sample of Delaware voters (Wilson & Brewer, 2016). The poll asked about people’s support for voter identification laws, which require voters to show a photo ID before casting a ballot. Participants heard one of several different versions of the question. Here are three of them:

1. What is your opinion? Do you strongly favor, mostly favor, mostly oppose, or

strongly oppose voter ID laws?

2. Opponents of voter ID laws argue that they will prevent people who are eligible

to vote from voting. What is your opinion? Do you strongly favor, mostly favor,

mostly oppose, or strongly oppose voter ID laws?

3. Opponents of voter ID laws argue that they will prevent people who are eligible to

vote from voting, and that the laws will affect African American voters especially

hard. What is your opinion? Do you strongly favor, mostly favor, mostly oppose,

or strongly oppose voter ID laws?

As you can see, the first version of the question simply asks people’s opinion about voter ID laws, but the second question is potentially a leading question, one whose wording leads people to a particular response because it explains why some people oppose the law. The third question specifies a group (African Amer- ican voters) that will be affected by voter ID laws. The researchers found that presenting potentially leading information did, in fact, affect people’s support. Without the additional wording, 79% of people supported voter ID laws. In the second version, 69% supported them, and when African Americans were specified, 61% supported such laws. The study shows that the wording matters; when people answer questions that suggest a particular viewpoint, at least some people change their answers.

In general, if the intention of a survey is to capture respondents’ true opinions, the survey writers might attempt to word every question as neutrally as possible. When researchers want to measure how much the wording matters for their topic, they word each question more than one way. If the results are the same regardless of the wording, they can conclude that question wording does not affect people’s responses to that particular topic. If the results are different, then they may need to report the results separately for each version of the question.


The wording of a question is sometimes so complicated that respondents have trouble answering in a way that accurately reflects their opinions. In a survey, it is always best to ask a simple question. When people understand the question,

157Construct Validity of Surveys and Polls

they can give a clear, direct, and meaningful answer, but sometimes survey writers forget this basic guideline. For example, an online survey from the National Rifle Association asked this question:

Do you agree that the Second Amendment to our United States Constitution

guarantees your individual right to own a gun and that the Second Amendment

is just as important as your other Constitutional rights?



No opinion

This is called a double-barreled question; it asks two questions in one. Double-barreled questions have poor construct validity because people might be responding to the first half of the question, the second half, or both. Therefore, the item could be measuring the first construct, the second construct, or both. Careful researchers would have asked each question separately:

Do you agree that the Second Amendment guarantees your individual right to

own a gun?



No opinion

Do you agree that the Second Amendment is just as important as your other

Constitutional rights?



No opinion


Negatively worded questions are another way survey items can be unneces- sarily complicated. Whenever a question contains negative phrasing, it can cause confusion, thereby reducing the construct validity of a survey or poll (Schwarz & Oyserman, 2001).

A classic example comes from a survey on Holocaust denial, which found that 20% of Americans denied that the Nazi Holocaust ever happened. In the months that followed the publication of this survey’s results, writers and journalists crit- icized and analyzed the “intensely disturbing” news (Kagay, 1994).

Upon further investigation, the Roper polling organization reported that the people in the original telephone poll had been asked, “Does it seem possible or does it seem impossible to you that the Nazi extermination of the Jews never happened?” Think for a minute about how you would answer that question. If you wanted to convey the opinion that the Holocaust did happen, you would

158 CHAPTER 6 Surveys and Observations: Describing What People Do

have to say, “It’s impossible that it never happened.” In order to give your opin- ion about the Holocaust accurately, you must also be able to unpack the double negative of impossible and never. So instead of measuring people’s beliefs, the question may be measuring people’s working memory or their motivation to pay attention.

We know that this negatively worded question may have affected people’s responses because the same polling organization repeated the survey less than a year later, asking the question more clearly, with less negativity: “Does it seem possible to you that the Nazi extermination of the Jews never happened, or do you feel certain that it happened?” This time, only 1% responded that the Holo- caust might not have happened, 8% did not know, and 91% said they were cer- tain it happened (Kagay, 1994). This new result, as well as other polls reflecting similarly low levels of Holocaust denial, indicates that because of the original wording, the question had poor construct validity: It probably did not measure people’s true beliefs.

Sometimes even one negative word can make a question difficult to answer. For example, consider the following question:

Abortion should never be restricted.

1 2 3 4 5

Disagree Agree

To answer this question, those who oppose abortion must think in the double negative (“I disagree that abortion should never be restricted”), while those who support abortion rights would be able to answer more easily (“I agree—abortion should never be restricted”).

When possible, negative wording should be avoided, but researchers some- times ask questions both ways, like this:

Abortion should never be restricted.

1 2 3 4 5

Disagree Agree

I favor strong restrictions on abortion.

1 2 3 4 5

Disagree Agree

After asking the question both ways, the researchers can study the items’ internal consistency (using Cronbach’s alpha) to see whether people respond similarly to both questions (in this case, agreement with the first item should correlate with disagreement with the second item). Like double-barreled questions, negatively worded ones can reduce construct validity because they might capture people’s ability or motivation to figure out the question rather than their true opinions.

159Construct Validity of Surveys and Polls


The order in which questions are asked can also affect the responses to a survey. The earlier questions can change the way respondents understand and answer the later questions. For example, a question on a parenting survey such as “How often do your children play?” would have different meanings if the previous questions had been about sports versus music versus daily activities.

Consider this example: Political opinion researcher David Wilson and his colleagues asked people whether they supported affirmative action for different groups (Wilson, Moore, McKay, & Avery, 2008). Half the participants were asked two forced-choice questions in this order:

1. Do you generally favor or oppose affirmative action programs for women?

2. Do you generally favor or oppose affirmative action for racial minorities?

The other half were asked the same two questions, but in the opposite order:

1. Do you generally favor or oppose affirmative action for racial minorities?

2. Do you generally favor or oppose affirmative action programs for women?

Wilson found that Whites reported more support for affirmative action for minorities when they had first been asked about affirmative action for women. Presumably, most Whites support affirmative action for women more than they do for minorities. To appear consistent, they might feel obligated to express support for affirmative action for racial minorities if they have just indicated their support for affirmative action for women.

The most direct way to control for the effect of question order is to prepare different versions of a survey, with the questions in different sequences. If the results for the first order differ from the results for the second order, researchers can report both sets of results separately. In addition, they might be safe in assum- ing that people’s endorsement of the first question on any survey is unaffected by previous questions.

Encouraging Accurate Responses Careful researchers pay attention to how they word and order their survey ques- tions. But what about the people who answer them? Overall, people can give mean- ingful responses to many kinds of questions (Paulhus & Vazire, 2007; Schwarz & Oyserman, 2001). In certain situations, however, people are less likely to respond accurately. It’s not because they are intentionally being dishonest. People might give inaccurate answers because they don’t make an effort to think about each question, because they want to look good, or because they are simply unable to report accurately about their own motivations and memories.


Some students are skeptical that people can ever report accurately on surveys. Despite what you might think, though, self-reports are often ideal. People are able

160 CHAPTER 6 Surveys and Observations: Describing What People Do

to report their own gender identity, socioeconomic status, ethnicity, and so on; there is no need to use expensive or difficult measures to collect such information. More importantly, self-reports often provide the most meaningful information you can get. Diener and his colleagues, in their studies of well-being (see Chapter 5), were specifically interested in subjective perspectives on happiness, so it made sense to ask participants to self-report on aspects of their life satisfaction (Diener, Emmons, Larsen, & Griffin, 1985).

In some cases, self-reports might be the only option. For example, research- ers who study dreaming can monitor brain activity to identify when someone is dreaming, but they need to use self-reports to find out the content of the person’s dreams because only the dreamer experiences the dream. Other traits are not very observable, such as how anxious somebody is feeling. Therefore, it is meaning- ful and effective to ask people to self-report on their own subjective experiences (Vazire & Carlson, 2011).


Response sets, also known as nondifferentiation, are a type of shortcut respondents can take when answering survey questions. Although response

sets do not cause many problems for answering a single, stand-alone item, people might adopt a consistent way of answering all the questions—especially toward the end of a long questionnaire (Lelkes, Krosnick, Marx, Judd, & Park, 2012). Rather than thinking carefully about each question, people might answer all of them positively, neg- atively, or neutrally. Response sets weaken construct valid- ity because these survey respondents are not saying what they really think.

One common response set is acquiescence, or yea- saying; this occurs when people say “yes” or “strongly agree” to every item instead of thinking carefully about each one. For example, a respondent might answer “5” to every item on Diener’s scale of subjective well-being—not because he is a happy person, but because he is using a yea-saying shortcut (Figure 6.2). People apparently have a bias to agree with (say “yes” to) any item—no matter what it states (Krosnick, 1999). Acquiescence can threaten construct validity because instead of measuring the construct of true feelings of well-being, the survey could be measuring the tendency to agree, or the lack of motivation to think carefully.

How can researchers tell the difference between a respondent who is yea-saying and one who really does agree with all the items? The most common way is by includ- ing reverse-worded items. Diener might have changed the wording of some items to mean their opposite, for instance,

FIGURE 6.2 Response sets. When people use an acquiescent response set, they agree with almost every question or statement. It can be hard to know whether they really mean it, or whether they’re just using a shortcut to respond to the questions.

161Construct Validity of Surveys and Polls

“If I had my life to live over, I’d change almost everything.” One benefit is that reverse-worded items might slow people down so they answer more carefully. (Before computing a scale average for each person, the researchers rescore only the reverse-worded items such that, for example, “strongly disagree” becomes a 5 and “strongly agree” becomes a 1.) The scale with reverse-worded items would have more construct validity because high or low averages would be mea- suring true happiness or unhappiness, instead of acquiescence. A drawback of reverse-wording is that sometimes the result is negatively worded items, which are more difficult to answer.

Another specific response set is fence sitting—playing it safe by answering in the middle of the scale, especially when survey items are controversial. People might also answer in the middle (or say “I don’t know”) when a question is con- fusing or unclear. Fence sitters can weaken a survey’s construct validity when middle-of-the-road scores suggest that some responders don’t have an opinion, when they actually do. Of course, some people honestly may have no opinion on the questions; in that case, they choose the middle option for a valid reason. It can be difficult to distinguish those who are unwilling to take a side from those who are truly ambivalent.

Researchers may try to jostle people out of this tendency. One approach is to take away the neutral option. Compare these two formats:

Race relations are going well in this country.

○ ○ ○ ○ ○





Race relations are going well in this country.

○ ○ ○ ○





When a scale contains an even number of response options, the person has to choose one side or the other because there is no neutral choice. The drawback of this approach is that sometimes people really do not have an opinion or an answer, so for them, having to choose a side is an invalid representation of their truly neutral stance. Therefore, researchers must carefully consider which format is best.

Another common way to get people off the fence is to use forced-choice ques- tions, in which people must pick one of two answers. Although this reduces fence sitting, again it can frustrate people who feel their own opinion is somewhere in the middle of the two options. In some telephone surveys, interviewers will write down a response of “I don’t know” or “No opinion” if a person volunteers that response. Thus, more people get off the fence, but truly ambivalent people can also validly report their neutral opinions.

162 CHAPTER 6 Surveys and Observations: Describing What People Do


Most of us want to look good in the eyes of others, but when survey respondents give answers that make them look better than they really are, these responses decrease the survey’s construct validity. This phenomenon is known as socially desirable responding, or faking good. The idea is that because respondents are embarrassed, shy, or worried about giving an unpopular opinion, they will not tell the truth on a survey or other self-report measure. A similar, but less common, phenomenon is called faking bad.

To avoid socially desirable responding, a researcher might ensure that the participants know their responses are anonymous—perhaps by conducting the survey online, or in the case of an in-person interview, reminding people of their anonymity right before asking sensitive questions (Schwarz & Oyserman, 2001). However, anonymity may not be a perfect solution. Anonymous respondents may treat surveys less seriously. In one study, anonymous respondents were more likely to start using response sets in long surveys. In addition, anonymous people were less likely to accurately report a simple behavior, such as how many candies they had just eaten, which suggests they were paying less attention to things (Lelkes, Krosnik, Marx, Judd, & Park, 2012).

One way to minimize this problem is to include special survey items that identify socially desirable responders with target items like these (Crowne & Marlowe, 1960):

My table manners at home are as good as when I eat out in a restaurant.

I don’t find it particularly difficult to get along with loud-mouthed, obnoxious people.

If people agree with many such items, researchers may discard that individual’s data from the final set, under suspicion that they are exaggerating on the other survey items, or not paying close attention in general.

Researchers can also ask people’s friends to rate them. When it comes to domains where we want to look good (e.g., on how rude or how smart we are), others know us better than we know ourselves (Vazire & Carlson, 2011). Thus, researchers might be better off asking people’s friends to rate them on traits that are observable but desirable.

Finally, researchers increasingly use special, computerized measures to eval- uate people’s implicit opinions about sensitive topics. One widely used test, the Implicit Association Test, asks people to respond quickly to positive and negative words on the right and left of a computer screen (Greenwald, Nosek, & Banaji, 2003). Intermixed with the positive and negative words may be faces from different social groups, such as Black and White faces. People respond to all possible com- binations, including positive words with Black faces, negative words with White faces, negative words with Black faces, and positive words with White faces. When people respond more efficiently to the White-positive/Black-negative combination than to the White-negative/Black-positive combination, researchers infer that the person may hold negative attitudes on an implicit, or unconscious, level.

163Construct Validity of Surveys and Polls


As researchers strive to encourage accurate responses, they also ask whether people are capable of reporting accurately on their own feelings, thoughts, and actions. Everyone knows his or her opinions better than anyone else does, right? Only I know my level of support for a political candidate. Only you know how much you liked a professor. Only the patron knows how much she liked that restaurant. In some cases, however, self-reports can be inaccurate, especially when people are asked to describe why they are thinking, behaving, or feeling the way they do. When asked, most people willingly provide an explanation or an opinion to a researcher, but sometimes they unintentionally give inaccurate responses.

Psychologists Richard Nisbett and Timothy Wilson (1977) conducted a set of studies to demonstrate this phenomenon. In one study, they put six pairs of nylon stockings on a table and asked female shoppers in a store to tell them which of the stockings they preferred. As it turned out, almost everyone selected the last pair on the right. The reason for this preference was something of a mystery— especially since all the stockings were exactly the same! Next, the researchers asked each woman why she selected the pair she did. Every participant reported that she selected the pair on the right for its excellent quality. Even when the researchers suggested they might have chosen the pair because it was on the far right side of the table, the women insisted they made their choices based on the quality of the stockings. In other words, the women easily formulated answers for the researchers, but their answers had nothing to do with the real reason they selected the one pair of stockings (Figure 6.3). Moreover, the women did not seem to be aware they were inventing a justification for their preference. They gave a sincere, reasonable response—one that just happened to be wrong. Therefore, researchers cannot assume the reasons people give for their own behavior are their actual rea- sons. People may not be able to accurately explain why they acted as they did.


Even if people can’t always accurately report the reasons behind their behaviors, surely they know what those behaviors were, right? In fact, psycho- logical research has shown that people’s memo- ries about events in which they participated are not very accurate. For example, many American adults can say exactly where they were when they heard the news that two planes had crashed into New York’s World Trade Center on September 11, 2001, and their memories are often startlingly vivid. Cognitive psychologists have checked the accuracy of such “flashbulb memories.”

FIGURE 6.3 The accuracy of self-reports. If you ask this shopper why she chooses one of these items, she will probably give you a reasonable answer. But does her answer represent the true reason for making her choice?

164 CHAPTER 6 Surveys and Observations: Describing What People Do

To conduct such a study, researchers administer a short questionnaire to their students on the day after a dramatic event, asking them to recall where they were, whom they were with, and so forth. A few years later, the researchers ask the same people the same questions as before, and also ask them to rate their confidence in their memories. Such studies have shown that overall accuracy is very low: For example, years later, about 73% of students recalling their memories of the 9/11 attacks remembered seeing the first plane hit the World Trade Center on TV, when in fact no such footage was available at that time (Pezdek, 2003).

The other important finding from these studies is that people’s confidence in the accuracy of their memories is virtually unrelated to how accurate the memories actually are. Three years later, people who are extremely confident in their memo- ries are about as likely to be wrong as people who report their memories with little or no confidence. In one study, when researchers showed participants what they wrote years ago, on the day after significant events, they were genuinely stumped, saying, “I still think of it as the other way around” or “I mean, like I told you, I have no rec- ollection of [that version] at all” (Neisser & Harsch, 1992, p. 21). Studies like these remind us to question the construct validity of even the most vivid and confidently held “memories” of the past. In other words, asking people what they remember is probably not the best operationalization for studying what really happened to them.


What about the special case of online product ratings? Online ratings constitute data that support frequency claims. Are consumers able to make good judgments about products they have purchased and used? One study found little correspon- dence between five-star ratings on and the ratings of the same products by Consumer Reports, an independent product rating firm (De Langhe, Fernbach, & Lichtenstein, 2016). The researchers found that consumers’ ratings were, instead, correlated with the cost of the product and the prestige of its brand. Studies like these suggest that people may not always be able to accurately report on the quality of products they buy (Figure 6.4).

FIGURE 6.4 Do consumer ratings match expert ratings? This camera’s online reviews were positive, but Consumer Reports (an independent rating firm) ranked it second to last. While consumers may be able to report their subjective experience with a product, their ratings might not accurately predict product quality.

165Construct Validity of Behavioral Observations


1. What are three potential problems related to the wording of survey questions? Can they be avoided?

2. Name at least two ways to ensure that survey questions are answered accurately.

3. For which topics, and in what situations, are people most likely to answer accurately to survey questions?

1. See pp. 156–159. 2. See pp. 159–162. 3. See pp. 159–160.

CONSTRUCT VALIDITY OF BEHAVIORAL OBSERVATIONS Survey and poll results are among the most common types of data used to support a frequency claim—the kind you read most often in newspapers or on websites. Researchers also study people simply by watching them in action. When a researcher watches people or animals and systematically records how they behave or what they are doing, it is called observational research. Some scientists believe observing behavior is better than collecting self-reports through surveys, because people cannot always report on their behavior or past events accurately, as we’ve discussed. Given the potential for the effect of question order, response sets, socially desirable responding, and other problems, many psychologists trust behavioral data more than survey data, at least for some variables.

Observational research can be the basis for frequency claims. Researchers might record how much people eat in fast-food restaurants or observe drivers, counting how many will stop for a pedestrian in a crosswalk. They might test the balance of athletes who have been hit on the head during practice, listen in on the comments of parents watching a hockey game, or watch families as they eat dinner. Observational research is not just for frequency claims: Observations can also be used to operationalize variables in association claims and causal claims. Regardless of the type of claim, it is important that observational measures have good construct validity.

Some Claims Based on Observational Data Self-report questions can be excellent measures of what people think they are doing, and of what they think is influencing their behavior. But if you want to know what people are really doing or what really influences behavior, you should

❮❮ For more detail on statistical significance, see Chapter 3, pp. 71–72, and Statistics Review: Inferential Statistics, pp. 499–501.

166 CHAPTER 6 Surveys and Observations: Describing What People Do

probably watch them. Here are three examples of how observational methods have been used to answer research questions in psychology.


Matthias Mehl and his colleagues kept track of what people say in everyday contexts (Mehl, Vazire, Ramirez-Esparza, Slatcher, & Pennebaker, 2007). The researchers recruited several samples of students and asked them to wear an electronically activated recorder (EAR) for 2–10 days (depending on the sample). This device contains a small, clip-on microphone and a digital sound recorder similar to an iPod (Figure 6.5A). At 12.5-minute intervals throughout the day, the EAR records 30 seconds of ambient sound. Later, research assistants transcribe everything the person says during the recorded time periods. The published data demonstrate that on average, women speak 16,215 words per day, while men speak 15,669 words per day (Figure 6.5B). This difference is not statistically signifi- cant, so despite stereotypes of women being the chattier gender, women and men showed the same level of speaking.


Canadian researchers investigated popular media stories about parents who had acted violently at youth ice hockey games (Bowker et al., 2009). To see how wide- spread this “problem” was, the researchers decided to watch a sample of hockey games and record the frequency of violent, negative behavior (as well as positive,

FIGURE 6.5 Observational research on daily spoken words. (A) Study participants wore a small recording device to measure how many words they spoke per day. (B) This table shows the study’s results, as they were reported in the original empirical journal article. (Source: Mehl et al., 2007, Table 1.)


167Construct Validity of Behavioral Observations

supportive behavior) by parents. Although the media has reported dramatic stories about fights among parents at youth hockey games, these few instances seem to have been an exception. After sitting in the stands at 69 boys’ and girls’ hockey games in one Canadian city, the researchers found that 64% of the parents’ comments were positive, and only 4% were negative. The authors concluded that their results were “in stark contrast to media reports, which paint a grim picture of aggressive spectators and out-of-control parents” (Bowker et al., 2009, p. 311).


A third example comes from a study of families in which both parents work (Campos et al., 2013). The researchers had camera crews follow both parents from a sample of 30 dual-earner families, from the time they got home from work until 8:00 p.m. Later, teams of assistants coded a variety of behaviors from the resulting videotapes. The researchers studied two aspects of family life: the emotional tone of the parents, and the topics of conversation during dinner.

To code emotional tone, they watched the videos, rating each parent on a 7-point scale. The rating scale went from 1 (cold/hostile) to 4 (neutral) to 7 (warm/happy) (Figure 6.6). The results from the Campos study showed that emotional tone in the families was slightly positive in the evening hours (around 4.2 on the 7-point scale). In addition, they found that kids and parents differed in

Interrater reliability of this observation.

FIGURE 6.6 Coding emotional tone. Here is how researchers reported the way they coded emotional tone in the Method section of their article. (Source: Adapted from Campos et al., 2013.)


168 CHAPTER 6 Surveys and Observations: Describing What People Do












46 50



4 4



29 33




Appreciation Distaste Reward Negotiation Health Pleasure


FIGURE 6.7 Coding dinnertime topics. (A) The coders assigned each piece of dinnertime conversation to one of six categories. (B) The results showed that children were most likely to express distaste at the food their parents had prepared. (Source: Adapted from Campos et al., 2013.)




what they discussed at dinnertime. The kids were more likely to express distaste at the food, while the parents talked about how healthy it was (Figure 6.7). In addition to these frequency estimates, the researchers also studied associations. For example, they found that mothers’ (but not fathers’) emotional tone was more negative when children complained about the food at dinner.


The previous examples illustrate a variety of ways researchers have conducted observational studies—either through direct means, such as sitting in the stands during a hockey game, or by using technology, such as an EAR or a video camera. Let’s reflect on the benefits of behavioral observation in these cases. What might have happened if the researchers had asked the participants to self-report? The college students certainly would not have been able to state how many words they

169Construct Validity of Behavioral Observations

spoke each day. The hockey parents might have reported that their own comments at the rink were mostly positive, but they might have exaggerated their reports of other parents’ degree of negativity. And while parents could report on how they were feeling in the evening and at dinner, they might not have been able to describe how emotionally warm their expressions appeared to others—the part that matters to their partners and children. Observations can sometimes tell a more accurate story than self- reporting (Vazire & Carlson, 2011).

Making Reliable and Valid Observations Observational research is a way to operationalize a conceptual variable, so when interro- gating a study we need to ask about the construct validity of any observational measure. We ask: What is the variable of interest, and did the observations accurately measure that variable? Although observational research may seem straightforward, researchers must work quite diligently to be sure their observations are reliable and valid.

The construct validity of observations can be threatened by three problems: observer bias, observer effects, and reactivity. Observations have good construct validity to the extent that they can avoid these three problems.


Observer bias occurs when observers’ expectations influence their interpretation of the participants’ behaviors or the outcome of the study. Instead of rating behaviors objectively, observers rate behaviors according to their own expectations or hypoth- eses. In one study, psychoanalytic therapists were shown a videotape of a 26- year-old man talking to a professor about his feelings and work experiences (Langer & Abelson, 1974). Some of the therapists were told the young man was a patient, while others were told he was a job applicant. After seeing the videotape, the clinicians were asked for their observations. What kind of person was this young man?

Although all the therapists saw the same videotape, their reactions were not the same. Those who thought the man was a job applicant described him with such terms such as “attractive,” “candid,” and “innovative.” Those who saw the videotape thinking the young man was a patient described him as a “tight, defen- sive person,” “frightened of his own aggressive impulses” (Langer & Abelson, 1974, p. 8). Since everyone saw the same tape, these striking differences can only have reflected the biases of the observers in interpreting what they saw.


It is problematic when observer biases affect researchers’ own interpretations of what they see. It is even worse when the observers inadvertently change the behavior of those they are observing, such that participant behavior changes to match observer expectations. Known as observer effects, or expectancy effects, this phenomenon can occur even in seemingly objective observations.

170 CHAPTER 6 Surveys and Observations: Describing What People Do

Bright and Dull Rats. In a classic study of observer effects, researchers Rosenthal and Fode (1963) gave each student in an advanced psychology course five rats to test as part of a final lab experience in the course. Each student timed how long it took for their rats to learn a simple maze, every day for several days. Although each student actually received a randomly selected group of rats, the researchers told half of them that their rats were bred to be “maze-bright” and the other half that their rats were bred to be “maze-dull.”

Even though all the rats were genetically similar, those that were believed to be maze-bright completed the maze a little faster each day and with fewer mistakes. In contrast, the rats believed to be maze-dull did not improve their performance over the testing days. This study showed that observers not only see what they expect to see; sometimes they even cause the behavior of those they are observing to conform to their expectations.

Clever Hans. A horse nicknamed Clever Hans provides another classic example of how observers’ subtle behavior changed a subject’s behavior, and how scien- tifically minded observers corrected the problem (Heinzen, Lillienfeld, & Nolan, 2015). More than 100 years ago, a retired schoolteacher named William von Osten tutored his horse, Hans, in mathematics. If he asked Hans to add 3 and 2, for example, the horse would tap his hoof five times and then stop. After 4 years of daily training, Clever Hans could perform math at least as well as an average fifth- grader, identify colors, and read German words (Figure 6.8). Von Osten allowed many scientists to test his horse’s abilities, and all were satisfied that von Osten was not giving Hans cues on the sly because he apparently could do arithmetic even when his owner was not even present.

Just when other scientists had concluded Clever Hans was truly capable of doing math, an experi- mental psychologist, Oskar Pfungst, came up with a more rigorous set of checks (Pfungst, 1911). Sus- pecting the animal was sensing subtle nonverbal cues from his human questioners, Pfungst showed the horse a series of cards printed with numbers. He alternated situations in which the questioner could or could not see each card. As Pfungst sus- pected, Hans was correct only when his questioner saw the card.

As it turned out, the horse was extremely clever—but not at math. He was smart at detect- ing the subtle head movements of the questioner. Pfungst noticed that a questioner would lean over to watch Hans tap his foot, raising his head a bit at the last correct tap. Clever Hans had learned this slight move was the cue to stop tapping (Heinzen et al., 2015).

FIGURE 6.8 William von Osten and Clever Hans. The horse Clever Hans could detect nonverbal gestures from anybody—not just his owner—so his behavior even convinced a special commission of experts in 1904.

171Construct Validity of Behavioral Observations


Researchers must ensure the construct validity of observational measures by tak- ing steps to avoid observer bias and observer effects. First and foremost, careful researchers train their observers well. They develop clear rating instructions, often called codebooks, so the observers can make reliable judgments with less bias. Codebooks are precise statements of how the variables are operationalized, and the more precise and clear the codebook statements are, the more valid the operationalizations will be. Figure 6.9 shows an example of how the parents’ comments were coded in the hockey games study.

Researchers can assess the construct validity of a coded measure by using multiple observers. Doing so allows the researchers to assess the interrater reliability of their measures. Refer to Figure 6.6, the excerpt from the Campos et al. (2013) article, in which the researchers discuss the interrater reliability of the emotional tone ratings. The abbreviation ICC is a correlation that quantifies degree of agreement. The closer the correlation is to 1.0, the more the observers agreed with one another. The coders in this case showed acceptable interrater reliability.

Using multiple observers does not eliminate anyone’s biases, of course, but if two observers of the same event agree on what happened, the researchers can be more confident. If there is disagreement, the researchers may need to train their observers better, develop a clearer coding system for rating the behaviors, or both.

Even when an operationalization has good interrater reliability, it still might not be valid. When two observers agree with each other, they might share the

FIGURE 6.9 Clear codebooks can improve the construct validity of observations. This information was included in the empirical journal article’s Method section. (Source: Adapted from Bowker et al., 2009.)


❮❮ For more on interrater reliability, see Chapter 5, pp. 125–132.

172 CHAPTER 6 Surveys and Observations: Describing What People Do

same biases, so their common observations are not necessarily valid. Think about the therapists in the Langer and Abelson (1974) study. Those who were told the man in the videotape was a patient might have showed interrater reli- ability in their descriptions of how defensive or frightened he appeared. But because they shared similar biases, their reliable ratings were not valid descrip- tions of the man’s behavior. Therefore, interrater reliability is only half the story; researchers should employ methods that minimize observer bias and observer effects.

Masked Research Design. The Rosenthal and Fode (1963) study and the Clever Hans effect both demonstrate that observers can give unintentional cues influ- encing the behavior of their subjects. A common way to prevent observer bias and observer effects is to use a masked design, or blind design, in which the observers are unaware of the purpose of the study and the conditions to which participants have been assigned.

If Rosenthal and Fode’s students had not known which rats were expected to be bright and dull, the students would not have evoked different behavior in their charges. Similarly, when Clever Hans’ observers did not know the right answer to the questions they were asking, the horse acted differently; he looked much less intelligent. These examples make it clear that coders and observers should not be aware of a study’s hypotheses, or should take steps to mask the conditions they are observing.


Sometimes the mere presence of an outsider is enough to change the behavior of those being observed. Suppose you’re visiting a first-grade classroom to observe the children. You walk quietly to the back of the room and sit down to watch what the children do. What will you see? A roomful of little heads swiveled around looking at you! Do first graders usually spend most of their time staring at the back of the room? Of course not. What you are witnessing is an example of reactivity.

Reactivity is a change in behavior when study participants know another person is watching. They might react by being on their best behavior—or in some cases, their worst—rather than displaying their typical behavior. Reactivity occurs not only with human participants but also with animal subjects. Psychologist Robert Zajonc once demonstrated that even cockroaches behave differently in the presence of other cockroaches (Zajonc, Heingartner, & Herman, 1969). If people and animals can change their behavior just because they are being watched, what should a careful researcher do?

Solution 1: Blend In. One way to avoid observer effects is to make unobtrusive observations—that is, make yourself less noticeable. A developmental psychol- ogist doing research might sit behind a one-way mirror, like the one shown

173Construct Validity of Behavioral Observations

in Figure 6.10, in order to observe how children interact in a classroom without letting them know. In a public setting, a researcher might act like a casual onlooker—another face in the crowd—to watch how other people behave. In the Bowker hockey games study, observers col- lected data in plain sight by posing as fans in the stands. In 69 hockey games, only two parents ever asked the observer what he or she was doing, sug- gesting that the researcher’s presence was truly unobtrusive.

Solution 2: Wait It Out. Another solution is to wait out the situation. A researcher who plans to observe at a school might let the children get used to his or her presence until they forget they’re being watched. The anthropologist Jane Goodall, in her studies of chimpanzees in the wild, used a similar tactic. When she began introducing herself to the chimps in the Gombe National Park in Africa, they fled, or stopped whatever else they were doing to focus on her. After several months, the chimps got used to having her around and were no longer afraid to go about their usual activities in her presence. Similarly, participants in the Mehl EAR study reported that after a couple of days of wearing the device, they did not find it to be invasive (Mehl & Pennebaker, 2003).

Solution 3: Measure the Behavior’s Results. Another way to avoid reactivity is to use unobtrusive data. Instead of observing behavior directly, researchers measure the traces a particular behavior leaves behind. For example, in a museum, wear-and-tear on the flooring can signal which areas of the museum are the most popular, and the height of smudges on the windows can indicate the age of visitors. The number of empty liquor bottles in residential garbage cans indicates how much alcohol is being consumed in a community (Webb, Campbell, Schwartz, & Sechrest, 1966). Researchers can measure behavior without doing any direct participant observation.


Is it ethical for researchers to observe the behaviors of others? It depends. Most psychologists believe it is ethical to watch people in museums, classrooms, hockey games, or even at the sinks of public bathrooms because in those settings people can reasonably expect their activities to be public, not private. Of course, when psychologists report the results of such observational studies, they do not specif- ically identify any of the people who were observed.

More secretive methods, such as one-way mirrors and covert video recording, are also considered ethical in some conditions. In most cases, psychologists doing

FIGURE 6.10 Unobtrusive observations. This one-way mirror lets the researcher unobtrusively record the behaviors of children in a preschool classroom.

174 CHAPTER 6 Surveys and Observations: Describing What People Do

research must obtain permission in advance to watch or to record people’s private behavior. If hidden video recording is used, the researcher must explain the pro- cedure at the conclusion of the study. If people object to having been recorded, the researcher must erase the file without watching it.

Certain ethical decisions may be influenced by the policies of a university where a study is conducted. As discussed in Chapter 4, institutional review boards (IRBs) assess each study to decide whether it can be conducted ethically.


1. Sketch a concept map of observer bias, observer effects, and reactivity, and indicate the approaches researchers can take to minimize each problem.

2. Explain why each of these three problems can threaten construct validity, using this sentence structure for each issue:

If an observational study suffers from , then the researcher

might be measuring instead of .

1. See pp. 169–173. 2. See pp. 169 and 172.

175Key Terms


Summary • Surveys, polls, and observational methods are used

to support frequency claims, but they also measure variables for association and causal claims. When interrogating a claim based on data from a survey or an observational study, we ask about the construct validity of the measurement.

Construct Validity of Surveys and Polls • Survey question formats include open-ended, forced-

choice, Likert scale, and semantic differential.

• Sometimes the way a survey question is worded can lead people to be more likely or less likely to agree with it.

• Double-barreled and negatively worded questions are difficult to answer in a valid way.

• People sometimes answer survey questions with an acquiescent or fence-sitting response tendency or in a way that makes them look good. Researchers can add items to a survey or change the way ques- tions are written, in order to avoid some of these problems.

• Surveys are efficient and accurate ways to assess people’s subjective feelings and opinions; they may be less appropriate for assessing people’s actual behavior, motivations, or memories.

Construct Validity of Behavioral Observations • Observational studies record people’s true behavior,

rather than what people say about their behavior.

• Well-trained coders and clear codebooks help ensure that observations will be reliable and not influenced by observer expectations.

• Some observational studies are susceptible to reac- tivity. Masked designs and unobtrusive observations make it more likely that observers will not make biased ratings, and that participants will not change their behavior in reaction to being observed.

• Local IRB guidelines may vary, but in general, it is considered ethical to conduct observational research in public settings where people expect to be seen by others.

Key Terms

survey, p. 154 poll, p. 154 open-ended question, p. 154 forced-choice question, p. 154 Likert scale, p. 155 semantic differential format, p. 155 leading question, p. 156

double-barreled question, p. 157 negatively worded question, p. 157 response set, p. 160 acquiescence, p. 160 fence sitting, p. 161 socially desirable responding, p. 162 faking good, p. 162

faking bad, p. 162 observational research, p. 165 observer bias, p. 169 observer effect, p. 169 masked design, p. 172 reactivity, p. 172 unobtrusive observation, p. 172

176 CHAPTER 6 Surveys and Observations: Describing What People Do

Review Questions

1. The following item appears on a survey: “Was your cell phone purchased within the last two years, and have you downloaded the most recent updates?” What is the biggest problem with this wording?

a. It is a leading question.

b. It involves negative wording.

c. It is a double-barreled question.

d. It is not on a Likert scale.

2. When people are using an acquiescent response set they are:

a. Trying to give the researcher the responses they think he or she wants to hear.

b. Misrepresenting their views to appear more socially acceptable.

c. Giving the same, neutral answer to each question.

d. Tending to agree with every item, no matter what it says.

3. In which of the following situations do people most accurately answer survey questions?

a. When they are describing the reasons for their own behavior.

b. When they are describing what happened to them, especially after important events.

c. When they are describing their subjective experi- ence; how they personally feel about something.

d. People almost never answer survey questions accurately.

4. Which of the following makes it more likely that behavioral observations will have good interrater reliability?

a. A masked study design

b. A clear codebook

c. Using naive, untrained coders

d. Open-ended responses

5. Which one of the following is a means of controlling for observer bias?

a. Using unobtrusive observations.

b. Waiting for the participants to become used to the observer.

c. Making sure the observer does not know the study’s hypotheses.

d. Measuring physical traces of behavior rather than observing behavior directly.

6. Which of the following is a way of preventing reactivity?

a. Waiting for the participants to become used to the observer.

b. Making sure the observers do not know the study’s hypotheses.

c. Making sure the observer uses a clear codebook.

d. Ensuring the observers have good interrater reliability.

Learning Actively

1. Consider the various survey question formats: open- ended, forced-choice, Likert scale, and semantic differential. For each of the following research topics, write a question in each format, keeping in mind some of the pitfalls in question writing. Which of the questions you wrote would have the best construct validity, and why?

a. A study that measures attitudes about women serving in combat roles in the military.

b. A customer service survey asking people about their satisfaction with their most recent shopping experience.

c. A poll that asks people which political party they have supported in the past.

2. As part of their Well-Being Index, the Gallup orga- nization asks a daily sample of Americans, “In the last seven days, on how many days did you exercise

To see samples of chapter concepts in the popular media, visit and click the box for Chapter 6.r

177Learning Actively

for 30 or more minutes?” If people say they have exercised three or more days, Gallup classifies them as “frequent exercisers.” Gallup finds that between about 47% (in the winter months) and 55% (in the summer) report being frequent exercisers (Gallup, n.d.). What kind of question is this: forced-choice, Likert scale, semantic differential, or some other for- mat? Does the item appear to be leading, negatively worded, or double-barreled? Do you think it leads to accurate responses?

3. Plan an observational study to see which kind of drivers are more likely to stop for a pedestrian in a crosswalk: male or female drivers. Think about how to maximize your construct validity. Will observ- ers be biased about what they record? How might they influence the people they’re watching, if at all? Where should they stand to observe driver behav- ior? How will you evaluate the interrater reliability of your observers? Write a two- to three-sentence operational definition of what it means to “stop for a pedestrian in a crosswalk.” The definition should be clear enough that if you asked two friends to use it to code “stopping for pedestrian” behavior, it would have good reliability and validity.

4. To study the kinds of faces babies usually see, researchers asked parents to place tiny video cam- eras on their 1-month-old and 3-month-old infants during their waking hours (Sugden, Mohamed-Ali, & Moulson, 2013). Coders viewed the resulting video footage frame by frame, categorizing the gender, race, and age of the faces each baby saw. The results revealed that babies are exposed to faces 25% of their waking hours. In addition, the babies in the sam- ple were exposed to female faces 70% of the time, and 96% of the time they were exposed to faces that were the same race as themselves. What questions might you ask to decide whether the observational measures in this study were susceptible to observer bias, observer effects, or reactivity?

61% Said This Shoe “Felt True to Size”

8 out of 10 Drivers Say They Experience Road Rage CBS Local, 2016

Three in Four Women Worldwide Rate Their Lives as “Struggling” or “Suffering” Gallup, 2015


Sampling: Estimating the Frequency of Behaviors and Beliefs THE CLAIMS THAT OPEN this chapter address a variety of topics: driving behavior, well-being, and the fit of a pair of shoes. The target population in each case is different. One example is about American drivers, one applies to women around the world, and the last represents online shoppers. In all three claims, we are being asked to believe something about a larger group of people (e.g., all U.S. drivers or all the world’s women—more than 3.5 billion), based on data from a smaller sample that was actually studied. In this chapter, you’ll learn about when we can use a sample to generalize to a population, and when we cannot. In addition, you’ll learn when we really care about being able to generalize to a population, and when it’s less important.

GENERALIZABILITY: DOES THE SAMPLE REPRESENT THE POPULATION? When interrogating external validity, we ask whether the results of a particular study can be generalized to some larger popula- tion of interest. External validity is often extremely important for frequency claims. To interrogate the external validity of frequency


A year from now, you should still be able to:

1. Explain why external validity is often essential for frequency claims.

2. Describe which sampling techniques allow generalizing from a sample to a population of interest, and which ones do not.

180 CHAPTER 7 Sampling: Estimating the Frequency of Behaviors and Beliefs

claims such as those discussed in Chapter 6 and also presented here, we might ask the following types of questions:

“Do the students who rated the professor on this website adequately represent all the professor’s former students?”

“Does the sample of drivers who were asked about road rage adequately represent American drivers?”

“Can feelings of the women in the sample generalize to all the world’s women?”

“Do the people who reviewed the fit of these shoes represent the population of people who wear them?”

and even . . .

“Can we predict the results of the presidential election if the polling sample consisted of 1,500 people?”

Recall that external validity concerns both samples and settings. A researcher may intend the results of a study to generalize to the other members of a certain pop- ulation, as in the questions above. Or a researcher may intend the results to gen- eralize to other settings, such as other shoes from the same manufacturer, other products, or other classes taught by the same professor. However, this chapter focuses primarily on the external validity of samples.

Populations and Samples Have you ever been offered a free sample in a grocery store? Say you tried a sample of spinach mini-quiche and you loved it. You probably assumed that all 50 in the box would taste just the same. Maybe you liked one baked pita chip and assumed all the chips in the bag would be good, too. The single bite you tried is the sample. The box or bag it came from is the population. A population is the entire set of people or products in which you are interested. The sample is a smaller set, taken from that population. You don’t need to eat the whole bag (the whole population) to know whether you like the chips; you only need to test a small sample. If you did taste every chip in the population, you would be conducting a census.

Researchers usually don’t need to study every member of the population either—that is, they do not need to conduct a census. Instead, they study a sample of people, assuming that if the sample behaves a certain way, the population will do the same. The external validity of a study concerns whether the sample used in the study is adequate to represent the unstudied population. If the sample can general- ize to the population, there is good external validity. If the sample is biased in some way, there is not. Therefore, when a sample has good external validity, we also say the sample “generalizes to” or “is representative of,” a population of interest.


The world’s population is around 7.5 billion people, but researchers only rarely have that entire population in mind when they conduct a study. Before researchers can

181Generalizability: Does the Sample Represent the Population?

decide whether a sample is biased or unbiased, they have to specify a population to which they want to generalize: the population of interest. Instead of “the population” as a whole, a research study’s intended population is more limited. A population of interest might be laboratory mice. It might be undergraduate women. It might be men with dementia. At the grocery store, the population of interest might be the 50 mini-quiches in the box, or the 200 pita chips in the bag.

If a sample of people rated a style of shoe on how well they fit, we might be interested in generalizing to the population of people who have worn those shoes. If we are interrogating the results of a national election poll, we might care primarily about the population of people who will vote in the next election in the country. In order to say that a sample generalizes to a population, we have to first decide which population we are interested in.


For a sample to be representative of a population, the sample must come from the population. However, coming from the population is not sufficient by itself; that is, just because a sample comes from a population does not mean it gener- alizes to that population. Just because a sample consists of American drivers does not mean it represents all American drivers. Just because a sample con- tains women doesn’t mean the sample can generalize to the population of the world’s women.

Samples are either biased or representative. In a biased sample, also called an unrepresentative sample, some members of the population of interest have a much higher probability of being included in the sample compared to other members. In an unbiased sample, also called a representative sample, all members of the population have an equal chance of being included in the sample. Only unbiased samples allow us to make inferences about the population of interest. Table 7.1 lists a few examples of biased and unbiased samples.


Biased and Unbiased Samples of Different Populations of Interest


Democrats in Texas Recruiting people sitting in the front row at the Texas Democratic Convention.

Obtaining a list of all registered Texas Democrats from public records, and calling a sample of them through randomized digit dialing.

Drivers Asking drivers to complete a survey when they stop to add coins to a parking meter.

Obtaining a list of licensed drivers in each state, and selecting a sample using a random number generator.

Students who have taken a class with Professor A

Including only students who have written comments about Professor A on an online website.

Obtaining a list of all of Professor A’s current and former students, and selecting every fifteenth student for study.

182 CHAPTER 7 Sampling: Estimating the Frequency of Behaviors and Beliefs

When Is a Sample Biased? Let’s return to the food examples to explore biased and unbiased samples further. If you reached all the way to the bottom of the bag to select your sample pita chip, that sample would be biased, or unrepresentative. Broken chips at the bottom of the bag are not representative of the population, and choosing a broken chip would cause you to draw the wrong conclusions about the quality of that bag of chips. Similarly, suppose the box of 50 quiches was a variety pack, containing various flavors of quiche. In that case, a sample spinach quiche would be unrepresentative, too. If the other types of quiche are not as tasty as the spinach, you would draw incorrect conclusions about the varied population.

In a consumer survey or an online opinion poll, a biased sample could be like getting a handful from the bottom of the bag, where the broken pita chips are more likely to be. In other words, a researcher’s sample might contain too many of the most unusual people. For instance, the students who rate a professor on a website might tend to be the ones who are angry or disgruntled, and they might not repre- sent the rest of the professor’s students very well. A biased study sample could also be like an unrepresentative spinach quiche. A researcher’s sample might include only one kind of people, when the population of interest is more like a variety pack. Imagine a poll that sampled only Democrats when the population of interest con- tains Republicans, Democrats, and people with other political views (Figure 7.1). Or imagine a study that sampled only men when the population of interest contains both men and women.

Of course, the population of interest is what the researcher says it is, so if the po pula- tion is only Democrats, it is appropriate to use only people who are registered Democrats

FIGURE 7.1 Biased, or unrepresentative, samples. If the population of interest includes members of all political parties, a sample from a single party’s political convention would not provide a representative sample.

183Generalizability: Does the Sample Represent the Population?

in the sample. Even then, the researcher would want to be sure the Democrats in the sample are representative of the population of Democrats.


A sample could be biased in at least two ways: Researchers might study only those they can contact conveniently, or only those who volunteer to respond. These two biases can threaten the external validity of a study because people who are convenient or more willing might have different opinions from those who are less handy and less willing.

Sampling Only Those Who Are Easy to Contact. Many studies incorpo- rate convenience sampling, using a sample of people who are easy to contact and readily available to participate. Psychology studies are often conducted by psychology professors, and they find it handy to use college students as partici- pants. The Mehl study on how much people talk is an example (see Chapter 6). However, those easy-to-reach college students may not be representative of other populations that are less educated, older, or younger (e.g., Connor, Snibbe & Markus, 2005).

Another form of convenience sampling is used in online studies. In the last 10 years, psychologists have been conducting research through websites such as Amazon’s Mechanical Turk and Prolific Academic (Figure 7.2). People who want to earn money for participating in research can do so online. Even though these samples are convenient, those who complete studies on websites

FIGURE 7.2 Online studies normally use convenience samples. People who participate in online research for payment, such as at the MTurk website, are considered a convenience sample. How do they differ from college student samples, or from representative samples of people from the same country? (Source:

184 CHAPTER 7 Sampling: Estimating the Frequency of Behaviors and Beliefs

sometimes show personality traits and political beliefs that differ slightly from other adult samples (Clifford, Jewell, & Waggoner, 2015; Goodman, Cyrder, & Cheema, 2013).

Here’s another example. Imagine you are conducting an exit poll during a presidential election, and you’ve hired interviewers to ask people who they voted for as they’re leaving the polling place. (Exit polls are widely used in the United States to help the media predict the results of an election before the votes are completely counted.) The sample for your exit poll might be biased in a couple of ways. For one, maybe you had only enough money to send poll- sters to polling places that were nearby and easy to reach. The resulting sample might be biased because the neighboring precincts might be different from the district as a whole. Therefore, it would be better to send interviewers to a sam- ple of precincts that represent the entire population of precincts. In addition, at a particular precinct, the pollsters might approach the population of exiting voters in a biased way. Untrained exit poll workers may feel most comfortable approaching voters who look friendly, look similar to themselves, or look as if they are not in a hurry. For instance, younger poll workers might find it easiest to approach younger voters. Yet because younger voters tend to be more liberal, that sample’s result might lead you to conclude the voters at that location voted for a Democratic candidate more often than they really did. In this case, sam- pling only the people who are convenient would lead to a biased sample. Effec- tive pollsters train their staff to interview exiting voters according to a strict (usually randomized) schedule.

Researchers might also end up with a convenience sample if they are unable to contact an important subset of people. They might not be able to study those who live far away, who don’t show up to a study appointment, or who don’t answer the phone. Such circumstances may result in a biased sample when the people the researchers can contact are different from the population to which they want to generalize.

Many years ago, for instance, people conducting surveys in the U.S. selected their samples from landline telephone numbers. At the time, this approach made sense because almost all Americans had telephones in their homes. Yet the number of people who use only cell phones increases every year. If wireless-only people are different from people who have landlines, then a survey or poll that excluded wireless numbers could have inaccurate results.

The U.S. government’s Centers for Disease Control and Prevention conducts monthly surveys of people’s health behavior. They use the data to estimate important health indicators, such as psychological distress, vaccination rates, smoking, and alcohol use. They have found that wireless-only citizens (at last esti- mate, 48.3% of households) differ from those with both types of phones (41.2%), landline-only (7.2%), or no phones (3.1%) (Blumberg & Luke, 2015). People in wireless-only households tend to be younger, renting rather than owning their homes, and living in poverty.

185Generalizability: Does the Sample Represent the Population?


Health Behaviors of Samples with Different Phone Ownership It’s important to include cell phone numbers in survey and poll samples.





Current smoker 11.5 18.8 20.8

At least one heavy drinking day in the past year

17.2 29.6 24.0

Received influenza vaccine in the past year

49.8 33.9 38.1

Note: N = 15,988 Source: U.S. Department of Health and Human Services, 2016. Data collected from July 2015 to December 2015.

As Table 7.2 shows, some of their health behaviors also differ. If the CDC esti- mated the American population’s smoking behavior simply from calling landline phones, their estimates would be incorrect; they would be biased toward under- estimating. Fortunately, the CDC calls both wireless and landline numbers for its monthly survey.

Sampling Only Those Who Volunteer. Another way a sample might be biased is through self-selection, a term used when a sample is known to contain only people who volunteer to participate. Self-selection is ubiquitous in online polls, and it can cause serious problems for external validity.

When Internet users choose to rate something—a product on, an online quiz on, a professor on—they are self-selecting when doing so (Figure 7.3). This could lead to biased estimates because the people who rate the items are not necessarily representative of the population of all people who have bought the product, visited the website, or taken the class. Researchers do not always know how online “raters” differ from “non- raters,” but they speculate that the people who take the time to rate things might have stronger opinions or might be more willing to share ideas with others.

Not all Internet-based surveys are subject to self-selection bias. An exam- ple is the road rage survey conducted by the American Automobile Association (Figure 7.4). A panel of people were randomly selected by a market research firm to complete weekly surveys online. Members of the panel could not self- select; they were invited only if their home address had been randomly selected. In addition, the small portion of participants who did not have Internet access were provided with a laptop and Internet service so they could be represented in the sample.

186 CHAPTER 7 Sampling: Estimating the Frequency of Behaviors and Beliefs

Obtaining a Representative Sample: Probability Sampling Techniques Samples that are convenient or self-selected are not likely to represent the population of interest. In contrast, when external validity is vital and researchers need an unbiased, representative sample from a population, probability sampling is the best option. There are several techniques for probability sampling, but they all involve an element of random selection. In probability sampling, also called random sampling, every member of the population of interest has an equal and

FIGURE 7.4 Some Internet polls are based on random samples. The data in this figure came from an online survey on road rage. Respondents had been randomly sampled and invited to respond, so we can probably generalize from this poll to the population of American drivers. Respondents could endorse multiple behaviors on the list. (Source:

FIGURE 7.3 Some Internet polls are based on self-selected samples. This online poll invited readers to vote on their plans for child car seats. Why can’t we generalize from this sample of 17,153 to the population of parents of small children? (Source:

187Generalizability: Does the Sample Represent the Population?

known chance of being selected for the sample, regardless of whether they are convenient or motivated to volunteer. Therefore, probability samples have excel- lent external validity. They can generalize to the population of interest because all members of the population are equally likely to be represented. In contrast, nonprobability sampling techniques involve nonrandom sampling and result in a biased sample.


The most basic form of probability sampling is simple random sampling. To visualize this process, imagine that each member of the population of interest has his or her name written on a plastic ball. The balls are rolled around in a bowl, then a mechanism spits out a number of balls equal to the size of the desired sample. The people whose names are on the selected balls will make up the sample.

Another way to create a simple random sample is to assign a number to each individual in a population, and then select certain ones using a table of random num- bers. Professional researchers use software to generate random numbers (Figure 7.5). When pollsters need a random sample, they program computers to randomly select telephone numbers from a database of eligible cell phones and landlines.

❮❮ For a sample table of random numbers, see Appendix A, pp. 547–550.

FIGURE 7.5 Computerized randomizers. This website generates lists of random numbers. In this example, the user requested a list of 50 random members of a population of 650. Each individual in the original population must first be assigned a number from 1 to 650. The randomizer tool determines which of the 50 individuals should be in the random sample. (Source:

188 CHAPTER 7 Sampling: Estimating the Frequency of Behaviors and Beliefs

Although simple random sampling works well in theory, it can be surprisingly difficult and time consuming. It can be nearly impossible to find and enumerate every member of the population of interest, so researchers usually use variants of the basic technique. The variants below are just as externally valid as simple random sampling because they all contain an element of random selection.


Cluster sampling is an option when people are already divided into arbitrary groups. Clusters of participants within a population of interest are randomly selected, and then all individuals in each selected cluster are used. If a researcher wanted to ran- domly sample high school students in the state of Pennsylvania, for example, he could start with a list of the 952 public high schools (clusters) in that state, randomly select 100 of those high schools (clusters), and then include every student from each of those 100 schools in the sample. The Bowker hockey games study (2009) used a version of cluster sampling (see Chapter 6). The researchers selected 69 games at random out of 630 possible hockey games in Ottawa, Canada, that they could have attended during the season. They then sampled every single comment at each game. Each game was a cluster, and every comment at the game was sampled.

In the related technique of multistage sampling, two random samples are selected: a random sample of clusters, then a random sample of people within those clusters. In the earlier example, the researcher starts with a list of high schools (clusters) in the state and selects a random 100 of those schools. Then, instead of selecting all students at each school, the researcher selects a random sample of stu- dents from each of the 100 selected schools. Both cluster sampling and multistage sampling are easier than sampling from all Pennsylvania high schools, and both should still produce a representative sample because they involve random selection.

Professional pollsters might use three-stage multistage sampling to select phone numbers for telephone polls. They first select a random sample of area codes out of all possible area codes in the country. Next, they select a random sample of the exchanges (the middle three digits of a U.S. phone number) out of all possible exchanges in each selected area code. Then, for each area code and exchange selected, they dial the last four digits at random, using a computer. The area codes and exchanges are considered clusters. At each stage of this sampling process, random selection is used.


Another multistage technique is stratified random sampling, in which the researcher purposefully selects particular demographic categories, or strata, and then randomly selects individuals within each of the categories, proportionate to their assumed membership in the population. For example, a group of researchers might want to be sure their sample of 1,000 Canadians includes people of South Asian descent in the same proportion as in the Canadian population (which is 4%). Thus, they might have two categories (strata) in their population: South Asian Canadians and other Canadians. In a sample of 1,000, they would make sure to include at least

189Generalizability: Does the Sample Represent the Population?

40 members of the category of interest (South Asian Canadians). Importantly, how- ever, all 1,000 members of both categories are selected at random.

Stratified random sampling differs from cluster sampling in two ways. First, strata are meaningful categories (such as ethnic or religious groups), whereas clusters are more arbitrary (any random set of hockey games or high schools would do). Second, the final sample sizes of the strata reflect their proportion in the population, whereas clusters are not selected with such proportions in mind.


A variation of stratified random sampling is called oversampling, in which the researcher intentionally overrepresents one or more groups. Perhaps a researcher wants to sample 1,000 people, making sure to include South Asians in the sam- ple. Maybe the researcher’s population of interest has a low percentage of South Asians (say, 4%). Because 40 individuals may not be enough to make accurate sta- tistical estimates, the researcher decides that of the 1,000 people she samples, a full 100 will be sampled at random from the Canadian South Asian community. In this example, the ethnicities of the participants are still the categories, but the researcher is oversampling the South Asian population: The South Asian group will constitute 10% of the sample, even though it represents only 4% of the popula- tion. A survey that includes an oversample adjusts the final results so members in the oversampled group are weighted to their actual proportion in the population. However, this is still a probability sample because the 100 South Asians in the final sample were sampled randomly from the population of South Asians.


In systematic sampling, using a computer or a random number table, the researcher starts by selecting two random numbers—say, 4 and 7. If the popula- tion of interest is a roomful of students, the researcher would start with the fourth person in the room and then count off, choosing every seventh person until the sample was the desired size. Mehl and his colleagues (2007) used the EAR device to sample conversations every 12.5 minutes (see Chapter 6). Although they did not choose this value (12.5 min) at random, the effect is essentially the same as being a random sample of participants’ conversations. (Note that although exter- nal validity often involves generalizing to populations of people, researchers may also generalize to settings—in this case, to a population of conversations.)


When reading about studies in the news or in empirical journal articles, you’ll probably come across methods of sampling that combine the techniques described here. Researchers might do a combination of multistage sampling and oversam- pling, for example. As long as clusters or individuals were selected at random, the sample will represent the population of interest. It will have good external validity.

In addition, to control for bias, researchers might supplement random selection with a statistical technique called weighting. If they determine that the final sample

❮❮ For more on random numbers and how to use them, see Appendix A, pp. 545–546.

190 CHAPTER 7 Sampling: Estimating the Frequency of Behaviors and Beliefs

contains fewer members of a subgroup than it should (such as fewer wireless-only respondents or fewer young adults), they adjust the data so responses from members of underrepresented categories count more, and overrepresented members count less.

In sum, there are many acceptable ways to obtain a representative sample. Because all these probability sampling techniques involve a component of ran- domness, they all ensure that each individual, cluster, or systematic interval has an equal and known chance of being selected. In other words, people are not excluded from the sample for any of the reasons that might lead to bias. Figure 7.6 provides a visual overview of the probability and nonprobability sampling techniques.


In conversation you might hear, “I have a random question . . .” for an unexpected comment. But in research, random has a more precise meaning: occurring with- out any order or pattern. Each coin flip in a series is random because you cannot predict (beyond 50% certainty) whether it will come up heads or tails; there’s no predictable order.

In the context of research methods, it’s important not to confuse random sampling and random assignment. With random sampling (probability sampling), researchers

Probability sampling

Simple random sample

Systematic sample

Cluster sample, multistage

sample Oversample

Convenience sample

Purposive sample

Quota sample

Snowball sample

Stratified random sample

Nonprobability sampling

All members of population of interest

have an equal and known chance.

Some types of people are systematically left out.

Population of interest

FIGURE 7.6 Probability and nonprobability sampling techniques. Probability sampling techniques all involve an element of random selection, and result in samples that resemble the population. Nonprobability sampling techniques are biased because they exclude systematic subsets of individuals; they cannot be generalized to the population.

191Generalizability: Does the Sample Represent the Population?

create a sample using some random method, such as drawing names from a hat or using a random-digit phone dialer, so that each member of the population has an equal chance of being in the sample. Random sampling enhances external validity.

Random assignment is used only in experimental designs. When researchers want to place participants into two different groups (such as a treatment group and a comparison group), they usually assign them at random. Random assign- ment enhances internal validity by helping ensure that the comparison group and the treatment group have the same kinds of people in them, thereby controlling for alternative explanations. For example, in an experiment testing how exercise affects one’s well-being, random assignment would make it likely that the people in the treatment and comparison groups are about equally happy at the start. (For more detail on random assignment, see Chapters 3 and 10.)

Settling for an Unrepresentative Sample: Nonprobability Sampling Techniques Samples obtained through random selection achieve excellent external validity, but such samples can be difficult to obtain. For example, the Gallup organization really does survey people in 160 countries, either by calling random samples of people in each country or traveling in person to randomly selected remote villages. You can appreciate the expense required to obtain the estimate that “three in four women worldwide rate their lives as ‘struggling’ or ‘suffering.’”

In cases where external validity is not vital to a study’s goals, researchers might be content with a nonprobability sampling technique. Depending on the type of study, they can choose among a number of techniques for gathering such a sample.


The most common sampling technique in behavioral research, convenience sam- pling (introduced earlier) uses samples that are chosen merely on the basis of who is easy to reach. Many psychologists study students on their own campuses because they are nearby. The researchers may ask for volunteers in an introduc- tory psychology class or among residents of a dormitory.


If researchers want to study only certain kinds of people, they recruit only those particular participants. When this is done in a nonrandom way, it is called purposive sampling. Researchers wishing to study, for example, the effectiveness of a specific intervention to quit smoking would seek only smokers for their sample. Notice that limiting a sample to only one type of participant does not make a sample purposive. If smokers are recruited by phoning community members at random, that sample would not be considered purposive because it is a random sample. However, if researchers recruit the sample of smokers by posting flyers at a local tobacco store, that action makes it a purposive sample, because only smokers will participate, and because the smokers are not randomly selected. Researchers studying a weight management

192 CHAPTER 7 Sampling: Estimating the Frequency of Behaviors and Beliefs

program might study only people in a diabetes clinic. Such a sample would not be, and might not need to be, representative of the population of obese people in some area.


One variation on purposive sampling that can help researchers find rare individuals is snowball sampling, in which participants are asked to recommend a few acquain- tances for the study. For a study on coping behaviors in people who have Crohn’s dis- ease, for example, a researcher might start with one or two who have the condition, and then ask them to recruit people from their support groups. Each of them might, in turn, recruit one or two more acquaintances, until the sample is large enough. Snow- ball sampling is unrepresentative because people are recruited via social networks, which are not random. You might be familiar with this approach from online surveys that urge you to forward the survey link to a few more people. (Many Facebook quiz- zes work like this, even though they are created for entertainment, not for research.)


Similar to stratified random sampling, in quota sampling the researcher identifies subsets of the population of interest and then sets a target number for each category in the sample (e.g., 80 Asian Americans, 80 African Americans, and 80 Latinos). Next, the researcher samples from the population of interest nonrandomly until the quotas are filled. As you can see, both quota sampling and stratified random sampling specify subcategories and attempt to fill targeted percentages or numbers for each subcategory. However, in quota sampling the participants are selected nonrandomly (perhaps through convenience or purposive sampling), and in strat- ified random sampling they are selected using a random selection technique.


1. What are five techniques for selecting a probability sample of a population of interest? Where does randomness enter into each of these five selection


2. In your own words, define the word random in the research methods context. Then describe the difference between random sampling and

random assignment.

3. What are four ways of selecting a nonprobability sample? What types of people might be more likely to be selected in each case?

4. Why are convenience, purposive, snowball, and quota sampling not examples of representative sampling?

1. See pp. 186–191. 2. See pp. 190–191. 3. See pp. 191–192. 4. Because none of them involve selecting participants at random.

193Interrogating External Validity: What Matters Most?

INTERROGATING EXTERNAL VALIDITY: WHAT MATTERS MOST? A sample is either externally valid for a population of interest, or it has unknown external validity. Table 7.3 organizes synonymous terms from this chapter under the two descriptors.

Although external validity is crucial for many frequency claims, it might not always matter. When researchers study association claims or causal claims, they are often comfortable with unknown external validity.

In a Frequency Claim, External Validity Is a Priority Frequency claims, as you know, are claims about how often something happens in a population. When you read headlines like these—“8 out of 10 Drivers Say They Experience Road Rage” and “Three in Four Women Worldwide Rate Their Lives as ‘Struggling’ or ‘Suffering’ ”—it might be obvious to you that external validity is important. If the driving study used sampling techniques that contained mostly urban residents, the road rage estimate might be too high because urban driving may be more stressful. If the Gallup poll included too few women from impov- erished countries, the three-in-four estimate might be too low. In such claims, external validity, which relies on probability sampling techniques, is crucial.

In certain cases, the external validity of surveys based on random samples can actually be confirmed. In political races, the accuracy of pre-election opinion polling can be compared with the final voting results. In most cases, however, researchers are not able to check the accuracy of their samples’ estimates because they hardly ever complete a full census of a population on the variable of interest. For example, we could never evaluate the well-being of all the women in the world to find out the true percentage of those who are struggling or suffering. Similarly, a researcher can’t find all the owners of a particular style of shoe to ask them whether their shoes “fit true to size.” Because you usually cannot directly check


Synonymous Sampling Terms Used in This Chapter


Unbiased sample Biased sample

Probability sample Nonprobability sample

Random sample Nonrandom sample

Representative sample Unrepresentative sample

194 CHAPTER 7 Sampling: Estimating the Frequency of Behaviors and Beliefs

accuracy when interrogating a frequency claim, the best you can do is examine the method the researchers used. As long as it was a probability sampling technique, you can be more confident in the external validity of the result.

When External Validity Is a Lower Priority Even though you need a probability sample to support a frequency claim, many associations and causes can still be accurately detected even in a non-probability sample. Researchers might not have the funds to obtain random samples for their studies, and their priorities lie elsewhere. For example, as you will learn, random assignment is prioritized over random sampling when conducting an experiment.

What about a frequency claim that is not based on a probability sample? It might matter a lot, or it might not. You will need to carefully consider whether the reason for the sample’s bias is relevant to the claim.


Consider whether self-selection affects the results of an online shopping rating, as in the headline, “61% said this shoe felt true to size.” You can be pretty sure the people who rated the fit of these shoes are self-selected and there- fore don’t represent all the people who own that model. The raters obviously have Internet access, whereas some of the shoe owners might not; the raters probably do more online shopping, whereas some of the shoe owners used bricks-and-mortar stores. More importantly, the raters cared enough to rate the shoes online; while many of them probably responded because they either loved or hated the shoes, those who are in-between might not be motivated enough to bother rating their new shoes online.

Another reason people respond might be that they are conscientious. They like to keep others informed, so they tend to rate everything they buy. In this case, the shopping rating sample is self-selected to include people who are more helpful than average.

The question is: Do the opinions of these nonrandom shoppers apply to other shoppers, and to how the shoes will fit you? Are the feet of opinionated or con- scientious raters likely to be very different from those of the general popula- tion? Probably not, so their opinions about the fit of the shoes might generalize. The raters’ fashion sense might even be the same as yours. (After all, they were attracted to the same image online.) If you believe the characteristics of this self-selected sample are roughly the same as others who bought them, their ratings might be valid for you after all.

Here’s another example. Let’s say a driver uses the Waze navigation app to report heavy traffic on a specific highway (Figure 7.7). This driver is not a ran- domly selected sample of drivers on that stretch of road; in fact, she is more consci- entious and thus more likely to report problems. However, these traits are not that relevant. Traffic is the same for everybody, conscientious or not, so even though this driver is a nonrandom sample, her traffic report can probably generalize to the

❯❯ For more on when

external validity may not be a priority, see Chapter 8, pp. 226–227; Chapter 10,

pp. 301–303; and Chapter 14.

195Interrogating External Validity: What Matters Most?

other drivers on that road. The feature that has biased the sample (being consci- entious) is not relevant to the variable being measured (being in traffic).

In short, when you know a sample is not representative, you should think carefully about how much it matters. Are the characteristics that make the sam- ple biased actually relevant to what you are measuring? In certain cases, it’s reasonable to trust the reports of unrepresentative samples.


Let’s use this reasoning to work through a couple of other examples. Recall from Chapter 6 the 30 dual-earner families who allowed the researchers to videotape their evening activities (Campos et al., 2013). Only certain kinds of families will let researchers walk around the house and record everyone’s behavior. Does this affect the conclusions of the study? It seems possible that a family that volun- teers for such intrusion has a warmer emotional tone than the full population of dual-earning families. Without more data on families who do not readily agree to be videotaped, we cannot know for sure. The researchers may have to live with some uncertainty about the generalizability of their data.

Now we’ll return to the Mehl study in Chapter 6 on how many words people speak in a day (Mehl, Vazire, Ramirez-Esparza, Slatcher, & Pennebaker, 2007). The sample of participants was not drawn randomly from a population of college students; instead, it was a convenience sample who participated because they were trying to earn class credit or a few extra dollars. Could the qualities that make these students volunteer for the study also be qualities that affect how many words they would say? Probably not, but it is possible. Again, we live with some uncer- tainty about whether the Mehl findings would generalize—not only to other college

FIGURE 7.7 Nonprobability samples might not matter. A driver who reports traffic is probably not representative of all the drivers on that stretch of road. Nevertheless, the report from this nonrandom sample might be accurate.

196 CHAPTER 7 Sampling: Estimating the Frequency of Behaviors and Beliefs

students, but to other populations outside the college setting as well. We know that Mehl found the same results among college students in Mexico, but we don’t know whether those results apply to samples of middle-aged or older adults. However, just because we don’t know whether the finding generalizes to other populations doesn’t mean Mehl’s results from college students are wrong or even uninteresting. Indeed, future research by Mehl and his colleagues could investigate this question in new populations.

Larger Samples Are Not More Representative In research, is a bigger sample always a better sample? The answer may surprise you: not necessarily. The idea that larger samples are more externally valid than smaller samples is perhaps one of the hardest misconceptions to dispel in a research methods course.

When a phenomenon is rare, we do need a large sample in order to locate enough instances of that phenomenon for valid statistical analysis. For example, in a study of religion in American life, the Pew Research Center phoned a ran- dom sample of 35,071 adults. The large size enabled them to obtain and analyze sufficiently large samples of small religious groups, such as Jehovah’s Witnesses, who comprise less than 1% of Americans (Pew Research Center, 2015). But for most variables, when researchers are striving to generalize from a sample to a population, the size of a sample is in fact much less important than how that sample was selected. When it comes to the external validity of the sample, it’s how, not how many.

Suppose you want to try to predict the outcome of the U.S. presidential elec- tion by polling 4,000 people at the Republican National Convention. You would have a grand old sample, but it would not tell you anything about the opinions of the entire country’s voting population because everyone you sampled would be a member of one political party. Similarly, many Internet polls are so popular that thousands of people choose to vote in them. Even so, 100,000 self-selected people are not likely to be representative of the population. Look back at the BabyCenter poll about car seats (see Figure 7.3). More than 17,000 people chose to vote, yet we have no idea to whom the results generalize.

When researchers conduct public opinion polls, it turns out that 1,000–2,000 randomly selected people are all they usually need—even for populations as large as the U.S. population of 319 million. For reasons of statistical accuracy, many polls shoot for, at most, a sample of 2,000. A researcher chooses a sample size for the poll in order to optimize the margin of error of the estimate. As introduced in Chapter 3, the margin of error of the estimate (or just margin of error) is a statistic that quantifies the degree of sampling error in a study’s results. For instance, you might read that 46% of Canadians in some poll support the Liberal Party, plus or minus 3%. In this example, the margin of error (“plus or minus 3%”) means that if the researchers conducted the same poll many times and computed mar- gins of error, 95% of the ranges would include the true value of support. In other

197Interrogating External Validity: What Matters Most?

words, it would mean that the range, 43% to 49%, probably contains the true percentage of Canadians who support the Liberal Party.

Table 7.4 shows the margin of error for sam- ples of different sizes. You can see in the table that the larger the sample size, the smaller the margin of error—that is, the more accurately the sample’s results reflect the views of the pop- ulation. However, after a random sample size of 1,000, it takes many more people to gain just a little more accuracy in the margin of error. That’s why many researchers consider 1,000 to be an optimal balance between statistical accuracy and polling effort. A sample of 1,000 people, as long as it is random, allows them to generalize to the population (even a population of 319 million) quite accurately. In effect, sam- ple size is not an external validity issue; it is a statistical validity issue.


Margins of Error Associated with Different Random Sample Sizes



2,000 Plus or minus 2%

1,500 Plus or minus 3%

1,000 Plus or minus 3%

500 Plus or minus 4%

200 Plus or minus 7%

100 Plus or minus 10%

50 Plus or minus 10%

Note: Margin of error varies as a function of sample size and the percentage result of the poll. In this table, estimates were based on a 50% polling result (e.g., if 50% of people supported a candidate).


1. When might researchers decide to use a nonprobability sample, even though a probability sample would ensure external validity?

2. For what type of claim will it be most important for a researcher to use a representative sample?

3. Which of these samples is more likely to be representative of a population of 100,000?

a. A snowball sample of 50,000 people

b. A cluster sample of 500 people

4. Explain why a larger sample is not necessarily more externally valid than a smaller one.

1. See pp. 194–196. 2. A frequency claim; see p. 193. 3. b. 4. See pp. 196–197.

198 CHAPTER 7 Sampling: Estimating the Frequency of Behaviors and Beliefs

Generalizability: Does the Sample Represent the Population? • The quality of a frequency claim usually depends on

the ability to generalize from the sample to the popu- lation of interest. Researchers use samples to estimate the characteristics of a population.

• When a sample is externally valid, we can also say it is unbiased, generalizable or representative.

• When generalization is the goal, random sampling techniques—rather than sample size—are vital because they lead to unbiased estimates of a population.

• Nonrandom and self-selected samples do not rep- resent the population. Such biased samples may be obtained when researchers sample only those who are easy to reach or only those who are more willing to participate.

• Probability sampling techniques can result in a representative sample; they include simple random sampling, cluster sampling, multistage sampling, stratified random sampling, oversampling, systematic sampling, and combinations of these. All of them select people or clusters at random, so all members

of the population of interest are equally likely to be included in the sample.

• Nonprobability sampling techniques include con- venience sampling, purposive sampling, snowball sampling, and quota sampling. Such sampling methods do not allow generalizing from the sample to a population.

Interrogating External Validity: What Matters Most? • When researchers intend to generalize from the

sample to the population, probability sampling (random sampling) is essential.

• Random samples are crucial when researchers are estimating the frequency of a particular opinion, condition, or behavior in a population. Nonprobability (nonrandom) samples can occasionally be appropriate when the cause of the bias is not relevant to the survey topic. Representative samples may be of lower priority for association and causal claims.

• For external validity, the size of a sample is not as important as whether the sample was selected randomly.

Summary When a claim makes a statement about a population of interest, you can ask how well the sample that was studied (such as a sample of online shoppers) represents the population in the claim (all online shoppers).


Key Terms

population, p. 180 sample, p. 180 census, p. 180

biased sample, p. 181 unbiased sample, p. 181 convenience sampling, p. 183

self-selection, p. 185 probability sampling, p. 186 nonprobability sampling, p. 187

199Learning Actively

Review Questions

1. Which of the following four terms is not synonymous with the others?

a. Generalizable sample

b. Externally valid sample

c. Representative sample

d. Biased sample

2. A researcher’s population of interest is New York City dog owners. Which of the following samples is most likely to generalize to this population of interest?

a. A sample of 25 dog owners visiting dog-friendly New York City parks.

b. A sample of 25 dog owners who have appoint- ments for their dogs at veterinarians in the New York City area.

c. A sample of 25 dog owners selected at random from New York City pet registration records.

d. A sample of 25 dog owners who visit New York City’s ASPCA website.

3. Which of the following samples is most likely to generalize to its population of interest?

a. A convenience sample of 12,000.

b. A quota sample of 120.

c. A stratified random sample of 120.

d. A self-selected sample of 120,000.

4. Externally valid samples are more important for some research questions than for others. For which of the following research questions will it be most important to use an externally valid sampling technique?

a. Estimating the proportion of U.S. teens who are depressed.

b. Testing the association between depression and illegal drug use in U.S. teens.

c. Testing the effectiveness of support groups for teens with depression.

To see samples of chapter concepts in the popular media, visit and click the box for Chapter 7.r

Learning Actively

1. During a recent U.S. election, the news media interviewed a group of women in Florida. Although opinion polls supported the liberal candidate, these conservative women were still optimistic that their own side would win. One woman said, “I don’t think those polls are very good—after all, they’ve never called me. Have they called any of you ladies?” Is this woman’s critique of polling techniques appropriate? Why or why not?

2. Imagine you’re planning to estimate the price of the average book at your college bookstore. The bookstore carries 13,000 titles, but you plan to sample only 200 books. You will select a sample of 200 books, record the price of each book, and use the average of the 200 books in the sample to estimate the average price of the 13,000 titles in

the bookstore. Assume the bookstore can give you access to a database that lists all 13,000 titles it carries. Based on this information, answer the following questions:

a. What is the sample in this study, and what is the population of interest?

b. How might you collect a simple random sample of books?

c. How might you collect a stratified random sample? (What would your strata be?)

d. How might you collect a convenience sample?

e. How might you collect a systematic random sample?

f. How might you collect a cluster sample?

g. How might you collect a quota sample?

simple random sampling, p. 187 cluster sampling, p. 188 multistage sampling, p. 188 stratified random sampling, p. 188

oversampling, p. 189 systematic sampling, p. 189 random assignment, p. 191 purposive sampling, p. 191

snowball sampling, p. 192 quota sampling, p. 192


Tools for Evaluating Association Claims

Meaningful Conversations Linked to Happier People Scientific American, 2010

Couples Who Meet Online Have Better Marriages Freakonomics, 2013


Bivariate Correlational Research THE TWO STATEMENTS ON the opposite page are examples of association claims that are supported by correlational studies. Each one is an association claim because it describes a relationship between variables: meaningful conversations and happiness; where people met their spouse and marriage quality.

The verbs used in each case are weaker, association verbs. In the first claim, the verb is linked, and the second claim’s verb is have. Neither statement argues that X causes Y, or X makes Y happen, or X increases rates of Y. (If they did, they would be causal claims, not association claims.)

What about the studies behind these statements? Even without reading the full details of each study, we might guess that the variables were all measured. Researchers can evaluate people’s meaningful conversations and their levels of happiness, but they can’t easily assign people to have deep conversations or assign people to have certain levels of happiness. Researchers can measure where people met their spouses, but they can’t reasonably assign people to meet their spouse either online or in person. They can measure marital satisfaction, but they can’t assign people to be satisfied or not. Because it’s a plausible assumption that the two variables in each claim were measured (rather than manipulated), we suspect the studies behind the claims are correlational.


A year from now, you should still be able to:

1. Explain that measured variables, not any particular statistic, make a study correlational.

2. Interrogate the construct validity and statistical validity (and, of lower priority, external validity) of an association claim.

3. Explain why a correlational study can support an association claim, but not a causal claim.

204 CHAPTER 8 Bivariate Correlational Research

This chapter describes the kinds of studies that can support association claims, explains what kinds of graphs and statistics are used to describe the associations, and shows how you can systematically interrogate an association claim using the four big validities framework. What kinds of questions should you ask when you encounter an association claim? What should you keep in mind if you plan to conduct a study to test such a claim?

INTRODUCING BIVARIATE CORRELATIONS An association claim describes the relationship found between two measured variables. A bivariate correlation, or bivariate association, is an association that involves exactly two variables. Chapter 3 introduced the three types of associa- tions: positive, negative, and zero. To investigate associations, researchers need to measure the first variable and the second variable—in the same group of people. Then they use graphs and simple statistics to describe the type of relationship the variables have with each other.

To investigate the association between meaningful, substantive conversa- tions and happiness, Matthias Mehl and his colleagues (2010) measured people’s happiness by combining Pavot and Diener’s (1993) subjective well-being (SWB) scale (see Chapter 5) with a measure of overall happiness. Then they measured people’s level of “deep talk” by having them wear an electronically activated recorder (EAR) for 4 days. (The EAR, introduced in Chapter 6, is an observa- tional measurement device, an unobtrusive microphone worn by a participant that records 30 seconds of ambient sound every 12.5 minutes.) After people’s daily conversations were recorded and transcribed, researchers coded the extent to which the recorded snippets represented “deep talk” or “substantive conversa- tion.” Each participant was assigned a value representing the percentage of time spent on substantive conversation. Those with deeper conversations had higher well-being scores.

To test the relationship between meeting one’s spouse online and marital satisfaction, researcher John Cacioppo and his colleagues had e-mail surveys sent to thousands of people who participate in uSamp, an online market research project (Cacioppo, Cacioppo, Gozaga, Ogburn, & VanderWeele, 2013). Respon- dents answered questions about where they met their spouse—online or not. Then, to evaluate marital satisfaction, the researchers used a 4-item measure called the Couples Satisfaction Index (CSI), which asks questions such as “Indicate the degree of happiness, all things considered, of your marriage,” with a 7-point rating scale from 1 (“extremely unhappy”) to 7 (“perfect”). People who met online scored a little higher on the CSI.

Another correlational study investigated this claim: “People who multitask the most are the worst at it” (introduced in Chapter 3). David Sanbonmatsu and his colleagues tested people on two variables: their frequency of media multitasking

205Introducing Bivariate Correlations

and their ability to do it (Sanbonmatsu, Strayer, Medeiros-Ward, & Watson, 2013). To measure frequency, the researchers had participants complete a Media Multitasking Inventory (MMI) indicating how many hours a day they spent using each of 12 kinds of media (such as web surfing, text messaging, music, computer video, TV) and also how often they used each one at the same time as doing another task. To measure multitasking ability, they gave par- ticipants the OSPAN (operation span) task. In this difficult task, participants had to alter- nately read letters on the computer screen and solve basic math problems in their heads. When prompted, they had to report all the let- ters and give the answers to all the math prob- lems they’d recently seen. In the Sanbonmatsu study, those who reported doing the most media multitasking had the lowest ability on the OSPAN task.

Sample data from these three studies appear in Tables 8.1, 8.2, and 8.3. Notice that each row shows one person’s scores on two measured variables. Even though each study measured more than two variables, an analysis of bivariate correlations looks at only two variables at a time. Therefore, a correlational study might have measured multiple variables, but the authors present the bivariate correlations between dif- ferent pairs of variables separately.

Review: Describing Associations Between Two Quantitative Variables After recording the data, the next step in test- ing an association claim is to describe the relationship between the two measured vari- ables using scatterplots and the correlation coefficient r. We could create a scatterplot for the relationship between deep talk and well- being, for example, by placing scores on the well- being scale on the x-axis and percentage


Sample Data from the Mehl Study on Well-Being and Deep Talk




A 4.5 80

B 3.0 52

C 3.2 35

D 4.1 42

… … …

ZZ 2.8 16

Note: Data are fabricated for illustration purposes. Source: Adapted from Mehl et al., 2010.


Sample Data from the Cacioppo Study on Marital Satisfaction



a Online 6.2

b Offline 5.5

c Online 7.0

d Offline 4.2

… … …

yy Online 7.0

Note: Data are fabricated for illustration purposes. Source: Adapted from Cacioppo et al., 2013.


Sample Data from the Sanbonmatsu Study on Multitasking Frequency and Ability




Alek 3.65 27

Jade 4.21 48

Deangie 2.06 62

Max 8.44 25

… …

Xiaxin 4.56 32

Source: Adapted from Sanbonmatsu et al., 2013.

206 CHAPTER 8 Bivariate Correlational Research

of conversations that include deep talk on the y-axis, placing a dot on the graph to represent each person (Figure 8.1).

In addition to creating the scatterplot, Mehl and his team computed the correlation coefficient for their data and came up with an r of .28. As dis- cussed in Chapter 3, a positive r means that the relationship is positive: High scores on one variable go with high scores on the other. In other words, high percentages of substantive conversation go with high levels of well-being, and low percentages of substantive conversation go with low levels of well-being. The magnitude of r is .28, which indicates a relationship that is moderate in strength.

The reason we consider an association of .28 to be moderate is that psychology researchers typi cally follow a set of conventions once provided by the psychological statistician Jacob Cohen (1992). Recall that r has two qualities: direction and strength. Direction refers to whether the associa tion is positive, negative, or zero; strength refers to how closely related the two variables are—how close r is to 1 or −1. Cohen provided benchmarks for labeling association strength, as shown in Table 8.4. According to these conven- tions, the magnitude of the deep talk/well- being association is medium. (These guidelines are discussed in more detail later in the chapter.)

FIGURE 8.1 Scatterplot of the association between deep talk and well- being. (Source: Adapted from Mehl et al., 2010.)

This dot represents one person who scored very high on the well-being scale and had a high percentage of deep talk.

–2.0 0











–1.5 1.5–1.0 1.0–0.5 0.50.0

Percentage of deep talk

Well-being scale (z scores)


Cohen’s Guidelines for Evaluating Strength of Association (Based on r)


.10 (or −.10) Small, or weak

.30 (or −.30) Medium, or moderate

.50 (or −.50) Large, or strong

Source: Cohen, 1992.

207Introducing Bivariate Correlations

Figure 8.2 shows a scatterplot for the study correlating the frequency and ability of multitasking. When Sanbonmatsu’s team computed the correlation coef- ficient between those two variables, they found an r of −.19. The negative r means that more frequent media multitasking is associated with lower scores on mul- titasking ability, and less frequent media multitasking is associated with higher ability. According to Cohen’s conventions, the size of the correlation, .19, would be interpreted as small to medium in strength.

Describing Associations with Categorical Data In the examples we have discussed so far, the nature of the association can be described with scatterplots and the correlation coefficient r. For the association between marital satisfaction and online dating, however, the dating variable is categorical; its values fall in either one category or another. A person meets his or her spouse either online or offline. The other variable in this association, marital satisfaction, is quantitative; 7 means more marital satisfaction than 6, 6 means more than 5, and so on.


When both variables in an association are measured on quantitative scales (as were number of substantive conversations and happiness), a scatterplot is usually the best way to represent the data. But is a scatterplot the best representation of an association in which one of the variables is measured

0 0









2 4 6 108

Multitasking ability (OSPAN score)

Frequency of media multitasking (MMI score)

FIGURE 8.2 Scatterplot of the association between the frequency and ability of media multitasking. Does this cloud of points slope up or down? Is it a strong or a weak relationship? (Source: Adapted from Sanbonmatsu et al., 2013.)

❮❮ For more on categorical and quantitative variables, see Chapter 5, pp. 122–123.

208 CHAPTER 8 Bivariate Correlational Research

categorically? Figure 8.3 shows how a scatterplot for the association between meeting location and marital satisfaction might look.

As in all scatterplots, one variable is plotted on the x-axis and the other on the y-axis, and one dot represents one person (in a study with a very large sample, one dot could represent several people who had the same scores). You can even look for an association in this graph: Do the scattered points slope up from right to left, do they slope down, or is the slope flat? If you answered that you see a very slight downward slope, you would be right. You’d conclude there’s an association

between where people meet their spouse and marital satisfaction, and that those who met online have slightly happier marriages, just as the researchers found when they conducted their study (Cacioppo et al., 2013). If you computed the correlation between these two variables, you would get a very weak correlation: r = −.06.

Although you can make a scatterplot of such data, it is far more common for researchers to plot the results of an association with a categorical variable as a bar graph, as in Figure 8.4. Each person is not represented by one data point; instead, the graph shows the mean marital satis- faction rating (the arithmetic average) for all the people who met their spouses online and the mean marital satis- faction rating for those who met their spouses in person.

When you use a bar graph, you usually examine the difference between the group averages to see whether there is an association. In the graph of meeting location and marital satisfaction in Figure 8.4, you can see that the average satisfaction score is slightly higher in the

FIGURE 8.3 Scatterplot of meeting location and marital satisfaction. Do you see an association here between meeting location and marital satisfaction? (Data are fabricated for illustration purposes.) (Source: Adapted from Cacioppo et al., 2013.)









Online O�ine

Marital satisfaction

Where did you meet your spouse?

FIGURE 8.4 Bar graph of meeting location and marital satisfaction. This is the same outcome as in Figure 8.3, graphed differently. Do you see an association here between meeting location and marital satisfaction? (Source: Adapted from Cacioppo et al., 2013.)

Online O�ine

Where did you meet your spouse?








Marital satisfaction

209Introducing Bivariate Correlations

online than the offline group. The difference in means indicates an association between where people met their spouse and marital satisfaction. Because the difference is small, the association would be considered weak.


When at least one of the variables in an association claim is categorical, as in the online dating example, researchers may use different statistics to analyze the data. Although they occasionally use r, it is more common to test whether the difference between means (group averages) is statistically significant, usually by using a statistic called the t test, or other statistical tests.

A Study with All Measured Variables Is Correlational It might seem confusing that association claims can be supported by either scatterplots or bar graphs, using a variety of statistics, such as r or t tests. It’s important to remember that no matter what kind of graph you see, when the method of the study measured both variables, the study is correlational, and therefore it can support an association claim. (In contrast, recall from Chapter 3 that if one of the variables is manipulated, it’s an experiment, which is more appropriate for testing a causal claim.) An association claim is not supported by a particular kind of statistic or a particular kind of graph; it is supported by a study design—correlational research—in which all the variables are measured (Figure 8.5).

❮❮ For more detail about the t test, see Statistics Review: Inferential Statistics, pp. 491–495.

FIGURE 8.5 Correlational studies support association claims. When you look for the study behind an association claim, you should find a correlational study.

Couples who meet online have better marriages.

Couples who meet online have better marriages.

two measured variables.

Online O�ine

Where did you meet your spouse?








Marital satisfaction

C or

rel ational study:

210 CHAPTER 8 Bivariate Correlational Research


1. At minimum, how many variables are there in an association claim?

2. What characteristic of a study makes it correlational?

3. Sketch three scatterplots: one showing a positive correlation, one showing a negative correlation, and one showing a zero correlation.

4. Sketch two bar graphs: one that shows a correlation between the two variables, and one that shows no correlation.

5. When do researchers typically use a bar graph, as opposed to a scatterplot, to display correlational data?

1. Two. 2. All variables are measured; see p. 209. 3. Answers may vary; see Figures 8.1, 8.2, and 8.3 for models. 4. A bar graph that shows a correlation should have bars at different heights; a bar graph with a zero correlation would show two bars of the same height. 5. See pp. 208–209.

INTERROGATING ASSOCIATION CLAIMS With an association claim, the two most important validities to interrogate are construct validity and statistical validity. You might also ask about the exter- nal validity of the association. Although internal validity is relevant for causal claims, not association claims, you need to be able to explain why correlational studies do not establish internal validity. We’ll now discuss the questions you’ll use to interrogate each of the four big validities specifically in the context of association claims.

Construct Validity: How Well Was Each Variable Measured? An association claim describes the relationship between two measured variables, so it is relevant to ask about the construct validity of each variable. How well was each of the two variables measured?

To interrogate the Mehl study, for example, you would ask questions about the researchers’ operationalizations of deep talk and well-being. Recall that deep talk in this study was observed via the EAR recordings and coded later by research assistants, while well-being was measured using the SWB scale. Once you know what kind of measure was used for each variable, you can ask questions to assess each one’s construct validity: Does the measure have

211Interrogating Association Claims

good reliability? Is it measuring what it’s intended to measure? What is the evidence for its face validity, its concurrent validity, its discriminant and convergent validity? For example, you could ask whether the 4-item measure of marital satisfaction used in the Caccioppo study had good internal reliability, and whether it had convergent validity. Does it correlate with other measures of marital happiness?

Statistical Validity: How Well Do the Data Support the Conclusion? When you ask about the statistical validity of an association claim, you are asking about fac- tors that might have affected the scatterplot, correlation coefficient r, bar graph, or differ- ence score that led to your association claim. You need to consider the effect size and sta- tistical significance of the relationship, any outliers that might have affected the overall findings, restriction of range, and whether a seemingly zero association might actually be curvilinear.


All associations are not equal; some are stron- ger than others. Recall that the effect size describes the strength of a relationship between two or more variables. As an example, Figure 8.6 depicts two associations: Both are positive, but the one in part B is stronger (its r is closer to 1). In other words, part B depicts a stronger effect size.

Recall the conventions for labeling correlations as small, medium, or large in strength. In the Mehl study, the association between deep talk and well-being was r = .28, a relationship of medium strength. In the Sanbonmatsu study, the size of the association between multitasking frequency and ability was r  = −.19, a relationship of small to medium strength. In the Cacioppo

FIGURE 8.6 Two scatterplots depicting different association strengths. Both of these are positive associations. Which scatterplot shows the stronger relationship, part A or part B?














0 0 1 2 3 4 5 6 7 8 9 10












0 0


1 2 3 4 5 6 7 8 9 10

212 CHAPTER 8 Bivariate Correlational Research

study, the relationship between meeting location and marital satisfaction was very small—a correlation of about r = .06. Therefore, of the three examples in this chapter, the strongest one is the deep talk/well-being relationship. But how strong is .28 compared with −.19 or .06? What is the logic behind these conventions?

Larger Effect Sizes Allow More Accurate Predictions. One meaning of “strong” when applied to effect size is that strong effect sizes enable predictions that are more accurate. When two variables are correlated, we can make predictions of one variable based on information from another. The more strongly correlated two variables are (the larger the effect size), the more accurate our predictions can be.

To understand how an association can help us make more accurate predictions, suppose we want to guess how tall a 2-year-old, we’ll call him Hugo, will be as an 18-year-old. If we know absolutely nothing about Hugo, our best bet would be to predict that Hugo’s adult height will be exactly average. Hugo might be taller than average or shorter than average, and the mean splits the difference. In the United States, the average height (or 50th percentile) for an 18-year-old man is 175 centi- meters, so we should guess that Hugo will be 175 cm tall at age 18.

Now suppose we happen to know that Hugo is a relatively short 2- year-old; his height is 83 cm, within the 25th percentile for that age group. Given this new information, we should lower our prediction of Hugo’s adult height accordingly. Why? Because we know there’s a strong correlation between 2-year-old height and adult height, and we can use a prediction line generated from this correlation (Figure 8.7A). Starting at Hugo’s 2-year-old height of 83 cm, we’d read up to the prediction line and predict 172 cm as his 18-year-old height.

Are our predictions of Hugo’s adult height likely to be perfect? Of course not. Let’s say we find out that Hugo actually grew up to be 170 cm tall at age 18. We guessed 172, so our prediction was off by 2 cm. That’s the error of prediction. Our 2 cm difference is an error, but it’s a smaller error than the 5 cm error we would have made for him before, using average adult height.

Errors of prediction get larger when associations get weaker. Suppose we want to predict Hugo’s adult height but we don’t know his 2-year-old height anymore; now all we know is the height of his mother. The correlation between mothers’ height and sons’ height is positive, but weaker than the correlation between one’s 2-year-old height and one’s adult height. As shown in Figure 8.7B, the scatterplot is more spread out. We can still use the prediction line associated with this correlation, but the fact that the correlation is weaker means our errors of prediction will be larger. If Hugo’s mother’s height is 163 cm, the prediction line indicates that Hugo’s adult height would be 174 cm. Our prediction is now off by 4 cm (recall that Hugo grew up to be 170 cm). Our prediction error was larger than when we used 2-year-old height, in part because the correlation behind our prediction was weaker.

213Interrogating Association Claims

In sum, positive and negative associations can allow us to predict one variable from another, and the stronger the effect size, the more accurate, on average, our predictions will be.

Larger Effect Sizes Are Usually More Important. Effect sizes can also indi- cate the importance of a result. When all else is equal, a larger effect size is often considered more important than a small one. By this criterion, the association between deep talk and happiness is more important than the much weaker one between meeting online and having a happier marriage.

However, there are exceptions to this rule. Depending on the context, even a small effect size can be important. A medical study on heart disease provides one famous example in which a small r was considered extremely important. The study (reported in McCartney & Rosenthal, 2000) found that taking an aspi- rin a day was associated with a lower rate of heart attacks, though the strength was seemingly tiny: r = .03. According to the guidelines in Table 8.4, this is a very weak association, but in terms of the number of lives saved, even this small

FIGURE 8.7 Stronger correlations mean more accurate predictions. (A) If we use Hugo’s 2-year-old height to predict Hugo’s adult height, we would be off by 2 cm. (B) If we use Hugo’s mother’s height to predict Hugo’s adult height, we would be off by 4 cm. Weaker correlations allow predictions, too, but their errors of prediction are larger. (Data are fabricated for illustration purposes.)


Height (cm) at age 2

Hugo’s predicted height (172 cm)


Height (cm) at age 18

81 82 83 84 85 86 87 88 89 90 91 92

Prediction line

Hugo’s 2-year-old height (83 cm)

Hugo (170 cm)

Prediction line


Mother’s height (cm)

Hugo’s predicted height (174 cm)







Son’s adult height (cm), age 18

150 155 160 165 175 180

Hugo’s mother’s height (163 cm)

Hugo (170 cm)













214 CHAPTER 8 Bivariate Correlational Research

association was substantial. The full sample in the study consisted of about 22,000 people. Comparing the 11,000 in the aspirin group to the 11,000 in the placebo group, the study showed 85 fewer heart attacks in the aspirin group. An r of only .03 therefore rep- resented 85 heart attacks avoided. This outcome was considered so dramatic that the doctors ended the study early and told everyone in the no-aspirin group to start taking aspirin (Figure 8.8). In such cases, even a tiny effect size, by Cohen’s standards, can be consid- ered important, especially when it has life-or-death implications. (Keep in mind that the aspirin study was also experimental, not correlational, so we can support the claim that the aspirin caused the lower rate of heart attacks.)

When the outcome is not as extreme as life or death, however, a very small effect size might indeed be negligible. For instance, at r =  .06, the effect size of the association between meeting online and marital satisfaction corresponds to a difference on the 7-point satisfaction scale of .16 (5.64 versus 5.48). It’s hard to picture what sixteen one-hundredths of a point difference means in practical terms, but it doesn’t seem like a large effect. Similarly, the Cacioppo team also collected the divorce rates in the two groups. They found that the divorce rate for couples who met online was 5.87%, compared to 7.73% for couples who met offline, which corresponds to an effect size of r = .02. That is also a very small effect size, representing about two extra divorces per 100 people. In your opinion, is it important? What’s more,

does the effect size correspond to the headlines used by journalists who covered the study in the press?


Whenever researchers obtain a correlation coefficient (r), they not only estab- lish the direction and strength (effect size) of the relationship; they also deter- mine whether the correlation is statistically significant. In the present context, statistical significance refers to the conclusion a researcher reaches regarding the likelihood of getting a correlation of that size just by chance, assuming there’s no correlation in the real world.

The Logic of Statistical Inference. Determining statistical significance is a pro- cess of inference. Researchers cannot study everybody in a population of interest, so they investigate one sample at a time, assuming the sample’s result mirrors what is happening in the population. If there is a correlation between two variables in a population, we will probably observe the same correlation in the sample, too. And likewise in the case of no association.

❯❯ For more on statistical

significance, see Statistics Review: Inferential Statistics,

pp. 499–500.

FIGURE 8.8 Effect size and importance. Larger effect sizes are usually more important than smaller ones. In some studies, however, such as those showing that an aspirin a day can reduce heart attack risk, even a very small effect size can be an important result.

215Interrogating Association Claims

Even if, in the full population, there is exactly zero association (r = .00) between two variables, a sample’s result can easily be a little larger or smaller than zero (such as r = .03 or r = –.08), just by chance. Therefore, when we find an association in a sample, we can never know for sure whether or not there really is one in the larger population.

Here’s an example. A researcher conducts a study on a sample of 310 college students and finds that ability to multitask correlates with frequency of multitask- ing at r = −.19. That correlation might really exist in the whole population of college students. On the other hand, even if there is zero correlation between multitasking ability and frequency in the real world, once in a while a sample may, for reasons of chance alone, find such a correlation as strong as –.19.

Statistical significance calculations help researchers evaluate the probability that the result (such as r = −.19) came from a population in which the association is really zero. We can estimate the probability that our sample’s result is the kind we’d get from a zero-association population, versus a result that is actually quite rare in a zero-association population. The calculations estimate the following: What kinds of r results would we typically get from a zero-correlation population if we (hypothetically) conducted the same study many, many times with samples of the same size? How rarely would we get an r of −.19 just by chance, even if there is no association in the population?

What Does a Statistically Significant Result Mean? Statistical significance calculations provide a probability estimate (p, sometimes abbreviated as sig for significance). The p value is the probability that the sample’s association came from a population in which the association is zero. If the probability (p) associated with the result is very small—that is, less than 5%—we know that the result is very unlikely to have come from a zero-association population (it is rare). The correla- tion is considered statistically significant. The r of −.19 in the Sanbonmatsu study had a low probability of being from a zero-association population (p < .05), so we can conclude that their result is statistically significant.

What Does a Nonsignificant Result Mean? By contrast, sometimes we deter- mine that the probability (p) of getting the sample’s correlation just by chance would be relatively high (i.e., higher than p = .05) in a zero-association popula- tion. In other words, the result is not that rare. It is considered to be “nonsig- nificant” (n.s.) or “not statistically significant.” This means we cannot rule out the possibility that the result came from a population in which the association is zero.

Effect Size, Sample Size, and Significance. Statistical significance is related to effect size; usually, the stronger a correlation (the larger its effect size), the more likely it will be statistically significant. That’s because the stronger an association is, the more rare it would be in a population in which the associ- ation is zero. But we can’t tell whether a particular correlation is statistically

❮❮ For more detail on effect size, sample size, and statistical significance, see Statistics Review: Inferential Statistics, pp. 487–493 and pp. 499–500.

216 CHAPTER 8 Bivariate Correlational Research

significant by looking at its effect size alone. We also have to look for the p values associated with it.

Statistical significance calculations depend not only on effect size but also on sample size. A very small effect size will be statistically significant if it is identi- fied in a very large sample (1,000 or more). For example, in the Cacioppo study on meeting location and marriage satisfaction, the researchers found a very small effect size (r = .06), but it was statistically significant because the sample size was extremely large: more than 20,000. That same small effect size of r = .06 would not have been statistically significant if the study used a small sample (say, 30). A small sample is more easily affected by chance events than a large sample is. Therefore, a weak correlation based on a small sample is more likely to be the result of chance variation and is more likely to be judged “not significant.”

Reading About Significance in Journal Articles. In an empirical journal article, statistically significant associations are recognizable by their p val- ues. Significance information may also be indicated by an asterisk (*), which usually means that an association is significant, or with the word sig, or with a notation such as p < .05 or p < .01. See, for example, Figure 8.9, from the

Correlation (r) between happiness and amount of time people were coded as being alone: a moderate, positive correlation. Asterisk directs you to notes below, saying p is less than .05, meaning correlation of .27 is statistically significant.

FIGURE 8.9 Statistical significance in an empirical journal article. This table presents a variety of bivariate correlations. It also presents interrater reliability information for the variables that were coded from the EAR. (The last column shows a multiple- regression analysis; see Chapter 9.) (Source: Mehl et al., 2010.)


217Interrogating Association Claims

Mehl et al. (2010) journal article. Some of the correlations have asterisks next to them indicating their statistical significance. In contrast, a popular media article usually will not specify whether a correlation is significant or not. The only way to know for sure is to track down the original study.


An outlier is an extreme score—a single case (or a few cases) that stands out from the pack. Depending on where it sits in relation to the rest of the sample, a single outlier can have an effect on the correlation coeffi- cient r. The two scatterplots in Figure 8.10 show the potential effect of an outlier, a single person who happened to score high on both x and y. Why would a single outlier be a problem? As it turns out, adding that one data point changes the correlation from r = .26 to r = .37. Depending on where the outlier is, it can make a medium-sized correlation appear stronger, or a strong one appear weaker, than it really is.

Outliers can be problematic because even though they are only one or two data points, they may exert disproportionate influence. Think of an association as a seesaw. If you sit close to the center of the seesaw, you don’t have much power to make it move, but if you sit way out on one end, you have a much larger influ- ence on whether it moves. Outliers are like peo- ple on the far ends of a seesaw: They can have a large impact on the direction or strength of the correlation.

In a bivariate correlation, outliers are mainly problematic when they involve extreme scores on both of the variables. In evaluating the positive correlation between height and weight, for example, a person who is both extremely tall and extremely heavy would make the r appear stronger; a person who is extremely short and extremely heavy would make the r appear weaker. When interrogat- ing an association claim, it is therefore import- ant to ask whether a sample has any outliers. The best way to find them is to look at the scatterplots and see if one or a few data points stand out.











0 1 2 3 4 5









0 0


1 2 3 4 5

FIGURE 8.10 The effects of an outlier. These two scatterplots are identical, except for the outlier in the top-right corner of part A. (A) r = .37. (B) r = .26.

218 CHAPTER 8 Bivariate Correlational Research

Outliers matter the most when a sample is small (Figure 8.11). If there are 500 points in a scatterplot (a whole bunch of people sitting in the middle of the seesaw), one outlier is not going to have as much impact. But if there are only 12 points in a scatterplot (only a few people in the middle of the seesaw), an outlier has much more influence on the pattern.


In a correlational study, if there is not a full range of scores on one of the variables in the association, it can make the correlation appear smaller than it really is. This situation is known as restriction of range.

To understand the problem, imagine a selec- tive college (College S) that admits only students with high SAT scores. To support this admissions practice, the college might claim that SAT scores are associated with academic success. To support their claim with data, they would use the correla- tion between SAT scores and first-year college grades. (Those grades are an appropriate measure for such a study because for many students, first- year college courses are similar in content and difficulty.)

Suppose College S plots the correlation between its own students’ SAT scores and first-year college grades, getting the results shown in Figure 8.12. You’ll see that the scatterplot shows a wide cloud of points. It has a positive slope, but it does not appear very strong. In fact, in real analyses of similar data,  the correlation between SAT and first-year college grades is about r = .33 (Camara & Echternacht, 2000). As you have learned, such a correlation is considered moderate in strength. Is  this the evidence College S was looking for? Maybe not.

Here’s where restriction of range comes in. As you may know, student scores on the SAT currently range from 400 to 1600. But our selective College S admits only students who score 1200 or higher

FIGURE 8.12 Correlation between SAT scores and first-year college grades. College S might observe a scatterplot like this for its enrolled students. (Data are fabricated for illustration purposes.)


2.0 1200 1600

SAT score

College grades (GPA)









0 0 1 2 3 4 5










0 0


1 2 3 4 5

FIGURE 8.11 Outliers matter most when the sample is small. Again, these two scatterplots are identical except for the outlier. But in this case, removing the outlier changed the correlation from r = .49 to r = .15; this is a much bigger jump than in Figure 8.10, which has more data points.

219Interrogating Association Claims

on their SATs, as shown in Figure 8.13A. Therefore, the true range of SAT scores is restricted in College S; it ranges only from 1200 to 1600 out of a possible 400 to 1600.

If we assume the pattern in Figure 8.12A continues in a linear fashion, we can see what the scatterplot would look like if the range on SAT scores were not restricted, as shown in Figure 8.13B. The admitted students’ scatterplot points are in exactly the same pattern as they were before, but now we have scatterplot points for the unadmitted students. Compared to the range-restricted correlation in part A, the full sample’s correlation in part B appears much stronger. In other words, the restriction of range situation means College S originally underestimated the true correlation between SAT scores and grades.


2.0 1200400 1600

SAT score

College grades (GPA)



2.0 1200400 1600

SAT score

College grades (GPA)


FIGURE 8.13 Restriction of range underestimates the true correlation. (A) College S admits only those students whose SAT scores are above 1200, so its observed correlation between SAT and GPA is about r = .33. (B) If we could include estimates of the scores for students who were not admitted, the correlation between SAT and GPA would be stronger, about r = .57. (Data are fabricated for illustration purposes.)

220 CHAPTER 8 Bivariate Correlational Research

What do researchers do when they suspect restriction of range? A study could obtain the true correlation between SAT scores and college grades by admitting all students to College S, regardless of their SAT scores, see what grades they obtained, and compute the correlation. Of course, College S would not be very keen on that idea. The second option is to use a statistical technique, correction for restriction of range. The formula is beyond the scope of this text, but it estimates the full set of scores based on what we know about an existing, restricted set, and then recomputes the correlation. Actual studies that have corrected for restriction of range have estimated a correlation of r = .57 between SAT scores and college grades—a much stronger association and more convincing evidence for the pre- dictive validity of the SAT.

Restriction of range can apply when, for any reason, one of the variables has very little variance. For example, if researchers were testing the correla- tion between parental income and child school achievement, they would want to have a sample of parents that included all levels of income. If their sample of parents were entirely upper middle class, there would be restriction of range on parental income, and researchers would underestimate any true correlation. Sim- ilarly, to get at the true correlation between multitasking ability and frequency, researchers Sanbonmatsu et al. (2013) would need to include people who do a lot of media multitasking and those who do very little. In addition, the Mehl team (2010) would need to have people who have both a lot of meaningful conversations and very few, as well as people who are very happy and who are less happy.

Because restriction of range makes correlations appear smaller, we would ask about it primarily when the correlation is weak. When restriction of range might be a problem, researchers could either use statistical techniques that let them correct for

restriction of range, or, if possible, recruit more people at the ends of the spectrum.


When a study reports that there is no relationship between two variables, the relationship might truly be zero. In rare cases, however, there might be a curvilinear association in which the relationship between two variables is not a straight line; it might be positive up to a point, and then become negative.

A curvilinear association exists, for example, between age and the use of health care services, as shown in Figure 8.14. As people get older, their use of the health care system decreases up to

❯❯ Restriction of range is similar

to ceiling and floor effects; see Chapter 11, pp. 333–334.

FIGURE 8.14 A curvilinear association. With increasing age, people’s use of the health care system decreases and then increases again. A curvilinear association is not captured adequately by the simple bivariate correlation coefficient r. In these data, r = .01, a value that does not describe the relationship. (Data are fabricated for illustration purposes.)


Low 0 20 40

Age in years

Use of health care system

60 80 100

221Interrogating Association Claims

a point. Then, as they approach age 60 and beyond, health care use increases again. However, when we compute a simple bivariate correlation coefficient r on these data, we get only r = −.01 because r is designed to describe the slope of the best- fitting straight line through the scatterplot. When the slope of the scatterplot goes up and then down (or down and then up), r does not describe the pattern very well. The straight line that fits best through this set of points is flat and horizontal, with a slope of zero. Therefore, if we looked only at the r and not at the scatterplot, we might conclude there is no relationship between age and use of health care. When researchers suspect a curvilinear association, the statistically valid way to analyze it is to compute the correlation between one variable and the square of the other.

Internal Validity: Can We Make a Causal Inference from an Association? Even though it’s not necessary to formally interrogate internal validity for an association claim, we must guard against the powerful temptation to make a causal inference from any association claim we read. We hear that couples who meet online have happier marriages, so we advise our single friends to sign up for (thinking online dating will make their future marriages more happy). In fact, a press release erroneously wrapped the dating study in the causal headline, “Meeting online leads to happier, more enduring marriages” (Harms, 2013; emphasis added). Or we hear that deep talk goes with greater well-being, and we vow to have more substantive conversations. A journalist’s report on the Mehl et al. (2010) finding included this sentence: “Deep conversations made peo- ple happier than small talk, one study found.” Oops; the strong verb made turned the claim into a causal one (Rabin, 2010). When we read a correlational result, the temptation to make a causal claim can be irresistible.


Because the causal temptation is so strong, we have to remind ourselves repeat- edly that correlation is not causation. Why is a simple association insufficient to establish causality? As discussed in Chapter 3, to establish causation, a study has to satisfy three criteria:

1. Covariance of cause and effect. The results must show a correlation, or associa- tion, between the cause variable and the effect variable.

2. Temporal precedence. The cause variable must precede the effect variable; it must come first in time.

3. Internal validity. There must be no plausible alternative explanations for the rela- tionship between the two variables.

The temporal precedence criterion is sometimes called the directionality problem because we don’t know which variable came first. The internal validity

222 CHAPTER 8 Bivariate Correlational Research

criterion is often called the third- variable problem: When we can come up with an alternative explanation for the association between two variables, that alternative is some lurking third variable. Figure 8.15 pro- vides a shorthand description of these three criteria.

Let’s apply these criteria to the deep talk and well-being association, to see whether we can conclude that meaningful conversations cause an increase in well-being:

1.  Covariance of cause and effect. From the study’s results, we already know deep talk is associated positively with well-being. As the percentage of deep talk goes up, well-being goes up, thus showing covari- ance of the proposed cause and the pro- posed effect.

2.  Temporal precedence. The study’s method meant that deep talk and well- being were measured during the same, short time period, so we cannot be sure whether people used deep talk first, followed by an increase in well- being, or whether people were happy first and later engaged in more meaningful conversations.

3. Internal validity. The association between deep talk and well-being could be attributable to some third variable connected to both deep talk and well-being. For instance, a busy, stressful life might lead people to report lower well-being and have less time for substantive talks. Or perhaps in this college sample, having a strong college-preparatory background is associated with both deep conversations and having higher levels of well-being in college (because those students are better prepared). But be careful—not any third variable will do. The third variable, to be plausible, must correlate logically with both of the measured variables in the original association. For example, we might propose that income is an alternative explanation, arguing that people with higher incomes will have greater well-being. For income to work as a plausible third variable, though, we would have to explain how higher income is related to more deep talk, too.

As you can see, the bivariate correlation between well-being and deep talk doesn’t let us make the causal claim that high levels of substantive conversation

1. Covariance: Do the results show that the variables are correlated?

2. Temporal precedence (directionality problem): Does the method establish which variable came first in time?

3. Internal validity (third-variable problem): Is there a C variable that is associated with both A and B, independently?

(If we cannot tell which came first, we cannot infer causation.)

(If there is a plausible third variable, we cannot infer causation.)


or did










FIGURE 8.15 The three criteria for establishing causation. When variable A is correlated with variable B, does that mean A causes B? To decide, apply the three criteria.

223Interrogating Association Claims

cause high levels of well-being. We also cannot make a causal claim in the other direction—that high levels of well-being cause people to engage in more deep conversations. Although the two variables are associated, the study has met only one of the three causal criteria: covariance. Further research using a different kind of study would be needed to establish temporal precedence and internal validity before we would accept this relationship as causal.

What about the press release stating that meeting one’s spouse online is associ- ated with a happier marriage? Consider whether this finding justifies the headline, “Meeting online leads to happier, more enduring marriages” (Harms, 2013). Let’s see how this study stands up to the three causal criteria:

1. Covariation of cause and effect. The study reported an association between meeting online and greater marital satisfaction. As discussed earlier, the association in the original study was very weak, but it was statistically significant.

2. Temporal precedence. We can be sure that the meeting location variable came first and marital satisfaction came later. People usually do have to meet some- body (either online or offline) before getting married!

3. Internal validity. This criterion is not met by the study. It is possible that certain types of people are more likely to both meet people online and be happier in their marriages. For example, people who are especially motivated to be in a relationship may be more likely to sign up for, and meet their spouses on, online dating sites. And these same relationship-motivated people may be especially prepared to feel happy in their marriages.

In this case, the two variables are associated, so the study has established covari- ance, and the temporal precedence criterion has also been satisfied. However, the study does not establish internal validity, so we cannot make a causal inference.


When we think of a reasonable third variable explanation for an association claim, how do we know if it’s an internal validity problem? In the Mehl study (2010) about deep talk and well-being, we thought level of education might be a third variable that explains this association. As mentioned earlier, it could be that better-educated people are both happier and have more meaningful conver- sations, and that’s why deep talk is correlated with well-being. Educational level makes a reasonable third variable here because well-educated people might have more substantive conversations, and research also shows that more educated people tend to be happier. But is education really responsible for the relationship the Mehl team found? We have to dig deeper.

What would it look like if education really was the third variable responsible for the correlation between deep talk and well-being? We can use a scatterplot to

224 CHAPTER 8 Bivariate Correlational Research

illustrate. Looking at Figure 8.16 overall, we see a moderate, positive relationship between deep talk (substantive conversations) and happiness, just as we know exists. But let’s think about separating people who are more and less educated into two subgroups. In the graph, the more-educated people are represented by green dots and the less-educated by blue dots. The more-educated (green dots) are generally higher on both happiness and substantive conversations, and the less-educated (blue dots) are generally lower on both variables.

Furthermore, if we study the scatterplot pattern of just the green dots, we see that within the subgroup of well-educated people, there is no positive relationship between deep talk and happiness. The cloud of green dots is spread out and has no positive slope at all. Similarly, if we study the pattern of just the blue dots, the same thing occurs; less-educated people are lower on both happiness and deep talk, and within this subgroup, the cloud of blue dots shows no positive relation- ship between the two.

The outcome shown in Figure 8.16 means that the only reason deep talk and happiness are correlated is because well-educated people are higher on both of these variables. In other words, education presents a third variable problem. In such situations, the original relationship is referred to as a spurious association; the bivariate correlation is there, but only because of some third variable.











Percentage of substantive conversations

–2.0 –1.5 1.5–1.0 1.0–0.5 0.5–0.0

Well-being scale (z score)

FIGURE 8.16 A third variable, education: an internal validity problem. More-educated people (green dots) are both higher in happiness and higher in substantive conversations (deep talk), compared to less-educated people (blue dots). Within each educational level, there is no relationship between these two variables. This outcome means level of education is really the reason deep talk and happiness are correlated. (Data are fabricated for illustration purposes.)

225Interrogating Association Claims

Other times, proposed third variables come out differently. Suppose we suspect the relationship between substantive conversations and well-being is attributable to the third variable of leisure time. We might think that people with more leisure time have greater well-being (they’re more relaxed) and also have more meaning- ful conversations (they have more time for thinking). In Figure 8.17, the orange dots (more leisure time) are higher on both happiness and substantive conversa- tions and the dark blue dots (less leisure time) are lower on both variables. But this time, when we study just the orange dots, we see that within this subgroup of people, there is still a positive association between deep talk and happiness; the cloud of orange dots still has a positive slope. Similarly, within the dark blue dots alone, there is an overall positive relationship between deep talk and happiness. Therefore, the situation in Figure 8.17 indicates that although we thought amount of leisure time might be a third variable explanation for Mehl’s result, a closer look at the data indicates we were wrong: Deep talk and well-being are still correlated within subgroups of people with more and less leisure time.

When we propose a third variable that could explain a bivariate correlation, it’s not necessarily going to present an internal validity problem. Instead, it’s a reason to dig deeper and ask more questions. We can ask the researchers if their bivariate correlation is still present within potential subgroups.

FIGURE 8.17 A third variable, leisure time: not an internal validity problem. People with more leisure time (orange dots) are both higher in happiness and higher in substantive conversations (deep talk), compared to people with less leisure time (dark blue dots). However, within each leisure time group there is still a positive relationship between deep talk and happiness. This outcome means that amount of leisure time does not actually pose an internal validity problem; deep talk and happiness are still correlated even within the two leisure subgroups. (Data are fabricated for illustration purposes.)











Percentage of substantive conversations

–2.0 –1.5 1.5–1.0 1.0–0.5 0.5–0.0

Well-being scale (z score)

❮❮ For more on subgroups and third variables, see Chapter 9, pp. 244–247.

226 CHAPTER 8 Bivariate Correlational Research

In sum, when we’re interrogating a simple association claim, it is not necessary to focus on internal validity as long as it’s just that: an association claim. However, we must keep reminding ourselves that covariance satisfies only the first of the three criteria for causation. Before assuming that an association suggests a cause, we have to apply what we know about temporal precedence and internal validity.

External Validity: To Whom Can the Association Be Generalized? When interrogating the external validity of an association claim, you ask whether the association can generalize to other people, places, and times. For example, consider again the association between media multitasking frequency and ability. To interrogate the external validity of this association, the first questions would be who the participants were and how they were selected. If you check the orig- inal article (Sanbonmatsu et al., 2013), you’ll find that the sample consisted of 310 undergraduates: 176 women and 134 men.

As you interrogate external validity, recall that the size of the sample does not matter as much as the way the sample was selected from the population of interest. Therefore, you would next ask whether the 310 students in the sample were selected using random sampling. If that was the case, you could then generalize the associ- ation from these 310 students to their population—college students at the University of Utah. If the students were not chosen by a random sample of the population of inter- est, you could not be sure the sample’s results would generalize to that population.

As it turns out, the Sanbonmatsu team do not say in their article whether the 310 students were a random sample of University of Utah students or not. And of course, because the college students in this sample were from only Utah, we don’t know if the study results will generalize to other college students in other areas of the country. Finally, because the sample consisted entirely of college students, the association may not generalize to nonstudents and older people. The external validity of the Sanbonmatsu study is unknown.


What should you conclude when a study does not use a random sample? Is it fair to disregard the entire study? In the case of the Sanbonmatsu study, the construct validity is excellent; the measures of multitasking frequency and ability have been used in other studies and have been shown to be valid and reliable measures of these concepts. In terms of statistical validity, the correlation is statistically sig- nificant, and the effect size is moderate. The sample was large enough to avoid the influence of outliers, and there did not seem to be a curvilinear association or restriction of range. The researchers did not make any causal claims that would render internal validity relevant. In most respects, this association claim stands up; it lacks only external validity.

A bivariate correlational study may not have used a random sample, but you should not automatically reject the association for that reason. Instead, you can

❯❯ For more on sampling

techniques, see Chapter 7, pp. 183–192.

227Interrogating Association Claims

accept the study’s results and leave the question of generalization to the next study, which might test the association between these two variables in some other population.

Furthermore, many associations do generalize—even to samples that are very different from the original one. Imagine a study of college students in the U.S. that found men to be taller than women. Would this finding generalize to people in the Netherlands, who are, overall, taller than Americans? Most likely it would: We’d still find the same association between sex and height because Dutch men are taller than Dutch women.

Similarly, you might think the multitasking result would not generalize to older adults, ages 70–80, because you assume they are less likely to multitask with many forms of media, and perhaps they’re less capable of multitasking, compared to a younger, college-aged population. You would probably be right about these mean (average) differences between the samples. However, within a sample of people in the 70–80 age range, those who do tend to multitask the most may still be the ones who are the worst at it. The new sample of people might score lower, on average, on both variables in the association claim, but even so, the association might still hold true within that new sample. In a scat- terplot that includes both these samples, the association holds true within each subgroup (Figure 8.18).









































Multitasking test score (OSPAN)

10 20 80 9030 7040 6050

% time spent multitasking

FIGURE 8.18 An association in two different samples. Older adults (represented by O) might engage in less multitasking than college students (Y), and they might perform worse on multitasking tests, such as OSPAN. But the association between the two variables within each sample of people may exist. (Data are fabricated for illustration purposes.)

228 CHAPTER 8 Bivariate Correlational Research


When the relationship between two variables changes depending on the level of another variable, that other variable is called a moderator. Let’s consider a study on the correlation between professional sports games attendance and the success of the team. You might expect to see a positive correlation: The more the team wins, the more people will attend the games. However, research by Shige Oishi and his colleagues shows that this association is moderated by the franchise city’s level of residential mobility (Oishi et al., 2007). In cities with high residential mobility (such as Phoenix, Arizona), people move in and out of the city frequently. Because they don’t develop ties to their community, they theorized, they’d be “fair weather” fans whose interest in the local team depends on whether the team wins. In contrast, in cities with low residential mobility (such as Pittsburgh, Pennsylvania), people live in the city for a long time. They develop strong community ties and are loyal to their sports team even when it loses.

Using data gathered over many major league baseball seasons, Oishi and his team determined that in cities with high residential mobility, there is a posi- tive correlation between success and attendance, indicating a fair-weather fan base. In cities with low residential mobility, there is not a significant correlation between success and attendance. We say that the degree of residential mobil- ity moderates the relationship between success and attendance (Table 8.5 and Figure 8.19).

When we identify a moderator, we are not saying Arizona’s team wins more than Pittsburgh’s, or that Arizona’s games are better attended than Pittsburgh’s. Instead, the relationship differs: When the Arizona Diamondbacks lose, people are less likely to attend games, but when the Pittsburgh Pirates lose, people still attend.

For another example, consider a study introduced in Chapter 3, in which people with higher incomes were found to spend less time socializing (Bianchi

& Vohs, 2016). In follow-up analyses, Bianchi and Vohs separated people’s socializing into categories: friends, rel- atives, and neighbors. They found that income was negatively associated with time spent socializing with relatives and neighbors, but income was positively associated with time spent socializing with friends. Therefore, the variable “type of relationship” moderated the association between number of hours spent socializing and income, such that the relationship was positive for friends, and negative for relatives and neighbors.


A City’s Residential Mobility Moderates the Relationship Between Sports Team Success and Game Attendance


Phoenix, AZ (high residential mobility)


Pittsburgh, PA (low residential mobility)


Note: *p < .05; correlation is statistically significant. Source: Adapted from Oishi et al., 2007.

229Interrogating Association Claims

Finally, Mehl et al. (2010) looked for moderators in the relationship they found between deep talk and well-being. They wondered if the relationship would differ depending on whether substantive conversations took place on a weekend or a weekday. However, the results suggested that weekend/weekday status did not moderate the relationship between deep talk and well-being: The relationship was positive and of equal strength in both time periods (Table 8.6; see also Figure 8.9).

In correlational research, moderators can inform external validity. When an association is moderated by residential mobility, type of relationship, day of the week, or some other variable, we know it does not generalize from one of these situations to the others. For example, in asking whether the association between multitasking frequency and ability would generalize to 70–80-year-olds, you were asking whether that association would be moderated by age. Similarly, the Mehl team found that the association between deep talk and well-being does generalize well from the weekends to weekdays: The strength of the association is almost the same in the two contexts.


Weekend/Weekday Status Does Not Moderate the Relationship Between Deep Talk and Well-Being


Weekday .28*

Weekend .27*

Note: Substantive conversations are associated with happiness on both weekdays and weekends. *p < .05; result is statistically significant. Source: Adapted from Mehl et al., 2006.





20,000 40 45 50 6055


Percentage of wins







5,000 41 45 49 53


Percentage of wins


FIGURE 8.19 A moderating variable. A city’s degree of residential mobility moderates the relationship between local sports team success and attendance at games. Each dot represents one major league baseball season. The Arizona Diamondbacks are based in Phoenix, a high residential mobility city; that pattern shows people are more likely to attend games there when the team is having a winning season. Pittsburgh is a low residential mobility city; that pattern shows Pittsburgh Pirates fans attend games regardless of how winning the season is. (Source: Adapted from Oishi et al., 2007.)

❮❮ In Chapter 12, you will learn that another way of understanding moderators is to describe them as interactions; see p. 358.

230 CHAPTER 8 Bivariate Correlational Research

Review what you’ve learned in this chapter by studying the Working It Through section.


1. In one or two brief sentences, explain how you would interrogate the construct validity of a bivariate correlation.

2. What are five questions you can ask about the statistical validity of a bivariate correlation? Do all the statistical validity questions apply the same

way when bivariate correlations are represented as bar graphs?

3. Which of the three rules of causation is almost always met by a bivariate correlation? Which two rules might not be met by a correlational study?

4. Give examples of some questions you can ask to evaluate the external validity of a correlational study.

5. If we found that gender moderates the relationship between deep talk and well-being, what might that mean?

1. See pp. 210–211. 2. See pp. 211–221; questions about outliers and curvilinear associations may not be relevant for correlations represented as bar graphs. 3. See pp. 221–223. 4. See pp. 226–227. 5. It would mean the relationship between deep talk and well-being is different for men than for women. For example, the relationship might be stronger for women than it is for men.

231Interrogating Association Claims

Are Parents Happier Than People with No Children? Some researchers have found that people with children are less happy than people who don’t have kids. In contrast, a popular media story reports on a study in which parents are, in fact, happier. We will work through this example of a bivariate correlational study to illustrate the concepts from Chapter 8.



What kind of claim is being made in the journalist’s headline?

What are the two variables in the headline?

Association claims can be supported by correlational studies. Why can we assume the study was correlational?

“Parents are happier than non-parents” (Welsh, 2012).

The journalist reported: “Parents may not be the overtired, overworked and all-around miserable individuals they are sometimes made out to be, suggests new research finding Mom and Dad (particularly fathers) experience greater levels of happiness and meaning from life than nonparents” (Welsh, 2012).

The simple verb are made this an association claim. Parenting goes with being happier.

The two variables are “being a parent or not” and “level of happiness.”

We can assume this is a correlational study because parenting and level of happiness are probably measured variables (it’s not really possible to manipulate them). In a correlational study all variables are measured.

The journal article contains three studies on parenting and happiness, and we’ll focus on the second study (Nelson, Kushlev, English, Dunn, & Lyubomirsky, 2013)

Are the variables categorical or quantitative?

The Method section described how the researchers distributed pagers to 329 adults, about half of whom were parents. They reported on their happiness (on a 7-point scale) at different times of the day.

One variable is categorical; people were either parents or not.

The other variable, happiness, was quantitative because people could range from low to high.



232 CHAPTER 8 Bivariate Correlational Research



What were the results? 0.2 0.3

Parents Nonparents

0.1 0

–0.1 –0.2 –0.3


Because one variable was categorical and the other was quantitative, the researchers presented the results as a bar graph. Parents had a higher average well-being than nonparents.

Construct Validity How well was each variable measured?

To measure parenting status, the researchers simply asked people if they were parents.

To measure happiness, participants wore devices that paged them five times daily. At each page, the participant rated 8 positive emotions (such as pride, joy) and 11 negative emotions (such as anger, guilt). The researchers subtracted the ratings of negative emotions from positive ones at each time point, and averaged across each person’s 35 reports for a week.

It seems unlikely that people would lie about their parental status, so we can assume a self- report was a valid measure of parenting.

It seems reasonable that people who have positive emotions throughout the day are happier, so this operationalization of happiness has face validity. Notably, happiness was based on an “in-the-moment” response, not a global judgment, so it may be more accurate. However, the researchers don’t present any criterion validity or convergent validity evidence for the measure.

Statistical Validity What is the effect size? Is the difference statistically significant?

The researchers reported: “As in Study 1, we first examined the relationship between parenthood and happiness with t tests. Parents reported higher levels of global well-being, including more happiness, t(325) = 2.68, p = .008, r = .15…” (Nelson et al., 2013, p. 7).

The authors report an effect size of r = .15, and p = .008. Since p is less than .05, we can conclude the difference in happiness between parents and nonparents is statistically significant, and the effect size is small.

Are there outliers?

Is the association curvilinear?

Could there be restriction of range?

The researchers don’t mention outliers, but in a sample this large, outliers are unlikely to have an impact on the overall pattern.

Parenting is a two-category variable. Without some middle category we can’t have a curvilinear association.

We look for restriction of range when the results are smaller than expected. Perhaps because these researchers found a significant effect, they did not test for restriction of range.

Internal Validity Can the study possibly support the claim that parenthood causes people to be happy?

We can’t support a causal claim. The results show covariance, but temporal precedence is not present because happiness and parenting were measured at the same time. The correlational method could not rule out third variables. For example, parents are more likely to be married, and married people may also be happier.

External Validity To whom can we generalize this association?

The article reports that the sample of 329 adults came from “a study on emotional experience in adulthood” (p. 4).

The authors did not state that the sample was selected at random, so we do not know to whom this association can generalize.

Summary 233Summary

Summary • Association claims state that two variables are linked,

but they do not state that one causes the other.

• Association claims are supported by correlational research, in which both variables are measured in a set of participants. (If either of the variables is manipulated, the study is an experiment, which could potentially support a causal claim.)

Introducing Bivariate Correlations • The variables in a bivariate correlational study can be

either quantitative or categorical. If both variables are quantitative, the data are usually depicted in a scatter- plot; if one variable is categorical, the data are usually depicted in a bar graph.

• For a scatterplot, the correlation coefficient r can be used to describe the relationship. For a bar graph, the difference between the two group means is used to describe the relationship.

• Regardless of whether an association is analyzed with scatterplots or bar graphs, if both variables are measured, the study is correlational.

Interrogating Association Claims • Because a correlational study involves two measured

variables, the construct validity of each measure must be interrogated in a bivariate correlation study.

• Interrogating the statistical validity of an associa- tion claim involves five areas of inquiry: effect size (strength of r), statistical significance, the presence of outliers, possible restriction of range, and whether the association is curvilinear.

• Internal validity addresses the degree to which a study supports a causal claim. Although it is not necessary to interrogate internal validity for an association claim because it does not make a causal statement, it can be tempting to assume causality from a correlational study.

• Correlational studies do not satisfy all three criteria for a causal claim: They may show covariance but do not usually satisify temporal precedence, and they cannot establish internal validity.

• Interrogating the external validity of an association claim involves asking whether the sample is represen- tative of some population. If a correlational study does not use a random sample of people or contexts, the results cannot necessarily generalize to the population from which the sample was taken.

• A lack of external validity should not disqualify an entire study. If the study fulfills the other three validi- ties and its results are sound, the question of general- izability can be left for a future investigation.

• A bivariate correlation is sometimes moderated, which means the relationship changes depending on the levels of another variable, such as gender, age, or location.


234 CHAPTER 8 Bivariate Correlational Research

Key Terms

bivariate correlation, p. 204 mean, p. 208 t test, p. 209 effect size, p. 211

statistical significance, p. 214 outlier, p. 217 restriction of range, p. 218 curvilinear association, p. 220

directionality problem, p. 221 third-variable problem, p. 222 spurious association, p. 224 moderator, p. 228

To see samples of chapter concepts in the popular media, visit and click the box for Chapter 8.r

Review Questions

1. Suppose you hear that conscientious people are more likely to get regular health checkups. Which of the following correlations between conscientiousness and getting checkups would probably support this claim?

a. r = .03

b. r = .45

c. r = −.35

d. r = −1.0

2. Which of these associations will probably be plotted as a bar graph rather than a scatterplot?

a. The more conscientious people are, the greater the likelihood they’ll get regular health checkups.

b. Level of depression is linked to the amount of exercise people get.

c. Students at private colleges get higher GPAs than those at public colleges.

d. Level of chronic stomach pain in kids is linked to later anxiety as adults.

3. A study found that people who like spicy foods are generally risk takers. Which of the following questions interrogates the construct validity of this correlation?

a. Is the result statistically significant?

b. Did the study use a random sample of people?

c. Were there any outliers in the relationship?

d. How well did they measure each variable, risk taking and liking spicy foods?

4. Darrin reads a story reporting that students at private colleges get higher GPAs than those at public colleges. He wonders if this means going to a private college causes you to have a higher GPA; if so, he’ll go to a private college! Applying the three causal criteria, Darrin knows there is covariance here. He also knows there is temporal precedence because you choose a college first, and then you get your GPA. Which of the following questions would help Darrin ask about the third criterion, internal validity?

a. Could there be restriction of range?

b. Is the link between private college and high grades the same for both men and women?

c. How did they decide what qualifies a college as private or public?

d. Is there some other reason these two are related? Maybe better students are more likely to go to private colleges, and they are also going to get better grades.

5. Which of the following sentences describes a moder- ator for the relationship between risk taking and liking spicy foods?

a. There is a positive relationship between liking spicy foods and risk taking for men, but no rela- tionship for women.

b. Older adults tend to like spicy foods less than younger adults.

c. The relationship between liking spicy foods and risk taking is the same for people in cities and in rural areas.

Learning Actively

1. For each of the following examples, sketch a graph of the result (either a bar graph or a scatterplot). Then, interrogate the construct validity, the statistical validity, and the external validity of each association claim. What questions would you ask? What answers would you expect?

a. “Chronic stomach pain in kids is linked to adult anxiety disorders in later life.” In this study, the researchers “followed 332 children between the ages of 8 and 17 who were diagnosed with functional abdominal pain and 147 with no pain for an average of eight years. . . . On follow-up, the researchers interviewed the volunteers—who were on average age 20 at that point—either in person or by phone. . . . Of adults who had abdominal pain as children, 51 percent had experienced an anxiety disorder during their lives, compared to 20 percent of those who didn’t experience tummy aches as children” (Carroll, 2013).

b. “Kids with ADHD may be more likely to bully.” In this study, the researchers “followed 577 chil- dren—the entire population of fourth graders from a municipality near Stockholm—for a year. The researchers interviewed parents, teachers and children to determine which kids were likely to have ADHD. Children showing signs of the disorder were then seen by a child neurologist for diag- nosis. The researchers also asked the kids about bullying. [The study found that] children with

attention deficit hyperactivity disorder are almost four times as likely as others to be bullies” (Carroll, 2008).

2. A researcher conducted a study of 34 scientists (Grim, 2008). He reported a correlation between the amount of beer each scientist drank per year and the likeli- hood of that scientist publishing a scientific paper. The correlation was reported as r = −.55, p < .01.

a. What does a negative correlation mean in this example? Is this relationship strong or weak?

b. What does p < .01 mean in this result?

c. Draw a scatterplot of this association. What might happen to this correlation if you added one person in the sample who drank much more beer than other scientists and also published far fewer papers than other scientists?

d. A popular media report about this article was headlined, “Suds seem to skew scientific success” (San Diego Union-Tribune, 2008). Is such a causal claim justified?

e. Perhaps scientific discipline is a moderator of this relationship. Create a moderator table, using Table 8.5 as a model, showing that the association between beer drinking and publications is stron- ger for female scientists (perhaps because alcohol affects women more strongly) than for male scientists.

235Learning Actively

The Origins of Narcissism: Children More Likely to Be Self-Centered If They Are Praised Too Much Independent, 2015

Study Links Teen Pregnancy to Sex on TV Shows, 2008


Multivariate Correlational Research CORRELATIONAL STUDIES CAN PROVIDE interesting new information in their own right. The opening headlines provide examples. It might interest us to read that children who are praised too much are also self-centered or narcissistic. We might be surprised to learn that watching sex on TV shows is linked to teen pregnancy. Often, however, a correlational result is an early step in establishing a causal relationship between two variables. Psychological scientists (among many others) want to know about causes and effects, not just correlations, because they may suggest treatments. If praise is linked to narcissism, we might wonder whether or not the praise makes kids narcissistic. If it does, parents might change how they express approval. When reading that sexual content on TV is linked to teenage pregnancy, we may wonder whether watching sexual material causes behavior that leads to pregnancy. If it does, then pediatricians, teachers, or advocacy groups could argue for restricting teens’ exposure to certain kinds of TV shows. However, if the relationships are not causal, such interventions would not work.

Because correlation is not causation, what are the options? Researchers have developed some techniques that enable them to test for cause. The best of these is experimentation: Instead of measuring both variables, researchers manipulate one variable and measure the other. (Experimental designs are covered in Chapters 10–12.) Even without setting up an experiment, however,


A year from now, you should still be able to:

1. State why simple bivariate correlations are not sufficient for establishing causation.

2. Explain how longitudinal correlational designs can establish temporal precedence.

3. Explain how multiple-regression analyses help address internal validity (the third-variable problem).

4. Describe the value of pattern and parsimony, in which a variety of research results support a single, parsimonious causal theory.

5. Explain the function of a mediating variable.

238 CHAPTER 9 Multivariate Correlational Research

researchers can use some advanced correlational techniques to get a bit closer to making a causal claim. This chapter outlines three such techniques: lon- gitudinal designs, which allow researchers to establish temporal precedence in their data; multiple-regression analyses, which help researchers rule out certain third- variable explanations; and the “pattern and parsimony” approach, in which the results of a variety of correlational studies all support a single, causal theory. In the three techniques, as in all correlational studies, the variables are measured—that is, none are manipulated.

REVIEWING THE THREE CAUSAL CRITERIA Unlike the bivariate examples in Chapter 8, which involved only two measured variables, longitudinal designs, multiple-regression designs, and the pattern and parsimony approach are multivariate designs, involving more than two mea- sured variables. While these techniques are not perfect solutions to the causality conundrum, they are extremely useful and widely used tools, especially when experiments are impossible to run.

Remember that the three criteria for establishing causation are covariance, temporal precedence, and internal validity. We might apply these criteria to cor- relational research on the association between parental praise and narcissism.

In the research you’ll read about in this chapter, narcissism is studied as a personality trait in which people feel superior to others, believe they deserve spe- cial treatment, and respond strongly when others put them down; narcissists are difficult relationship partners. Parental overpraise, the other variable discussed in this example, is telling kids they are exceptional or more special than other children. It’s important to note that childhood narcissism is different from high self-esteem (a trait that is considered healthy). Similarly, overpraising is different from parents expressing warmth and love for their children.

Let’s examine the three criteria:

1. Is there covariance? One study did find covariance (Otway & Vignoles, 2006). Adults who were narcissistic remembered their parents praising them for almost everything they did. The correlation was weak, but statistically signif- icant (r values around .20).

2. Is there temporal precedence? A correlational study like Otway and Vignoles’ does not establish temporal precedence. Adults reflected on their parents’ behavior during childhood, so their current self-views could have colored their recall of the past. It’s not clear which variable came first in time.

3. Is there internal validity? The association between parental praise and child narcissism might be explained by a third variable. Perhaps parents praise boys more than girls, and boys are also more likely to have narcissistic traits. Or per- haps parents who are themselves narcissistic simply overpraise their children and, independently, tend to be mimicked by their kids.

239Establishing Temporal Precedence with Longitudinal Designs

ESTABLISHING TEMPORAL PRECEDENCE WITH LONGITUDINAL DESIGNS A longitudinal design can provide evidence for temporal precedence by measur- ing the same variables in the same people at several points in time. Longitudinal research is used in developmental psychology to study changes in a trait or an ability as a person grows older. In addition, this type of design can be adapted to test causal claims.

Researchers conducted such a study on a sample of 565 children and their mothers and fathers living in the Netherlands (Brummelman et al., 2015). The parents and children were contacted four times, every 6 months. Each time, the children completed questionnaires in school, responding to items about narcissism (e.g., “Kids like me deserve something extra”). Parents also completed question- naires about overpraising their children, which was referred to in the study as overvaluation (e.g., “My child is more special than other children”).

This study was longitudinal because the researchers measured the same variables in the same group of people across time—every 6 months. It is also a multivariate correlational study because eight variables were considered: child narcissism at Time 1, 2, 3, and 4, and parental overvaluation at Time 1, 2, 3, and 4.

Interpreting Results from Longitudinal Designs Because there are more than two variables involved, a multivariate design gives several individual correlations, referred to as cross-sectional correlations, auto- correlations, and cross-lag correlations. The Brummelman researchers conducted their analyses on mothers’ and fathers’ overvaluation separately, in order to investigate the causal paths for each parent separately. We present the results for mothers here, but the results are similar for fathers.


The first set of correlations are cross-sectional correlations; they test to see whether two variables, measured at the same point in time, are correlated. For example, the study reports that the correlation between mothers’ overvaluation


1. Why can’t a simple bivariate correlational study meet all three criteria for establishing causation?

1. See p. 238.

240 CHAPTER 9 Multivariate Correlational Research

at Time 4 and children’s narcissism at Time 4 was r = .099. This is a weak cor- relation, but consistent with the hypothesis. However, because both variables in a cross-sectional correlation were measured at the same time, this result alone cannot establish temporal precedence. Either one of these variables might have led to changes in the other. Figure 9.1 depicts how this study was designed and shows all of the cross-sectional correlations.


The next step was to evaluate the associations of each variable with itself across time. For example, the Brummelman team asked whether mothers’ overvaluation at Time 1 was associated with mothers’ overvaluation at Time 2, Time 3, and so on; they also asked whether children’s narcissism at Time 1 was associated with their scores at Time 2, Time 3, and so on. Such correlations are sometimes called autocorrelations because they determine the correlation of one variable with itself, measured on two different occasions. The results in Figure 9.2 suggest that both overvaluation and narcissism are fairly consistent over time.

Overvaluation Time 1

Overvaluation Time 2

Overvaluation Time 3

Overvaluation Time 4

Narcissism Time 1

r = .007 r = .070 r = .138 r = .099

Narcissism Time 2

Narcissism Time 3

Narcissism Time 4

FIGURE 9.1 Cross-sectional correlations. Look at the correlations of the variables when measured at the same time. Within each time period, the mothers’ overvaluation is weakly associated with child narcissism. Notice that the arrows point in both directions because in these cross-sectional correlations, the two variables were measured at the same time. The figure shows zero-order (bivariate) correlations. (Source: Adapted from Brummelman et al., 2015.)

Overvaluation Time 1

Overvaluation Time 2

Overvaluation Time 3

Overvaluation Time 4

Narcissism Time 1

Narcissism Time 2

Narcissism Time 3

Narcissism Time 4

.695 .603 .608

.500 .489 .520

FIGURE 9.2 Autocorrelation. In a longitudinal study, researchers also investigate the autocorrelations. These results indicate that both variables seem to be relatively stable over time. Notice that the arrows point in only one direction because the Time 1 measurements came before the Time 2 measurements. Overvaluation values are based on mothers’ results. (Source: Adapted from Brummelman et al., 2015.)

241Establishing Temporal Precedence with Longitudinal Designs


So far so good. However, cross-sectional correlations and autocorrelations are gen- erally not the researchers’ primary interest. They are usually most interested in cross-lag correlations, which show whether the earlier measure of one variable is associated with the later measure of the other variable. Cross-lag correlations thus address the directionality problem and help establish temporal precedence.

In the Brummelman study, the cross-lag correlations show how strongly moth- ers’ overvaluation at Time 1 is correlated with child narcissism later on, compared to how strongly child narcissism at Time 1 is correlated with mothers’ overvalu- ation later on. By inspecting the cross-lag correlations in a longitudinal design, we can investigate how one variable correlates with another one over time—and therefore establish temporal precedence. In Brummelman’s results, only one set of the cross-lag correlations was statistically significant; the other set was not significant (Figure 9.3). Mothers who overvalued their children at one time period had children who were higher in narcissism 6 months later. In contrast, children who were higher in narcissism at a particular time period did not have mothers who overvalued them 6 months later. Because the “overvaluation to narcissism” correlations are significant and the “narcissism to overvaluation” correlations are not, this suggests the overvaluation, not the narcissism, came first.

Three Possible Patterns from a Cross-Lag Study. The results of the cross-lag correlations in the Brummelman study could have followed one of three patterns. The study did show that parental overpraise (overvaluation) at earlier time periods was significantly correlated with child narcissism at the later time periods. Such a pattern was consistent with the argument that overpraise leads to increases in narcissism over time. However, the study could have shown the opposite result— that narcissism at earlier time periods was significantly correlated with overpraise later. Such a pattern would have indicated that the childhood narcissistic tendency came first, leading parents to change their type of praise later.

Overvaluation Time 1

Overvaluation Time 2

Overvaluation Time 3

Overvaluation Time 4

Narcissism Time 1

Narcissism Time 2

Narcissism Time 3

Narcissism Time 4


n.s .

n.s .

n.s .

.063 .068

FIGURE 9.3 Results of a cross-lag study. The cross-lag correlations in this study are consistent with the conclusion that parental overpraise comes before narcissism because overpraise in early time periods significantly predicts later narcissism, but narcissism in earlier time periods was not significantly (n.s.) related to later overpraise. The arrows point in only one direction because in each case the method makes clear which variable came first in time; Time 1 always comes before Time 2, and so on. Values shown are associated with mothers’ overpraise. (Source: Adapted from Brummelman et al., 2015.)

242 CHAPTER 9 Multivariate Correlational Research

Finally, the study could have shown that both correlations are significant—that overpraise at Time 1 predicted narcissism at Time 2 and that narcissism at Time 1 predicted overpraise at Time 2. If that had been the result, it would mean exces- sive praise and narcissistic tendencies are mutually reinforcing. In other words, there is a cycle in which overpraise leads to narcissism, which leads parents to overpraise, and so on.

Longitudinal Studies and the Three Criteria for Causation Longitudinal designs can provide some evidence for a causal relationship by means of the three criteria for causation:

1. Covariance. Significant relationships in longitudinal designs help establish covariance. When two variables are significantly correlated (as in the cross- lag correlations in Figure 9.3), there is covariance.

2. Temporal precedence. A longitudinal design can help researchers make infer- ences about temporal precedence. Because each variable is measured in at least two different points in time, they know which one came first. By comparing the relative strength of the two cross-lag correlations, the researchers can see which path is stronger. If only one of them is statistically significant (as in the Brummelman overvaluation and narcissism study), the researchers move a little closer to determining which variable comes first, thereby causing the other.

3. Internal validity. When conducted simply—by measuring only the two key variables—longitudinal studies do not help rule out third variables. For exam- ple, the Brummelman results presented in Figure 9.3 cannot clearly rule out the possible third variable of socioeconomic status. It’s possible that parents in higher income brackets overpraise their children, and also that children in upper-income families are more likely to think they’re better than other kids.

However, researchers can sometimes design their studies or conduct subsequent analyses in ways that address some third variables. For example, in the Brummel- man study, one possible third variable is gender. What if boys show higher levels of narcissism than girls, and what if parents of boys are also more likely to overpraise them? Gender might be associated with both variables. Participant gender does not threaten internal validity here, however, because Brummelman and his colleagues report that the pattern was the same when boys and girls were examined separately. Thus, gender is a potential third variable, but by studying the longitudinal patterns of boys and girls separately, the Brummelman team was able to rule it out.

Why Not Just Do an Experiment? Why would Brummelman and his team go to the trouble of tracking children every 6 months for 2 years? Why didn’t they just do an experiment? After all, conducting experiments is the only certain way to confirm or disconfirm causal claims. The

243Establishing Temporal Precedence with Longitudinal Designs

problem is that in some cases people cannot be randomly assigned to a causal variable of interest. For example, we cannot manipulate person- ality traits, such as narcissism in children. Similarly, while parents might be able to learn new ways to praise their children, they can’t easily be assigned to daily parenting styles, so it’s hard to manipulate this variable.

In addition, it could be unethical to assign some people, especially children, to a condition in which they receive a certain type of praise, especially over a long time period. Particularly if we suspect that one type of praise might make children narcissistic, it would not be eth- ical to expose children to it in an experimental setting. Similarly, if researchers suspect that smoking causes lung cancer or sexual content on TV causes pregnancy, it would be unethical (and difficult) to ask study participants to smoke cigarettes or watch certain TV shows for several years. When an experiment is not practical or ethical, a longi- tudinal correlational design is a good option.

Nevertheless, researchers who investigate how children react to dif- ferent types of praise have not relied solely on correlational data. They have developed ethical experiments to study such reactions, at least over a short time period (Brummelman, Crocker, & Bushman, 2016; Mueller & Dweck, 1998). By randomly assigning children to receive praise for who they are (e.g., “You are so smart”) versus praise for how hard they worked (e.g., “You must have worked hard at these problems”), research- ers have produced some solid evidence that children really do change their behavior and attitudes in response to adult praise (Figure 9.4). Because it is ethically ques- tionable to expose children to potentially harmful feedback, such studies had to pass strict ethical review and approval before they were conducted (see Chapter 4). In addition, the exposure time was short (only one instance of praise per study, and no instances of criticism). It would be much more challenging to do an ethical experi- mental study of the effects of long-term exposure to potentially maladaptive praise at home. That makes longitudinal correlational designs an attractive alternative.

FIGURE 9.4 Praising children. Correlational and experimental studies suggest that when adults praise children’s learning strategies and efforts (compared to praising the type of person they are), kids respond favorably and continue to work hard.


1. Why is a longitudinal design considered a multivariate design?

2. What are the three kinds of correlations obtained from a longitudinal design? What does each correlation represent?

3. Describe which patterns of temporal precedence are indicated by different cross-lag correlational results.

1. See p. 239. 2. See pp. 239–242. 3. See p. 241.

244 CHAPTER 9 Multivariate Correlational Research


Groundbreaking research suggests that pregnancy rates are much higher among

teens who watch a lot of TV with sexual dialogue and behavior than among those

who have tamer viewing tastes. (CBSNews, 2008)

This news item, referring to a study on TV content and teenage pregnancy, reports a simple association between the amount of sexual content teens watch on TV and their likelihood of becoming pregnant (Chandra et al., 2008). But is there a causal link? Does sexual TV content cause pregnancy? Apparently there is covariance: According to the published study, teens who watched more sexual material on TV were more likely to get pregnant. What about tempo- ral precedence? Did the TV watching come before the pregnancy? According to the report, this study did establish temporal precedence because first they asked teens to report the types of TV shows they like to watch, and followed up with the very same teens 3 years later to find out if they had experienced a pregnancy.

What about internal validity? Third variables could explain the association. Perhaps one is age: Older teenagers might watch more mature TV programs, and they’re also more likely to be sexually active. Or perhaps parenting is a third vari- able: Teens with stricter parents might monitor their TV use and also put tighter limits on their behavior.

How do we know whether one of these variables—or some other one—is the true explanation for the association? This study used a statistical technique called multiple regression (or multivariate regression), which can help rule out some third variables, thereby addressing some internal validity concerns.

Measuring More Than Two Variables In the sexual TV content and pregnancy study, the researchers investigated a sample of 1,461 teenagers on the two key variables (Chandra et al., 2008). To measure the amount of sexual TV content viewed, they had participants report how often they watched 23 different programs popular with teens. Then coders watched 14 episodes of each show, counting how many scenes involved sex, including passionate kissing, sexually explicit talk, or intercourse. To assess pregnancy rates 3 years later, they asked girls “Have you ever been pregnant?” and asked boys “Have you ever gotten a girl pregnant?” The two variables were positively correlated: Higher amounts of sex on TV were associated with a higher risk of pregnancy (Figure 9.5).

If the researchers had stopped there and measured only these two variables, they would have conducted a bivariate correlational study. However, they also measured several other variables, including the total amount of time teenage

245Ruling Out Third Variables with Multiple-Regression Analyses

participants watched any kind of TV, their age, their academic grades, and whether they lived with both parents. By measuring all these variables instead of just two (with the goal of testing the interrelationships among them all), they conducted a multivariate correlational study.


By conducting a multivariate design, researchers can evaluate whether a relation- ship between two key variables still holds when they control for another vari- able. To introduce what “controlling for” means, let’s focus on one potential third variable: age. Perhaps sexual content and pregnancy are correlated only because older teens are both more likely to watch more mature shows and more likely to be sexually active. If this is the case, all three variables are correlated with one another: Viewing sex on TV and getting pregnant are correlated, as we already determined, but sex on TV and age are also correlated with each other, and age and pregnancy are correlated, too. The researchers want to know whether age, as a third variable correlated with both the original variables, can account for the relationship between sexual TV content and pregnancy rates. To answer the question, they see what happens when they control for age.

You’ll learn more about multiple-regression computations in a full-semester statistics course; this book will focus on a conceptual understanding of what these analyses mean. The most statistically accurate way to describe the phrase “control for age” is to talk about proportions of variability. Researchers are asking whether, after they take the relationship between age and pregnancy into account, there is still a portion of variability in pregnancy that is attributable to watching sexy TV. But this is extremely abstract language. The meaning is a bit like asking about the overall movement (the variance) of your wiggling, happy dog when you return home. You can ask, “What portion of the variability in my dog’s overall movement

Exposure to sexual TV content



Pregnancy risk

Low High

FIGURE 9.5 Correlating sexual TV content with pregnancy risk. Higher rates of sexual content on TV go with higher risk of pregnancy, and lower rates of sexual content go with lower risk of pregnancy. (Data are fabricated for illustration purposes.)

246 CHAPTER 9 Multivariate Correlational Research

is attributable to his tail moving? To his shoulders moving? To his back legs mov- ing?” You can ask, “Will the dog still be moving when he greets me, even if I were to hold his tail constant—hold it still?”

An easier way to understand “controlling for” is to recognize that testing a third variable with multiple regression is similar to identifying subgroups. We can think of it like this: We start by looking only at the oldest age group (say, 20-year-olds) and see whether viewing sexual TV content and pregnancy are still correlated. Then we move to next oldest group (age 18), then the youngest group (age 16). We ask whether the bivariate relationship still holds at all levels.

There are a couple of possible outcomes from such a subgroup analysis, and one is shown in the scatterplot in Figure 9.6. Here, the overall association is positive— the more sexual TV programs teens watch, the higher the chance of getting preg- nant. In addition, the oldest teens (the 20 symbols) are, overall, higher on sexual TV content and higher in chance of pregnancy. The youngest teens (16 symbols) are, overall, lower on sexual TV content and lower in chance of pregnancy. If we look only at the 20-year-olds, or only at the 16-year-olds, however, we still find the key relationship between sexy TV and pregnancy: It remains positive even within these age subgroups. Therefore, the relationship is still there, even when we hold age constant.

In contrast, the second possible outcome is shown in Figure 9.7. Here, the overall relationship is still positive, just as before—the more sexual content teens watch on TV, the higher the chance of pregnancy. In addition, just as before, the 20- year-olds watch more sexy TV and are more likely to become pregnant. However, this time, when we look only at the age 20 subgroup or only the age 16 subgroup, the key relationship between sexy TV and pregnancy is absent. The scatterplots within the age subgroups do not show the relationship anymore.

FIGURE 9.6 The association between sexual TV content and pregnancy remains positive, even controlling for age. The overall relationship shown here is positive, and this holds even within the three subgroups: age 20, age 18, and age 16. (Data are fabricated for illustration purposes.) Exposure to sexual TV content



20 20

















1616 16

16 16

16 16 16




16 16




Pregnancy risk

Low High

247Ruling Out Third Variables with Multiple-Regression Analyses

Therefore, the association between watching sexual TV content and getting preg- nant goes away when we control for age. In this case, age was, indeed, the third variable that was responsible for the relationship.

Regression Results Indicate If a Third Variable Affects the Relationship Which one of the two scatterplots, Figure 9.6 or Figure 9.7, best describes the relationship between sexual content on TV and pregnancy? The statistical tech- nique of multiple regression can tell us. When researchers use regression, they are testing whether some key relationship holds true even when a suspected third variable is statistically controlled for.

As a consumer of information, you’ll probably work with the end result of this process, when you encounter regression results in tables in empirical journal articles. Suppose you’re reading an article and you come across a table showing what the regression results would look like for the sexy TV/pregnancy example. What do the numbers mean? What steps did the researchers follow to come up with them?


When researchers use multiple regression, they are studying three or more vari- ables. The first step is to choose the variable they are most interested in under- standing or predicting; this is known as the criterion variable, or dependent variable. The Chandra team were primarily interested in predicting pregnancy, so they chose that as their criterion variable. The criterion (dependent) variable is

FIGURE 9.7 The association between sexual TV content and pregnancy goes away, controlling for age. The overall association shown here is positive, but if we separately consider the subgroups of age 20, age 18, or age 16, there is no relationship between the two variables. (Data are fabricated for illustration purposes.)

2020 20

20 20 20


20 20


20 20

18 18

18 18

18 18

1818 18

18 18









1616 16

16 16 16


Exposure to sexual TV content



Pregnancy risk

Low High

248 CHAPTER 9 Multivariate Correlational Research

almost always specified in either the top row or the title of a regression table, like Table 9.1.

The rest of the variables measured in a regres- sion analysis are called predictor variables, or independent variables. In the sexy TV/pregnancy study, the predictor variables are the amount of sex- ual content teenagers reported viewing on TV and the age of each teen. In Table 9.1, the two predictor variables are listed below the criterion variable.


The point of the multiple-regression results in Table 9.1 is to see whether the relationship between

exposure to sex on TV and pregnancy might be explained by a third variable—age. Does the association remain, even within each age group (as in Figure 9.6)? Or does the relationship between sexy TV and pregnancy go away within each age group (as in Figure 9.7)? The betas in Table 9.1 help answer this central question.

Beta Basics. In a regression table like Table 9.1, there is often a column labeled beta (or β, or even standardized beta). There will be one beta value for each predictor variable. Beta is similar to r, but it reveals more than r does. A positive beta, like a positive r, indi- cates a positive relationship between that predictor variable and the criterion variable, when the other predictor variables are statistically controlled for. A negative beta, like a negative r, indicates a negative relationship between two variables (when the other predictors are controlled for). A beta that is zero, or not significantly different from zero, represents no relationship (when the other predictors are controlled for). Therefore, betas are similar to correlations in that they denote the direction and strength of a relationship. The higher beta is, the stronger the relationship is between that predictor variable and the criterion variable. The smaller beta is, the weaker the relationship.

Within a single regression table, we can usually compare predictor variables that show larger betas to predictor variables with smaller betas—the larger the beta, the stronger the relationship. For example, in Table 9.1 we can say that the beta for the age predictor is stronger than the beta for the exposure to sex on TV predictor. (However, it is not appropriate to compare the strengths of betas from one regression table to the strengths of betas from another one.)

Unlike r, there are no quick guidelines for beta to indicate effect sizes that are weak, moderate, or strong. The reason is that betas change, depending on what other predictor variables are being used—being controlled for—in the regression.

Sometimes a regression table will include the symbol b instead of beta. The coef- ficient b represents an unstandardized coefficient. A b is similar to beta in that the sign of b—positive or negative—denotes a positive or negative association (when the other predictors are controlled for). But unlike two betas, we cannot compare two b values within the same table to each other. The reason is that b values are computed


Multiple-Regression Results from a Study Predicting Pregnancy from Sexual Content on TV and Age


Predictor (independent) variables:

Exposure to sex on TV 0.25 *

Age 0.33 *

Note: Data are fabricated, based on imagined results if the researchers had used only two predictor variables. *p < .001.

249Ruling Out Third Variables with Multiple-Regression Analyses

from the original measurements of the predictor variables (such as dollars, centi- meters, or inches), whereas betas are computed from predictor variables that have been changed to standardized units. A predictor variable that shows a large b may not actually denote a stronger relationship to the criterion variable than a predictor variable with a smaller b.

Interpreting Beta. In Table 9.1, notice that the predictor variable “exposure to sex on TV” has a beta of 0.25. This positive beta, like a positive r, means higher levels of sex on TV go with higher pregnancy risk (and lower levels of sex on TV go with lower pregnancy risk), even when we statistically control for the other predictor on this table—age. In other words, even when we hold age constant statistically, the relationship between exposure to TV sex and pregnancy is still there. This result is consistent with the relationship depicted in Figure 9.6, not the one in Figure 9.7.

The other beta in Table 9.1, the one associated with the age predictor variable, is also positive. This beta means that older age is associated with higher pregnancy rates, when exposure to sex on TV is controlled for. In other words, when we hold exposure to sex on TV constant, age predicts pregnancy, too. In sum, the beta that is associated with a predictor variable represents the relationship between that predictor variable and the criterion variable, when the other predictor variables in the table are controlled for.

Statistical Significance of Beta. The regression tables in empirical journal arti- cles often have a column labeled sig or p, or an asterisked footnote giving the p value for each beta. Such data indicate whether each beta is statistically significantly different from zero. As introduced in Chapter 8, the p value gives the probability that the beta came from a population in which the relationship is zero. When p is less than .05, the beta (i.e., the relationship between that predictor variable and the criterion variable, when the other predictor variables are controlled for) is consid- ered statistically significant. When p is greater than .05, the beta is considered not significant, meaning we cannot conclude beta is different from zero.

In Table 9.1, both of the betas reported are statistically significant. Table 9.2 gives several appropriate ways to explain what the significant beta for the TV variable means.

❮❮ For more on statistical significance, see Statistics Review: Inferential Statistics, pp. 499–500.


Describing the Significant Beta of 0.25 in Table 9.1


• The relationship between exposure to sex on TV and pregnancy is positive (high levels of sex on TV are associated with higher levels of pregnancy risk), even when age is controlled for.

• The relationship between exposure to sex on TV and pregnancy is positive even when age is controlled for.

• The relationship between exposure to sex on TV and pregnancy is positive (high levels of sex on TV are associated with higher pregnancy risk), and is not attributable to the third variable of age because it holds even when age is held constant.

250 CHAPTER 9 Multivariate Correlational Research

What If Beta Is Not Significant? To answer this question, we’ll use an example from a different line of research: family meals and child academic achievement. When these two variables are studied as a bivariate relationship, researchers find that children in families that eat many meals together (dinners and breakfasts) tend to be more academically successful, compared to kids in families that eat only a few meals together.

Once again, this simple bivariate relationship is not enough to show causation. In many studies, family meal habits and academic success are measured at the same time, so there is a temporal precedence problem: Did family meals come first and reinforce key academic skills, leading to higher achievement? Or did high academic success come first, perhaps making it more pleasant for parents to have meals with their kids? In addition, there are third variables that present an internal validity concern. For instance, more involved parents might arrange more family meals, and more involved parents might also have higher-achieving children.

A multiple-regression analysis could hold parental involvement constant and see if family meal frequency is still associated with academic success. In one such study, the researchers found that when parental involvement was held constant (along with other variables), family meal frequency was no longer a significant predictor of school success (Miller, Waldfogel, & Han, 2012). This pattern of results means that the only reason family meals correlated with aca- demic success was because of the third-variable problem of parental involvement (Table 9.3).

In other words, although frequency of family meals and academic success are significantly related in their bivariate relationship, that relationship goes away when potential third variables, such as parental involvement, are controlled for.


Multiple-Regression Results from a Study Predicting Academic Success from Frequency of Family Meals and Parental Involvement


Predictor (independent) variables:

Frequency of family meals  −0.01 Not significant

Parental involvement  0.09 *

Note: Data are fabricated, but reflect actual research. The study controlled for not only parental involvement, but also income, family structure, school quality, birth weight, school type, and many other possible third variables. When controlling for all these in a sample of more than 20,000 children, the researchers found that the beta for frequency of family meals was not significant. *p < .001. Source: Adapted from Miller et al., 2012.

251Ruling Out Third Variables with Multiple-Regression Analyses


Multiple-Regression Results from a Study Predicting Pregnancy from Exposure to Sex on TV and Other Variables




Predictor (independent) variables:

Exposure to sex on TV  0.44 *

Total television exposure –0.42 *

Age  0.28 *

Lower grades  0.21 n.s

Parent education  0.00 n.s.

Educational aspirations (highest level of school you plan to finish)

–0.14 n.s.

Being Hispanic (vs. other ethnicities)  0.86 n.s.

Being Black (vs. other ethnicities)  1.20 *

Being female  1.20 *

Living in a 2-parent household –1.50 *

History of deviant or problem behavior (e.g., skipping school, stealing, cheating on a test)

 0.43 *

Intention to have children before age 22  0.61 n.s.

*p ≤ .001. Source: Adapted from Chandra et al., 2008, Table 2.



• The relationship between family meal frequency and child academic success is not significant when controlling for parental involvement.

• The relationship between family meal frequency and child academic success can be explained by the third variable of parental involvement.

• The relationship between family meal frequency and child academic success goes away when parental involvement is held constant.

When you hold parental involvement constant, there is no longer a relationship between frequency of family meals and academic success (Table 9.4).

Adding More Predictors to a Regression Up to now, when considering the relationship between sexual TV content and pregnancy, we’ve focused on only one potential internal validity problem—age. But remember there are many other possible third variables. What about participation in school activities? What about living with one versus two parents? In fact, the Chandra team measured each of those third variables and even added a few more, such as parental education, ethnicity, and having a his- tory of problem behaviors (Chandra et al., 2008). Table 9.5 shows every variable tested, as well as the multiple- regression results for all the other variables.

Even when there are many more predictor variables in the table, beta still means the same thing. The beta for the exposure to sex on TV is positive: High levels of sex on TV are associated with higher pregnancy rate, when the researchers controlled for age, total TV exposure, lower grades, parent education, educational aspirations, and so on, down to intention to have children before age 22. Even after controlling for all variables

252 CHAPTER 9 Multivariate Correlational Research

listed in Table 9.5, the researchers found that more exposure to sex on TV predicts a higher chance of pregnancy.

Adding several predictors to a regression analysis can help answer two kinds of questions. First, it helps control for several third variables at once. In the Chandra study, even after all other variables were controlled for, exposure to sex on TV still predicted pregnancy. A result like that gets the researchers a bit closer to making a causal claim because the relationship between the suspected cause (sexy TV) and the suspected effect (pregnancy) does not appear to be attributable to any of the other variables that were measured.

Second, by looking at the betas for all the other predictor variables, we can get a sense of which factors most strongly predict chance of pregnancy. One strong predictor is gender, which, as you can see, has a beta of 1.20, even when the other variables are controlled for. This result means girls are more likely to report becom- ing pregnant than boys are to report getting a girl pregnant. (Even though it takes two to cause a pregnancy, presumably boys are sometimes unaware of getting a girl pregnant, whereas a girl is more certain.) We also notice that teens with a history of deviant behavior also have a higher risk of pregnancy, controlling for exposure to sex on TV, age, grades, and the other variables in the table. In fact, the predictive power of history of deviant behavior is about the same magnitude as that of expo- sure to sex on TV. Even though the authors of this study were most interested in describing the potential risk of viewing sexual content on TV, they were also able to evaluate which other variables are important in predicting pregnancy. (Recall, however, that when a table presents b values, or unstandardized coefficients, it is not appropriate to compare their relative strength. We can only do so with beta, and even then, remember that betas change depending on what other predictor variables are used.)

Regression in Popular Media Articles When making association claims in the popular media—magazines, newspa- pers, websites—journalists seldom discuss betas, p values, or predictor variables. Because they’re writing for a general audience, they assume most of their readers will not be familiar with these concepts. However, if you read carefully, you can detect when a multiple regression has been used if a journalist uses one of the phrases in the sections that follow.


The phrase “controlled for” is one common sign of a regression analysis. For exam- ple, when journalists covered the story about family meals and academic success, they stated the findings like this:

Researchers . . . determined that there wasn’t any relationship between family

meals and a child’s academic outcomes or behavior. . . . Miller and his team also

253Ruling Out Third Variables with Multiple-Regression Analyses

controlled for factors such as parental employment, television-watching, the

quality of school facilities, and the years of experience the children’s teachers

had, among others. (Family Dinner Benefits,, 2012; emphasis



Here’s another example from an article about a study of male military veterans. This is the headline: “Perk of a good job: aging mind is sharp.”

Mentally demanding jobs come with a hidden benefit: less mental decline with

age. Work that requires decision making, negotiating with others, analysis, and

making judgments may not necessarily pad your bank account. But it does build

up your “cognitive reserve”—a level of mental function that helps you avoid or

compensate for age-related mental decline. (DeNoon, 2008)

In this story, the central association is between how mentally demanding a man’s job is and his cognitive functioning as he ages. The more demanding the job, the less cognitive decline he suffers. But could there be a third variable, such as intel- ligence or level of education? Perhaps the veterans in the study who were better educated were more likely to have a mentally challenging job and to experience less cognitive decline. However, the story goes on:

After taking into account both intelligence and education, [the researchers] found

that men with more complex jobs—in terms of general intellectual demands and

human interaction and communication—performed significantly better on tests of

mental function. (DeNoon, 2008; emphasis added)

The phrase “taking into account” means the researchers conducted multiple- regression analyses. Even when they controlled for education and intelligence, they still found a relationship between job complexity and cognitive decline.


When the sexy TV/pregnancy study was reported online, the journalist mentioned the simple relationship between exposure to sexual TV content and getting pregnant, and then wrote:

Chandra said TV watching was strongly connected with teen pregnancy even

when other factors were considered, including grades, family structure, and

parents’ education level. (CBSnews, 2008)

The phrase “even when other factors were considered” indicates the researchers used multiple regression.

254 CHAPTER 9 Multivariate Correlational Research

Similar terminology such as “adjusting for” can also indicate multiple regres- sion. Here’s a study that found a relationship between eating chocolate and body mass (Figure 9.8):

The people who ate chocolate the most frequently, despite eating more calories

and exercising no differently from those who ate the least chocolate, tended to

have lower B.M.I.’s. . . . The researchers adjusted their results for a number of vari-

ables, including age, gender, depression, vegetable consumption, and fat and cal-

orie intake. “It didn’t matter which of those you added, the relationship remained

very stably significant.” (O’Connor,, 2012; emphasis added)

In sum, journalists can use a variety of phrases to describe a study’s use of multi- ple regression. When you encounter an association claim in a magazine, newspaper, or online, one of your questions should be whether the researchers controlled for possible third variables. If you can’t tell from the story what the researchers con- trolled for, it’s reasonable to suspect that certain third variables cannot be ruled out.

Regression Does Not Establish Causation Multiple regression might seem to be a foolproof way to rule out all kinds of third variables. If you look at the data in Table 9.5 on exposure to TV sex and pregnancy, for example, you might think you can safely make a causal statement now, since the researchers controlled for so many internal validity problems. They seem to have thought of everything! Not so fast. One problem is that even though multivariate designs analyzed with regression statistics can control for third variables, they are not always able to establish temporal precedence. Of course, the Chandra study did measure viewing sexual TV content 3 years before pregnancies occurred. But others, such as the study on family meals and academic achievement, may not.

Even when a study takes place over time (longitudinally), another very impor- tant problem is that researchers cannot control for variables they do not measure.

The Chocolate Diet? “Since so many complicating factors can influence results, it is difficult to pinpoint cause and effect. But the researchers adjusted their results for a number of variables, including age, gender, depression, vegetable consumption, and fat and calorie intake.”

FIGURE 9.8 Multiple regression in the popular media. This journalist wrote that people who ate more chocolate had lower body mass index, and that the researchers adjusted their results for several variables. The phrase “adjusted for” signals a regression analysis, thereby ruling out those variables as internal validity problems. (Source: O’Connor, 2012.)

255Ruling Out Third Variables with Multiple-Regression Analyses

Even though multiple regression controls for any third variables the researchers do measure in the study, some other variable they did not consider could account for the association. In the sexy TV/pregnancy study, some unmeasured variable—maybe the teenagers’ level of religiosity or the geographic area where they live—might account for the relationship between watching sex on TV and getting pregnant. But since those possible third variables were not measured (or even considered), there is no way of knowing (Figure 9.9).

In fact, some psychological scientists have critiqued media studies like this one, arguing that certain types of teenagers are predisposed to watching sexual TV content, and these same teens are also more likely to be sexually active (Steinberg & Monahan, 2011). These critics contend that the relationship between sexual media content and sexual activity is attributable to this predisposition (see Collins, Martino, Elliott, & Miu, 2011).

This unknown third-variable problem is one reason that a well-run experi- mental study is ultimately more convincing in establishing causation than a correlational study. An experimental study on TV, for example, would randomly assign a sample of people to watch either sexy TV shows or programs without sexual content. The power of random assignment would make the two groups likely to be equal on any third variables the researchers did not happen to mea- sure, such as religiosity, social class, or parenting styles. But of course, just like randomly assigning children to get one type of praise or another, it is ethically questionable to conduct an experiment on sexual TV content.

FIGURE 9.9 Possible third variables in the association between sexual TV content and pregnancy. What additional third variables, not already measured by the researchers, might be associated with both watching sex on TV and pregnancy?

256 CHAPTER 9 Multivariate Correlational Research

A randomized experiment is the gold standard for determining causation. Multiple regression, in contrast, allows researchers to control for potential third variables, but only for the variables they choose to measure.


1. Describe what it means to say that some variable “was controlled for” in a multivariate study.

2. How many criterion variables are there in a multiple-regression analysis? How many predictor variables?

3. What does a significant beta mean? What does a nonsignificant beta mean?

4. Give at least two phrases indicating that a study used a multiple-regression analysis.

5. What are two reasons that multiple-regression analyses cannot completely establish causation?

1. See pp. 245–247. 2. One criterion variable, and at least two predictor variables. See pp. 247–248. 3. See pp. 248–251. 4. See pp. 252–254. 5. See pp. 254–256.

GETTING AT CAUSALITY WITH PATTERN AND PARSIMONY So far this chapter has focused on two multivariate techniques that help research- ers investigate causation, even when they’re working with correlations among measured variables. Longitudinal correlational designs can satisfy the temporal precedence criterion. And multiple-regression analyses statistically control for some potential internal validity problems (third variables).

In this section, we explore how researchers can investigate causality by using a variety of correlational studies that all point in a single, causal direction. This approach can be called “pattern and parsimony” because there’s a pattern of results best explained by a single, parsimonious causal theory. As discussed in Chapter 1, parsimony is the degree to which a scientific theory provides the sim- plest explanation of some phenomenon. In the context of investigating a causal claim, parsimony means the simplest explanation of a pattern of data—the theory that requires making the fewest exceptions or qualifications.

The Power of Pattern and Parsimony A great example of pattern and parsimony is the case of smoking and lung cancer. This example was first articulated by the psychological scientist Robert Abelson.

257Getting at Causality with Pattern and Parsimony

Decades ago, it started becoming clear that smokers had higher rates of lung can- cer than nonsmokers (the correlation has been estimated at about r = .40). Did the smoking cause the cancer? Cigarette manufacturers certainly did not want people to think so. If someone argued that this correlation was causal, a critic might counter that the cigarettes were not the cause; perhaps people who smoked were more stressed, which predisposed them to lung cancer. Or perhaps smokers also drank a lot of coffee, and it was the coffee, not the cigarettes, that caused cancer. The list of third-variable explanations could go on and on. Even though multiple- regression analyses could control for these alternative explanations, critics could always argue that regression cannot control for every possible third variable.

Another problem, of course, is that even though an experiment could rule out third-variable explanations, a smoking experiment would not be ethical or prac- tical. A researcher could not reasonably assign a sample of volunteers to become lifetime smokers or nonsmokers. The only data researchers had to work with were correlational.

The solution to this problem, Abelson explains, is to specify a mechanism for the causal path. Specifically, in the case of cigarettes, researchers proposed that cigarette smoke contains chemicals that are toxic when they come into contact with human tissue. The more contact a person has with these chemicals, the greater the toxicity exposure. This simple theory leads to a set of predictions, all of which could be explained by the single, parsimonious theory that chemicals in cigarettes cause cancer (Abelson, 1995, p. 184):

1. The longer a person has smoked cigarettes, the greater his or her chances of getting cancer.

2. People who stop smoking have lower cancer rates than people who keep smoking.

3. Smokers’ cancers tend to be in the lungs and of a particular type. 4. Smokers who use filtered cigarettes have a somewhat lower rate of

cancer than those who use unfiltered cigarettes. 5. People who live with smokers would have higher rates of cancer, too,

because of their passive exposure to the same chemicals.

This process exemplifies the theory-data cycle (see Chapter 1). A theory— cigarette toxicity—led to a particular set of research ques- tions. The theory also led researchers to frame hypotheses about what the data should show.

Indeed, converging evidence from several individual studies con- ducted by medical researchers has supported each of these separate predictions (their evidence became part of the U.S. Surgeon General’s warning in 1964), and that’s where parsimony comes in. Because all five of these diverse predictions are tied back to one central principle (the toxicity of the chemicals in cigarette smoke), there is a strong case for parsimony (Figure 9.10).

FIGURE 9.10 Pattern and parsimony. Many studies, using a variety of methods, provide converging evidence to support the causal claim that cigarettes contain toxic chemicals that are harmful to humans. Although each of the individual studies has methodological weaknesses, taken together, they all support the same, parsimonious conclusion.

258 CHAPTER 9 Multivariate Correlational Research

Notice, also, that the diversity of these five empirical findings makes it much harder to raise third-variable explanations. Suppose a critic argued that coffee drinking was a third variable. Coffee drinking could certainly explain the first result (the longer one smokes—and presumably drinks coffee, too—the higher the rates of cancer). But it cannot explain the effect of filtered cigarettes or the cancer rates among secondhand smokers. The most parsimonious explanation of this entire pattern of data—and the weight of the evidence—is the toxicity of cigarettes.

It is hard to overstate the strength of the pattern and parsimony technique. In psychology, researchers commonly use a variety of methods and many stud- ies to explore the strength and limits of a particular research question. Another example comes from research on TV violence and aggression. Many studies have investigated the relationship between watching violence on TV and violent behav- ior. Some studies are correlational; some are experimental. Some are on children; others on adults. Some are longitudinal; others are not. But in general, the evi- dence all points to a single, parsimonious conclusion that watching violence on TV causes people to behave aggressively (Anderson et al., 2003).

Many psychological scientists build their careers by doing study after study devoted to one research question. As discussed in Chapter 1, scientists dig deeper: They use a variety of methods, combining results to develop their causal theories and to support them with converging evidence.

Pattern, Parsimony, and the Popular Media When journalists write about science, they do not always fairly represent pat- tern and parsimony in research. Instead, they may report only the results of the latest study. For example, they might present a news story on the most recent nutrition research, without describing the other studies done in that area. They might report that people who multitask the most are the worst at it, but fail to cover the full pattern of studies on media multitasking. They might report on a single study that showed an association between eating chocolate and body mass, without mentioning the rest of the studies on that same topic, and without tying the results to the theory they are supporting.

When journalists report only one study at a time, they selectively present only a part of the scientific process. They might not describe the context of the research, such as what previous studies have revealed, or what theory the study was testing. Reporting on the latest study without giving the full context can make it seem as though scientists conduct unconnected stud- ies on a whim. It might even give the impression that one study can reverse decades of previous research. In addition, skeptics who read such science sto- ries might find it easy to criticize the results of a single, correlational study. But in fact, science accumulates incrementally. Ideally, journalists should report on the entire body of evidence, as well as the theoretical background, for a particular claim.

❯❯ To review the concept of

weight of the evidence, see Chapter 1, p. 15.


MEDIATION We have discussed the research designs and statistical tools researchers use to get closer to making causal claims. Once a relationship between two variables has been established, we often want to explore it further by thinking about why. For example, we might ask why watching sexual content on TV predicts a higher pregnancy risk, or why people who engage in meaningful conversations are hap- pier. Many times, these explanations suggest a mediator, or mediating variable. Researchers may propose a mediating step between two of the variables. A study does not have to be correlational to include a mediator; experimental studies can also test them. However, mediation analyses often rely on multivariate tools such as multiple regression, so it makes sense to learn about mediators here.

Consider this example. We know conscientious people are more physically healthy than less conscientious people. But why? The mediator of this relation- ship might be the fact that conscientious people are more likely to follow medical advice and instructions, and that’s why they’re healthier. Following doctor’s orders would be the mediator of the relationship between the trait, conscientiousness, and the outcome, better health (Hill & Roberts, 2011).

Similarly, we know there’s an association between having deep conversations and feelings of well-being (see Chapter 8). Researchers might next propose a reason—a mediator of this relationship. One likely mediator could be social ties: Deeper conver- sations might help build social connections, which in turn can lead to increased well- being. The researchers could draw this mediation hypothesis, as shown in Figure 9.11.


1. Why do many researchers find pattern and parsimony an effective way to support a causal claim?

2. What is a responsible way for journalists to cover single studies on a specific topic?

1. See pp. 256–258. 2. See p. 258.

a b

cAmount of deep talk

Quality of social ties


FIGURE 9.11 A proposed mediation model. We could propose that deep talk leads to stronger social ties, which leads to increased well-being. To test this model, a researcher follows five steps (see text).

260 CHAPTER 9 Multivariate Correlational Research

They would propose an overall relationship, c, between deep talk and well-being. However, this overall relationship exists only because there are two other rela- tionships: a (between deep talk and social ties) and b (between social ties and well-being). In other words, social ties mediate the relationship between deep talk and well-being. (Of course, there are other possible mediators, such as intellectual growth or taking a break from technology. Those mediators could be tested, too, in another study.)

The researchers could examine this mediation hypothesis by following five steps (Kenny, 2008):

1. Test for relationship c. Is deep talk associated with well-being? (If it is not, there is no relationship to mediate.)

2. Test for relationship a. Is deep talk associated with the proposed mediator, strength of social ties? Do people who have deeper conversations actually have stronger ties than people who have more shallow conversations? (If social tie strength is the aspect of deep talk that explains why deep talk leads to well- being, then, logically, people who have more meaningful conversations must also have stronger social ties.)

3. Test for relationship b. Do people who have stronger social ties have higher levels of well-being? (Again, if social tie strength explains well-being, then, logically, people with stronger social connections must also have higher well-being.)

4. Run a regression test, using both strength of social ties and deep talk as predictor variables to predict well-being, to see whether relationship c goes away. (If social tie strength is the mediator of relationship c, the relationship between deep talk and well-being should drop when social tie strength is con- trolled for. Here we would be using regression as a tool to show that deep talk was associated with well-being in the first place because social tie strength was responsible.)

Because mediation hypotheses are causal claims, a fifth important step establishes temporal precedence:

5. Mediation is definitively established only when the proposed causal variable is measured or manipulated first in a study, followed some time later by the mediating variable, followed by the proposed outcome variable.

In other words, to establish mediation in this example, the researchers must con- duct a study in which the amount of deep talk is measured (or manipulated) first, followed shortly afterwards by a measure of social tie strength. They have to measure well-being last of all, to rule out the possibility that the well-being led to having deeper conversations.


If researchers want to examine whether following doctor’s orders is the mediator of the relationship between conscientiousness and good health, the design of the study should ideally measure conscientiousness first, and then later measure medical compliance, and then later measure health. If the design establishes temporal precedence and the results support the steps above, there is evidence for mediation.

Mediators vs. Third Variables Mediators appear similar to third-variable explanations. Both of them involve multivariate research designs, and researchers use the same statistical tool (multiple regression) to detect them. However, they function differently.

In a third-variable explanation, the proposed third variable is external to the two variables in the original bivariate correlation; it might even be seen as an accident—a problematic “lurking variable” that potentially distracts from the rela- tionship of interest. For example, if we propose that education level is a third vari- able responsible for the deep talk/well-being relationship, we’re saying deep talk and well-being are correlated with each other only because each one is correlated separately with education, as shown in Figure 9.12. In other words, the relation- ship between deep talk and well-being is there only because both of those variables happen to vary with the outside third variable, education level. The third variable may seem like a nuisance; it might not be of central interest to the researchers. (If they are really interested in deep talk and well-being, they have to control for education level first.)

In contrast, when researchers propose a mediator, they are interested in iso- lating which aspect of the presumed causal variable is responsible for that rela- tionship. A mediator variable is internal to the causal variable and often of direct interest to the researchers, rather than a nuisance. In the deep talk example, the

FIGURE 9.12 A third variable. In a third-variable scenario, the third variable is seen as external to the original two variables. Here, deep talk and higher well-being might both be associated with education level.

Amount of deep talk


Education level

262 CHAPTER 9 Multivariate Correlational Research

researchers believe stronger social ties is the important aspect, or outcome, of deep talk that is responsible for increasing well-being.

Mediators vs. Moderators Recall that moderators were introduced in Chapter 8. Similar-sounding names can make them confusing at first. However, testing for mediation versus moderation involves asking different questions (Baron & Kenny, 1986). When researchers test for mediating variables, they ask: Why are these two variables linked? When they test for moderating variables, they ask: Are these two variables linked the same way for everyone, or in every situation? Mediators ask: Why? Moderators ask: Who is most vulnerable? For whom is the association strongest?

A mediation hypothesis could propose, for instance, that medical compli- ance is the reason conscientiousness is related to better health. In contrast, a moderation hypothesis could propose that the link between conscientiousness and good health is strongest among older people (perhaps because their health problems are more severe, and most likely to benefit from medical compliance) and weakest among younger people (whose health problems are less serious anyway).

As the name implies, the mediating variable comes in the middle of the other two variables. The word moderate can mean “to change,” and a moderating vari- able can change the relationship between the other two variables (making it more intense or less intense). Figure 9.13 diagrams the differences between mediation, moderation, and third variables.


1. Explain why each of the five steps examining a mediation hypothesis is important to establishing evidence for a mediator.

2. Think of a possible mediator for the relationship between exposure to sex on TV and chance of pregnancy. Sketch a diagram of the mediator you propose,

following Figure 9.11.

1. See pp. 259–261. 2. Diagram should resemble Figure 9.11, with exposure to sex on TV in the left box, pregnancy risk in the right box, and your proposed mediator in the middle box.


Parental discussion moderates the relationship between TV violence and aggressive behavior. Children are more vulnerable when parents do not discuss TV violence with them.


Definition Why are two variables related?



Sentence Level of desensitization mediates the relationship between TV violence and aggressive behavior.

The relationship between viewing violent TV and aggressive behavior may be attributable to the third variable of parental leniency.

Are there certain groups or situations for which the two variables

are more strongly related?

Moderation Third-variable Problem

is related to


lea ds


which leads to





Viewing violent TV

Aggressive behavior

Aggressive behavior

Viewing violent TV

Viewing violent TV

Aggressive behavior

Aggressive behavior



Parents discuss TV content with kids

No parental discussion

r = .35*

r = .10

Two variables are correlated, but only because they are

both linked to a third variable.

is related toA



C is r

ela ted


is relatedto


but only because






C type 1

C type 2



is related toA B

C for one type of but not for the

other type of

Becoming desensitized to violence

Viewing violent TV

Having lenient parents

FIGURE 9.13 Mediation, moderation, and third variables. How are they different?

264 CHAPTER 9 Multivariate Correlational Research

MULTIVARIATE DESIGNS AND THE FOUR VALIDITIES Researchers use multivariate correlational research, such as longitudinal designs and multiple-regression analyses, to get closer to making causal claims. Longitu- dinal designs help establish temporal precedence, and multiple-regression analysis helps rule out third variables, thus providing some evidence for internal validity. We must remember, however, to interrogate the other three major validities— construct, external, and statistical validity—as well.

For any multivariate design, as for any bivariate design, it is appropriate to interrogate the construct validity of the variables in the study by asking how well each variable was measured. In the Brummelman study (2015) on overpraise and narcissism, is asking parents what they say to their kids a reliable and valid way to measure their actual level of overpraise? Similarly, is self-report a reliable and valid way to measure a child’s levels of narcissism? In the Chandra study (2008), what about the measures of exposure to sex on TV and pregnancy? Did the coding of TV show content have interrater reliability? Did coders identify sexual content in a valid way?

We can also interrogate the external validity of a multivariate design. In the Brummelman study on narcissism, the researchers invited all children from 17 schools in the Netherlands to participate, and 565 (75%) of them agreed. Vol- unteers are not a random sample, so we are uncertain about whether we can generalize from this sample to the population of children in the 17 schools. We might also ask whether the association generalizes to other kinds of praise, such as praise from teachers or other adults.

To interrogate the external validity of the sexual TV content and pregnancy study, we can ask whether the teenagers were sampled randomly, and from what kind of population. In fact, the Chandra study came from a sample of U.S. teens from all states, and the sample’s demographic characteristics were similar to those for the entire U.S. However, the researchers do not report whether or not their sample was selected randomly (Chandra et al., 2008).

For interrogating a multivariate correlational research study’s statistical validity, we can ask about the effect size and statistical significance (see Chap- ter 8). In the case of the sexy TV/pregnancy study, we know the beta was 0.44 and was statistically significant. However, there are no guidelines for what con- stitutes a “large” or “small” beta. The authors of the study also presented the pregnancy risk of low, compared to medium and high, sexual content viewers (Figure 9.14). These data show that among 20-year-olds, those who had watched the most sexual TV had a pregnancy risk two times higher than those who had watched the least. Because the risk of pregnancy doubled, it was interpreted as a strong effect size by the authors.

265Multivariate Designs and the Four Validities

Age 16

Age at follow-up



Low viewing






Chance of pregnancy at follow-up

Age 18 Age 20

Medium viewing

High viewing

Groups of bars get taller from left to right. Chance of pregnancy is lower in 16-year-olds.

Age 20 pregnancy risk for those who watched the least TV sex: 12%. Compare to age 20 pregnancy risk for those who watched the most TV sex: 25%.

FIGURE 9.14 Statistical validity in the sexual TV content and pregnancy study. The researchers calculated the pregnancy risk among teens who reported watching the lowest levels of sexual content (low viewing), as well as medium and high viewing levels. Different age groups were calculated separately. The graph depicts a large effect size for sexual content because the pregnancy risk for the highest viewers is double that for the lowest viewers. (Source: Adapted from Chandra et al., 2008.)


1. Give an example of a question you would ask to interrogate each of the four validities for a multivariate study.

1. See pp. 264–265.

Other statistical validity questions apply to multivariate designs, too. When researchers use multivariate designs, they need to take precautions to look for subgroups, outliers, and curvilinear associations, all of which can be more com- plicated to detect when there are more than two variables.


Summary Research often begins with a simple bivariate correlation, which cannot establish causation. Researchers can use multivariate techniques to help them get closer to making a causal claim.

Reviewing the Three Causal Criteria • In a multivariate design, researchers measure more than

two variables and look for the relationships among them.

• A simple, bivariate correlation indicates that there is covariance, but cannot always indicate temporal precedence or internal validity, so it cannot establish causation.

Establishing Temporal Precedence with Longitudinal Designs • Longitudinal designs start with two key variables, on

which the same group of people are measured at mul- tiple points in time. Researchers can tell which variable came first in time, thus helping establish temporal precedence.

• Longitudinal designs produce cross-sectional correla- tions (correlations between the two key variables at any one time period) and autocorrelations (correla- tions between one variable and itself, over time).

• Longitudinal designs also produce cross-lag correla- tions. By comparing the relative strengths of the two cross-lag correlations, researchers can infer which of the variables probably came first in time (or if they are mutually reinforcing each other).

Ruling Out Third Variables with Multiple- Regression Analyses • In a regression design, researchers start with a bivari-

ate correlation and then measure other potential third variables that might affect it.

• Using multiple-regression analysis, researchers can see whether the basic relationship is still present, even when they statistically control for one or more third variables. If the beta is still significant for the key variable when the researchers control for the third variables, it means the key relationship is not explained by those third variables.

• If the beta becomes nonsignificant when the research- ers control for a third variable, then the key relation- ship can be attributed to that third variable.

• Even though regression analyses can rule out third variables, they cannot definitively establish causation because they can only control for possible third variables that the researchers happened to measure. An experiment is the only design that definitively establishes causation.

Getting at Causality with Pattern and Parsimony • Researchers can approach causal certainty through

pattern and parsimony; they specify a mechanism for the causal relationship and combine the results from a variety of research questions. When a single causal theory explains all of the disparate results, researchers are closer to supporting a causal claim.

Mediation • In a mediation hypothesis, researchers specify a variable

that comes between the two variables of interest as a possible reason the two variables are associated. After collecting data on all three variables (the original two,

266 CHAPTER 9 Multivariate Correlational Research

267Review Questions

Review Questions

1. A headline in Yahoo! News made the following (bivar- iate) association claim: “Facebook users get worse grades in college” (Hsu, 2009). The two variables in this headline are:

a. Level of Facebook use and college grades.

b. High grades and low grades.

c. High Facebook use and low Facebook use.

2. Suppose a researcher uses a longitudinal design to study the relationship between Facebook use and grades over time. She measures both of these variables in Year 1, and then measures both variables again in Year 2. Which of the following is an example of an autocorrelation in the results?

a. The correlation between Facebook use in Year 1 and Facebook use in Year 2.

b. The correlation between Facebook use in Year 1 and grades in Year 2.

c. The correlation between grades in Year 1 and Facebook use in Year 2.

d. The correlation between grades in Year 1 and Facebook use in Year 1.

3. In the longitudinal study described in question 2, which pattern of cross-lag correlations would indicate

that Facebook use leads to lower grades (rather than the reverse)?

a. Grades at Year 1 shows a strong correlation with Facebook use at Year 2, but Facebook use at Year 1 shows a weak correlation with grades at Year 2.

b. Grades at Year 1 shows a weak correlation with Facebook use at Year 2, but Facebook use at Year 1 shows a strong correlation with grades at Year 2.

c. Grades at Year 1 shows a strong correlation with Facebook use at Year 2, and Facebook use at Year 1 shows a strong correlation with grades at Year 2.

4. Consider this statement: “People who use Facebook got worse grades in college, even when the research- ers controlled for the level of college preparation (operationalized by SAT scores) of the students.” What does it mean?

a. Facebook use and grades are correlated only because both of these are associated with SAT score.

b. SAT score is a third variable that seems to explain the association between Facebook use and grades.

c. SAT score can be ruled out as a third variable explanation for the correlation between Facebook use and college grades.

plus the mediator), they follow specific steps to evaluate how well the data support the mediation hypothesis.

Multivariate Designs and the Four Validities • Interrogating multivariate correlational designs

involves investigating not only internal validity and

temporal precedence, but also construct validity, external validity, and statistical validity. While no single study is perfect, exploring each validity in turn is a good way to systematically assess a study’s strengths and weaknesses.

Key Terms

multivariate design, p. 238 longitudinal design, p. 239 cross-sectional correlation, p. 239 autocorrelation, p. 240

cross-lag correlation, p. 241 multiple regression, p. 244 control for, p. 245 criterion variable, p. 247

predictor variable, p. 248 parsimony, p. 256 mediator, p. 259

To see samples of chapter concepts in the popular media, visit and click the box for Chapter 9.r

268 CHAPTER 9 Multivariate Correlational Research

5. Which of the following statements is an example of a mediator of the relationship between Facebook use and college grades?

a. Facebook use and college grades are more strongly correlated among nonathletes, and less strongly correlated among athletes.

b. Facebook use and college grades are only correlated with each other because they are both related to the difficulty of the major. Students in more difficult majors get worse grades, and those in difficult majors have less time to use Facebook.

c. Facebook use and college grades are correlated because Facebook use leads to less time studying, which leads to lower grades.

6. A news outlet reported on a study of people with dementia. The study found that among patients with dementia, bilingual people had been diagnosed 3-4 years later than those who were monolingual. What are the variables in this bivariate association?

a. Being bilingual or monolingual

b. Being bilingual or not, and age at dementia diagnosis

c. Age at dementia diagnosis

7. The journalist reported that the relationship between bilingualism and age at diagnosis did not change, even when the researchers controlled for level of education. What does this suggest?

a. That the relationship between bilingualism and dementia onset is probably attributable to the third variable: level of education.

b. That the relationship between bilingualism and dementia onset is not attributable to the third variable: level of education.

c. That being bilingual can prevent dementia.

8. Researchers speculated that the reason bilingualism is associated with later onset of dementia is that bilingual people develop richer connections in the brain through their experiences in managing two languages; these connections help stave off dementia symptoms. This statement describes:

a. A mediator

b. A moderator

c. A third variable

Learning Actively

1. The accompanying figure shows the result of a cross-lag panel study on a sample of Dutch children aged 7–11 (Brummelman et al., 2015). The study collected several variables at four time points, each about 6 months apart. At each wave, they measured the child’s self-esteem using a self-report measure (a sample item was “Kids like me are happy with themselves as a person”). It also measured the child’s perception of each parent’s warmth (a sample question was “My father/mother lets me know he/she loves me”). The results in the figure are only for the mother’s warmth (as rated by the child). All results in the figure are statistically significant.

a. Point to the autocorrelations in the figure.

b. Are there cross-sectional correlations in the figure?

c. Overall, what do the cross-lag correlations suggest? Does parental warmth lead to higher self-esteem, or does higher self-esteem lead to parental warmth, or is there a mutually reinforcing relationship?

2. Indicate whether each statement below is describing a mediation hypothesis, a third-variable argument, or a moderator result. First, identify the key bivariate relationship. Next, decide whether the extra variable comes between the two key variables or is causing the two key variables simultaneously. Then, draw a sketch of each explanation, following the examples in Figure 9.13. a. Having a mentally demanding job is associated

with cognitive benefits in later years, because

.489 .430 .446

.466 .486 .471

Overvaluation Time 1

Overvaluation Time 2

Overvaluation Time 3

Overvaluation Time 4

Narcissism Time 1

Narcissism Time 2

Narcissism Time 3

Narcissism Time 4


.06 3

.06 2

.06 0

.052 .053

269Learning ActivelyLearning Actively

people who are highly educated take mentally demanding jobs, and people who are highly edu- cated have better cognitive skills.

b. Having a mentally demanding job is associated with cognitive benefits in later years, but only in men, not women.

c. Having a mentally demanding job is associated with cognitive benefits in later years because cognitive challenges build lasting connections in the brain.

d. Being a victim of sibling aggression is associated with poor mental health in childhood, but the link is especially strong for later-born children and weaker in firstborn children.

e. Sibling aggression is associated with poor child- hood mental health because child victims of sibling aggression are more likely to feel lonely at home. Sibling aggression leads to loneliness, which leads to mental health problems.

f. Sibling aggression is associated with poor child- hood mental health only because of parental conflict. Sibling aggression is more likely among parents who argue frequently, and arguing also affects kids’ mental health.

3. Do victims of sibling aggression suffer worse mental health? A recent study investigated this question (Tucker, Finkelhor, Turner, & Shattuck, 2013). The researchers wondered whether sibling aggression was linked to poor mental health in children, and whether sibling victimization was as bad for kids as peer victimization. In a large sample of children and youths, ages 2–17, they measured several kinds of sibling aggression (e.g., physical assault, taking something away from the child, breaking the child’s toys on purpose, calling names). They also measured mental health using a trauma symptom checklist, on which high scores indicate the child has more symp- toms of anxiety, depression, and other signs of mental disturbances. The researchers also measured parents’ education, child’s age, and so on. The regression table in Table 9.6 comes from their article.

a. What is the criterion (dependent) variable in this study, and where do you find it?

b. How many predictor variables are there in this study?

c. Write a sentence that describes what the beta for the “Total types of sibling victimization” predictor means. (Use the sentences in Table 9.2 as a model.)

d. Write a sentence that describes what the beta for the “Total types of peer victimization” predictor variable means.

e. Write a sentence that describes what the beta for the “Child maltreatment” predictor variable means.

f. Write a sentence that describes what the beta for the “Internet victimization” predictor means.

g. Using the magnitude of the betas to decide, which of the predictors is most strongly associated with poor childhood mental health? What about the researchers’ initial question: Is sibling victimization just as bad for kids as peer victimization?


Multiple Regression Predicting Children’s and Adolescents’ Mental Health



Parent education: some college −0.02

 College degree or more −0.04


 Black −0.05b

 Hispanic, any race −0.01

 Other or mixed −0.00

Language of interview in Spanish −0.01

Child age 10 plus −0.13a

Child gender male 0.00

Child maltreatment 0.15a

Sexual victimization 0.06b

School victimization 0.05c

Internet victimization 0.02

Witness family violence 0.17a

Witness community violence 0.07a

Total types of sibling victimization 0.15a

Total types of peer victimization 0.25a

Total sibling × peer types of victimization


R2 0.27

a p < .001. b p < .01. c p < .05. Source: Tucker et al., 2013.


Tools for Evaluating Causal Claims

Serving Food on a Larger Plate “Makes People Eat More” Independent, 2015

A Learning Secret: Don’t Take Notes with a Laptop Scientific American, 2014


Introduction to Simple Experiments A CAUSAL CLAIM IS the boldest kind of claim a scientist can make. A causal claim replaces verb phrases such as related to, is associated with, or linked to with powerful verbs such as makes, influences, or affects. Causal claims are special: When researchers make a causal claim, they are also stating something about interventions and treatments. The advice to not take notes with a laptop is based on a causal inference: Taking notes on a laptop causes something negative. Similarly, if serving food in a larger bowl makes people eat more, then dieters can be advised to serve foods in smaller bowls or use smaller individual plates. Interventions are often the ultimate goal of scientific studies, and they must be based on sound experimental research. Experiments are the only way to investigate such causal issues.

TWO EXAMPLES OF SIMPLE EXPERIMENTS Let’s begin with two examples of experiments that supported valid causal claims. As you read the two studies, consider how each one differs from the bivariate correlational studies in Chapter 8. What makes each of these studies an experiment? How does the experimen- tal design allow the researchers to support a causal claim rather than an association claim?


A year from now, you should still be able to:

1. Apply the three criteria for establishing causation to experiments, and explain why experiments can support causal claims.

2. Identify an experiment’s independent, dependent, and control variables.

3. Classify experiments as independent-groups and within- groups designs, and explain why researchers might conduct each type of study.

4. Evaluate three potential threats to internal validity in an experiment— design confounds, selection effects, and order effects—and explain how experimenters usually avoid them.

5. Interrogate an experimental design using the four validities.

274 CHAPTER 10 Introduction to Simple Experiments

Example 1: Taking Notes Do you bring a pen to class for taking notes on what your professor is saying? Or do you open your laptop and type? If you’re like most students, you use the note- taking habit you think works for you. But should you trust your own experience? Maybe one way of taking notes is actually better than the other (Figure 10.1).

Researchers Pam Mueller and Daniel Oppenheimer (2014) decided to conduct an experiment that compared the two practices. When they considered the pro- cesses involved, both approaches seemed to have advantages. When typing on a laptop, they reasoned, students can easily transcribe the exact words and phrases a professor is saying, resulting in seemingly more complete notes. However, stu- dents might not have to think about the material when they’re typing. When taking handwritten notes, in contrast, students can summarize, paraphrase, or make drawings to connect ideas—even if fewer words are used than on a com- puter. Longhand notes could result in deeper processing of the material and more effective comprehension. Which way would be better?

Sixty-seven college students were recruited to come a laboratory classroom, usually in pairs. The classroom was prepared in advance: Half the time it con- tained laptops; the other half, notebooks and pens. Having selected five different TED talks on interesting topics, the researchers showed one of the lectures on a video screen. They told the students to take notes on the lectures using their assigned method (Mueller & Oppenheimer, 2014). After the lecture, students spent 30 minutes doing another activity meant to distract them from thinking about the lecture. Then they were tested on what they had learned from the TED talk.

FIGURE 10.1 Take note. Which form of notetaking would lead to better learning?

275Two Examples of Simple Experiments

The essay questions asked about straightforward factual information (e.g., “Approximately how many years ago did the Indus civilization exist?”) as well as conceptual information (e.g., “How do Japan and Sweden differ in their approaches to equality in their societies?”). Their answers were scored by a research assistant who did not know which form of taking notes each participant had used.

The results Mueller and Oppenheimer obtained are shown in Figure 10.2. Students in both the laptop and the longhand groups scored about equally on the factual questions, but the longhand group scored higher on the conceptual questions.

Mueller and Oppenheimer didn’t stop at just one study. They wanted to demonstrate that the original result could happen again. Their journal article reports two other studies, each of which compared longhand to laptop notetaking, and each of which showed the same effect: The longhand group performed better on conceptual test questions. (The two other studies, unlike the first, showed that longhand notetakers did better on factual questions, too.) The authors made a causal claim: Taking notes in longhand causes students to do better. Do you think their study supports the causal claim?

Example 2: Eating Pasta An article with this headline—“Serving food on a larger plate makes people eat more”—summarized studies on plate size and eating patterns. One such study was conducted at Cornell University’s Food and Brand Lab, by researchers Ellen van Kleef, Mitsuru Shimizu, and Brian Wansink (2012). They invited 68 college students to come to a kitchen laboratory during the lunch hour, where they par- ticipated in smaller groups.

Behind the scenes, the researchers had assigned the students to one of two experimental sessions by flipping a coin. Half were assigned to a “large bowl” session and half were assigned to a “medium bowl” session. They were invited to serve themselves pasta from a bowl at the buffet. The bowl was continually refilled when about half the pasta was gone, so nobody felt the food was getting scarce. After they filled their plates, a research assistant weighed the plates to measure the amount they took. Participants were allowed to eat their pasta lunches at a comfortable pace. When they were finished, the assistants weighed each plate again, to determine how much food each person had actually eaten.




Factual Conceptual

Longhand Laptop







Score on essay questions (standardized)

FIGURE 10.2 The effect of laptop and longhand notetaking on test performance. In this study, performance on factual questions was the same in the laptop and longhand groups, but performance on conceptual questions was better for those who took handwritten notes. (Source: Adapted from Mueller & Oppenheimer, 2014.)

276 CHAPTER 10 Introduction to Simple Experiments

The results are shown in Figure 10.3. On average, the participants took more pasta from the large serving bowl than the medium one (Figure 10.3A). When the researchers converted the amount of consumed pasta into calories, it was clear that the large-bowl participants had eaten about 140 calories more than the medium-bowl ones (Figure 10.3B). The researchers used causal language in their article’s conclusion: “The size of the serving bowl had a substantial influence” on the amount of food people ate (van Kleef et al., 2012, p. 70).

EXPERIMENTAL VARIABLES The word experiment is common in everyday use. Colloquially, “to experiment” means to try something out. A cook might say he experimented with a recipe by replacing the eggs with applesauce. A friend might say she experimented with a different driving route to the beach. In psychological science, the term experiment specifically means that the researchers manipulated at least one variable and measured another (as you learned in Chapter 3). Experiments can take place in a laboratory and just about anywhere else: movie theaters, confer- ence halls, zoos, daycare centers, and even online environments—anywhere a researcher can manipulate one variable and measure another.

A manipulated variable is a variable that is controlled, such as when the researchers assign participants to a particular level (value) of the variable. For example, Mueller and Oppenheimer (2014) manipulated notetaking by flipping a coin to determine whether a person would take notes with a laptop or in longhand. (In other words, the participants did not get to choose which form they would use.) Notetaking method was a variable because it had more than one level (laptop and

FIGURE 10.3 The effect of serving bowl size on amount eaten. Participants who served themselves from a large bowl took more and ate more, compared to those who served themselves from a medium bowl. (Source: Adapted from van Kleef et al., 2012.)










Medium bowl

Pasta served (g)

Large bowl

Serving bowl size











Medium bowl

Estimated calories consumed (kcal)

Large bowl

Serving bowl size


277Experimental Variables

longhand), and it was a manipulated variable because the experimenter assigned each participant to a particular level. The van Kleef team (2012) similarly mani- pulated the size of the pasta serving bowl by flipping a coin ahead of time to decide which session participants were in. (Participants did not choose the bowl size from which they would serve themselves.)

Measured variables take the form of records of behavior or attitudes, such as self-reports, behavioral observations, or physiological measures (see Chapter 5). After an experimental situation is set up, the researchers simply record what hap- pens. In their first study, Mueller and Oppenheimer measured student performance on the essay questions. After manipulating the notetaking method, they watched and recorded—that is, they measured—how well people answered the factual and conceptual questions. The van Kleef team manipulated the serving bowl size, and then measured two variables: how much pasta people took and how much they ate.

Independent and Dependent Variables In an experiment, the manipulated (causal) variable is the independent variable. The name comes from the fact that the researcher has some “independence” in assigning people to different levels of this variable. A study’s independent variable should not be confused with its levels, which are also referred to as conditions. The independent variable in the van Kleef study was serving bowl size, which had two conditions: medium and large.

The measured variable is the dependent variable, or outcome variable. How a participant acts on the measured variable depends on the level of the inde- pendent variable. Researchers have less control over the dependent variable; they manipulate the independent variable and then watch what happens to people’s self- reports, behaviors, or physiological responses. A dependent variable is not the same as its levels, either. The dependent variable in the van Kleef study was the amount of pasta eaten (not “200 calories”).

Experiments must have at least one independent variable and one dependent variable, but they often have more than one dependent variable. For example, the notetaking study had two dependent variables: performance on factual ques- tions and performance on conceptual questions. Similarly, the pasta bowl study’s dependent variables were the grams of pasta taken from the bowl and the calo- ries of pasta consumed. When the dependent variables are measured on different scales (e.g., grams and calories), they are usually presented on separate graphs (see Figure 10.3). (Chapter 12 introduces experiments that have more than one independent variable.)

Here’s a way to tell the two kinds of variables apart. When researchers graph their results, the independent variable is almost always on the x-axis, and the dependent variable is almost always on the y-axis (see Figures 10.2 and 10.3 for examples). A mnemonic for remembering the two types of variables is that the independent variable comes first in time (and the letter I looks like the number 1), and the dependent variable is measured afterward (or second).

278 CHAPTER 10 Introduction to Simple Experiments

Control Variables When researchers are manipulating an independent variable, they need to make sure they are varying only one thing at a time—the potential causal force or pro- posed “active ingredient” (e.g., only the form of notetaking, or only the size of the serving bowl). Therefore, besides the independent variable, researchers also control potential third variables (or nuisance variables) in their studies by hold- ing all other factors constant between the levels of the independent variable. For example, Mueller and Oppenheimer (2014) manipulated the method people used to take notes, but they held constant a number of other potential variables: People in both groups watched lectures in the same room and had the same experimenter. They watched the same videos and answered the same questions about them, and so on. Any variable that an experimenter holds constant on purpose is called a control variable.

In the van Kleef et al. study (2012), one control variable was the quality of the food: It was always the same kind of pasta. The researchers also controlled the size of the serving spoon and the size of the plates (each participant served pasta onto a 9-inch plate).

Control variables are not really variables at all because they do not vary; exper- imenters keep the levels the same for all participants. Clearly, control variables are essential in experiments. They allow researchers to separate one potential cause from another and thus eliminate alternative explanations for results. Control variables are therefore important for establishing internal validity.


1. What are the minimum requirements for a study to be an experiment?

2. Define independent variable, dependent variable, and control variable, using your own words.

1. A manipulated variable and a measured variable; see p. 276. 2. See pp. 276–278.

WHY EXPERIMENTS SUPPORT CAUSAL CLAIMS In both of the examples above, the researchers manipulated one variable and measured another, so both studies can be considered experiments. But are these researchers really justified in making causal claims on the basis of these

279Why Experiments Support Causal Claims

experiments? Yes. To understand how experiments support causal claims, you can first apply the three rules for causation to the pasta bowl study. The three rules should be familiar to you by now:

1. Covariance. Do the results show that the causal variable is related to the effect variable? Are distinct levels of the independent variable associated with differ- ent levels of the dependent variable?

2. Temporal precedence. Does the study design ensure that the causal variable comes before the outcome variable in time?

3. Internal validity. Does the study design rule out alternative explanations for the results?

Experiments Establish Covariance The results of the experiment by van Kleef and her colleagues did show covariance between the causal (independent) variable (size of bowl) and the outcome (depen- dent) variable (amount of pasta eaten). On average, students who were in the large- bowl condition ate 425 calories worth of pasta, and students in the medium-bowl condition ate 283 calories (see Figure 10.3). In this case, covariance is indicated by a difference in the group means: The large-bowl calories were different from the medium-bowl calories. The notetaking study’s results also showed covariance, at least for conceptual questions: Longhand notetakers had higher scores on concep- tual questions than laptop notetakers.


The covariance criterion might seem obvious. In our everyday reasoning, though, we tend to ignore its importance because most of our personal experiences do not have the benefit of a comparison group, or comparison condition. For instance, you might suspect that your mom’s giant pasta bowl is making you eat too much, but without a comparison bowl, you cannot know for sure. An experiment, in contrast, provides the comparison group you need. Therefore, an experiment is a better source of information than your own experience because an experiment allows you to ask and answer: Compared to what? (For a review of experience versus empiricism, see Chapter 2.)

If independent variables did not vary, a study could not establish covariance. For example, in Chapter 1, you read about a non-peer-reviewed study that con- cluded dogs don’t like being hugged (Coren, 2016). Having collected Internet pho- tos of people hugging their dogs, the researchers reported that 82% of the dogs showed signs of stress. However, this study did not have a comparison group: There were no photos of dogs not being hugged. Therefore, we cannot know, based on this study, if signs of stress are actually higher in hugged dogs than not-hugged dogs. In contrast, true experiments manipulate an independent variable. Because every independent variable has at least two levels, true experiments are always set up to look for covariance.

280 CHAPTER 10 Introduction to Simple Experiments


Manipulating the independent (causal) variable is necessary for establishing cova- riance, but the results matter, too. Suppose the van Kleef researchers had found no difference in how much pasta people consumed in the two groups. In that case, the study would have found no covariance, and the experimenters would have had to conclude that serving bowl size does not cause people to eat more pasta. After all, if pasta consumption does not vary with serving bowl size, there is no causal impact to explain.


There are a couple of ways an independent variable might be designed to show covariance. Your early science classes may have emphasized the importance of a control group in an experiment. A control group is a level of an independent variable that is intended to represent “no treatment” or a neutral condition. When a study has a control group, the other level or levels of the independent variable are usually called the treatment group(s). For example, if an experiment is testing the effectiveness of a new medication, the researchers might assign some participants to take the medication (the treatment group) and other participants to take an inert sugar pill (the control group). When the control group is exposed to an inert treat- ment such as a sugar pill, it is called a placebo group, or a placebo control group.

Not every experiment has—or needs—a control group, and often, a clear con- trol group does not even exist. The Mueller and Oppenheimer notetaking study (2014) had two comparison groups—laptop and longhand—but neither was a con- trol group, in the sense that neither of them clearly established a “no notetaking” condition. The van Kleef pasta eating study (2012) did not have a true control group either; the researchers simply used two different serving bowl sizes.

Also consider the experiment by Harry Harlow (1958), discussed in Chapter 1, in which baby monkeys were put in cages with artificial “mothers” made of either cold wire or warm cloth. There was no control group, just a carefully designed comparison condition. When a study uses comparison groups, the levels of the independent variable differ in some intended and meaningful way. All experiments need a comparison group so the researchers can compare one condition to another, but the comparison group does not need to be a control group.

Experiments Establish Temporal Precedence The experiment by van Kleef’s team also established temporal precedence. The experimenters manipulated the causal (independent) variable (serving bowl size) to ensure that it came first in time. Then the students picked up the spoon to serve their own pasta. The causal variable clearly did come before the outcome (depen- dent) variable. This ability to establish temporal precedence, by controlling which variable comes first, is a strong advantage of experimental designs. By manipulat- ing the independent variable, the experimenter virtually ensures that the cause comes before the effect (or outcome).

❯❯ For more details on the placebo effect and how

researchers control for it, see Chapter 11, pp. 323–325.

281Why Experiments Support Causal Claims

The ability to establish temporal precedence is a feature that makes experi- ments superior to correlational designs. A simple correlational study is a snapshot— all variables are measured at the same time, so when two variables covary (such as multitasking frequency and multitasking ability, or deep conversations and well- being), it’s impossible to tell which variable came first. In contrast, experiments unfold over time, and the experimenter makes sure the independent variable comes first.

Well-Designed Experiments Establish Internal Validity Did the van Kleef study establish internal validity? Are there any alternative expla- nations for why people in the large-bowl condition took more pasta than people in the medium-bowl condition?

A well-designed experiment establishes internal validity, which is one of the most important validities to interrogate when you encounter causal claims. To be internally valid, a study must ensure that the causal variable (the active ingredient), and not other factors, is responsible for the change in the outcome variable. You can interrogate internal validity by exploring potential alterna- tive explanations. For example, you might ask whether the participants in the large-bowl group were served tastier-looking pasta than those in the medium- bowl group. If so, the quality of the pasta would be an alternative explanation for why people took more. However, the research- ers put the same type of pasta in both serving bowls (Figure 10.4). In fact, the quality of the pasta was a control variable: It was held constant for all partici- pants, for just this reason.

You might be wondering whether the experiment- ers treated the large-bowl group differently than the other group. Maybe the research assistants acted in a more generous or welcoming fashion with participants in the large-bowl group than the medium-bowl group. That would have been another threat to internal valid- ity, so it’s important to know whether the assistants knew the hypothesis of the study.

For any given research question, there can be several possible alternative explanations, which are known as confounds, or potential threats to internal validity. The word confound can mean “confuse”: When a study has a confound, you are confused about what is causing the change in the dependent variable. Is it the intended causal variable (such as bowl size)? Or is there some alternative explanation (such as the generous attitude of the research assistants)? Internal validity is subject to a number of distinct threats, three of which

❮❮ For a discussion about how researchers use blind and double-blind designs to control internal validity, see Chapter 11, p. 323.

FIGURE 10.4 A threat to internal validity. If the pasta in the large bowl had been more appetizing than the pasta in the medium bowl, that would have been an internal validity problem in this study. (Study design is fabricated for illustration purposes.)











Medium bowl with less

appetizing pasta

Estimated calories consumed (kcal)

Large bowl with more

appetizing pasta

282 CHAPTER 10 Introduction to Simple Experiments

are discussed in this chapter. Design confounds and selection effects are described next, and order effects are described in a later section. The rest are covered in Chapter 11. As experimenters design and interpret studies, they keep these threats to internal validity in mind and try to avoid them.


A design confound is an experimenter’s mistake in designing the independent variable; it is a second variable that happens to vary systematically along with the intended independent variable and therefore is an alternative explanation for the results. As such, a design confound is a classic threat to internal validity. If the van Kleef team had accidentally served a more appetizing pasta in the large bowl than the medium bowl, the study would have a design confound because the second variable (pasta quality) would have systematically varied along with the independent variable. If the research assistants had treated the large-bowl group with a more generous attitude, the treatment of each participant would have been a design confound, too.

Consider the study on notetaking. If all of the students in the laptop group had to answer more difficult essay questions than the longhand group, that would be a design confound. We would not know whether the difference in conceptual performance was caused by the question difficulty or the notetaking method. However, the researchers did not make this error; they gave the same questions to all participants, no matter what notetaking condition they were in, so there would be no systematic differences between the groups.

When an experiment has a design confound, it has poor internal validity and cannot support a causal claim. Because the van Kleef study did not have any apparent design confounds, its internal validity is sound. The researchers care- fully thought about confounds in advance and turned them into control variables instead. Similarly, Mueller and Oppenheimer controlled for a number of potential design confounds, such as question difficulty, experimenter expectations, room conditions, and so on. In both cases, the researchers took steps to help them justify making a causal claim.

Systematic Variability Is the Problem. You need to be careful before accusing a study of having a design confound. Not every potentially problematic variable is a confound. Consider the example of the pasta bowl experimenters. It might be the case that some of the research assistants were generous and welcoming, and oth- ers were reserved. The attitude of the research assistants is a problem for internal validity only if it shows systematic variability with the independent variable. Did the generous assistants work only with the large-bowl group and the reserved ones only with the medium-bowl group? Then it would be a design confound. However, if the research assistants’ demeanor showed unsystematic variability (random or haphazard) across both groups, then their attitude would not be a confound.

Here’s another example. Perhaps some of the participants in the notetaking study were interested in the video lectures and others were not. This variability in interest would not be a design confound unless it varied systematically with the

283Why Experiments Support Causal Claims

notetaking condition to which they were assigned. If those in the longhand group all happened to be very interested in the lectures and those in the laptop group were all uninterested, that would vary systematically with the notetaking condition—and would be a confound. But if some participants in each condition were interested and some were not, that would be unsystematic variability and would not be a confound.

Unsystematic variability can lead to other problems in an experiment. Specif- ically, it can obscure, or make it difficult to detect differences in, the dependent variable, as discussed fully in Chapter 11. However, unsystematic variability should not be called a design confound (Figure 10.5).

FIGURE 10.5 Unsystematic variability is not the same as a confound. Some people eat more than others, some like pasta more than others, and some eat breakfast while others do not. But individual differences don’t become a confound unless one type of people end up in one group systematically more than another group. If individual differences are distributed evenly in both groups, they are not a confound.

Medium bowl

Pasta served (g)

Large bowl


Hates pasta

Loves pasta

Big eater

Didn’t eat breakfast

Gluten free

Big eater

Light eater Dieting

Loves pasta



284 CHAPTER 10 Introduction to Simple Experiments


In an experiment, when the kinds of participants in one level of the indepen- dent variable are systematically different from those in the other, selection effects can result. They can also happen when the experimenters let partic- ipants choose (select) which group they want to be in. A selection effect may result if the experimenters assign one type of person (e.g., all the women, or all who sign up early in the semester) to one condition, and another type of person (e.g., all the men, or all those who wait until later in the semester) to another condition.

Here’s a real-world example. A study was designed to test a new intensive therapy for autism, involving one-on-one sessions with a therapist for 40 hours per week (Lovaas, 1987; see Gernsbacher, 2003). To determine whether this therapy would cause a significant improvement in children’s autism symp- toms, the researchers recruited 38 families that had children with autism, and arranged for some children to receive the new intensive treatment while others received their usual treatment. The researchers assigned families to either the intensive-treatment group or the treatment-as-usual group. However, some of the families lived too far away to receive the new treatment; other parents pro- tested that they’d prefer to be in the intensive-treatment group. Thus, not all the families were randomly assigned to the two groups.

At the end of the study, the researchers found that the symptoms of the children in the intensive-treatment group had improved more than the symptoms of those who received their usual treatment. However, this study suffered from a clear selec- tion effect: The families in the intensive- treatment group were probably systematically different from the treatment-as-usual group because the groups self-selected. Many parents in the intensive- treatment group were placed there because of their eagerness to try a focused, 40-hour-per-week treatment regi- men. Therefore, parents in that group may have been more motivated to help their children, so there was a clear threat to internal validity.

Because of the selection effect, it’s impossible to tell the reason for the results (Figure 10.6). Did the children in that group improve because of the intensive treatment? Or did they improve because the families who selected the new therapy were simply more engaged in their children’s treatment? Of course, in any study that tests a therapy, some participants will be more motivated than others. This variability in motivation becomes a confound only when the more motivated folks tend to be in one group—that is, when the variability is systematic.

Avoiding Selection Effects with Random Assignment.  Well-designed experiments often use random assignment to

FIGURE 10.6 Selection effects. In a study for treating autism, some parents insisted that their children be in the new intensive-treatment group rather than the treatment-as-usual group. Because they had this choice, it’s not possible to determine whether the improvement in the intensive group was caused by the treatment itself or by the fact that the more motivated parents chose it. (Data are fabricated for illustration purposes.)



Treatment as usual


Intensive treatment

285Why Experiments Support Causal Claims

avoid selection effects. In the pasta bowl study, an experimenter flipped a coin to decide which participants would be in each group, so each one had an equal chance of being in the large-bowl or medium-bowl condition. What does this mean? Suppose that, of the 68 participants who volunteered for the study, 20 were exceptionally hungry that day. Probabilistically speaking, the rolls of the die would have placed about 10 of the hungry people in the medium-bowl condition and about 10 in the large-bowl condition. Similarly, if 12 of the par- ticipants were dieting, random assignment would place about 6 of them in each group. In other words, since the researchers used random assignment, it’s very unlikely, given the random (deliberately unsystematic) way people were assigned to each group, that all the hungry people, dieters, and so on would have been clustered in the same group.

Assigning participants at random to different levels of the independent vari- able—by flipping a coin, rolling a die, or using a random number generator— controls for all sorts of potential selection effects (Figure 10.7). Of course in practice, random assignment may not usually create numbers that are perfectly even. The 20 exceptionally hungry people may be distributed as 9 and 11, or 12 and 8, rather than exactly 10 and 10. However, random assignment almost always works. In fact, simulations have shown that random assignment creates similar groups up to 98% of the time, even when there are as few as 4 people in each group (Sawilowsky, 2005; Strube, 1991).

Random assignment is a way of desystematizing the types of participants who end up in each level of the independent variable. Of course, some people are more motivated than others; some are hungrier than others; some are more extroverted. Successful random assignment spreads these differences out more evenly. It cre- ates a situation in which the experimental groups will become virtually equal, on average, before the independent variable is applied. After random assignment (and before manipulating the independent variable), researchers should be able

FIGURE 10.7 Random assignment. Random assignment ensures that every participant in an experiment has an equal chance to be in each group.

Randomly assign

❮❮ To review the difference between random assignment and random sampling, see Chapter 7, pp. 190–191.

286 CHAPTER 10 Introduction to Simple Experiments

to test the experimental groups for intelligence, extroversion, motivation, and so on, and averages of each group should be comparable on these traits.

Avoiding Selection Effects with Matched Groups. In the simplest type of random assignment, researchers assign partic- ipants at random to one condition or another in the experiment. In certain situations, researchers may wish to be absolutely sure the experimental groups are as equal as possible before they administer the independent variable. In these cases, they may choose to use matched groups, or matching.

To create matched groups from a sample of 30, the researchers would first measure the participants on a par- ticular variable that might matter to the dependent variable. Student ability, operationalized by GPA, for instance, might matter in a study of notetaking. They would next match participants up in pairs, starting with the two having the highest GPAs, and within that matched set, randomly assign one of them to each of the two notetaking conditions. They would then take the pair with the next-highest GPAs and within that set again assign randomly to the two groups. They would continue this process until they reach the par- ticipants with the lowest GPAs and assign them at random, too (Figure 10.8).

Matching has the advantage of randomness. Because each member of the matched pair is randomly assigned, the technique prevents selection effects. This method also ensures that the groups are equal on some important variable, such as GPA, before the manipulation of the independent variable. The disadvantage is that the matching process requires an extra step—in this case, finding out people’s GPA before assigning to groups. Matching, therefore, requires more time and often more resources than random assignment.


1. Why do experiments usually satisfy the three causal criteria?

2. How are design confounds and control variables related?

3. How does random assignment prevent selection effects?

4. How does using matched groups prevent selection effects?

1. See pp. 279–282. 2. See pp. 282–283; control variables are used to eliminate potential design confounds. 3. See pp. 284–285. 4. See p. 286.

FIGURE 10.8 Matching groups to eliminate selection effects. To create matched groups, participants are sorted from lowest to highest on some variable and grouped into sets of two. Individuals within each set are then assigned at random to the two experimental groups.

Two participants with highest GPA

Next two highest

(and so on)

Two lowest

Group 1 Group 2

1 2

4 3

30 29

287Independent-Groups Designs

INDEPENDENT-GROUPS DESIGNS Although the minimum requirement for an experiment is that researchers manipulate one variable and measure another, experiments can take many forms. One of the most basic distinctions is between independent-groups designs and within-groups designs.

Independent-Groups vs. Within-Groups Designs In the notetaking and pasta bowl studies, there were different participants at each level of the independent variable. In the notetaking study, some partici- pants took notes on laptops and others took notes in longhand. In the pasta bowl study, some participants were in the large-bowl condition and others were in the medium-bowl condition. Both of these studies used an independent-groups design, in which different groups of participants are placed into different levels of the independent variable. This type of design is also called a between-subjects design or between-groups design.

In a within-groups design, or within-subjects design, there is only one group of participants, and each person is presented with all levels of the independent variable. For example, Mueller and Oppenheimer (2014) might have run their study as a within-groups design if they had asked each participant to take notes twice—once using a laptop and another time handwritten.

Two basic forms of independent-groups designs are the posttest-only design and the pretest/posttest design. The two types of designs are used in different situations.

Posttest-Only Design The posttest-only design is one of the simplest independent-groups experimental designs. In the posttest-only design, also known as an equivalent groups, posttest- only design, participants are randomly assigned to independent variable groups and are tested on the dependent variable once (Figure 10.9).

FIGURE 10.9 A posttest-only design.

Measure on dependent variable

Measure on dependent variable

Independent variable Group 1

Independent variable Group 2

Randomly assign

288 CHAPTER 10 Introduction to Simple Experiments

The notetaking study is an example of a posttest-only design, with two independent variable levels (Mueller & Oppenheimer, 2014). Participants were randomly assigned to a laptop condition or a longhand condition (Figure 10.10).

Posttest-only designs satisfy all three criteria for causation. They allow researchers to test for covariance by detecting differences in the dependent vari- able. (Having at least two groups makes it possible to do so.) They establish tem- poral precedence because the independent variable comes first in time. And when they are conducted well, they establish internal validity. When researchers use appropriate control variables, there should be no design confounds, and random assignment takes care of selection effects.

Pretest/Posttest Design In a pretest/posttest design, or equivalent groups, pretest/posttest design, par- ticipants are randomly assigned to at least two different groups and are tested on the key dependent variable twice—once before and once after exposure to the independent variable (Figure 10.11).

A study on the effects of mindfulness training, introduced in Chapter 1, is an example of a pretest/posttest design. In this study, 48 students were randomly assigned to participate in either a 2-week mindfulness class or a 2-week nutri- tion class (Mrazek, Franklin, Phillips, Baird, & Schooler, 2013). One week before starting their respective classes, all students completed a verbal-reasoning sec- tion of a GRE test. One week after their classes ended, all students completed another verbal-reasoning GRE test of the same difficulty. The results, shown in

FIGURE 10.10 Studying notetaking: a posttest-only design.

Comprehension test

Comprehension test

Laptop notes

Longhand notes

Randomly assign

FIGURE 10.11 A pretest/posttest design.

Mindfulness class

Nutrition class

Verbal GRE score

Verbal GRE score

Verbal GRE score

Verbal GRE score

Randomly assign

289Independent-Groups Designs

Figure  10.12, revealed that, while the nutrition group did not improve signifi- cantly from pretest to posttest, the mind- fulness group scored significantly higher at posttest than at pretest.

Researchers might use a pretest/posttest design when they want to demonstrate that random assignment made groups equal. In this case, a pretest/posttest design means researchers can be absolutely sure there is no selection effect in a study. If you examine the white pretest bars in Figure 10.12, you’ll see the nutrition and mindfulness groups had almost identi- cal pretest scores, indicating that random assignment worked as expected.

In addition, pretest/posttest designs can enable researchers to track people’s change in performance over time. Although the two groups started out, as expected, with about the same GRE abi lity, only the mindfulness group improved their GRE scores.

Which Design Is Better? Why might researchers choose to do a posttest-only experiment rather than use a pretest/posttest design? Shouldn’t they always make sure groups are equal on GRE ability or pasta appetite before they experience a manipulation? Not necessarily.

In some situations, it is problematic to use a pretest/posttest design. Imagine how the van Kleef team might have done this. Maybe they would want to pretest participants to see how much pasta they usually eat. But if they did that, people would have felt too full to participate in the rest of the study. Instead, the research- ers trusted in random assignment to create equivalent groups. Big eaters and light eaters all had an equal chance of being in either one of the serving bowl groups, and if they were distributed evenly across both groups, their effects would can- cel each other out. Therefore, any observed difference in overall eating behavior between these two groups should be attributable only to the two different bowl sizes. In other words, “being a big eater” could have been a selection effect, but random assignment helped avoid it.

In contrast, a pretest/posttest design made sense for the Mrazek team’s study. They could justify giving their sample of students the GRE test two times because they had told participants they were studying ways of “improving cognitive performance.”

FIGURE 10.12 Results using a pretest/posttest design. In this study, mindfulness training caused students to improve their GRE verbal scores. (Source: Mrazek et al., 2013, Fig. 1A.)

Star means posttest score of mindfulness group was statistically significantly higher than pretest score.

These two bars are at equal heights. As expected, before taking their respective classes, both groups scored equally on verbal GRE.

Independent variable is almost always on the x-axis.

Dependent variable is almost always on the y-axis.


290 CHAPTER 10 Introduction to Simple Experiments

WITHIN-GROUPS DESIGNS There are two basic types of within-groups design. When researchers expose participants to all levels of the independent variable, they might do so by repeated exposures, over time, to different levels, or they might do so concurrently.

Repeated-Measures Design A repeated-measures design is a type of within-groups design in which par- ticipants are measured on a dependent variable more than once, after exposure to each level of the independent variable. Here’s an example. Humans are social animals, and we know that many of our thoughts and behaviors are influenced by the presence of other people. Happy times may be happier, and sad times sadder, when experienced with others. Researchers Erica Boothby and her col- leagues used a repeated-measures design to investigate whether a shared expe- rience would be intensified even when people do not interact with the other person (Boothby, Clark, & Bargh, 2014). They hypothesized that sharing a good experience with another person makes it even better than it would have been if experienced alone.

They recruited 23 college women to a laboratory. Each participant was joined by a female confederate. The two sat side-by-side, facing forward, and never spoke to each other. The experimenter explained that each person in the pair


1. What is the difference between independent-groups and within-groups designs?

2. Describe how posttest-only and pretest/posttest designs are both independent-groups designs. Explain how they differ.

1. See p. 287. 2. See pp. 287–289.

In short, the posttest-only design may be the most basic type of independent- groups experiment, but its combination of random assignment plus a manipulated variable can lead to powerful causal conclusions. The pretest/posttest design adds a pretesting step to the most basic independent-groups design. Researchers might use a pretest/posttest design if they want to study improvement over time, or to be extra sure that two groups were equivalent at the start—as long as the pretest does not make the participants change their more spontaneous behavior.

291Within-Groups Designs

would do a variety of activities, including tasting some dark chocolates and viewing some paintings. During the experiment, the order of activities was determined by drawing cards. The drawings were rigged so that the real partic- ipant’s first two activities were always tasting chocolates. In addition, the real participant tasted the first chocolate at the same time the confederate was also tasting it, but tasted the second chocolate while the confederate was viewing a painting. The participant was told that the two chocolates were different, but in fact they were exactly the same. After tasting each chocolate, participants rated how much they liked it. The results showed that people liked the chocolate more when the confederate was also tasting it (Figure 10.13).

In this study, the independent variable had two levels: Sharing and not sharing an experience. Participants experienced both levels, making it a within-groups design. The dependent variable was rating of the chocolate. It was a repeated- measures design because people rated the chocolate twice (i.e., repeatedly).

Concurrent-Measures Design In a concurrent-measures design, participants are exposed to all the levels of an independent variable at roughly the same time, and a single attitudinal or behavioral preference is the dependent variable. An example is a study inves- tigating infant cognition, in which infants were shown two faces at the same time, a male face and a female face; an experimenter recorded which face they looked at the longest (Quinn, Yahr, Kuhn, Slater, & Pascalis, 2002). The inde- pendent variable is the gender of the face, and babies experience both levels

Rate chocolate

Taste chocolate alone

Rate chocolate

Taste chocolate with confederate

One group









0 Shared


Liking of chocolate

Unshared experience



FIGURE 10.13 Testing the effect of sharing an experience using a repeated-measures design. (A) The design of the study. (B) The results of the study. (Source: Adapted from Boothby et al., 2014.)

292 CHAPTER 10 Introduction to Simple Experiments

(male and female) at the same time. The baby’s looking preference is the dependent variable (Figure 10.14). This study found that babies show a preference for looking at female faces, unless their primary caretaker is male.

Harlow also used a concurrent-measures design when he presented baby monkeys with both a wire and a cloth “mother” (Harlow, 1958). The monkeys indicated their preference by spending more time with one mother than the other. In Harlow’s study, the type of mother was the independent variable (manipulated as within- groups), and each baby monkey’s cling- ing behavior was the dependent variable.

Advantages of Within-Groups Designs The main advantage of a within-groups design is that it ensures the participants in the two groups will be equivalent. After all, they are the same participants! For example, some people really like dark chocolate, and others do not. But in a repeated measures design, people bring their same level of chocolate affection to both conditions, so their individual liking for the chocolate stays the same. The only difference between the two conditions will be attributable to the indepen- dent variable (whether people were sharing the experience with the confederate or not). In a within-groups design such as the chocolate study, researchers say that each woman “acted as her own control” because individual or personal variables are kept constant.

Similarly, when the Quinn team (2002) studied whether infants prefer to look at male or female faces as a within-groups design, they did not have to worry (for instance) that all the girl babies would be in one group or the other, or that babies with older siblings or who go to daycare would be in one group or the other. Every baby saw both types of faces, which kept any extraneous personal variables con- stant across the two facial gender conditions.

The idea of “treating each participant as his or her own control” also means matched-groups designs can be treated as within-groups designs. As discussed earlier, in a matched-groups design, researchers carefully match sets of partici- pants on some key control variable (such as GPA) and assign each member of a set to a different group. The matched participants in the groups are assumed to be more similar to each other than in a more traditional independent-groups design, which uses random assignment.

Besides providing the ability to use each participant as his or her own con- trol, within-groups designs also give researchers more power to notice differ- ences between conditions. Statistically speaking, when extraneous differences

❯❯ To review matched-groups

design, see p. 286.

Female face

Male face

Looking preferenceOne group

FIGURE 10.14 A concurrent-measures design for an infant cognition study. Babies saw two faces simultaneously, and the experimenters recorded which face they looked at the most.

293Within-Groups Designs

(unsystematic variability) in personality, food preferences, gender, ability, and so on are held constant across all conditions, researchers will be more likely to detect an effect of the independent variable manipulation if there is one. In this context, the term power refers to the probability that a study will show a statis- tically significant result when an independent variable truly has an effect in the population. For example, if mindfulness training really does improve GRE scores, will the study’s results find a difference? Maybe not. If extraneous differences exist between two groups, too much unsystematic variability may be obscuring a true difference. It’s like being at a noisy party—your ability to detect somebody’s words is hampered when many other conversations are going on around you.

A within-groups design can also be attractive because it generally requires fewer participants overall. Suppose a team of researchers is running a study with two conditions. If they want 20 participants in each condition, they will need a total of 40 people for an independent-groups design. However, if they run the same study as a within-groups design, they will need only 20 participants because each participant experiences all levels of the independent variable (Figure 10.15). In this way, a repeated-measures design can be much more efficient.

❮❮ For more on power, see Chapter 11, pp. 340–341, and Statistics Review: Inferential Statistics, pp. 487–490.

1 2 3 4 5

6 7 8 9 10

11 12 13 14 15

16 17 18 19 20

1 2 3 4 5

6 7 8 9 10

11 12 13 14 15

16 17 18 19 20

21 22 23 24 25

26 27 28 29 30

31 32 33 34 35

36 37 38 39 40

1 2 3 4 5

6 7 8 9 10

11 12 13 14 15

16 17 18 19 20

Independent-groups design

+ = 40

= 20Within-groups design

Condition 1 Condition 2

FIGURE 10.15 Within-groups designs require fewer participants. If researchers want to use 20 people in each of two experimental conditions, a within-groups design is more efficient than an independent-groups design.

294 CHAPTER 10 Introduction to Simple Experiments

Covariance, Temporal Precedence, and Internal Validity in Within-Groups Designs Do within-groups designs allow researchers to make causal claims? In other words, do they stand up to the three criteria for causation?

Because within-groups designs enable researchers to manipulate an indepen- dent variable and incorporate comparison conditions, they provide an opportu- nity for establishing covariance. The Boothby team (2014) observed, for example, that the chocolate ratings covaried with whether people shared the tasting expe- rience or not.

A repeated-measures design also establishes temporal precedence. The experi- menter controls the independent variable and can ensure that it comes first. In the chocolate study, each person tasted chocolate as either a shared or an unshared experience, and then rated the chocolate. In the infant cognition study, the researchers presented the faces first, and then measured looking time.

What about internal validity? With a within-groups design, researchers don’t have to worry about selection effects because participants are exactly the same in the two conditions. They might be concerned about design confounds. For example, Boothby’s team made sure both chocolates were exactly the same. If the chocolate that people tasted in the shared condition was of better quality, the experimenters would not know if it was the chocolate quality, or the shared experience, that was responsible for higher ratings. Similarly, Quinn’s team made sure the male and female faces they presented to the babies were equally attractive and of the same ethnicity.


Within-groups designs have the potential for a particular threat to internal validity: Sometimes, being exposed to one condition changes how participants react to the other condition. Such responses are called order effects, and they happen when exposure to one level of the independent variable influences responses to the next level. An order effect in a within-groups design is a confound, meaning that behavior at later levels of the independent variable might be caused not by the experimental manipulation, but rather by the sequence in which the conditions were experienced.

Order effects can include practice effects, also known as fatigue effects, in which a long sequence might lead participants to get better at the task, or to get tired or bored toward the end. Order effects also include carryover effects, in which some form of contamination carries over from one condition to the next. For example, imagine sipping orange juice right after brushing your teeth; the first taste contaminates your experience of the second one.

An order effect in the chocolate-tasting study could have occurred if people rated the first chocolate higher than the second simply because the first bite of chocolate is always the best; subsequent bites are never quite as good. That would be an order effect, and a threat to internal validity because the order of tasting chocolate is confounded with the condition (shared versus unshared experiences).

295Within-Groups Designs


Because order effects are potential internal validity problems in a within-groups design, experimenters want to avoid them. When researchers use counterbalancing, they present the levels of the independent variable to participants in different sequences. With counterbalancing, any order effects should cancel each other out when all the data are collected.

Boothby and her colleagues (2014) used counterbalancing in their experiment (Figure 10.16). Half the participants tasted the first chocolate in the shared con- dition, followed by a second chocolate in the unshared condition. The other half tasted chocolate in the unshared followed by the shared condition. Therefore, the effect of “first taste of chocolate” was present for half of the people in each con- dition. When the data were combined from these two sequences, any order effect dropped out of the comparison between the shared and unshared conditions. As a result, the researchers knew that the difference they noticed was attributable only to the shared (versus unshared) experiences, and not to practice, carryover, or some other order effect.

Procedures Behind Counterbalancing. When researchers counterbalance conditions (or levels) in a within-groups design, they have to split their partic- ipants into groups; each group receives one of the condition sequences. How do the experimenters decide which participants receive the first order of pre- sentation and which ones receive the second? Through random assignment, of course! They might recruit, say, 30 participants to a study and randomly assign 15 of them to receive the order A then B, and assign 15 of them to the order B then A.

There are two methods for counterbalancing an experiment: full and partial. When a within-groups experiment has only two or three levels of an indepen- dent variable, researchers can use full counterbalancing, in which all possi- ble condition orders are represented. For example, a repeated-measures design with two conditions is easy to counterbalance because there are only two orders (A @ B and B @ A). In a repeated-measures design with three conditions—A, B,

Rate chocolate

Taste chocolate with confederate

Rate chocolate

Taste chocolate alone

Rate chocolate

Taste chocolate alone

Rate chocolate

Taste chocolate with confederate

Randomly assign

FIGURE 10.16 Counterbalanced design. Using counterbalancing in an experiment will cancel out any order effects in a repeated- measures design.

296 CHAPTER 10 Introduction to Simple Experiments

and C—each group of participants could be randomly assigned to one of the six following sequences:

A @ B @ C B @ C @ A A @ C @ B C @ A @ B B @ A @ C C @ B @ A

As the number of conditions increases, however, the number of possible orders needed for full counterbalancing increases dramatically. For example, a study with four conditions requires 24 possible sequences! If experimenters want to put at least a few participants in each order, the need for participants can quickly increase, counteracting the typical efficiency of a repeated-measures design. Therefore, they might use partial counterbalancing, in which only some of the possible condition orders are represented. One way to partially counterbalance is to present the conditions in a randomized order for every subject. (This is easy to do when an experiment is administered by a computer; the computer delivers conditions in a new random order for each participant.)

Another technique for partial counterbalancing is to use a Latin square, a formal system to ensure that every condition appears in each position at least once. A Latin square for six conditions (conditions 1 through 6) looks like this:

1 2 6 3 5 4 2 3 1 4 6 5 3 4 2 5 1 6 4 5 3 6 2 1 5 6 4 1 3 2 6 1 5 2 4 3

The first row is set up according to a formula, and then the conditions simply go in numerical order down each column. Latin squares work differently for odd and even numbers of conditions. If you wish to create your own, you can find formulas for setting up the first rows of a Latin square online.

Disadvantages of Within-Groups Designs Within-groups designs are true experiments because they involve a manipulated variable and a measured variable. They potentially establish covariance, they ensure temporal precedence, and when experimenters control for order effects and design confounds, they can establish internal validity, too. So why wouldn’t a researcher choose a within-groups design all the time?

Within-groups designs have three main disadvantages. First, as noted earlier, repeated-measures designs have the potential for order effects, which can threaten internal validity. But a researcher can usually control for order effects by using counterbalancing, so they may not be much of a concern.

A second possible disadvantage is that a within-groups design might not be pos- sible or practical. Suppose someone has devised a new way of teaching children how to ride a bike, called Method A. She wants to compare Method A with the older method, Method B. Obviously, she cannot teach a group of children to ride a bike with

297Within-Groups Designs

Method A and then return them to baseline and teach them again with Method B. Once taught, the children are permanently changed. In such a case, a within-groups design, with or without counterbalancing, would make no sense. The study on mind- fulness training on GRE scores fits in this category. Once people had participated in mindfulness training, they presumably could apply their new skill indefinitely.

A third problem occurs when people see all levels of the independent variable and then change the way they would normally act. If participants in the van Kleef pasta bowl study had seen both the medium and large serving bowls (instead of just one or the other), they might have thought, “I know I’m participating in a study at the moment; seeing these two bowls makes me wonder whether it has some- thing to do with serving bowl size.” As a result, they might have changed their spontaneous behavior. A cue that can lead participants to guess an experiment’s hypothesis is known as a demand characteristic, or an experimental demand. Demand characteristics create an alternative explanation for a study’s results. You would have to ask: Did the manipulation really work, or did the participants simply guess what the researchers expected them to do, and act accordingly?

Is Pretest/Posttest a Repeated-Measures Design? You might wonder whether pretest/posttest independent-groups design should be considered a repeated-measures design. After all, in both designs, participants are tested on the dependent variable twice.

In a true repeated-measures design, however, participants are exposed to all levels of a meaningful independent variable, such as a shared or unshared expe- rience, or the gender of the face they’re looking at. The levels of such independent variables can also be counterbalanced. In contrast, in a pretest/posttest design, par- ticipants see only one level of the independent variable, not all levels (Figure 10.17).

Exposure to IV level A

Measure (DV)

Exposure to IV level B

Posttest (DV)

Posttest (DV)

Pretest (DV)

Pretest (DV)

Exposure to

IV level A

Measure (DV)

Exposure to

IV level B

Measure (DV)

Exposure to

IV level B

Measure (DV)

Exposure to

IV level A

Randomly assign

Pretest/posttest design

Repeated-measures design

Randomly assign

FIGURE 10.17 Pretest/posttest design versus repeated- measures design. In a pretest/posttest design, participants see only one level of the independent variable, but in a repeated- measures design, they see all the levels. (DV = dependent variable. IV = independent variable.)

298 CHAPTER 10 Introduction to Simple Experiments


1. What are the two basic types of within-groups design?

2. Describe how counterbalancing improves the internal validity of a within- groups design.

3. Summarize the three advantages and the three potential disadvantages of within-groups designs.

1. Concurrent measures and repeated measures; see pp. 290–292. 2. See pp. 295–296. 3. See pp. 292–293 and pp. 296–297.

TABLE 10.1

Two Independent-Groups Designs and Two Within-Groups Designs




Posttest-only design Pretest/posttest design

Concurrent-measures design

Repeated-measures design

INTERROGATING CAUSAL CLAIMS WITH THE FOUR VALIDITIES Let’s use Mueller and Oppenheimer’s (2014) study on notetaking to illustrate how to interrogate an experimental design using the four big validities as a framework. What questions should you ask, and what do the answers mean?

Construct Validity: How Well Were the Variables Measured and Manipulated? In an experiment, researchers operationalize two constructs: the independent variable and the dependent variable. When you interrogate the construct validity of an experiment, you should ask about the construct validity of each of these variables.

Table 10.1 summarizes the four types of experimental designs covered in this chapter.

299Interrogating Causal Claims with the Four Validities


Chapters 5 and 6 explained in detail how to interrogate the construct validity of a dependent (measured) variable. To interrogate construct validity in the notetaking study, you would start by asking how well the researchers measured their depen- dent variables: factual knowledge and conceptual knowledge.

One aspect of good measurement is face validity. Mueller and Oppenheim (2014) provided examples of the factual and conceptual questions they used, so you could examine these and evaluate if they actually do constitute good mea- sures of factual learning (e.g., “What is the purpose of adding calcium propionate to bread?”) and conceptual learning (e.g., “If a person’s epiglottis was not work- ing properly, what would be likely to happen?”). These two examples do seem to be appropriate types of questions because the first asks for direct recall of a lecture’s factual information, and the second requires people to understand the epiglottis and make an inference. The researchers also noted that each of these open-ended questions was graded by two coders. The two sets of scores, they reported, showed good interrater reliability (.89). In this study, the strong inter- rater reliability indicates that the two coders agreed about which participants got the right answers and which ones did not. (To review interrater reliability, see Chapter 5.)


To interrogate the construct validity of the independent variables, you would ask how well the researchers manipulated (or operationalized) them. In the Mueller and Oppenheimer study, this was straightforward: People were given either a pen or a laptop. This operationalization clearly manipulated the intended independent variable.

Manipulation Checks and Pilot Studies. In other studies, researchers need to use manipulation checks to collect empirical data on the construct validity of their independent variables. A manipulation check is an extra dependent variable that researchers can insert into an experiment to convince them that their experimen- tal manipulation worked.

A manipulation check was not necessary in the notetaking study because research assistants could simply observe participants to make sure they were actually using the laptops or pens they had been assigned. Manipulation checks are more likely to be used when the intention is to make participants think or feel certain ways. For example, researchers may want to manipulate feelings of anxiety by telling some students they have to give a public speech. Or they may wish to manipulate people’s empathy by showing a poignant film. They may manipulate amusement by telling jokes. In these cases, a manipulation check can help researchers determine whether the operationalization worked as intended.

300 CHAPTER 10 Introduction to Simple Experiments

Here’s an example. Researchers were interested in investigating whether humor would improve students’ memory of a college lecture (Kaplan & Pascoe, 1977). Students were randomly assigned to listen to a serious lecture or one punc- tuated by humorous examples. To ensure they actually found the humorous lec- ture funnier than the serious one, students rated the lecture on how “funny” and “light” it was. These items were in addition to the key dependent variable, which was their memory for the material. As expected, the students in the humorous lecture condition rated the speaker as funnier and lighter than students in the seri- ous lecture condition. The researchers concluded that the manipulation worked as expected.

The same procedure might also be used in a pilot study. A pilot study is a simple study, using a separate group of participants, that is completed before (or sometimes after) conducting the study of primary interest. Kaplan and Pascoe (1977) might have exposed a separate group of students to either a serious or a humorous lecture, and then asked them how amusing they found it. Researchers may use pilot study data to confirm the effectiveness of their manipulations before using them in a target study.


Experiments are designed to test theories. Therefore, interrogating the construct validity of an experiment requires you to evaluate how well the measures and manipulations researchers used in their study capture the conceptual variables in their theory.

Recall that Mueller and Oppenheimer (2014) originally proposed that laptop notetaking would let students more easily take notes verbatim, compared to taking handwritten notes. In fact, their study included measures of “verbatim overlap” so they could test their theory about why laptop notetakers might perform worse. After transcribing each person’s notes, they measured how closely the notes over- lapped verbatim with the video lecture narration. It turned out that people in the laptop condition had, in fact, written more verbatim notes than people in the longhand condition. In addition, the more people wrote verbatim notes, the worse they did on the essay test. The researchers supported their theory by measuring key constructs that their theory proposed.

Here’s another example of how theory guides the variables researchers manipulate and measure in an experiment. Recall that the chocolate-tasting study was designed to test the theory that sharing an experience makes it more intense (Boothby et al., 2014). In addition to showing that good-tasting chocolate tastes better when another person is tasting it, the researchers also needed to demonstrate the same effect in response to a negative experience. Using the same repeated-measures design, in a second study they used squares of 90% dark chocolate, containing almost no sugar, so it was bitter as opposed to the sweeter chocolate in the first study. People rated their liking for the bitter chocolate lower when the experience was shared, compared to unshared (Figure 10.18).


Two main results of the chocolate studies support their construct validity. (1)  People in the first study rated the chocolate higher overall than those in the second study, which is what you’d expect if one was supposed to represent a positive expe- rience and the other a negative experience. (2) People reported being more absorbed in the shared experience than the unshared one. This result supports the theory that shared experiences should be more intense (absorbing) than unshared ones.

External Validity: To Whom or What Can the Causal Claim Generalize? Chapters 7 and 8 discussed external validity in the context of frequency claims and association claims. Interrogating external validity in the context of causal claims is similar. You ask whether the causal relationship can generalize to other people, places, and times. (Chapter 14 goes into even more detail about external validity questions.)








0 Shared


Study 1: Tasting sweet chocolate

Liking chocolate

Unshared experience








0 Shared


Study 2: Tasting bitter chocolate

Liking chocolate

Unshared experience









0 Shared


Study 2: Tasting bitter chocolate

Absorption in experience

Unshared experience


FIGURE 10.18 Construct validity is theory-driven. (A) When people tasted bitter chocolate in this study, they rated it more negatively when the experience was shared, compared to unshared. They also rated both of the bitter chocolates lower than the sweet chocolates in the first study, providing construct validity evidence that the experience in the second study was negative. (B) People were more absorbed in the shared experience, evidence that the shared versus unshared experience was manipulated as intended. (Source: Adapted from Boothby et al., 2014.)

Interrogating Causal Claims with the Four Validities

302 CHAPTER 10 Introduction to Simple Experiments


As with an association claim or a frequency claim, when interrogating a causal claim’s external validity, you should ask how the experimenters recruited their participants. Remember that when you interrogate external validity, you ask about random sampling—randomly gathering a sample from a population. (In contrast, when you interrogate internal validity, you ask about random assignment— randomly assigning each participant in a sample into one experimental group or another.) Were the participants in a study sampled randomly from the population of interest? If they were, you can be relatively sure the results can be generalized, at least to the population of participants from which the sample came.

In the Mueller and Oppenheimer study (2014), the 67 students were a conve- nience sample (rather than a random sample) of undergraduates from Princeton University. Because they were a convenience sample, you can’t be sure if the results would generalize to all Princeton University students, not to mention to college students in general. In addition, because the study was run only on college students, you can’t assume the results would apply to middle school or high school students.


External validity also applies to the types of situations to which an experiment might generalize. For example, the van Kleef study used pasta, but other research- ers in the same lab found that large serving containers also cause people to con- sume more soup, popcorn, and snack chips (Wansink, 2006). The notetaking study used five videotaped TED talk lectures. In their published article, Mueller and Oppenheimer (2014) reported two additional experiments, each of which used new video lectures. All three experiments found the same pattern, so you can infer that the effect of laptop notetaking does generalize to other TED talks. However, you can’t be sure from this study if laptop notetaking would generalize to a live lecture class. You also don’t know if the effect of laptop notetaking would generalize to other kinds of college teaching, such as team-based learning or lab courses.

To decide whether an experiment’s results can generalize to other situa- tions, it is sometimes necessary to consider the results of other research. One experiment, conducted after Mueller and Oppenheimer’s three studies, helped demonstrate that the laptop notetaking effect can generalize to live lecture classes (Carter, Greenberg, & Walker, 2016). College students at West Point were randomly assigned to their real, semester-long economics classes. There were 30 sections of the class, which all followed the same syllabus, used the same textbook, and gave almost the same exams. In 10 of the sections, students were not allowed to use laptops or tablets, and in another 10 sections, they were allowed to use them. In the last 10 sections, students could use tablets as long as they were kept flat on their desk during the class. The results indicated that students in the two computerized sections scored lower on exams than students in the computer-free classrooms. This study helps us generalize from Mueller and Oppenheimer’s short-term lec- ture situation to a real, semester-long college class. Similarly, you might ask if


the hypothesis about shared experiences might generalize to other experiences besides tasting chocolate (Figure 10.19).


Should you be concerned that Mueller and Oppenheimer did not select their par- ticipants at random from the population of college students? Should you be con- cerned that all three of their studies used TED talks instead of other kinds of classroom material?

Remember from Chapter 3 that in an experiment, researchers usually priori- tize experimental control—that is, internal validity. To get a clean, confound-free manipulation, they may have to conduct their study in an artificial environment like a university laboratory. Such locations may not represent situations in the real world. Although it’s possible to achieve both internal and external validity in a single study, doing so can be difficult. Therefore, many experimenters decide to sacrifice real-world representativeness for internal validity.

Testing their theory and teasing out the causal variable from potential con- founds were the steps Mueller and Oppenheimer, like most experimenters, took care of first. In addition, running an experiment on a relatively homogenous sam- ple (such as college students) meant that the unsystematic variability was less likely to obscure the effect of the independent variable (see Chapter 11). Repli- cating the study using several samples in a variety of contexts is a step saved for later. Although Mueller and Oppenheimer sampled only college students and ran their studies in a laboratory, at least one other study demonstrated that taking notes by computer can cause lower grades even in real, semester-long courses. Future researchers might also be interested in testing the effect of using laptops in younger students or for other subjects (such as psychology or literature courses). Such studies would demonstrate whether longhand notetaking is more effective than laptop notetaking for all subjects and for all types of students.

❮❮ For more discussion on prioritizing validities, see Chapter 14, pp. 438–452.

FIGURE 10.19 Generalizing to other situations. The chocolate-tasting study showed that flavors are more intense when the experience is shared. A future study might explore whether the shared experiences effect generalizes to other situations, such as watching a happy or sad movie.

Interrogating Causal Claims with the Four Validities

304 CHAPTER 10 Introduction to Simple Experiments

Statistical Validity: How Well Do the Data Support the Causal Claim? For the present context, interrogating the statistical validity of an experiment involves two basic concerns: statistical significance and effect size. In your statis- tics class, you will learn how to ask other questions about experimental designs, such as whether the researchers conducted the right statistical tests.


The first question to ask is whether the difference between means obtained in the study is statistically significant. Recall from Chapter 8 that when a result is statis- tically significant, it is unlikely to have been obtained by chance from a population in which nothing is happening. When the difference (say, between a laptop group and a longhand group) in a study is statistically significant, you can be more con- fident the difference is not a fluke result. In other words, a statistically significant result suggests covariance exists between the variables in the population from which the sample was drawn.

When the difference between conditions is not statistically significant, you cannot conclude there is covariance—that the independent variable had a detect- able effect on the dependent variable. Any observed difference between the groups found in the study is similar to the kinds of differences you would find just by chance when there is no covariance. And if there is no covariance, the study does not support a causal claim.


Knowing a result is statistically significant tells you the result probably was not drawn by chance from a population in which there is no difference between groups. However, if a study used a very large sample, even tiny differences might be statistically significant. Therefore, asking about effect size can help you eval- uate the strength of the covariance (i.e., the difference). In general, the larger the effect size, the more important, and the stronger, the causal effect. When a study’s result is statistically significant, it is not necessarily the same as having a large effect size.

As discussed in Chapter 8, the correlation coefficient r can help researchers evaluate the effect size of an association. In experiments, they often use a differ- ent indicator of standardized effect size, called d. This measure represents how far apart two experimental groups are on the dependent variable. It indicates not only the distance between the means, but also how much the scores within the groups overlap. The standardized effect size, d, takes into account both the dif- ference between means and the spread of scores within each group (the standard deviation). When d is larger, it usually means the independent variable caused the dependent variable to change for more of the participants in the study. When d is smaller, it usually means the scores of participants in the two experimental

❯❯ For more detail on standard

deviation and effect size, see Statistics Review: Descriptive

Statistics, pp. 462–465 and pp. 472–477.


groups overlap more. Figure 10.20 shows what two d values might look like when a study’s results are graphed showing all participants. Even though the difference between means is exactly the same in the two graphs, the effect sizes reflect the different degrees of overlap between the group participants.

In Mueller and Oppenheimer’s first study (2014), the effect size for the differ- ence in conceptual test question performance between the longhand and laptop groups was d = 0.77. Table 10.2 shows how the conventions apply to d as well as r. According to these guidelines, a d of 0.77 would be considered fairly strong. It means the laptop group scored about 0.77 of a standard deviation higher than the longhand group. Therefore, if you were interrogating the statistical valid- ity of Mueller and Oppenheimer’s causal claim, you would conclude that the effect of the notetaking method on essay performance was strong, and you may be more convinced of the study’s importance. By comparison, the effect size for the difference between the shared and unshared experience in the second, bitter chocolate study was d = 0.31. According to Cohen’s guidelines, a d of 0.31 represents a small to moderate effect of shared experience on people’s rating of a negative experience.

❮❮ For more questions to ask when interrogating statistical validity, such as whether the researchers used the appropriate tests or whether they made any inferential errors, see Statistics Review: Inferential Statistics, pp. 479–503.

Group 1 Group 2

d = 0.56 d = 0.24

Group 1 Group 2









M = 9 M = 9

M = 11 M = 11 Score

FIGURE 10.20 Effect size and overlap between groups. Effect sizes are larger when the scores in the two experimental groups overlap less. Overlap is a function of how far apart the group means are, as well as how variable the scores are within each group. On both sides of the graph, the two group means (M) are the same distance apart (about 2 units), but the overlap of the scores between groups is greater in the blue scores on the right. Because there is more overlap between groups, the effect size is smaller.

TABLE 10.2

Cohen’s Guidelines for Effect Size Strength




0.20 Small, or weak .10

0.50 Medium, or moderate .30

0.80 Large, or strong .50

Interrogating Causal Claims with the Four Validities

306 CHAPTER 10 Introduction to Simple Experiments

Internal Validity: Are There Alternative Explanations for the Results? When interrogating causal claims, internal validity is often the priority. Experi- menters isolate and manipulate a key causal variable, while controlling for all pos- sible other variables, precisely so they can achieve internal validity. If the internal validity of an experiment is sound, you know that a causal claim is almost certainly appropriate. But if the internal validity is flawed—if there is some confound— a causal claim would be inappropriate. It should instead be demoted to an association claim.

Three potential threats to internal validity have already been discussed in this chapter. These fundamental internal validity questions are worth asking of any experiment:

1. Did the experimental design ensure that there were no design confounds, or did some other variable accidentally covary along with the intended independent variable? (Mueller and Oppenheimer made sure people in both groups saw the same video lectures, in the same room, and so on.)

2. If the experimenters used an independent-groups design, did they control for selection effects by using random assignment or matching? (Random assignment controlled for selection effects in the notetaking study.)

3. If the experimenters used a within-groups design, did they control for order effects by counterbalancing? (Counterbalancing is not relevant in Mueller and Oppenheimer’s design because it was an independent-groups design.)

Chapter 11 goes into further detail on these threats to internal validity. In addition, nine more threats are covered.


1. How do manipulation checks provide evidence for the construct validity of an experiment’s independent variable? Why does theory matter in

evaluating construct validity?

2. Besides generalization to other people, what other aspect of generalization does external validity address?

3. What does it mean when an effect size is large (as opposed to small) in an experiment?

4. Summarize the three threats to internal validity discussed in this chapter.

1. See pp. 299–301. 2. Generalization to other situations; see pp. 302–303. 3. See pp. 304–305. 4. See p. 306.



Summary Causal claims are special because they can lead to advice, treatments, and interventions. The only way to support a causal claim is to conduct a well-designed experiment.

Two Examples of Simple Experiments • An experiment showed that taking notes on a laptop

rather than in longhand caused students to do worse on a conceptual test of lecture material.

• An experiment showed that providing a large serving bowl caused people to serve themselves more pasta, and to eat more of it, than a medium serving bowl.

Experimental Variables • Experiments study the effect of an independent

(manipulated) variable on a dependent (measured) variable.

• Experiments deliberately keep all extraneous variables constant as control variables.

Why Experiments Support Causal Claims • Experiments support causal claims because they

potentially allow researchers to establish covariance, temporal precedence, and internal validity.

• The three potential internal validity threats covered in this chapter that researchers work to avoid are design confounds, selection effects, and order effects.

Independent-Groups Designs • In an independent-groups design, different partici-

pants are exposed to each level of the independent variable.

• In a posttest-only design, participants are randomly assigned to one of at least two levels of an indepen- dent variable and then measured once on the dependent variable.

• In a pretest/posttest design, participants are randomly assigned to one of at least two levels of an indepen- dent variable, and are then measured on a dependent variable twice—once before and once after they experience the independent variable.

• Random assignment or matched groups can help establish internal validity in independent-groups designs by minimizing selection effects.

Within-Groups Designs • In a within-groups design, the same participants are

exposed to all levels of the independent variable.

• In a repeated-measures design, participants are tested on the dependent variable after each exposure to an independent variable condition.

• In a concurrent-measures design, participants are exposed to at least two levels of an independent vari- able at the same time, and then indicate a preference for one level (the dependent variable).

• Within-groups designs allow researchers to treat each participant as his or her own control, and require fewer participants than independent-groups designs. Within-groups designs also present the potential for order effects and demand characteristics.

308 CHAPTER 10 Introduction to Simple Experiments

Review Questions Max ran an experiment in which he asked people to shake hands with an experimenter (played by a female friend) and rate the experimenter’s friendliness using a self-report measure. The experimenter was always the same person, and used the same standard greeting for all participants. People were randomly assigned to shake hands with her either after she had cooled her hands under cold water or after she had warmed her hands under warm water. Max’s results found that people rated the experimenter as more friendly when her hands were warm than when they were cold.

1. Why does Max’s experiment satisfy the causal crite- rion of temporal precedence?

a. Because Max found a difference in rated friendli- ness between the two conditions, cold hands and warm hands.

b. Because the participants shook the experimenter’s hand before rating her friendliness.

c. Because the experimenter acted the same in all conditions, except having cold or warm hands.

d. Because Max randomly assigned people to the warm hands or cold hands condition.

2. In Max’s experiment, what was a control variable?

a. The participants’ rating of the friendliness of the experimenter.

b. The temperature of the experimenter’s hands (warm or cold).

c. The gender of the students in the study.

d. The standard greeting the experimenter used while shaking hands.

Key Terms

experiment, p. 276 manipulated variable, p. 276 measured variable, p. 277 independent variable, p. 277 condition, p. 277 dependent variable, p. 277 control variable, p. 278 comparison group, p. 279 control group, p. 280 treatment group, p. 280 placebo group, p. 280 confound, p. 281

design confound, p. 282 systematic variability, p. 282 unsystematic variability, p. 282 selection effect, p. 284 random assignment, p. 284 matched groups, p. 286 independent-groups design, p. 287 within-groups design, p. 287 posttest-only design, p. 287 pretest/posttest design, p. 288 repeated-measures design, p. 290 concurrent-measures design, p. 291

power, p. 293 order effect, p. 294 practice effect, p. 294 carryover effect, p. 294 counterbalancing, p. 295 full counterbalancing, p. 295 partial counterbalancing, p. 296 Latin square, p. 296 demand characteristic, p. 297 manipulation check, p. 299 pilot study, p. 300

Interrogating Causal Claims with the Four Validities • Interrogating construct validity involves evaluating

whether the variables were manipulated and mea- sured in ways consistent with the theory behind the experiment.

• Interrogating external validity involves asking whether the experiment’s results can be generalized to other people or to other situations and settings.

• Interrogating statistical validity starts by asking how strongly the independent variable covaries with the dependent variable (effect size), and whether the effect is statistically significant.

• Interrogating internal validity involves looking for design confounds and seeing whether the researchers used techniques such as random assignment and counterbalancing.

To see samples of chapter concepts in the popular media, visit and click the box for Chapter 10.r

309Learning Actively

3. What type of design is Max’s experiment?

a. Posttest-only design

b. Pretest/posttest design

c. Concurrent-measures design

d. Repeated-measures design

4. Max randomly assigned people to shake hands either with the “warm hands” experimenter or the “cold hands” experimenter. Why did he randomly assign participants?

a. Because he had a within-groups design.

b. Because he wanted to avoid selection effects.

c. Because he wanted to avoid an order effect.

d. Because he wanted to generalize the results to the population of students at his university.

5. Which of the following questions would be interro- gating the construct validity of Max’s experiment?

a. How large is the effect size comparing the rated friendliness of the warm hands and cold hands conditions?

b. How well did Max’s “experimenter friendliness” rating capture participants’ actual impressions of the experimenter?

c. Were there any confounds in the experiment?

d. Can we generalize the results from Max’s friend to other experimenters with whom people might shake hands?

Learning Actively

1. Design a posttest-only experiment that would test each of the following causal claims. For each one, identify the study’s independent variable(s), identify its dependent variable(s), and suggest some import- ant control variables. Then, sketch a bar graph of the results you would predict (remember to put the dependent variable on the y-axis). Finally, apply the three causal criteria to each study.

a. Having a friendly (versus a stern) teacher for a brief lesson causes children to score better on a test of material for that lesson.

b. Practicing the piano for 30 minutes a day (com- pared with 10 minutes a day) causes new neural connections in the temporal region of the brain.

c. Drinking sugared lemonade (compared to sugar- free lemonade) makes people better able to perform well on a task that requires self-control.

2. For each of the following independent variables, how would you design a manipulation that used an independent-groups design? How would you design a manipulation that used a within-groups design? Explain the advantages and disadvantages of manipulating each independent variable as independent-groups versus within-groups.

a. Listening to a lesson from a friendly teacher versus a stern teacher.

b. Practicing the piano for 30 minutes a day versus 10 minutes a day.

c. Drinking sugared versus sugar-free lemonade.

3. To study people’s willingness to help others, social psychologists Latané and Darley (1969) invited

people to complete questionnaires in a lab room. After handing out the questionnaires, the female experimenter went next door and staged a loud accident: She pretended to fall off a chair and get hurt (she actually played an audio recording of this accident). Then the experimenters observed whether each participant stopped filling out the questionnaire and went to try to help the “victim.”

Behind the scenes, the experimenters had flipped a coin to assign participants randomly to either an “alone” group, in which they were in the question- naire room by themselves, or a “passive confederate” group, in which they were in the questionnaire room with a confederate (an actor) who sat impassively during the “accident” and did not attempt to help the “victim.”

In the end, Latané and Darley found that when participants were alone, 70% reacted, but when participants were with a passive confederate, only 7% reacted. This experiment supported the research- ers’ theory that during an accident, people take cues from others, looking to others to decide how to interpret the situation.

a. What are the independent, dependent, and control variables in this study?

b. Sketch a graph of the results of this study.

c. Is the independent variable in this study manip- ulated as independent-groups or as repeated- measures? How do you know?

d. For this study, ask at least one question for each of the four validities.

“Was it really the therapy, or something else, that caused symptoms to improve?”

“How should we interpret a null result?”


More on Experiments: Confounding and Obscuring Variables CHAPTER 10 COVERED THE basic structure of an experiment, and this chapter addresses a number of questions about experimen- tal design. Why is it so important to use a comparison group? Why do many experimenters create a standardized, controlled, seemingly artificial environment? Why do they use so many (or so few) participants? Why do researchers often use computers to measure their variables? Why do they insist on double-blind study designs? For the clearest possible results, responsible researchers specifically design their experiments with many factors in mind. They want to detect differences that are really there, and they want to determine conclusively when their predictions are wrong.

The first main section describes potential internal validity problems and how researchers usually avoid them. The second main section discusses some of the reasons experiments may yield null results.


A year from now, you should still be able to:

1. Interrogate a study and decide whether it rules out twelve potential threats to internal validity.

2. Describe how researchers can design studies to prevent internal validity threats.

3. Interrogate an experiment with a null result to decide whether the study design obscured an effect or whether there is truly no effect to find.

4. Describe how researchers can design studies to minimize possible obscuring factors.

312 CHAPTER 11 More on Experiments: Confounding and Obscuring Variables

THREATS TO INTERNAL VALIDITY: DID THE INDEPENDENT VARIABLE REALLY CAUSE THE DIFFERENCE? When you interrogate an experiment, internal validity is the priority. As discussed in Chapter 10, three possible threats to internal validity include design confounds, selection effects, and order effects. All three of these threats involve an alternative explanation for the results.

With a design confound, there is an alternative explanation because the experi- ment was poorly designed; another variable happened to vary systematically along with the intended independent variable. Chapter 10 presented the study on pasta serv- ing bowl size and amount of pasta eaten. If the pasta served to the large-bowl group had looked more appetizing than the pasta served to the medium-bowl group, that would have been a design confound (see Figure 10.4). It would not be clear whether the bowl size or the appearance of the pasta caused the large-bowl group to take more.

With a selection effect, a confound exists because the different independent variable groups have systematically different types of participants. In Chapter 10, the example was a study of an intensive therapy for autism, in which children who received the intensive treatment did improve over time. However, we are not sure if their improvement was caused by the therapy or by greater overall involvement on the part of the parents who elected to be in the intensive-treatment group. Those parents’ greater motivation could have been an alternative explanation for the improvement of children in the intensive-treatment group.

With an order effect (in a within-groups design), there is an alternative expla- nation because the outcome might be caused by the independent variable, but it also might be caused by the order in which the levels of the variable are presented. When there is an order effect, we do not know whether the independent variable is really having an effect, or whether the participants are just getting tired, bored, or well-practiced.

These types of threats are just the beginning. There are other ways—about twelve in total—in which a study might be at risk for a confound. Experimenters think about all of them, and they plan studies to avoid them. Normally, a well- designed experiment can prevent these threats and make strong causal statements.

The Really Bad Experiment (A Cautionary Tale) Previous chapters have used examples of published studies to illustrate the mate- rial. In contrast, this chapter presents three fictional experiments. You will rarely encounter published studies like these because, unlike the designs in Chapter 10, the basic design behind these examples has so many internal validity problems.

Nikhil, a summer camp counselor and psychology major, has noticed that his

current cabin of 15 boys is an especially rowdy bunch. He’s heard a change in

313Threats to Internal Validity: Did the Independent Variable Really Cause the Difference?

diet might help them calm down, so he eliminates the sugary snacks and des-

serts from their meals for 2 days. As he expected, the boys are much quieter

and calmer by the end of the week, after refined sugar has been eliminated from

their diets.

Dr. Yuki has recruited a sample of 40 depressed women, all of whom are inter-

ested in receiving psychotherapy to treat their depression. She measures their

level of depression using a standard depression inventory at the start of therapy.

For 12 weeks, all the women participate in Dr. Yuki’s style of cognitive therapy.

At the end of the 12-week session, she measures the women again and finds that

on the whole, their levels of depression have significantly decreased.

A dormitory on a university campus has started a Go Green Facebook campaign,

focused on persuading students to turn out the lights in their rooms when they’re

not needed. Dorm residents receive e-mails and messages on Facebook that

encourage energy-saving behaviors. At the start of the campaign, the head resi-

dent noted how many kilowatt hours the dorm was using by checking the electric

meters on the building. At the end of the 2-month campaign, the head resident

checks the meters again and finds that the usage has dropped. He compares the

two measures (pretest and posttest) and finds they are significantly different.

Notice that all three of these examples fit the same template, as shown in Figure 11.1. If you graphed the data of the first two studies, they would look some- thing like the two graphs in Figure 11.2. Consider the three examples: What alter- native explanations can you think of for the results of each one?

The formal name for this kind of design is the one-group, pretest/posttest design. A researcher recruits one group of participants, measures them on a pre- test, exposes them to a treatment, intervention, or change, and then measures them on a posttest. This design differs from the true pretest/posttest design you learned in Chapter 10, because it has only one group, not two. There is no com- parison group. Therefore, a better name for this design might be “the really bad experiment.” Understanding why this design is problematic can help you learn about threats to internal validity and how to avoid them with better designs.

Pretest measure (DV) Treatment

Posttest measure (DV)Participants


Measure rowdy


Reduce sugar

Measure rowdy

behavior Campers


Measure depression

Cognitive therapy

Measure depression

Depressed women


FIGURE 11.1 The really bad experiment. (A) A general diagram of the really bad experiment, or the one-group, pretest/posttest design. Unlike the pretest/ posttest design, it has only one group: no comparison condition. (B, C) Possible ways to diagram two of the examples given in the text. Using these as a model, try sketching a diagram of the Go Green example.

314 CHAPTER 11 More on Experiments: Confounding and Obscuring Variables

Six Potential Internal Validity Threats in One-Group, Pretest/Posttest Designs By the end of this chapter, you will have learned a total of twelve internal valid- ity threats. Three of them we just reviewed: design confounds, selection effects, and order effects. Several of the internal validity threats apply especially to the really bad experiment, but are prevented with a good experimental design. These include maturation threats, history threats, regression threats, attrition threats, testing threats, and instrumentation threats. And the final three threats (observer bias, demand characteristics, and placebo effects) potentially apply to any study.


Why did the boys in Nikhil’s cabin start behaving better? Was it because they had eaten less sugar? Perhaps. An alternative explanation, however, is that most of them simply settled in, or “matured into,” the camp setting after they got used to the place. The boys’ behavior improved on its own; the low-sugar diet may have had nothing to do with it. Such an effect is called a maturation threat, a change in behavior that emerges more or less spontaneously over time. People adapt to changed environments; children get better at walking and talking; plants grow taller—but not because of any outside intervention. It just happens.

Similarly, the depressed women may have improved because the cognitive therapy was effective, but an alternative explanation is that a systematically high portion of them simply improved on their own. Sometimes the symptoms of depression or other disorders disappear, for no known reason, with time. This phenomenon, known as spontaneous remission, is a specific type of maturation.

Preventing Maturation Threats. Because the studies both Nikhil and Dr. Yuki conducted followed the model of the really bad experiment, there is no way of knowing whether the improvements they noticed were caused by mat- uration or by the treatments they administered. In contrast, if the two research- ers had conducted true experiments (such as a pretest/posttest design, which, as you learned in Chapter 10, has at least two groups, not one), they would

FIGURE 11.2 Graphing the really bad experiment. The first two examples can be graphed this way. Using these as a model, try sketching a graph of the Go Green example.


Rowdy behavior score





Beginning of week

End of week (after sugar-free diet)


Depression score




0 Pretherapy Posttherapy

315Threats to Internal Validity: Did the Independent Variable Really Cause the Difference?

also have included an appropriate comparison group. Nikhil would have observed a compar- ison group of equally lively campers who did not switch to a low-sugar diet. Dr. Yuki would have studied a comparison group of women who started out equally depressed but did not receive the cognitive therapy. If the treatment groups improved significantly more than the comparison groups did, each researcher could essentially subtract out the effect of maturation when they interpret their results. Figure  11.3 illustrates the benefits of a comparison group in preventing a maturation threat for the depres- sion study.


Sometimes a threat to internal validity occurs not just because time has passed, but because something specific has happened between the pretest and posttest. In the third example, why did the dorm residents use less electricity? Was it the Go Green campaign? Perhaps. But a plausible alternative explanation is that the weather got cooler and most residents did not use air conditioning as much.

Why did the campers’ behavior improve? It could have been the low-sugar diet, but maybe they all started a difficult swimming course in the middle of the week and the exercise tired most of them out.

These alternative explanations are examples of history threats, which result from a “historical” or external factor that systematically affects most members of the treatment group at the same time as the treatment itself, making it unclear whether the change is caused by the treatment received. To be a history threat, the external factor must affect most people in the group in the same direction (systematically), not just a few people (unsystematically).

Preventing History Threats. As with maturation threats, a comparison group can help control for history threats. In the Go Green study, the stu- dents would need to measure the kilowatt usage in another, comparable dor- mitory during the same 2 months, but not give the students in the second dorm the Go Green campaign materials. (This would be a pretest/posttest design rather than a one-group prettest/posttest design.) If both groups decreased their electricity usage about the same over time (Figure 11.4A), the decrease probably resulted from the change of seasons, not from the Go Green cam- paign. However, if the treatment group decreased its usage more than the





Depression score




6 Pretherapy

No-therapy group

Therapy group


No-therapy comparison group’s depression level decreased over time, suggesting simple maturation or spontaneous improvement.

Therapy group’s depression level decreased even more, indicating cognitive therapy worked in addition to maturation.

FIGURE 11.3 Maturation threats. A pretest/posttest design would help control for the maturation threat in Dr. Yuki’s depression study.

316 CHAPTER 11 More on Experiments: Confounding and Obscuring Variables

comparison group did (Figure 11.4B), you can rule out the history threat. Both the comparison group and the treatment group should experience the same seasonal “historical” changes, so including the comparison group controls for this threat.


A regression threat refers to a statistical concept called regression to the mean. When a group average (mean) is unusually extreme at Time 1, the next time that group is measured (Time 2), it is likely to be less extreme—closer to its typical or average performance.

Everyday Regression to the Mean. Real-world situations can help illustrate regression to the mean. For example, during the 2014 World Cup semifinal, the men’s team from Germany outscored the team from Brazil 7–1. That’s a huge score; soccer (football) teams hardly ever score 7 points in a game. Without being famil- iar with either team, people who know about soccer would predict that in their next game, Germany would score fewer than 7 goals. Why? Simply because most people have an intuitive understanding of regression to the mean.

Here’s the statistical explanation. Germany’s score in the semifinal was exceptionally high partly because of the team’s talent, and partly because of a unique combination of random factors that happened to come out in Germany’s favor. The German team’s injury level was, just by chance, much lower than usual, while Brazil had one star player out with an injury and their captain had been benched for yellow cards in previous games. The weather may have favored Germany as well, and Brazil may have felt unusual pressure on their home field. Therefore, despite Germany’s legitimate talent as a team, they also benefited from randomness—a chance combination of lucky events that would probably never happen in the same combination again. Overall, the team’s score

❯❯ For more detail on arithmetic

mean, see Statistics Review: Descriptive Statistics, p. 461.

This result indicates Go Green campaign did not work, as all dorms reduced energy usage by same amount during fall.

Decreased energy usage for both groups, but Go Green dorm’s usage decreased even more, indicating campaign worked in addition to weather.

Kilowatt hours used

September November

Go Green campaign in October

No campaign

Kilowatt hours used

September November

Go Green campaign in October

No campaign


FIGURE 11.4 History threats. A comparison group would help control for the history threat of seasonal differences in electricity usage.

❯❯ For more on pretest/posttest

design, see Chapter 10, pp. 288–289.

317Threats to Internal Validity: Did the Independent Variable Really Cause the Difference?

in the subsequent game would almost necessarily be worse than in this game. Indeed, the team did regress; they beat Argentina in the final, but the score was only 1–0. In other words, Germany finished closer to their average level of performance.

Here’s another example. Suppose you’re normally cheerful and happy. On any given day, though, your usual upbeat mood can be affected by random factors, such as the weather, your friends’ moods, and even parking problems. Every once in a while, just by chance, several of these random factors will affect you negatively: It will pour rain, your friends will be grumpy, and you won’t be able to find a parking space. Your day is terrible! The good news is that tomorrow will almost certainly be better because those random factors are unlikely to occur in that same, unlucky combination again. It might still be raining, but your friends won’t be grumpy, and you’ll quickly find a good parking space. If even one of these factors is different, your day will go better and you will regress toward your average, happy mean.

Regression works at both extremes. An unusually good performance or out- come is likely to regress downward (toward its mean) the next time. And an unusu- ally bad performance or outcome is likely to regress upward (toward its mean) the next time. Either extreme is explainable by an unusually lucky, or an unusually unlucky, combination of random events.

Regression and Internal Validity. Regression threats occur only when a group is measured twice, and only when the group has an extreme score at pretest. If the group has been selected because of its unusually high or low group mean at pretest, you can expect them to regress toward the mean somewhat when it comes time for the posttest.

You might suspect that the 40 depressed women Dr. Yuki studied were, as a group, quite depressed. Their group average at pretest may have been partly due to their true, baseline level of depression. The group was selected because they were extreme on the pretest. In a group of people who are seeking treatment for depression, a large proportion are feeling especially depressed at that moment, partly because of random events (e.g., the winter blues, a recent illness, family or relationship problems, job loss, divorce). At the posttest, the same unlucky combination of random effects on the group mean probably would not be the same as they were at pretest (maybe some saw their relationships get better, or the job situation improved for a few), so the posttest depression average would go down. The group’s change would not occur because of the treatment, but sim- ply because of regression to the mean, so in this case there would be an internal validity threat.

Preventing Regression Threats. Once again, comparison groups can help researchers prevent regression threats, along with a careful inspection of the  pattern of results. If the comparison group and the experimental group are equally extreme at pretest, the researchers can account for any regression effects in their results.

318 CHAPTER 11 More on Experiments: Confounding and Obscuring Variables

In Figure 11.5A, you can rule out regression and conclude that the therapy really does work: If regression played a role, it would have done so for both groups because they were equally at risk for regression at the start. In contrast, if you saw the pattern of results shown in Figure 11.5B, you would suspect that regression had occurred. Regression is a particular threat in exactly this situation—when one group has been selected for its extreme mean. In Figure 11.5C, in contrast, the therapy group started out more extreme on depression, and therefore probably

Regression e�ects can be ruled out; both groups started out equally extreme at pretest but therapy group improved even more than no-therapy group.

Regression possible because therapy group started out more extreme than no-therapy group. Extreme pretest scores can be influenced by chance events.

This group started out more extreme, but regression alone does not make an extreme group cross over the mean toward the other extreme. Therapy must have also had some influence.


Depression score




0 Pretherapy Posttherapy

No-therapy group

Therapy group

No-therapy group

Therapy group

No-therapy group

Therapy group


Depression score

Depression score




0 Pretherapy Posttherapy





0 Pretherapy Posttherapy




FIGURE 11.5 Regression threats to internal validity. Regression to the mean can be analyzed by inspecting different patterns of results.

319Threats to Internal Validity: Did the Independent Variable Really Cause the Difference?

regressed to the mean. However, regression probably can’t make a group cross over the com- parison group, so the pattern shows an effect of therapy, in addition to a little help from regres- sion effects.


Why did the average level of rowdy behavior in Nikhil’s campers decrease over the course of the week? It could have been because of the low- sugar diet, but maybe it was because the most unruly camper had to leave camp early.

Similarly, the level of depression among Dr. Yuki’s patients might have decreased because of the cognitive therapy, but it might have been because three of the most depressed women in the study could not maintain the treatment reg- imen and dropped out of the study. The posttest average is lower only because these extra-high scores are not included.

In studies that have a pretest and a posttest, attrition (sometimes referred to as mortality) is a reduction in participant numbers that occurs when people drop out before the end. Attrition can happen when a pretest and posttest are administered on separate days and some partic- ipants are not available on the second day. An attrition threat becomes a problem for inter- nal validity when attrition is systematic; that is, when only a certain kind of participant drops out. If any random camper leaves midweek, it might not be a problem for Nikhil’s research, but it is a problem when the rowdiest camper leaves early. His departure creates an alternative expla- nation for Nikhil’s results: Was the posttest aver- age lower because the low-sugar diet worked, or because one extreme score is gone?

Similarly, as shown in Figure 11.6, it would not be unusual if two of 40 women in the depression therapy study dropped out over time. However, if the two most depressed women systematically drop out, the mean for the posttest is going to be lower, only because it does not include these two extreme scores (not because of the therapy). Therefore, if the depression score goes down from pretest to posttest, you wouldn’t know whether the decrease occurred because












Time 1 scores, full sample


Time 2 scores with two fewer people












Time 1 scores, full sample

Time 2 scores with two fewer people




M = 5.71

M = 4.32

M = 5.71 M = 5.77

FIGURE 11.6 Attrition threats. (A) If two people (noted by blue dots) drop out of a study, both of whom scored at the high end of the distribution on the pretest, the group mean changes substantially when their scores are omitted, even if all other scores stay the same. (B) If the dropouts’ scores on the pretest are close to the group mean, removing their scores does not change the group mean as much.

320 CHAPTER 11 More on Experiments: Confounding and Obscuring Variables

of the therapy or because of the alternative explanation—that the highest-scoring women had dropped out.

Preventing Attrition Threats. An attrition threat is fairly easy for researchers to identify and correct. When participants drop out of a study, most researchers will remove those participants’ scores from the pretest average too. That way, they look only at the scores of those who completed both parts of the study. Another approach is to check the pretest scores of the dropouts. If they have extreme scores on the pretest, their attrition is more of a threat to internal validity than if their scores are closer to the group average.


A testing threat, a specific kind of order effect, refers to a change in the partic- ipants as a result of taking a test (dependent measure) more than once. People might have become more practiced at taking the test, leading to improved scores, or they may become fatigued or bored, which could lead to worse scores over time. Therefore, testing threats include practice effects (see Chapter 10).

In an educational setting, for example, students might perform better on a posttest than on a pretest, but not because of any educational intervention. Instead, perhaps they were inexperienced the first time they took the test, and they did bet- ter on the posttest simply because they had more practice the second time around.

Preventing Testing Threats. To avoid testing threats, researchers might aban- don a pretest altogether and use a posttest-only design (see Chapter 10). If they do use a pretest, they might opt to use alternative forms of the test for the two measurements. The two forms might both measure depression, for example, but use different items to do so. A comparison group can also help. If the comparison group takes the same pretest and posttest but the treatment group shows an even larger change, testing threats can be ruled out (Figure 11.7).

Ability score

Pretest Posttest

Ability score


Treatment group

Comparison group



FIGURE 11.7 Testing threats. (A) If there is no comparison group, it’s hard to know whether the improvement from pretest to posttest is caused by the treatment or simply by practice. (B) The results from a comparison group can help rule out testing threats. Both groups might improve, but the treatment group improves even more, suggesting that both practice and a true effect of the treatment are causing the improvement.

321Threats to Internal Validity: Did the Independent Variable Really Cause the Difference?


An instrumentation threat occurs when a measuring instrument changes over time. In observational research, the people who are coding behaviors are the measuring instrument, and over a period of time, they might change their stan- dards for judging behavior by becoming more strict or more lenient. Thus, maybe Nikhil’s campers did not really become less disruptive; instead, the people judging the campers’ behavior became more tolerant of shoving and hitting.

Another case of an instrumentation threat would be when a researcher uses different forms for the pretest and posttest, but the two forms are not sufficiently equivalent. Dr. Yuki might have used a measure of depression at pretest on which people tend to score a little higher, and another measure of depression at posttest that tends to yield lower scores. As a result, the pattern she observed was not a sign of how good the cognitive therapy is, but merely reflected the way the alternative forms of the test are calibrated.

Preventing Instrumentation Threats. To prevent instrumentation threats, researchers could switch to a posttest-only design, or should take steps to ensure that the pretest and posttest measures are equivalent. To do so, they might col- lect data from each instrument to be sure the two are calibrated the same. To avoid shifting standards of behavioral coders, researchers might retrain their coders throughout the experiment, establishing their reliability and validity at both pretest and posttest. Using clear coding manuals would be an important part of this process. Another simple way to prevent an instrumentation threat is to use a posttest-only design (in which behavior is measured only once).

Finally, to control for the problem of different forms, Dr. Yuki could also coun- terbalance the versions of the test, giving some participants version A at pretest and version B at posttest, and giving other participants version B, and then version A.

Instrumentation vs. Testing Threats. Because these two threats are pretty similar, here’s a way to remember the difference. An instrumentation threat means the measuring instrument has changed from Time 1 to Time 2. A testing threat means the participants change over time from having been tested before.


You have learned throughout this discussion that true pretest/posttest designs (those with two or more groups) normally take care of many internal validity threats. However, in some cases, a study with a pretest/posttest design might combine selection threats with history or attrition threats. In a selection-history threat, an outside event or factor affects only those at one level of the independent variable. For example, perhaps the dorm that was used as a comparison group was undergoing construction, and the construction crew used electric tools that drew on only that dorm’s power supply. Therefore, the researcher won’t be sure: Was it because the Go Green campaign reduced student energy usage? Or was it only because the comparison group dorm used so many power tools?

322 CHAPTER 11 More on Experiments: Confounding and Obscuring Variables

Similarly, in a selection-attrition threat, only one of the experimental groups experiences attrition. If Dr. Yuki conducted her depression therapy experiment as a pretest/posttest design, it might be the case that the most severely depressed people dropped out—but only from the treatment group, not the control group. The treatment might have been especially arduous for the most depressed people, so they drop out of the study. Because the control group was not undergoing treat- ment, they are not susceptible to the same level of attrition. Therefore, selection and attrition can combine to make Dr. Yuki unsure: Did the cognitive therapy really work, compared to the control group? Or is it just that the most severely depressed people dropped out of the treatment group?

Three Potential Internal Validity Threats in Any Study Many internal validity threats are likely to occur in the one-group prettest/ posttest design, and these threats can often be examined simply by adding a com- parison group. Doing so would result in a pretest/posttest design. The posttest- only design is another option (see Chapter 10). However, three more threats to internal validity—observer bias, demand characteristics, and placebo effects— might apply even for designs with a clear comparison group.


Observer bias can be a threat to internal validity in almost any study in which there is a behavioral dependent variable. Observer bias occurs when researchers’ expectations influence their interpretation of the results. For example, Dr. Yuki might be a biased observer of her patients’ depression: She expects to see her patients improve, whether they do or do not. Nikhil may be a biased observer of his campers: He may expect the low-sugar diet to work, so he views the boys’ posttest behavior more positively.

Although comparison groups can prevent many threats to internal validity, they do not necessarily control for observer bias. Even if Dr. Yuki used a no- therapy comparison group, observer bias could still occur: If she knew which participants were in which group, her biases could lead her to see more improvement in the therapy group than in the comparison group.

Observer bias can threaten two kinds of validity in an experiment. It threatens internal validity because an alternative explanation exists for the results. Did the therapy work, or was Dr. Yuki biased? It can also threaten the construct validity of the dependent variable because it means the depression ratings given by Dr. Yuki do not represent the true levels of depression of her participants.


Demand characteristics are a problem when participants guess what the study is supposed to be about and change their behavior in the expected direction.

❯❯ For more on observer bias,

see Chapter 6, p. 169.

323Threats to Internal Validity: Did the Independent Variable Really Cause the Difference?

For example, Dr. Yuki’s patients know they are getting therapy. If they think Dr. Yuki expects them to get better, they might change their self-reports of symp- toms in the expected direction. Nikhil’s campers, too, might realize something fishy is going on when they’re not given their usual snacks. Their awareness of a menu change could certainly change the way they behave.

Controlling for Observer Bias and Demand Characteristics. To avoid observer bias and demand characteristics, researchers must do more than add a comparison group to their studies. The most appropriate way to avoid such prob- lems is to conduct a double-blind study, in which neither the participants nor the researchers who evaluate them know who is in the treatment group and who is in the comparison group.

Suppose Nikhil decides to test his hypothesis as a double-blind study. He could arrange to have two cabins of equally lively campers and for only one group, replace their sugary snacks with good-tasting low-sugar versions. The boys would not know which kind of snacks they were eating, and the people observing their behavior would also be blind to which boys were in which group.

When a double-blind study is not possible, a variation might be an acceptable alternative. In some studies, participants know which group they are in, but the observers do not; this is called a masked design, or blind design (see Chapter 6). The students exposed to the Go Green campaign would certainly be aware that someone was trying to influence their behavior. Ideally, however, the raters who were recording their electrical energy usage should not know which dorm was exposed to the campaign and which was not. Of course, keeping observers unaware is even more important when they are rating behaviors that are more difficult to code, such as symptoms of depression or behavior problems at camp.

Recall the Chapter 10 study by Mueller and Oppenheimer (2014) in which people took notes in longhand or on laptops. The research assistants in that study were blind to the condition each participant was in when they graded their tests on the lectures. The participants themselves were not blind to their notetaking method. However, since the test-takers participated in only one condition (an independent-groups design), they were not aware that the form of notetaking was an important feature of the experiment. Therefore, they were blind to the reason they were taking notes in longhand or on a laptop.


The women who received Dr. Yuki’s cognitive therapy may have improved because her therapeutic approach really works. An alternative explanation is that there was a placebo effect: The women improved simply because they believed that they were receiving an effective treatment.

A placebo effect occurs when people receive a treatment and really improve—but only because the recipients believe they are receiving a valid treat- ment. In most studies on the effectiveness of medications, for example, one group receives a pill or an injection with the real drug, while another group receives a

❮❮ For more on demand characteristics, see Chapter 10, p. 297.

324 CHAPTER 11 More on Experiments: Confounding and Obscuring Variables

pill or an injection with no active ingredients—a sugar pill or a saline solution. People can even receive placebo psychotherapy, in which they sim- ply talk to a friendly listener about their problems; these placebo conversations have no therapeutic structure. The inert pill, injection, or therapy is the placebo. Often people who receive the placebo see their symptoms improve because they believe the treatment they are receiving is supposed to be effective. In fact, the placebo effect can occur whenever any kind of treatment is used to control symptoms, such as an herbal remedy to enhance wellness (Figure 11.8).

Placebo effects are not imaginary. Placebos have been shown to reduce real symptoms, both psychological and physical, including depres- sion (Kirsch & Sapirstein, 1998); postoperative pain or anxiety (Benedetti, Amanzio, Vighetti, & Asteggiano, 2006); terminal cancer pain; and epilepsy (Beecher, 1955). They are not always ben- eficial or harmless; physical side effects, including skin rashes and headaches, can be caused by pla- cebos, too. People’s symptoms appear to respond not just to the active ingredients in medications or to psychotherapy, but also to their belief in what the treatment can do to improve their situation.

A placebo can be strong medicine. Kirsch and Sapirstein (1998) reviewed studies that gave either antidepressant medication, such as Prozac, or a placebo to depressed patients, and concluded that the placebo groups improved almost as much as groups that received real medicine. In fact, up to 75% of the depression improvement in the Prozac groups was also achieved in placebo groups.

Designing Studies to Rule Out the Placebo Effect. To determine whether an effect is caused by a therapeutic treatment or by placebo effects, the standard approach is to include a special kind of comparison group. As usual, one group receives the real drug or real therapy, and the second group receives the placebo drug or placebo therapy. Crucially, however, neither the people treating the patients nor the patients themselves know whether they are in the real group or the placebo group. This experimental design is called a double-blind placebo control study.

The results of such a study might look like the graph in Figure 11.9. Notice that both groups improved, but the group receiving the real drug improved even more,

FIGURE 11.8 Are herbal remedies placebos? It is possible that perceived improvements in mood, joint pain, or wellness promised by herbal supplements are simply due to the belief that they will work, not because of the specific ingredients they contain.


Pretest rating

True therapy

Placebo therapy

Posttest rating

FIGURE 11.9 A double-blind placebo control study. Adding a placebo comparison group can help researchers separate a potential placebo effect from the true effect of a particular therapy.

325Threats to Internal Validity: Did the Independent Variable Really Cause the Difference?

showing placebo effects plus the effects of the real drug. If the results turn out like this, the research- ers can conclude that the treatment they are test- ing does cause improvement above and beyond a placebo effect. Once again, an internal validity threat—a placebo effect—can be avoided with a careful research design.

Is That Really a Placebo Effect? If you thought about it carefully, you probably noticed that the results in Figure 11.9 do not definitively show a pla- cebo effect pattern. Both the group receiving the real drug and the group receiving the placebo improved over time. However, some of the improvement in both groups could have been caused by matura- tion, history, regression, testing, or instrumentation threats (Kienle & Kiene, 1997). If you were interested in showing a placebo effect specifically, you would have to include a no-treatment comparison group— one that receives neither drug nor placebo.

Suppose your results looked something like those in Figure 11.10. Because the placebo group improved over time, even more than the no-therapy/ no-placebo group, you can attribute the improve- ment to placebo and not just to maturation, history, regression, testing, or instrumentation.

With So Many Threats, Are Experiments Still Useful? After reading about a dozen ways a good experiment can go wrong, you might be tempted to assume that most experiments you read about are faulty. How- ever, responsible researchers consciously avoid internal validity threats when they design and inter- pret their work. Many of the threats discussed in this chapter are a problem only in one-group pre- test/posttest studies—those with no comparison group. As shown in the Working It Through section (p. 328), a carefully designed comparison group will correct for many of these threats. The section ana- lyzes the study on mindfulness (Mrazek, Franklin, Phillips, Baird, & Schooler, 2013), discussed in Chapter 10 and presented again here in Figure 11.11.


Pretest rating

True therapy

Placebo therapy

Posttest rating

No therapy

FIGURE 11.10 Identifying a placebo effect. Definitively showing a placebo effect requires three groups: one receiving the true therapy, one receiving the placebo, and one receiving no therapy. If there is a placebo effect, the pattern of results will show that the no-therapy group does not improve as much as the placebo group.

FIGURE 11.11 Mindfulness study results. This study showed that mindfulness classes, but not nutrition classes, were associated with an increase in GRE scores. Can the study rule out all twelve internal validity threats and support a causal claim? (Source: Mrazek et al., 2013, Fig. 1A.)


326 CHAPTER 11 More on Experiments: Confounding and Obscuring Variables

TABLE 11.1 

Asking About Internal Validity Threats in Experiments


Design confound A second variable that unintentionally varies systematically with the independent variable.

From Chapter 10: If pasta served in a large bowl appeared more appetizing than pasta served in a medium bowl.

Did the researchers turn potential third variables into control variables, for example, keeping the pasta recipe constant?

Selection effect In an independent-groups design, when the two independent variable groups have systematically different kinds of participants in them.

From Chapter 10: In the autism study, some parents insisted they wanted their children to be in the intensive-treatment group rather than the control group.

Did the researchers use random assignment or matched groups to equalize groups?

Order effect In a repeated-measures design, when the effect of the independent variable is confounded with carryover from one level to the other, or with practice, fatigue, or boredom.

From Chapter 10: People rated the shared chocolate higher only because the first taste of chocolate is always more delicious than the second one.

Did the researchers counterbalance the orders of presentation of the levels of the independent variable?

Maturation An experimental group improves over time only because of natural development or spontaneous improvement.

Disruptive boys settle down as they get used to the camp setting.

Did the researchers use a comparison group of boys who had an equal amount of time to mature but who did not receive the treatment?

History An experimental group changes over time because of an external factor that affects all or most members of the group.

Dorm residents use less air conditioning in November than September because the weather is cooler.

Did the researchers include a comparison group that had an equal exposure to the external factor but did not receive the treatment?

Regression to the mean

An experimental group whose average is extremely low (or high) at pretest will get better (or worse) over time because the random events that caused the extreme pretest scores do not recur the same way at posttest.

A group’s average is extremely depressed at pretest, in part because some members volunteered for therapy when they were feeling much more depressed than usual.

Did the researchers include a comparison group that was equally extreme at pretest but did not receive the therapy?

Attrition An experimental group changes over time, but only because the most extreme cases have systematically dropped out and their scores are not included in the posttest.

Because the rowdiest boy in the cabin leaves camp early, his unruly behavior affects the pretest mean but not the posttest mean.

Did the researchers compute the pretest and posttest scores with only the final sample included, removing any dropouts’ data from the pretest group average?

Table 11.1 summarizes the internal validity threats in Chapters 10 and 11, and suggests ways to find out whether a particular study is vulnerable.

327Threats to Internal Validity: Did the Independent Variable Really Cause the Difference?


Testing A type of order effect: An experimental group changes over time because repeated testing has affected the participants. Practice effects (fatigue effects) are one subtype.

GRE verbal scores improve only because students take the same version of the test both times and therefore are more practiced at posttest.

Did the researchers have a comparison group take the same two tests? Did they use a posttest-only design, or did they use alternative forms of the measure for the pretest and posttest?

Instrumentation An experimental group changes over time, but only because the measurement instrument has changed.

Coders get more lenient over time, so the same behavior is coded as less disruptive at posttest than at pretest.

Did the researchers train coders to use the same standards when coding? Are pretest and posttest measures demonstrably equivalent?

Observer bias An experimental group’s ratings differ from a comparison group’s, but only because the researcher expects the groups’ ratings to differ.

The researcher expects a low- sugar diet to decrease the campers’ unruly behavior, so he notices only calm behavior and ignores wild behavior.

Were the observers of the dependent variable unaware of which condition participants were in?

Demand characteristic

Participants guess what the study’s purpose is and change their behavior in the expected direction.

Campers guess that the low-sugar diet is supposed to make them calmer, so they change their behavior accordingly.

Were the participants kept unaware of the purpose of the study? Was it an independent-groups design, which makes participants less able to guess the study’s purpose?

Placebo effect Participants in an experimental group improve only because they believe in the efficacy of the therapy or drug they receive.

Women receiving cognitive therapy improve simply because they believe the therapy will work for them.

Did a comparison group receive a placebo (inert) drug or a placebo therapy?


1. How does a one-group pretest/posttest design differ from a pretest/ posttest design, and which threats to internal validity are especially

applicable to this design?

2. Using Table 11.1 as a guide, indicate which of the internal validity threats would be relevant even to a (two-group) posttest-only design.

1. See pp. 312–322. 2. See pp. 326–327.

328 CHAPTER 11 More on Experiments: Confounding and Obscuring Variables

Did Mindfulness Training Really Cause GRE Scores to Improve? In Chapter 10, you read about a pretest/posttest design in which students were randomly assigned to a mindfulness training course or to a nutrition course (Mrazek et al., 2013). Students took GRE verbal tests both before and after their assigned training course. Those assigned to the mindfulness course scored significantly higher on the GRE posttest than pretest. The authors would like to claim that the mindful- ness course caused the improvement in GRE scores. Does this study rule out internal validity threats?


Is the study susceptible to any of these internal validity threats?

Design confound

The paper reports that classes met for 45 minutes four times a week for 2 weeks and were taught by professionals with extensive teaching experience in their respective fields.“Both classes were taught by expert instructors, were composed of similar numbers of students, were held in comparable classrooms during the late afternoon, and used a similar class format, including both lectures and group discussions” (p. 778).

These passages indicate that the classes were equal in their time commitment, the quality of the instructors used, and other factors, so these are not design confounds. It appears the two classes did not accidentally vary on anything besides their mindfulness or nutrition content.

Selection effect The article reports that “students . . . were randomly assigned to either a mindfulness class . . . or a nutrition class” (p. 777).

Random assignment controls for selection effects, so selection is not a threat in the study.


329Threats to Internal Validity: Did the Independent Variable Really Cause the Difference?


Order effect Order effects are relevant only for repeated- measures designs, not independent-groups designs like this one.

Maturation threat While it’s possible that people could simply get better at the GRE over time, maturation would have happened to the nutrition group as well (but it did not). We can rule out maturation.

History threat Could some outside event, such as a free GRE prep course on campus, have improved people’s GRE scores? We can rule out such a history threat because of the comparison group: It’s unlikely a campus GRE program would just happen to be offered only to students in the mindfulness group.

Regression threat A regression threat is unlikely here. First, the students were randomly assigned to the mindfulness group, not selected on the basis of extremely low GRE scores. Second, the mindfulness group and the nutrition group had the same pretest means. They were equally extreme, so if regression had affected one group, it would also have affected the other.

Attrition threat There’s no indication in the paper that any participants dropped out between pretest and posttest.

Because all participants apparently completed the study, attrition is not a threat.

Testing threat Participants did take the verbal GRE two times, but if their improvement was simply due to practice, we would see a similar increase in the nutrition group, and we do not.

Instrumentation threat The study reports, “We used two versions of the verbal GRE measure that were matched for difficulty and counterbalanced within each condition” (p. 777).

The described procedure controls for any difference in test difficulty from pretest to posttest.

Observer bias “We minimized experimenter expectancy effects by testing participants in mixed-condition groups in which nearly all task instructions were provided by computers” (p. 778).

Experimenter expectancy is another name for observer bias. These procedures seem to be reasonable ways to prevent an experimenter from leading participants in one group to be more motivated to do well on the dependent measure.

Demand characteristics or placebo effects

“All participants were recruited under the pretense that the study was a direct comparison of two equally viable programs for improving cognitive performance, which minimized motivation and placebo effects” (p. 778).

This statement argues that all students expected their assigned program to be effective. If true, then placebo effects and demand characteristics were equal in both conditions.

This study’s design and results have controlled for virtually all the internal validity threats in Table 11.1, so we can conclude its internal validity is strong and the study supports the claim that mindfulness training improved students’ GRE verbal scores. (Next you could interrogate this study’s construct, statistical, and external validity!)

330 CHAPTER 11 More on Experiments: Confounding and Obscuring Variables

INTERROGATING NULL EFFECTS: WHAT IF THE INDEPENDENT VARIABLE DOES NOT MAKE A DIFFERENCE? So far, this chapter has discussed cases in which a researcher works to ensure that any covariance found in an experiment was caused by the independent vari- able, not by a threat to internal validity. What if the independent variable did not make a difference in the dependent variable; there is no significant covariance between the two? That outcome is known as a null effect, also referred to as a null result.

You might not read about null effects very often. Journals, newspapers, and websites are much more likely to report the results of a study in which the inde- pendent variable does have an effect. However, research that finds null effects are surprisingly common—something many students learn when they start to conduct their own studies. Often, researchers who get a null result will say their study “didn’t work.” What might null effects mean?

Here are three hypothetical examples:

Many people believe having more money will make them happy. But will it?

A researcher designed an experiment in which he randomly assigned people to

three groups. He gave one group nothing, gave the second group a little money,

and gave the third group a lot of money. The next day, he asked each group to

report their happiness on a mood scale. The group who received cash (either a

little or a lot) was not significantly happier, or in a better mood, than the group

who received nothing.

Do online reading games make kids better readers? An educational psy-

chologist recruited a sample of 5-year-olds, all of whom did not yet know

how to read. She randomly assigned the children to two groups. One group

played with a commercially available online reading game for 1 week (about

30  minutes per day), and the other group continued “treatment as usual,”

attending their normal kindergarten classes. Afterward, the children were

tested on their reading ability. The reading game group’s scores were a little

higher than those of the kindergarten-as-usual group, but the difference was

not statistically significant.

Researchers have hypothesized that feeling anxious can cause people to reason

less  carefully and logically. To test this hypothesis, a research team randomly

assigned people to three groups: low, medium, and high anxiety. After a few minutes

of being exposed to the anxiety manipulation, the participants solved problems

requiring logic, rather than emotional reasoning. Although the researchers had

predicted the anxious people would do worse on the problems, participants in the

three groups scored roughly the same.

331Interrogating Null Effects: What If the Independent Variable Does Not Make a Difference?

These three examples of null effects, shown as graphs in Figure 11.12, are all posttest-only designs. However, a null effect can happen in a within-groups design or a pretest/posttest design, too (and even in a correlational study). In all three of these cases, the independent variable manipulated by the experimenters did not result in a change in the dependent variable. Why didn’t these experiments show covariance between the independent and dependent variables?

Any time an experiment gives a null result, it might be the case that the independent variable really does not affect the dependent variable. In the real world, perhaps money does not make people happier, online reading games do not improve kids’ reading skill, and being anxious does not affect logical reasoning. In other words, the experiment gave an accurate result, showing that the manipula- tion the researchers used did not cause a change in the dependent variable. Impor- tantly, therefore, when we obtain a null result, it can mean our theory is incorrect.

Another possible reason for a null effect is that the study was not designed or conducted carefully enough. The independent variable actually does cause a change in the dependent variable, but some obscuring factor in the study pre- vented the researchers from detecting the true difference. Such obscuring factors can take two general forms: There might not have been enough between-groups difference, or there might have been too much within-groups variability.

To illustrate these two types of problems, suppose you prepared two bowls of salsa: one containing two shakes of hot sauce and the other containing four shakes

A lot of cash






Happiness one day later (mood scale)

Some cash

No cash

High anxiety






Logical reasoning score

Medium anxiety

Low anxiety





Reading score

Control group

Online reading games



FIGURE 11.12 Results of three hypothetical experiments showing a null effect. (A) Why might cash not have made people significantly happy? (B) Why might the online reading games not have worked? (C) Why might anxiety not have affected logical reasoning?

332 CHAPTER 11 More on Experiments: Confounding and Obscuring Variables

of hot sauce. People might not taste any difference between the two bowls. One reason is that four shakes is not different enough from two; there’s not enough between-groups difference. A second reason is that each bowl contains many other ingredients (tomatoes, onions, jalapeños, cilantro, lime juice), so it’s hard to detect any change in hot sauce intensity, with all those other flavors getting in the way. This is a problem of too much within-groups variability. Now let’s see how this analogy plays out in psychological research.

Perhaps There Is Not Enough Between-Groups Difference When a study returns a null result, sometimes the culprit is not enough between- groups difference. Weak manipulations, insensitive measures, ceiling and floor effects, and reverse design confounds might prevent study results from revealing a true difference that exists between two or more experimental groups.


Why did the study show that money did not affect people’s moods? You might ask how much money the researcher gave each group. What if the amounts were $0.00, $0.25, and $1.00? In that case, it would be no surprise that the manipula- tion didn’t work; a dollar doesn’t seem like enough money to affect most people’s mood. Like the difference between two shakes and four shakes of hot sauce, it’s not enough of an increase to matter. Similarly, perhaps a 1-week exposure to reading games is not sufficient to cause any change in reading scores. Both of these would be examples of weak manipulations, which can obscure a true causal relationship.

When you interrogate a null result, then, it’s important to ask how the researchers operationalized the independent variable. In other words, you have to ask about construct validity. The researcher might have obtained a very different pattern of results if he had given $0.00, $5.00, and $150.00 to the three groups. The educational psychologist might have found reading games improve scores if done daily for 3 months rather than just a week.


Sometimes a study finds a null result because the researchers have not used an operationalization of the dependent variable with enough sensitivity. It would be like asking a friend who hates spicy food to taste your two bowls of salsa; he’d sim- ply call both of them “way too spicy.” If a medication reduces fever by a tenth of a degree, you wouldn’t be able to detect it with a thermometer that was calibrated in one-degree increments; it wouldn’t be sensitive enough. Similarly, if online reading games improve reading scores by about 2 points, you wouldn’t be able to detect the improvement with a simple pass/fail reading test (either passing or failing, nothing in between). When it comes to dependent measures, it’s smart to use ones that have detailed, quantitative increments—not just two or three levels.

❯❯ For more on scales of

measurement, see Chapter 5, pp. 122–124.

333Interrogating Null Effects: What If the Independent Variable Does Not Make a Difference?


In a ceiling effect, all the scores are squeezed together at the high end. In a floor effect, all the scores cluster at the low end. As special cases of weak manipulations and insensitive measures, ceiling and floor effects can cause independent variable groups to score almost the same on the dependent variable.

Ceilings, Floors, and Independent Variables. Ceiling and floor effects can be the result of a problematic independent variable. For example, if the researcher really did manipulate his independent variable by giving people $0.00, $0.25, or $1.00, that would be a floor effect because these three amounts are all low—they’re squeezed close to a floor of $0.00.

Consider the example of the anxiety and reasoning study. Suppose the researcher manipulated anxiety by telling the groups they were about to receive an electric shock. The low-anxiety group was told to expect a 10-volt shock, the medium-anxiety group a 50-volt shock, and the high-anxiety group a 100-volt shock. This manipulation would probably result in a ceiling effect because expect- ing any amount of shock would cause anxiety, regardless of the shock’s intensity. As a result, the various levels of the independent variable would appear to make no difference.

Ceilings, Floors, and Dependent Variables. Poorly designed dependent variables can also lead to ceiling and floor effects. Imagine if the logical rea- soning test in the anxiety study was so difficult that nobody could solve the problems. That would cause a floor effect: The three anxiety groups would score the same, but only because the measure for the dependent variable results in low scores in all groups. Similarly, your friend has a ceiling effect on spiciness; he rates both bowls as extremely spicy.

In the money and mood study, participants rated their happiness on the following scale:

1 = I feel horrible. 2 = I feel awful. 3 = I feel bad. 4 = I feel fine.

Because there is only one option on this measure to indicate feeling good (and people generally tend to feel good, rather than bad), the majority would report the maximum, 4. Money would appear to have no effect on their mood, but only because the dependent measure of happiness used was subject to a ceiling effect.

Or suppose the reading test used in the online game study asked the children to point to the first letter of their own name. Almost all 5-year-olds can do this, so the measure would result in a ceiling effect. All children would get a perfect

❮❮ Ceiling and floor effects are examples of restriction of range; see Chapter 8, pp. 218–220.

334 CHAPTER 11 More on Experiments: Confounding and Obscuring Variables

score; there would be no room for between-group variability on this measure. Similarly, if the reading test asked children to analyze a passage of Tolstoy, almost all children would fail, creating a floor effect (Figure 11.13).


When you interrogate a study with a null effect, it is important to ask how the independent and dependent variables were operationalized. Was the inde- pendent variable manipulation strong enough to cause a difference between groups? And was the dependent variable measure sensitive enough to detect that difference?

Recall from Chapter 10 that a manipulation check is a separate dependent variable that experimenters include in a study, specifically to make sure the manipulation worked. For example, in the anxiety study, after telling people they were going to receive a 10-volt, 50-volt, or 100-volt shock, the researchers might have asked: How anxious are you right now, on a scale of 1 to 10? If the manipulation check showed that participants in all three groups felt nearly the same level of anxiety (Figure 11.14A), you’d know the researchers did not effectively manipulate what they intended to manipulate. If the manipulation check showed that the independent variable levels differed in an expected way— participants in the high-anxiety group really felt more anxious than those in the

other two groups (Figure 11.14B)—then you’d know the researchers did effectively manipulate anxiety, the independent variable. If the manip- ulation check worked, the researchers would have to look for another reason for the null effect of anxiety on logical reasoning. Perhaps the dependent measure has a floor effect; that is, the logical reasoning test might be too diffi- cult, so everyone scores low (see Figure 11.13). Or perhaps there really is no effect of anxiety on logical reasoning.


Confounds are usually considered to be internal validity threats—alternative explanations for some observed difference in a study. However, they can apply to null effects, too. A study might be designed in such a way that a design confound actually counteracts, or reverses, some true effect of an independent variable.

FIGURE 11.13 Ceiling and floor effects. A ceiling or floor effect on the dependent variable can obscure a true difference between groups. If all the questions on a test are too easy, everyone will get a perfect score. If the questions are too hard, everyone will score low.







Percent correct

Ceiling e�ect Floor e�ect

Reading games

Control games

Questions too hard: everyone gets them wrong

Questions too easy: everyone gets them right

335Interrogating Null Effects: What If the Independent Variable Does Not Make a Difference?

In the money and happiness study, for example, perhaps the students who received the most money happened to be given the money by a grumpy exper- imenter, while those who received the least money were exposed to a more cheerful person; this confound would have worked against any true effect of money on mood.

Perhaps Within-Groups Variability Obscured the Group Differences Another reason a study might return a null effect is that there is too much unsystematic variability within each group. This is referred to as noise (also known as error variance or unsystematic variance). In the salsa exam- ple, noise refers to the great number of the other flavors in the two bowls. Noisy within-group variability can get in the way of detecting a true difference between groups.

Consider the sets of scores in Figure 11.15. The bar graphs and scatterplots depict the same data, but in two graphing formats. In each case, the mean dif- ference between the two groups is the same. However, the variability within each group is much larger in part A than part B. You can see that when there is more variability within groups, it obscures the differences between the groups because more overlap exists between the members of the two groups. It’s a statistical

High- anxiety group







How anxious are you? (manipulation check)

How anxious are you? (manipulation check)







Medium- anxiety group

Low- anxiety group

High- anxiety group

Medium- anxiety group

Low- anxiety group


FIGURE 11.14 Possible results of a manipulation check. (A) These results suggest the anxiety manipulation did not work because people at all three levels of the independent variable reported being equally anxious. (B) These results suggest the manipulation did work because the anxiety of people in the three independent variable groups did vary in the expected way.

336 CHAPTER 11 More on Experiments: Confounding and Obscuring Variables

validity problem: The greater the overlap, the smaller the effect size, and the less likely the two group means will be statistically significant; that is, the less likely the study will detect covariance.

When the data show less variability within the groups (see Figure 11.15B), the effect size will be larger, and it’s more likely the mean difference will be statistically significant. The less within-group variability, the less likely it is to obscure a true group difference. If the two bowls of salsa contained nothing but tomatoes, the difference between two and four shakes of hot sauce would be more easily detectable because there would be fewer competing, “noisy” flavors within bowls.

In sum, the more unsystematic variability there is within each group, the more the scores in the two groups overlap with each other. The greater the overlap, the less apparent the average difference. As described next, most researchers prefer to keep within-group variability to a minimum, so they

❯❯ For more on statistical

significance, see Chapter 10, p. 304; and Statistics

Review: Inferential Statistics.

Reading score

Reading score

Control group

Online reading games

Reading score

Control group

Online reading games


Control group

Online reading games

Reading score

Control group

Online reading games

FIGURE 11.15 Within-group variability can obscure group differences. Notice that the group averages are the same in both versions, but the variability within each group is greater in part A than part B. Part B is the situation researchers prefer because it enables them to better detect true differences in the independent variable.

337Interrogating Null Effects: What If the Independent Variable Does Not Make a Difference?

can more easily detect between-group differences. They keep in mind a few common culprits: measurement error, irrelevant individual differences, and situation noise.


One reason for high within-group variability is measurement error, a human or instrument factor that can inflate or deflate a person’s true score on the dependent variable. For example, a man who is 160 centimeters tall might be measured at 160.5 cm because of the angle of vision of the person using the meter stick, or he might be recorded as 159.5 cm because he slouched a bit.

All dependent variables involve a certain amount of measurement error, and researchers try to keep those errors as small as possible. For example, the reading test used as a dependent variable in the educational psychologist’s study is not perfect. Indeed, a group’s score on the reading test represents the group’s “true” reading ability—that is, the actual level of the construct in a group—plus or minus some random measurement error. Maybe one child’s batch of questions happened to be more difficult than average. Perhaps another student just happened to be exposed to the tested words at home. Maybe one child was especially distracted during the test, and another was especially focused. When these distortions of measurement are random, they cancel each other out across a sample of people and will not affect the group’s average, or mean. Nevertheless, an operationalization with a lot of measurement error will result in a set of scores that are more spread out around the group mean (see Figure 11.15A).

A child’s score on the reading measure can be represented with the following formula:

child’s reading score =  child’s true reading ability +/− random error of measurement

Or, more generally:

dependent variable score =  participant’s true score +/− random error of measurement

The more sources of random error there are in a dependent variable’s measure- ment, the more variability there will be within each group in an experiment (see Figure 11.15A). In contrast, the more precisely and carefully a dependent variable is measured, the less variability there will be within each group (see Figure 11.15B). And lower within-groups variability is better, making it easier to detect a differ- ence (if one exists) between the different independent variable groups.

Solution 1: Use Reliable, Precise Tools. When researchers use measurement tools that have excellent reliability (internal, interrater, and test-retest), they can

338 CHAPTER 11 More on Experiments: Confounding and Obscuring Variables

reduce measurement error (see Chapter 5). When such tools also have good con- struct validity, there will be a lower error rate as well. More precise and accurate measurements have less error.

Solution 2: Measure More Instances. A precise, reliable measurement tool is sometimes impossible to find. What then? In this case, the best alternative is to use a larger sample (e.g., more people, more animals). In other words, one solution to measuring badly is to take more measurements. When a tool potentially causes a great deal of random error, the researcher can cancel out many errors simply by including more people in the sample.

Is one person’s score 10 points too high because of a random measurement error? If so, it’s not a problem, as long as another participant’s score is 10 points too low because of a random measurement error. The more participants there are, the better the chances of having a full representation of all the possible errors. The errors cancel each other out, and the result is a better estimate of the “true” average for that group. The reverse applies as well: When a measurement tool is known to have a very low error rate, the researcher might be able to use fewer participants in the study.


Individual differences can be another source of within-group variability. They can be a problem in independent-groups designs. In the experiment on money and mood, for example, the normal mood of the participants must have varied. Some people are naturally more cheerful than others, and these individual differences have the effect of spreading out the scores of the students within each group, as Figure  11.16 shows. In the $1.00 condition is Candace, who is typically unhappy. The $1.00 gift might have made her happier, but her

mood would still be relatively low because of her normal level of grumpiness. Michael, a cheerful guy, was in the no-money control condition, but he still scored high on the mood measure.

Looking over the data, you’ll notice that, on average, the participants in the experimental condition did score a lit- tle higher than those in the control condition. But the data are mixed and far from consistent; there’s a lot of overlap between the scores in the money group and the control group. Because of this overlap, the effect of a money gift might not reach statistical significance. It is hard to detect the effect of money above and beyond these individual dif- ferences in mood. The effect of the gift would be small com- pared to the variability within each group.

Solution 1: Change the Design. One way to accom- modate individual differences is to use a within-groups

Happy mood

$ 1.00No money



FIGURE 11.16 Individual differences. Overall, students who received money were slightly more cheerful than students in the control group, but the scores in the two groups overlapped a great deal.

339Interrogating Null Effects: What If the Independent Variable Does Not Make a Difference?

design instead of an independent-groups design. In Figure  11.17, each pair of points, connected by a line, represents a single person whose mood was measured under both conditions. The top pair of points represents Michael’s mood after a money gift and after no gift. Another pair of points represents Candace’s mood after a money gift and after no gift. Do you see what happens? The individual data points are exactly where they were in Figure 11.16, but the pairing process has turned a scram- bled set of data into a clear and very consistent finding: Every participant was happier after receiving a money gift than after no gift. This included Michael, who is always cheerful, and Candace, who is usually unhappy, as well as others in between.

A within-groups design, which compares each parti- cipant with himself or herself, controls for irrelevant indi- vidual differences. Finally, notice that the study required only half as many participants as the original independent- groups experiment. You can see again the two strengths of within-groups designs (introduced in Chapter 10): They control for irrelevant individual differences, and they require fewer participants than independent- groups designs.

Solution 2: Add More Participants. If within-groups or matched-groups designs are inappropriate (and sometimes they are, because of order effects, demand char- acteristics, or other practical concerns), another solution to individual difference variability is to measure more people. The principle is the same as it is for measure- ment error: When a great deal of variability exists because of individual differences, a simple solution is to increase the sample size. The more people you measure, the less impact any single person will have on the group’s average. Adding more participants reduces the influence of individual differences within groups, thereby enhancing the study’s ability to detect differences between groups.

Another reason larger samples reduce the impact of irrelevant individual dif- ferences is mathematical. The number of people in a sample goes in the denomi- nator of the statistical formula for a t test—used for detecting a difference between two related means. As you will learn in your statistics class, the formula for a t test for dependent groups (one possible t test) is: Variables are the core unit of psychological research. A variable, as the word implies, is something that varies, so it must have at least two levels, or values.

mean difference

astandard deviation of the difference √n


Take this headline: “72% of the world smiled yesterday.” Here, “smiling yes- terality, would be a constant, not a variable. A constant is something that could potentially vary but that has only one level in the study in question. (In contrast, in this example, “smoking” would be a variable, and its levels would be “smoker” or “non-smoker”).

The larger the number of participants (n), the smaller the denominator of t. And the smaller that denominator is, the larger t can get, and the easier it is to find a significant t. A significant t means you do not have a null result.

❮❮ For more on t tests, see Statistics Review: Inferential Statistics, pp. 491–495.

Happy mood

$ 1.00No money



FIGURE 11.17 Within-groups designs control for individual differences. When each person participates in both levels of the independent variable, the individual differences are controlled for, and it is easier to see the effect of the independent variable.

340 CHAPTER 11 More on Experiments: Confounding and Obscuring Variables


Besides measurement error and individual differences, situation noise— external distractions—is a third factor that could cause variability within groups and obscure true group differences. Suppose the money and mood researcher had conducted his study in the middle of the student union on campus. The sheer number of distractions in this setting would make a mess of the data. The smell of the nearby coffee shop might make some participants feel peaceful, seeing friends at the next table might make some feel extra happy, and seeing the cute guy from sociology class might make some feel nervous or self-conscious. The kind and amount of distractions in the student union would vary from participant to participant and from moment to moment. The result, once again, would be unsys- tematic variability within each group.

Situation noise, therefore, can add unsystematic variability to each group in an experiment. Unsystematic variability, like that caused by random measurement error or irrelevant individual differences, will obscure true differences between groups.

Researchers often attempt to minimize situation noise by carefully con- trolling the surroundings of an experiment. The investigator might choose to distribute money and measure people’s moods in a consistently undistracting laboratory room, far from coffee shops and classmates. Similarly, the researcher studying anxiety and logical reasoning might reduce situation noise by admin- istering the logical reasoning test on a computer in a standardized classroom environment.

Sometimes the controls for situation noise have to be extreme. Consider one study on smell (cited in Mook, 2001), in which the researchers had to control all extraneous odors that might reach the participants’ noses. The researchers dressed the participants in steam-cleaned plastic parkas fastened tightly under the chin to trap odors from their clothes, and placed them in a steam-cleaned plastic enclosure. A layer of petroleum jelly over the face trapped odors from the skin. Only then did the researchers introduce the odors being studied by means of tubes placed directly in the participants’ nostrils.

Obviously, researchers do not usually go to such extremes. But they do typ- ically try to control the potential distractions that might affect the dependent variable. To control the situation so it doesn’t induce unsystematic variability in mood (his dependent variable), the researcher would not have a TV turned on in the lab. To control the situation to avoid unsystematic variability in her dependent variable, reading performance, the educational psychologist may limit children’s exposure to alternative reading activities. The researchers in the anxiety and reasoning study would want to control any kind of unsystematic situational factor that might add variability to people’s scores on the logical reasoning test.


When researchers use a within-groups design, employ a strong manipulation, carefully control the experimental situation, or add more participants to a study,

341Interrogating Null Effects: What If the Independent Variable Does Not Make a Difference?

they are increasing the power of their study. Recall from Chapter 10 that power, an aspect of statistical validity, is the likelihood that a study will return a sta- tistically significant result when the independent variable really has an effect. If online reading games really do make a difference, even a small one, will the experiment reveal it? If anxiety really affects problem solving, will the study find a significant result? A within-groups design, a strong manipulation, a larger num- ber of partici pants, and less situation noise are all things that will increase the power of an experiment. Of these, the easiest way to increase power is to add more participants.

When researchers design a study with a lot of power, they are more likely to detect true patterns—even small ones. Consider the analogy of looking for an object in a dark room. If you go into the room with a big, powerful flashlight, you have a better chance of finding what you’re looking for—even if it’s something small, like an earring. But if you have just a candle, you’ll probably miss finding smaller objects. A study with a lot of participants is like having a strong flashlight: It can detect even small differences in reading scores or happiness. Importantly, a study with a lot of participants is also desirable because it prevents a few extreme ones from having undue influence on the group averages, which could cause misleading results.

Similarly, a study with a strong manipulation is analogous to increasing the size of the object you’re looking for; you’ll be able to find a skateboard in the room more easily than an earring—even if you have only a candle for light (Figure 11.18). Good experimenters try to maximize the power of their experimental designs by strengthening their “light source” (i.e., their sample size) or increasing the size of their effects.

❮❮ For more on power, see Statistics Review: Inferential Statistics, pp. 487–490.

Studies with low power can find only large e�ects

Studies with high power can find both large and small e�ects

FIGURE 11.18 Studies with more power can detect small effects. Experimenters can increase a study’s power by strengthening the “light source” (through large samples or accurate measurements), or by increasing the size of the effects (through strong manipulations). If a study has high power, you can be more confident it has detected any result worth finding. If a study has low power, it might return a null result inconclusively.

342 CHAPTER 11 More on Experiments: Confounding and Obscuring Variables

Sometimes There Really Is No Effect to Find When an experiment reveals that the independent variable conditions are not significantly different, what should you conclude? The study might be flawed in some way, so you might first ask whether it was designed to elicit and detect between-group differences. Was the manipulation strong? Was the dependent measure sensitive enough? Could either variable be limited to a ceiling or floor effect? Are any design confounds working against the independent variable?

You would also ask about the study’s ability to minimize within-group differ- ences. Was the dependent variable measured as precisely as possible, to reduce measurement error? Could individual differences be obscuring the effect of the independent variable? Did the study include enough participants to detect an effect? Was the study conducted with appropriate situational controls? Any of these factors, if problematic, could explain why an experiment showed a null effect. Table 11.2 summarizes the possible reasons for a null result in an experiment.

If, after interrogating these possible obscuring factors, you find the experiment was conducted in ways that maximized its power and yet still yielded a nonsig- nificant result, you can probably conclude the independent variable truly does not

TABLE 11.2

Reasons for a Null Result



Ineffective manipulation of independent variable

One week of reading games might not improve reading skill (compared with a control group), but 3 months might improve scores.

How did the researchers manipulate the independent variable? Was the manipulation strong? Do manipulation checks suggest the manipulation did what it was intended to do?

Insufficiently sensitive measurement of dependent variable

Researchers used a pass/fail measure, when the improvement was detectable only by using a finer-grained measurement scale.

How did the researchers measure the dependent variable? Was the measure sensitive enough to detect group differences?

Ceiling or floor effects on independent variable

Researchers manipulated three levels of anxiety by threatening people with 10-volt, 50-volt, or 100-volt shocks (all of which make people very anxious).

Are there meaningful differences between the levels of the independent variable? Do manipulation checks suggest the manipulation did what it was intended to do?

Ceiling or floor effects on dependent variable

Researchers measured logical reasoning ability with a very hard test (a floor effect on logical reasoning ability).

How did the researchers measure the dependent variable? Do participants cluster near the top or near the bottom of the distribution?

343Interrogating Null Effects: What If the Independent Variable Does Not Make a Difference?

affect the dependent variable. Perhaps money really doesn’t buy happiness. Maybe online reading games just don’t help students score higher. Or perhaps anxiety really doesn’t affect logical reasoning. In other words, if you read about a study that used a really strong flashlight, and yet still didn’t find anything—that’s a sign there’s probably no effect to be found. And if their theory had predicted a differ- ence between groups but found none, then the theory is incorrect. The Working It Through section provides an example.

There are many occurrences of true null effects in psychological science. An analysis of 1.2 million children concluded that vaccinating children does not cause autism (Taylor, Swerdfeger, & Eslick, 2014). Certain therapeutic programs apparently do not have the intended effect (such as the Scared Straight program, discussed in Chapter 1). After a certain level of income, money does not appear to be related to happiness (Diener, Horwitz, & Emmons, 1985; Lyubomirsky, King, & Diener, 2005; Lucas & Schimmack, 2009; Myers, 2000). And despite stereotypes to the contrary, women and men apparently do not differ in how much they talk (Mehl, Vazire, Ramirez-Esparza, Slatcher, & Pennebaker, 2007).



Measurement error Logical reasoning test scores are affected by multiple sources of random error, such as item selection, participant’s mood, fatigue, etc.

Is the dependent variable measured precisely and reliably? Does the measure have good construct validity? If measurements are imprecise, did the experiment include enough participants to counteract this obscuring effect?

Individual differences Reading scores are affected by irrelevant individual differences in motivation and ability.

Did the researchers use a within-groups design to better control for individual differences? If an independent-groups design is used, larger sample size can reduce the impact of individual differences.

Situation noise The money and happiness study was run in a distracting location, which introduced several external influences on the participants’ mood.

Did the researchers attempt to control any situational influences on the dependent variable? Did they run the study in a standardized setting?


The independent variable, in truth, has no effect on the dependent variable

Did the researchers take precautions to maximize between-group variability and minimize within-group variability? In other words, does the study have adequate power? If so, and they still don’t find a group difference, it’s reasonable to conclude that the independent variable does not affect the dependent variable.

344 CHAPTER 11 More on Experiments: Confounding and Obscuring Variables

Will People Get More Involved in Local Government If They Know They’ll Be Publicly Honored? A group of researchers tested ways to get citizens more involved in local government (Arceneaux & Butler, 2016). They sent a survey to citizens of a small town, 340 of whom replied. Near the end of the survey, residents were invited to volunteer for local city committees. They were randomly assigned to read one of two messages embedded in the survey—a baseline message simply asking people to volunteer, or another promising to “publicly honor” volunteers on the city website. The depen- dent variable was whether people signed up to volunteer at the end of the survey. The results showed that 18.4% of people in the baseline message group expressed interest in volunteering, while 17.8% of people in the “publicly honor” message group expressed interest. The difference was not statistically significant. Can the researchers conclude that publicly honoring people doesn’t make them volunteer? We’ll work through the questions in Table 11.2 to find out.


Was there enough variability between levels?

Were the baseline message and the experimental message different enough from each other?

Could there have been a floor effect on the dependent variable?

Could there be a confound acting in reverse?

The title used in the baseline message was “Serve on a City Committee!” while the title used in the experimental message added, “Be a Hero to Your Community!”

The dependent variable was a simple “yes” or “no” to volunteering.

The experimental group not only read about being thanked, they also were told volunteering was heroic. This “may have reinforced the notion that participation in this context is special rather than a common expectation of democratic citizens” (p. 137).

These titles seem clearly different, but it’s possible people did not read the text of the appeal very carefully, weakening the manipulation.

Perhaps measure the dependent variable with a finer scale, such as interest from 1 to 9.

A true effect of being publicly honored could have been counteracted by the impression that volunteering is too difficult.




345Interrogating Null Effects: What If the Independent Variable Does Not Make a Difference?


Was there too much variability within levels?

Was there situation noise or individual differences?

Was the sample size large enough to counteract situation noise and individual differences?

People completed surveys online, so there could be a lot of distracting situation noise. In addition, individuals may differ wildly in their civic engagement and their ability to volunteer.

The sample sizes in the two groups were fairly large (more than 100 in each group).

There was likely some within-group variability due to situation noise and irrelevant individual differences.

The sample size was probably large enough to counteract these influences.

Could there be no effect of the message on volunteering?

An improved study (with a more sensitive dependent measure and a cleaner manipulation of public gratitude) might show an effect of gratitude on volunteering.

However, we have learned that if a town uses the exact experimental message tested here, it will not increase volunteering.

Null Effects May Be Published Less Often When studies are conducted with adequate power, null results can be just as inter- esting and just as informative as experiments that show group differences. However, if you’re looking for studies that yielded null effects, you won’t find many. There is a publication bias about what gets published in scientific journals and which stories are picked up by magazines and newspapers. Most readers are more interested in independent variables that matter than in those that do not. It’s more interesting to learn dark chocolate has health benefits rather than that it doesn’t, and that women and men differ on a particular trait, as opposed to being the same. Differences seem more interesting than null effects, so a publication bias, in both journals and jour- nalism, favors differences. (For more on this publication bias, see Chapter 14.)


1. How can a study maximize variability between independent variable groups? (There are four ways.)

2. How can a study minimize variability within groups? (There are three ways.)

3. In your own words, describe how within-groups designs minimize unsystematic variability.

1. See pp. 332–335 and Table 11.2. 2. See pp. 335–341 and Table 11.2. 3. See pp. 338–339.

346 CHAPTER 11 More on Experiments: Confounding and Obscuring Variables


Summary Responsible experimenters may conduct double-blind studies, measure variables precisely, or put people in controlled environments to eliminate internal validity threats and increase a study’s power to avoid false null effects.

Threats to Internal Validity: Did the Independent Variable Really Cause the Difference? • When an experiment finds that an independent

variable affected a dependent variable, you can interrogate the study for twelve possible internal validity threats.

• The first three threats to internal validity to consider are design confounds, selection effects, and order effects (introduced in Chapter 10).

• Six threats to internal validity are especially relevant to the one-group, pretest/posttest design: maturation, history, regression, attrition, testing, and instrumenta- tion threats. All of them can usually be ruled out if an experimenter conducts the study using a comparison group (either a posttest-only design or a pretest/ posttest design).

• Three more internal validity threats could potentially apply to any experiment: observer bias, demand characteristics, and placebo effects.

• By interrogating a study’s design and results, you can decide whether the study has ruled out all twelve threats. If it passes all your internal validity queries, you can conclude with confidence that the study was a strong one: You can trust the result and make a causal claim.

Interrogating Null Effects: What If the Independent Variable Does Not Make a Difference? • If you encounter a study in which the independent

variable had no effect on the dependent variable (a null effect), you can review the possible obscuring factors.

• Obscuring factors can be sorted into two categories of problems. One is the problem of not enough between- groups difference, which results from weak manipula- tions, insensitive measures, ceiling or floor effects, or a design confound acting in reverse.

• The second problem is too much within-groups variability, caused by measurement error, irrelevant individual differences, or situation noise. These prob- lems can be counteracted by using multiple measure- ments, more precise measurements, within-groups designs, large samples, and very controlled experi- mental environments.

• If you can be reasonably sure a study avoided all the obscuring factors, then you can probably trust the result and conclude that the independent variable really does not cause a change in the dependent variable.

347Review Questions

Key Terms

one-group, pretest/posttest design, p. 313

maturation threat, p. 314 history threat, p. 315 regression threat, p. 316 regression to the mean, p. 316 attrition threat, p. 319 testing threat, p. 320 instrumentation threat, p. 321

selection-history threat, p. 321 selection-attrition threat, p. 322 observer bias, p. 322 demand characteristics, p. 322 double-blind study, p. 323 masked design, p. 323 placebo effect, p. 323 double-blind placebo control

study, p. 324

null effect, p. 330 ceiling effect, p. 333 floor effect, p. 333 manipulation check, p. 334 noise, p. 335 measurement error, p. 337 situation noise, p. 340 power, p. 341

To see samples of chapter concepts in the popular media, visit and click the box for Chapter 11.r

Review Questions

1. Dr. Weber conducted a long-term study in which people were tested on happiness, asked to make two new friends, and then tested on happiness 1 month later. He noticed that six of the most introverted people dropped out by the last session. Therefore, his study might have which of the following internal validity threats?

a. Attrition

b. Maturation

c. Selection

d. Regression

2. How is a testing threat to internal validity different from an instrumentation threat?

a. A testing threat can be prevented with random assignment; an instrumentation threat cannot.

b. A testing threat applies only to within-groups designs; an instrumentation threat applies to any type of study design.

c. A testing threat can be prevented with a double-blind study; an instrumentation threat can be prevented with a placebo control.

d. A testing threat refers to a change in the partici- pants over time; an instrumentation threat refers to a change in the measuring instrument over time.

3. A regression threat applies especially:

a. When there are two groups in the study: an experi- mental group and a control group.

b. When the researcher recruits a sample whose average is extremely low or high at pretest.

c. In a posttest-only design.

d. When there is a small sample in the study.

4. Dr. Banks tests to see how many training sessions it takes for dogs to learn to “Sit and stay.” She ran- domly assigns 60 dogs to two reward conditions: one is miniature hot dogs, the other is small pieces of steak. Surprisingly, she finds the dogs in each group learn “Sit and stay” in about the same number of sessions. Given the design of her study, what is the most likely explanation for this null effect?

a. The dogs loved both treats (her reward manipulation has a ceiling effect).

b. She used too many dogs.

c. She didn’t use a manipulation check.

d. There were too many individual differences among the dogs.

348 CHAPTER 11 More on Experiments: Confounding and Obscuring Variables

Learning Actively The scenarios described in items 1–3 below contain threats to internal validity. For each scenario:

a. Identify the independent variable (IV) and dependent variable (DV).

b. Identify the design (posttest-only, pretest/posttest, repeated measures, one-group pretest/posttest).

c. Sketch a graph of the results. (Reminder: Put the dependent variable on the y-axis.)

d. Decide whether the study is subject to any of the internal validity threats listed in Table 11.1.

e. Indicate whether you could redesign the study to correct or prevent any of the internal validity threats.

1. For his senior thesis, Jack was interested in whether viewing alcohol advertising would cause college students to drink more alcohol. He recruited 25 seniors for a week-long study. On Monday and  Tuesday, he had them use a secure website and record how many alcoholic beverages they had consumed the day before. On Wednesday, he invited them to the lab, where he showed them a 30-minute TV show interspersed with entertaining ads for alcoholic products. Thursday and Friday were the follow-up measures: Students logged in to the website and recorded their alcoholic bever- age consumption again. Jack found that students reported increased drinking after seeing the alcohol advertising, and he concluded the advertising caused them to drink more.

2. In a cognitive psychology class, a group of student presenters wanted to demonstrate the power of retrieval cues. First, the presenters had the class mem- orize a list of 20 words that were read aloud to them in a random order. One minute later, the class members wrote down as many words as they could remember. On average, the class recalled 6 words. Second, the presenters told the class to try sorting the words into categories as the words were read (color words, vehicle words, and sports words). The presenters read the same words again, in a different random order. On the second test of recall, the class remembered, on average, 14 words. The presenters told the class this experiment demonstrated that categorizing helps people remember words because of the connections they can develop between various words.

5. Dr. Banks modifies her design and conducts a second study. She uses the same number of dogs and the same design, except now she rewards one group of dogs with miniature hot dogs and another group with pieces of apple. She finds a big differ- ence, with the hot-dogs group learning the com- mand faster. Dr. Banks avoided a null result this time because her design:

a. Increased the between-groups variability.

b. Decreased the within-groups variability.

c. Improved the study’s internal validity.

6. When a study has a large number of participants and a small amount of unsystematic variability (low measurement error, low levels of situation noise), then it has a lot of:

a. Internal validity

b. Manipulation checks

c. Dependent variables

d. Power

349Learning Actively

3. A group of researchers investigated the effect of mindfulness meditation on mental health work- ers, 10 weeks after a major hurricane. A sample of 15  mental health workers were pretested on their depression and anxiety symptoms. Then they engaged in meditation training for 8 weeks. After the training was completed, they were tested on their symptoms again, using the same test. The study found that anxiety and depression symptoms were significantly lower at posttest. The researchers concluded the meditation train- ing helped the participants (based on Waelde et al., 2008).

4. Dr. Dove was interested in the effects of eating chocolate on well-being. She randomly assigned 20 participants to two groups. Both groups ate as they normally would, but one group was instructed to eat a 1-ounce square of dark chocolate after lunch. After 4 weeks on this diet, they completed a questionnaire measuring their level of well-being (happiness, contentment). Dr. Dove was surprised to find the chocolate had no effect: Both groups, on average, scored the same on the well-being measure. Help Dr. Dove troubleshoot her study. What should she do next time to improve her chances of finding a significant effect for the chocolate-enhanced diet, if eating chocolate really does improve well-being?

The Reason Why You’re an Angry Drunk Men’s Health, 2012

New California Law Prohibits All Cell Phone Use While Driving KSBW8 News, 2016


Experiments with More Than One Independent Variable SO FAR, YOU HAVE read two chapters about evaluating causal claims. Chapters 10 and 11 introduced experiments with one inde- pendent variable and one dependent variable. Now you’re ready for experiments with more than one independent variable. What happens when more independent variables are added to the mix?

REVIEW: EXPERIMENTS WITH ONE INDEPENDENT VARIABLE Let’s start with the first headline on the opposite page: Is it true that certain people can be angry drunks? According to research, there’s almost no doubt that drunk people are more aggressive than sober folks. In several studies, psychologists have brought participants into comfortable laboratory settings, had them drink various amounts of alcohol, and then placed them in different settings to measure their aggressive tendencies. For example, a team of researchers led by Aaron Duke invited community members into their lab (Duke, Giancola, Morris, Holt, & Gunn, 2011). After screening out volunteers who had problem drinking behaviors, were pregnant, or had other risky health conditions, they randomly assigned them to drink a glass of orange juice that contained different amounts of alcohol. The “active placebo” group drank orange juice with a very small amount of vodka—enough


A year from now, you should still be able to:

1. Explain why researchers combine independent variables in a factorial design.

2. Describe an interaction effect in both everyday terms and arithmetic terms.

3. Identify and interpret the main effects and interactions from a factorial design.

352 CHAPTER 12 Experiments with More Than One Independent Variable

to smell and to taste, but not enough to make them drunk. Another group was assigned to drink enough vodka to get drunk, by reaching a blood alcohol concentration (BAC) of 0.10% (legally impaired is BAC 0.08% or higher).

After confirming the two groups’ intoxication levels with a breathalyzer test, the researchers had the volunteers play a computer game with an opponent who was supposedly in another room (the opponent was actually a computer pro- grammed in advance). The players took turns, and when one made a mistake, the opponent was allowed to deliver a shock as punishment. Players chose the intensity of the shock their opponents would receive for each mistake (on a scale of 1 to 10), and they could hold the shock delivery button down for different lengths of time. The researchers measured the intensity and duration of the shocks each par- ticipant delivered. The more intense the shocks and the lon- ger their duration, the more aggressive the participants were said to be. Results showed a difference: drunk participants were more aggressive (Figure 12.1).

The new California law in the second headline responds to research showing that using a cell phone while behind the wheel impairs a person’s ability to drive. Some of the evi- dence comes from experiments by David Strayer and his col- leagues (Strayer & Drews, 2004), who asked people to talk on hands-free cell phones in a driving simulator that looked almost exactly like a real car. As the participants drove, the researchers recorded several dependent variables, includ- ing driving speed, braking time, and following distance. In a repeated-measures (within-groups) design, they had partici- pants drive on several 10-mile segments of highway in the simulator. For two of the segments, the drivers carried on a conversation on a hands-free cell phone. For the other two segments, drivers were not on the phone (of course, the order

of the different segments was counterbalanced). The results showed that when driv- ers were simply talking on cell phones (not even texting or using apps), their reactions to road hazards were 18% slower. Drivers on cell phones also took longer to regain their speed after slowing down and got into more (virtual) accidents (Figure 12.2).

The Strayer and Drews study, like the Duke team’s study, had one independent variable (cell phone use, manipulated as a within-groups variable) and one depen- dent variable (driving quality). Their study also showed a difference: People drove more poorly while using cell phones. Studies with one independent variable can demonstrate a difference between conditions. These two studies were analyzed with a simple difference score: placebo minus drunk conditions, or cell phone minus control.

Placebo group

Drunk group (BAC 0.10)









Intensity of shock (1-10 scale)

Placebo group

Alcohol intake

Alcohol intake

Drunk group (BAC 0.10)







Duration of shock (sec)



FIGURE 12.1 Alcohol intake and aggressive tendencies. Compared to a placebo group in this study, drunk participants delivered (A) higher-intensity shocks and (B) shocks for longer duration. These results demonstrated that alcohol causes people to behave aggressively (Source: Adapted from Duke et al., 2011.)

❯❯ To review counterbalancing,

see Chapter 10, pp. 295–296.

353Review: Experiments with One Independent Variable

Experiments with Two Independent Variables Can Show Interactions The Strayer and Drews study found that hands-free cell phones cause people to drive badly. These researchers also wondered whether that overall difference would apply in all situations and to all people. For example, might younger drivers be less distracted by using cell phones than older drivers? On the one hand, they might, because they grew up using cell phones and are more accustomed to them. On the other hand, older drivers might be less distracted because they have more years of driving expe- rience. By asking these questions, the researchers were thinking about adding another independent variable to the original study: driver age, and the levels could be old and young. Would the effect of driving while using a cell phone depend on age?

Adding an additional independent variable allows researchers to look for an interaction effect (or interaction)— whether the effect of the original independent variable (cell phone use) depends on the level of another independent variable (driver age). Therefore, an interaction of two independent variables allows researchers to estab- lish whether or not “it depends.” They can now ask: Does the effect of cell phones depend on age?

The mathematical way to describe an interaction of two independent variables is to say that there is a “difference in differences.” In the driving example, the dif- ference between the cell phone and control conditions (cell phone minus control) might be different for older drivers than younger drivers.


Variables are the core unit of psychological research. A variable, as the word implies, is something that varies, so it must have at least two levels, or values.

mean difference

astandard deviation of the difference √n


Take this headline: “72% of the world smiled yesterday.” Here, “smiling yes- terality, would be a constant, not a variable. A constant is something that could potentially vary but that has only one level in the study in question. (In contrast, in this example, “smoking” would be a variable, and its levels would be “smoker” or “non-smoker”).

mean difference

astandard deviation of the difference √n


A person’s score on the GRE measure can be represented with the following formula:

  student’s GRE score = student’s true GRE ability ± random error of measurement

The researchers in any study either measure or manipulate each variable. The distinction is important because some claims are tested with measured variables, while other claims must be tested with both measured and manipulated variables. A measured variable is one whose levels are simply

Interaction  = a difference in differences

  =   the effect of one independent variable depends on the level of the other independent variable

Some variables, such as height, IQ, and blood pressure, are typically measured variables, such as depression and stress, researchers might devise a special set of questions to represent the various levels.

Intuitive Interactions Behaviors, thoughts, motivations, and emotions are rarely simple; they usually involve interactions between two or more influences. Therefore, much of the most important research in psychology explores interactions among multiple indepen- dent variables. What’s the best way to understand what an interaction means?

Here’s one example of an interaction: Do you like hot foods or cold foods? It prob- ably depends on the food. You probably like your ice cream cold, but you like your pancakes hot. In this example, there are two independent variables: the food you are judging (ice cream or pancakes) and the temperature of the food (cold or hot).


Braking onset time (ms)







0 On cell phone

Cell phone condition

Not on phone

FIGURE 12.2 Cell phone use and driver reaction time. In this study, drivers using hands-free cell phones were slower to hit the brakes in response to a road hazard. (Source: Adapted from Strayer & Drews, 2004.)

354 CHAPTER 12 Experiments with More Than One Independent Variable

The dependent variable is how much you like the food. A graph of the interaction is shown in Figure 12.3. Notice that the lines cross each other; this kind of interaction is some- times called a crossover interaction, and the results can be described with the phrase “it depends.” People’s preferred food temperature depends on the type of food.

To describe this interaction, you could say that when people eat ice cream, they like their food cold more than hot; when people eat pancakes, they like their food hot more than cold. You could also apply the mathematical definition by saying that there is a difference in differences. You like ice cream cold more than you like it hot (cold minus hot is a positive value), but you like pancakes cold less than you like them hot (cold minus hot is a negative value).

Here’s another example: the behavior of my dog, Fig. Does he sit down? It depends on whether I say “Sit,” and on whether I have a treat in my hand. When I don’t have a treat, Fig will not sit, even if I tell him to sit. If I do hold a treat, he will sit, but only when I say “Sit.” (In other words, my stub- born dog has to be bribed.) In this example, the probability that my dog will sit is the dependent variable, and the two independent variables are what I say (“Sit” or nothing) and what I am holding (a treat or nothing). Figure 12.4 shows a graph of this interaction. Notice that the lines are not parallel, and they do not cross over each other. This kind of interaction is sometimes called a spreading interaction, and the pattern can be described with the phrase “only when.” My dog sits when I say “Sit,” but only when I’m holding a treat.

Here is the mathematical description of this interaction: When I say nothing, there is zero difference between the treat and no-treat conditions (treat minus no treat equals

zero). When I say “Sit,” there is a large difference between the treat and no-treat conditions (treat minus no treat equals a positive value). There is a difference in differences.

You can graph the interaction accurately either way—by putting the “What I say” independent variable on the x-axis, as in Figure 12.4, or by putting the “What I’m holding” independent variable on the x-axis, as shown in Figure 12.5. Although the two graphs may look a little different, each one is an accurate representation of the data.

When psychological scientists think about behavior, they might start with a simple link between an independent and a dependent variable, but often they find they need a second independent variable to tell the full story. For example, in a romantic relationship, are positive attitudes, such as forgiveness, healthy? (In other words, does the independent variable of positive versus negative attitudes

How much do you like it?

Ice cream Pancakes



FIGURE 12.3 A crossover interaction: “It depends.” How much you like certain foods depends on the temperature at which they are served. It’s equally correct to say that the temperature you prefer depends on which food you’re eating.

Proportion of time dog sits







Say nothing Say “Sit”

Holding a treat

No treat

FIGURE 12.4 A spreading interaction: “Only when…” My dog sits when I say “Sit,” but only when I’m holding a treat.

355Review: Experiments with One Independent Variable

affect the dependent variable, relationship health?) The answer depends on how serious the disagree- ments are. Research shows that when difficulties are minor, positive attitudes are healthy for the relationship, but when the issues are major (e.g., one partner is abusive to the other or is drug-de- pendent), positive attitudes seem to prevent a couple from addressing their problems (McNulty, 2010). Thus, the degree of severity of the problems (minor versus major) is the second independent variable.

Does going to daycare hurt children’s social and intellectual development? It seems to depend on the quality of care. According to one study, high-quality daycare can benefit the social and intellectual development of kids (compared to children who have only parental care); when the quality of daycare is poor, development might be impaired (Vandell, Henderson, & Wilson, 1988). Reflect for a moment: What would the dependent and independent variables be in this example?

Factorial Designs Study Two Independent Variables When researchers want to test for interactions, they do so with factorial designs. A factorial design is one in which there are two or more independent variables (also referred to as factors). In the most common factorial design, researchers cross the two independent variables; that is, they study each possible combination of the independent variables. Strayer and Drews (2004) created a factorial design to test whether the effect of driving while talking on a cell phone depended on the driver’s age. They used two independent variables (cell phone use and driver age), creating a condition representing each possible combination of the two. As shown in Figure 12.6, to cross the two independent variables, they essentially overlaid one independent variable on top of another. This overlay process created four unique

Proportion of time dog sits







Holding a treat No treat

Say nothing

Say “Sit”

FIGURE 12.5 The same interaction, graphed the other way. The data in Figure 12.4 can be graphed equally accurately with the other independent variable on the x-axis.

Not on phone

On cell phone

Older drivers

Younger drivers

Younger drivers on

cell phones

Older drivers on

cell phones

Younger drivers not on phones

Older drivers not on phones


FIGURE 12.6 Factorial designs cross two independent variables. A second independent variable was overlaid on top of a first independent variable, creating (in this case) four new experimental conditions, or cells.

356 CHAPTER 12 Experiments with More Than One Independent Variable

conditions, or cells: younger drivers using cell phones, younger drivers not using cell phones, older drivers using cell phones, and older drivers not using cell phones.

Figure 12.6 shows the simplest possible factorial design. There are two inde- pendent variables (two factors)—cell phone use and age—and each one has two levels (driving while using a cell phone or not; younger or older driver). This partic- ular design is called a 2 × 2 (two-by-two) factorial design, meaning that two levels of one independent variable are crossed with two levels of another independent variable. Since 2 × 2 = 4, there are four cells in this design.


You might have noticed that one of the variables, cell phone use, was truly manip- ulated; the researchers had participants either talk or not talk on cell phones while driving (Strayer & Drews, 2004). The other variable, age, was not manipulated; it was a measured variable. The researchers did not assign people to be older or younger; they simply selected participants who fit those levels. Age is an example of a participant variable—a variable whose levels are selected (i.e., measured), not manipulated. Because the levels are not manipulated, variables such as age, gender, and ethnicity are not truly “independent” variables. However, when they are studied in a factorial design, researchers often call them independent variables, for the sake of simplicity.

Factorial Designs Can Test Limits One reason researchers conduct studies with factorial designs is to test whether an independent variable affects different kinds of people, or people in different

situations, in the same way. The study on cell phone use while driving is a good example of this purpose. By crossing age and cell phone use, the researchers were asking whether the effect of using a cell phone was limited to one age group only, or whether it would have the same effect on people of different ages.

This research team observed two samples of drivers: ages 18–25 and ages 65–74 (Strayer & Drews, 2004). Each participant drove in the simu- lator for a warm-up period and then drove 10-mile stretches in simulated traffic four times. During two of the four segments, drivers carried on a con- versation using a hands-free phone, chatting with a research assistant about their day (Figure 12.7). The researchers collected data on a variety of dependent variables, including accidents, following

FIGURE 12.7 A young driver using a hands-free cell phone while driving in a simulator.

357Review: Experiments with One Independent Variable

distance, and braking onset time (how long it takes, in milliseconds, for a driver to brake for an upcoming road hazard). Figure 12.8 shows the results for braking onset time. Notice that the same results are presented in two ways: as a table and as a graph.

The results might surprise you. The primary conclusion from this study is that the effect of talking on a cell phone did not depend on age. Older drivers did tend to brake more slowly than younger ones, overall; that finding is consistent with past research on aging drivers. However, Strayer and Drews wanted to know whether the difference between the cell phone and control conditions would be different for older drivers. The answer was no. The effect of using a cell phone (i.e., the simple difference between the cell phone condition and the control condition) was about the same in both age groups. In other words, cell phone use did not interact with (did not depend on) age. At least for these two age groups, the harmful effect of cell phone use was the same.


You might have recognized this goal of testing limits as being related to external validity. When researchers test an independent variable in more than one group at once, they are testing whether the effect generalizes. Sometimes, as in the exam- ple of age and cell phone use while driving, the independent variable affects the groups in the same way, suggesting that the effect of cell phone use generalizes to drivers of all ages.

In other cases, groups might respond differently to an independent vari- able.  In one study, for instance, researchers tested whether the effect of

DV: Braking onset time (ms)

IV1: Cell phone condition

IV2: Driver age

Younger drivers 912 780

1086 912 Older drivers

On phone

Not on phone Braking

onset time (ms)








On cell phone Not on phone

Older drivers Younger drivers

Cell phone condition


FIGURE 12.8 Factorial design results in table and graph formats. Table values depict group means. (Source: Adapted from Strayer & Drews, 2004.)

358 CHAPTER 12 Experiments with More Than One Independent Variable

alcohol  intake on aggressive behavior depends on body weight (DeWall, Bushman, Giancola, & Webster, 2010). Using a procedure similar to that of Duke et al. (2011), they randomly assigned men to a placebo group and a drunk group and then measured their aggression in the shock game. As shown in Figure 12.9, they found the effect of alcohol was especially strong for the heavier men. In other words, there may be some truth to the stereotype of the “big, drunk, aggressive guy.”


The process of using a factorial design to test limits is sometimes called testing for moderators. Recall from Chapter 8 that a moderator is a vari- able that changes the relationship between two other variables (Kenny, 2009). In factorial design language, a moderator is an independent variable that changes the relationship between another independent variable and a dependent variable. In other words, a moderator results in an inter- action; the effect of one independent variable depends on (is moderated by) the level of another independent variable. When Strayer and Drews studied whether driver age would interact with cell phone use, they found that driver age did not moderate the impact of cell phone use on braking onset time. However, DeWall and his colleagues showed that body weight moderates the effect of alcohol on aggression.

Factorial Designs Can Test Theories Researchers can use factorial designs not only to test the generalizability of a causal variable but

also to test theories. The goal of most experiments in psychological science is to test hypotheses derived from theories. Indeed, many theories make statements about how variables interact with one another. The best way to study how vari- ables interact is to combine them in a factorial design and measure whether the results are consistent with the theory.

Shock intensity (1-10 scale)

Body weight

Alcohol intake

Light men (151 lb)

Heavy men (215 lb)

Drunk group

Placebo group










Shock intensity (1-10 scale)

Placebo group Drunk group

Light men (151 lb) Heavy men (215 lb)










FIGURE 12.9 Testing limits with factorial design. According to this study, the effect of alcohol on aggression is stronger in heavier men than lighter men. The same results are graphed two ways, with different independent variables on the x-axis. (Source: Adapted from DeWall et al., 2010.)

❯❯ To review how moderators

work in correlational designs, see Chapter 8, pp. 228–230.

❯❯ To review the theory-data

cycle, see Chapter 1, pp. 11–15.

359Review: Experiments with One Independent Variable


Once studies established that alcohol intake can lead to aggressive behavior, researchers wanted to dig deeper. They theorized about why drinking alcohol causes aggression. One idea is that alcohol impairs the brain’s executive function- ing; it interferes with a person’s ability to consider the consequences of his or her actions (Giancola, 2000). In addition to pharmacological effects, another theory suggests that through exposure to cultural messages and stereotypes about drink- ing behavior, people learn to cognitively associate alcohol with aggression. Merely thinking about alcohol might prime people to think about aggression. Researchers Bruce Bartholow and Adrienne Heinz (2006) sought to test the theory that alcohol can become cognitively associated with thoughts of aggression. They didn’t get anybody drunk in their research; they simply exposed them to pictures of alcohol.

In the lab, participants viewed a series of images and words on a computer screen. Their task was to indicate whether a string of letters was a word or a non- word. For example, the letter string EDVIAN would be classified as a nonword, and the letter string INVADE would be classified as a word. Some of the words were aggression-related (e.g., hit, combat, or fight) and others were neutral (e.g., sit, wonder, or caught).

Before seeing each of the word strings, participants viewed a photograph on the computer screen for a brief period (300 ms). Sometimes the photograph was related to alcohol, perhaps a beer bottle or a martini glass. Other times the photo- graph was not related to alcohol; it was a photo of a plant. The researchers hypoth- esized that people would be faster to identify aggression-related words after seeing the photos of alcohol. They used the computer to measure how quickly people responded to the words.

As shown in Figure 12.10, Bartholow and Heinz were interested in the inter- action of two independent variables: photo type (alcohol or plant) and word type (aggressive or neutral). The results told the story they hypothesized: When people

DV: Reaction time (ms)

IV2: Word type

IV1: Photo type



Alcohol photo

Plant photo

Alcohol photo

Plant photo

Neutral words

Aggression-related words


FIGURE 12.10 Theory testing by crossing two independent variables. This type of design creates all possible combinations of the independent variables. Here, one independent variable (photo type) is crossed with another independent variable (word type) to create all four possible combinations.

360 CHAPTER 12 Experiments with More Than One Independent Variable

had just seen a photo of alcohol, they were quicker to identify an aggressive word. When people had just seen a photo of a plant, they were slower to identify an aggressive word (Figure 12.11).

This study is a good example of how a researcher can test a theory using a factorial design. The result- ing interaction supported one theory of why alcohol intake causes aggressive behavior: People cognitively associate alcohol cues with aggressive concepts.


Another study used a factorial design to test why memory capacity develops as children get older. One theory stated that adults remember more than children do simply because of accumulated knowl- edge. A richer knowledge structure enables adults to make more mental connections for storing new information. One researcher tested this theory by comparing the memory abilities of two groups: Children who were chess experts (recruited from a chess tournament) and adults who were chess nov- ices (Chi, 1978). Both groups performed two mem- ory tasks: recalling digits (numbers) read in random order, and recalling the placement of pieces on a chessboard during a game in progress. Over a series of trials, participants were asked to remember more and more numbers and more and more chess pieces.

This study had a 2 × 2 design, with a participant variable (child experts or adult novices) and an inde-

pendent variable (digits versus chess pieces). The number of items recalled was the dependent variable. The results, shown in Figure 12.12, clearly demonstrate that while the adults had better memory than children for digits, the children had a better memory than adults for the domain in which they had more knowledge: chess pieces.

In this study, the researchers used a factorial design to test their theory about why memory develops with age. The results showed the interaction predicted by the theory: Children’s memory capacity can be better than adults when they know a lot about the topic.

Interpreting Factorial Results: Main Effects and Interactions After running a study with a factorial design with two independent vari- ables, researchers, of course, want to analyze the results. In a design with two

Reaction time (ms)

Photo type

544 Alcohol photo

Aggression-related words Neutral words

Plant photo











DV: Reaction time (ms)

IV1: Photo type

IV2: Word type

Aggressive 551 559

562 552Neutral

Alcohol Plant

FIGURE 12.11 Factorial study results in table and graph formats. (Source: Adapted from Bartholow & Heinz, 2006.)

361Review: Experiments with One Independent Variable

independent variables, there will be three results to inspect: two main effects and one interaction effect.


In a factorial design, researchers test each independent variable to look for a main effect—the overall effect of one independent variable on the dependent variable, averaging over the level