Part II: Consider Threats to Validity

In their landmark text Experimental and Quasi-Experimental Designs for Generalized Causal Inference, Shadish, Cook, and Campbell identify four types of validity that every researcher must consider: statistical conclusion validity (can we trust the statistical relationship we’ve found?), internal validity (is the relationship actually causal?), construct validity (are we measuring what we think we’re measuring?), and external validity (do the findings generalize beyond this specific study?). These four validity types form the conceptual backbone of Part 2.

Understanding these validity types matters for two reasons. First, when you’re designing a study, you need to anticipate threats to each type of validity and build in protections. Second, when you’re reading others’ work, you need to assess where validity might be compromised and what that means for the conclusions. Design and appraisal are two sides of the same coin.

We’ll start with statistical inference, the process of using sample data to draw conclusions about broader populations. You’ll meet the p-value, that much-maligned and much-misunderstood creature, and learn what it actually tells you (spoiler: not what you probably think). We’ll explore why a result labeled “not significant” doesn’t mean “no effect,” why a result labeled “significant” doesn’t mean “important,” and why the scientific community is in something of a crisis over how we’ve been doing inference for decades. I’ll introduce you to alternatives like confidence intervals and Bayesian approaches that might help you reason more clearly about uncertainty.

From there, we turn to causal inference—arguably the heart of most research questions we care about. When we ask whether bed nets prevent malaria, whether a therapy reduces depression, or whether a policy improves health outcomes, we’re asking causal questions. But the path from “these things are associated” to “this causes that” is treacherous. You’ll learn about confounding, that pesky problem where some third variable explains the apparent relationship between your cause and effect. You’ll draw DAGs (directed acyclic graphs), trace backdoor paths, and discover that sometimes adding more variables to your model makes things worse, not better. Most importantly, you’ll develop the habit of thinking counterfactually: what would have happened if things had been different?

The third chapter tackles generalizability and external validity. Suppose you’ve established that your intervention works—in your specific sample, in your specific setting, with your specific measures. Can you extend that finding to other populations, other contexts, other ways of implementing the intervention? This is the question of external validity, and it’s where much research falls short. The UTOS framework (Units, Treatments, Outcomes, Settings) gives you a systematic way to think about what might or might not generalize, and why “this worked in a trial” doesn’t automatically mean “this will work in your context.”

Finally, we confront measurement and construct validity. Before you can ask whether coffee causes longevity, you have to ask: what exactly do we mean by “coffee consumption”? One cup a day? Three? Espresso or drip? And what about “longevity”—are we talking about all-cause mortality, cardiovascular death, quality-adjusted life years? The gap between the abstract concept you want to study (the construct) and the specific thing you actually measure (the indicator) is where a lot of research goes quietly wrong. You’ll learn to evaluate whether instruments measure what they claim to measure, how to think about reliability and validity of measurement, and why sometimes the most sophisticated analysis in the world can’t rescue a study built on flawed measures.

If this sounds like a lot of ways for things to go wrong, well, it is. But here’s the empowering part: once you learn to see these threats, you can design studies that minimize them, read studies with appropriate skepticism, and evaluate claims with nuance rather than naive acceptance or cynical rejection. You’re not trying to find the mythical perfect study—you’re trying to understand the specific ways a study might mislead you and whether those ways matter for the question at hand.

NoteA note about what’s ahead

The next few chapters are denser than what we covered in Part 1. They introduce notation, frameworks, and ways of thinking that may feel unfamiliar at first. That’s okay. You’re building the intellectual infrastructure that will serve you for the rest of your career. Take your time with these chapters. Return to them. The concepts will click with practice.